SCADA Security and Fault Tolerance - A Beautiful Pairing!

Note from Eric Byres:  Oliver Kleineberg makes his debut today as a blogger for Practical SCADA Security and we welcome his expertise in the areas of fault tolerance and redundant networking.  He has recently joined Tofino Security from Hirschmann, our sister company, based in Germany (and both of our groups are part of Belden). Like me, Oliver is involved with standards groups, but in his case he works with IEEE 802.1 and 802.3. In particular he is working with the IEC SC65C WG15 high availability automation networks working group.

As a reader of this blog you are likely already convinced of the need for improved cyber security for industrial networks.  After all, the whole industrial automation world was stirred by the sudden appearance of the Stuxnet malware in 2010.   Besides this targeted need, however, there is another reason why cyber security technology like Tofino is needed.  That reason is the broader need for reliable networks that are used in mission-critical applications.

Good food paired with good wine brings out the sublime qualities of both. 

Industrial Ethernet Pervades Modern Life

Let me explain this by pausing to consider the stunning number of places where Industrial Ethernet is used in today’s society, often hidden in plain view. For example, in things you may come across or interact with on a daily basis, such as traffic lights. You likely don’t think “Industrial Ethernet” when stopped at a red light. That is a good thing, because it means that the Ethernet and IP technology is getting the job done properly! If it wasn’t you would certainly have already noticed.

The reliability of traffic systems is mostly due to the fact that the switches being used are designed for rugged environments, e.g. with a wide range of operating temperatures and with special resistance to strong electromagnetic fields or vibration. This is very different to the switches you can buy at an electronics store for your home network.

In addition to that, the switches also implement special redundancy protocols that help to recover from errors in the network, e.g. the Parallel Redundancy Protocol (PRP) or the Media Redundancy Protocol (MRP). If, for example, an excavator accidentally damages a cable, the switches will automatically compensate. Great, but what does this have to do with security?

In a nutshell, industrial ruggedized networks are tough to bring down. But remember the story of David and Goliath? Even the strongest giant can be defeated with just a small pebble, if it is used correctly. That is where cyber security comes into play.

Fault Tolerance and Cyber Security: Separate Design Considerations?

When I talk to experts who design mission-critical networks, e.g. for power utility or industrial automation systems, fault tolerance and security are usually addressed independently.  Totally separate solutions are developed and implemented. Since fault tolerance is generally used to increase the resilience of the network, and security is usually implemented to prevent unauthorized network access, this seems to be a prudent course of action. Or is it?

Implementing fault tolerant networks increases the total network availability because the network can automatically reconfigure to compensate for media or device failure. Security increases the total network availability because it protects from any downtime caused by a cyber attack or network incident.

In addition to that, security technology protects redundant systems from attackers tampering with their protocols and the redundancy technology assures that the secure systems are still available, even after physical failure or a physical attack.

 Figure 1: This network is protected by redundant systems, but the protocols used for redundancy could be the target of a cyber attack, if the security appliances were not present.  Instead, no matter where the master link fails, the security appliance passes through the fail status and the redundant link takes over.  The result is high availability for a mission critical network.


High Availability Network Design Elements
Fault Tolerance Cyber Security
The network automatically reconfigures to compensate for media or device failure Protects from any downtime caused by a cyber attack or network incident
Assures that the secure systems are still available, even after a physical failure or a physical attack Prevents attackers from tampering with redundancy protocols

  Figure 2: Summary of Fault Tolerance and Cyber Security Design Elements for High Availability Networks

Pairing Fault Tolerance and Cyber Security – Ah, just like Good Food with Good Wine!

In the future, network architects will have to rethink their traditional design approach because security and fault-tolerance are interdependent elements of high availability networks. 

Plus, just like how good food paired with the right wine brings out the sublime quality of both, a design paradigm that includes both elements takes high availability mission-critical networks to a whole new level.
For more information on the state of industrial fault tolerance technologies, see the White Paper available at the end of this article.  For more information on the state of cyber security, see previous Practical SCADA Security blog posts.

Are you a forward thinking network engineer who is designing mission-critical networks for both high redundancy and high security? Or, perhaps your organization is not advanced in these areas.  Let me know your thoughts on pairing high redundancy with high security to achieve high availability.

Related Links



Hi Oliver,

thank for your interesting blog and for all the blogs from Eric containing useful information I will use for a presentation I will held on the conference organized by the independent French association "Club Automation" in Paris June 19 th. on the topic Safety & Security: similarities and differences.
Would any representative of your organization be able to attend this meeting?


Hi Yannick,

Thank you for your message and I’m glad you like our blog.

Regarding your event: Thank you for letting me know, I will contact you directly to discuss it.


Hi Oliver,

Thanks for an interesting article. As a holder of certifications in both Functional Safety (FSEng) and Security (CISSP) permit me to inject a few additional comments.

In safety systems, we tend to introduce redundancy to mitigate the effect of random hardware failures by increasing "hardware fault tollerance". It does little to protect against common mode faults or systematic failures. The most common architecture is to duplicate the system with identical devices together with some sort of synchronisation and watchdog function to determine which is the master system. This will protect against physical failure of one device but probably not against physical attack as both systems are likely to be located in the same vicinity (a common mode fault). Furthermore if one of the systems fails due to a systematic fault (e.g. security vulnerability) it is quite probable that the redundant system will suffer the same fate.

Systematic faults can be addressed through diversity. If the redundant system implements the same functionality using different hardware, operating systems, compilers, and application software each system is likely to have its own set of unique vulnerabilities. Unfortunately, outside of the space and nuclear sectors this approach is prohibitivly expensive. The reality is that the move towards commercial off the shelf hardware and software and the widespread adoption of standards has significantly reduced diversity over the last 30 years.

Despite these observations I do agree with the principal point of your article that security and fault tollerance are complementary but are generally designed in isolation.

It seems to me that security has traditionally been the preserve of the IT Department and fault tollerence has been the preserve of the Automation Department and there is an unwritten rule which prevents these guys from talking to each other. There is certainly a huge gap in both culture and knowledge between these groups in most organisations. This is a form of organisational diversity which needs to be reduced !


Comments are my own and do not necessarily represent the views of the company I work for.

Hello Iain.

You are making a very good point with talking about diversity. This makes me think of an incident at the very beginning of my professional career several years ago when I worked as an IT consultant. A customer had set up a very expensive SCSI RAID 5 + Hot Spare Hard Disk Array as server storage for a mission-critical system.

One day, one of the hard disks in the array failed. Not a problem at first, that is what the technology was designed for, right? But within one hour, well before any reaction time from on-site service, subsequently all hard disks in the array failed and the whole array went down.
When we investigated later it turned out that the manufacturer of the array had bought and shipped hard disks from the same manufacturer with subsequent numerically increasing serial numbers... so we had identical hard drives, all produced in the same production batch.

After this incident (and in all subsequent projects involving large disk arrays) we requested disks from at least two different vendors and from different batches. We never encountered the same problem again.

Coming back to automation systems: The only automation system outside of nuclear/aviation applications you will frequently find with diversity in products implemented is the power substation automation field.

Here, at least in high voltage applications, and partially in medium voltage, you will still find a main bus A and main bus B setup. Here, not only the protection system (IED's - Intelligent Electronic Devices) are designed with diversity, having IED's from differing manufacturers in Bus A and B, but also automation and network equipment. But this is, I agree with your observation, not the general case.

Thank you for this valuable contribution. In our very technical world of today, usually technology is evolving a lot faster than most humans are able to adapt. But I have high hopes that we will eventually eliminate our "unwanted diversity" in this matter. With more and more mission-critical systems moving towards the "cyber-physical" world, we will need this soon enough!


Add new comment