Steve Flannagan of Citect advises how to protect production against failures in SCADA systems by using redundancy.
Supervisory control and data acquisition (SCADA) systems are widely used in plants and factories across the world; the advanced high-level control and monitoring features they provide are fundamental to improving plant efficiency and productivity. Generally SCADA systems are highly reliable. However, one aspect in their operation that is often overlooked at the specification stage is redundancy - or put more simply: what happens if the system fails?
The question is particularly relevant when one considers that control systems for both single-node and network applications have a single point of failure, meaning that they will break down entirely if one piece of hardware fails (such as the computer connected to the control and monitoring units). Granted, most modern computers are designed for reliability, but breakdowns still occur, especially with computers located in harsh environments. Consequently, if some or all of the plant processes are critical, or if the downtime costs are high, redundancy must be incorporated into the system to eliminate failures due to equipment failure.
In the first instance it is important to determine what level of redundancy is required by considering the risks: hardware failure, catastrophic failure, energy failure or a natural disaster? Mission-critical installations often have separate power sources in case of a power failure, and installations in areas prone to natural disasters or the threat of fire separate the servers in different geographic locations. However, whatever type of disaster recovery is planned for, it is possible to greatly reduce lost data and downtime by planning the proper system design, and by choosing a SCADA system with built-in redundancy.
Increased speed and efficiency
Citect pioneered built-in redundancy nearly 15 years ago, and its first redundant installation is still in operation today. In 1992, Citect for Windows started using a client/server architecture for plant monitoring and control. The benefit of this is that it increases the speed and efficiency of the system by distributing the processes in the control and monitoring application across two or more computers (using a LAN). In a simple application, the computer connected to the control and monitoring units becomes the server that is dedicated to communication with the plant control devices, while the display nodes are clients. When a client computer requires data for display, it requests data from the server and processes that data locally.
To provide redundancy, a second standby server can be added that is also dedicated to communication with plant control devices. If the primary server fails, the client's requests for data are channelled to the standby server. In very large installations, host pairs of servers are used with one host pair dedicated as a standby in a separate location from the primary host pair.
The standby server does not duplicate the primary server's functions. In that scenario, both servers would have to communicate with the PLCs, thus doubling the load on the PLC network and reducing system performance. A better alternative for a client/server system is one in which only the primary server communicates with the PLCs. The primary server also communicates with the standby server, continually updating the plant's status. If communication is interrupted, the standby server assumes the primary server has failed and takes over the role as the primary server. When the primary server is repaired and returned to service, it reads the plant's status from the standby server and resumes its role as the primary server. Data is automatically backfilled and the two servers become synchronised again as the standby server reverts to its former role.
Dual network paths
If a dedicated file server is also added to the SCADA system, the user can centralise the databases and display screens; continuity is then maintained if the primary server fails. Another benefit is that centralised databases are easier to manage and maintain: changes only need to be made to one database and are then automatically updated everywhere else. In addition, it is possible to support dual network paths to the centralised database, allowing dual file servers if required.
Having secured the system by removing the single point of failure (the I/O server) the user could be excused for thinking that all system eventualities have been covered. However, this is not the case: if the LAN in the newly configured system also fails, then control and monitoring by the display nodes is lost. In view of this, a second LAN and file server are crucial to help ensure system stability, even in the event of a network failure.
Where it is important to ensure a plant's uninterrupted operation, beyond duplicating hardware, the plant engineer can employ Split-Task Redundancy. This goes further than simply maintaining continuous communication with the plant-floor devices; it also ensures that all alarm and trend data is maintained in the event of a failure.
Split-Task Redundancy is aimed directly at the centralised part of the processing, and guarantees that all processing power is employed. It is achieved by splitting the server's task into four subtasks: I/O (input/output), alarms, trends and reports. Each of these tasks manages its own database independently of the other tasks, enabling the system user to handle redundancy differently for each task. For example, the user can parallel the trend task in both servers (unlike the I/O task that uses primary/standby processing) to maintain the integrity of the trends. When the primary returns to service after a failure, it can update its lost trend data from the standby server. Both servers then have continuous, uninterrupted trend data. The I/O does not just have to be a single layer. There are many installations with multiple I/O layers to ensure redundancy.
Finally, to ensure maximum system stability, where parallel PLCs are employed these units should be connected to the same field devices. This ensures that any hardware component in the system can fail without disrupting the control and monitoring of the plant.