Fault Tolerance - Application Software
Essay by review • December 15, 2010 • Research Paper • 1,623 Words (7 Pages) • 1,398 Views
Fault Tolerance - Application Software
Introduction
Today's business requirements drive the necessity for software applications that enable organizations to maintain constant and continuous availability. We are living in a 24/7 world. Brick and mortar businesses that only needed to worry about the security of their hard assets after closing for the day now have websites allowing customers to shop online at their convenience. Simply establishing a web presence is not enough to compete in the market. Business-to-business (B2B) and business-to-customer (B2C) commerce requires a high-availability strategy with planning and solutions for a fault tolerant system with 24-hour access. An organization offering goods and services on the Internet needs fast loading pictures, smooth transition between pages and quick links to services and forms. Stalls, hesitations, and lockups are unacceptable to customers and business partners who have neither the time nor the patience to contend with inefficiency. Of course, the requirement of high-availability does not come without a price tag. Online maintenance and upgrades in a 24/7 environment can triple the maintenance costs for applications and design of redundancy.
Unlike hardware, software reliability is difficult to characterize. As software applications continue to increase in complexity, fault tolerance will always be a growing concern. Risk analysis is important in determining the 'pain level' a business can endure when software fault tolerance is improperly implemented.
Today's applications are quite complex, requiring millions of lines of code. It is due to the complexity of many of today's applications that hardware components are known to have a much higher reliability than software systems. Laprie (1990) says, "a system failure occurs when the delivered service no longer complies with the specifications, the latter being an agreed description of the system's expected function and/or service." We expect the software to perform as needed for continuous availability. The business operation's systems performance can be severely impacted by software faults, so it is essential to limit, control and contain the (possible) damage by faulty software.
For some applications, software fault tolerance is more of a safety issue than reliability. There can be legal or regulatory requirements to fault tolerance. For many B2B and B2C transactions, fault tolerance is a business decision. There are other organizations whose charter is driven by the implementation of a fault tolerant system. For example, the aviation industry is required to meet strict specifications for hardware and software. Airlines board over 500 million passengers per year. The transportation of such large numbers of human life would demand fault tolerance environments of robust safety and reliability. In the nuclear industry, the U.S. Nuclear Regulatory Commission developed a tool to assess software reliability systems by modeling redundant systems with analysis of failure data and calculations of availability. The International Society for Pharmaceutical Engineering provides risk-based guidance to encourage cost effective error detection and failure prevention. Companies that strive to maintain accurate and comprehensive environmental data are regulated with standards and guidelines by the Environmental Protection Agency. Additionally, emergency rescue services (i.e. police, fire, and 911) would require fault tolerant systems that prevent failures and increase system dependability. In the automotive industry, a minor software error in one of the 50 microprocessors in an automobile could cost many lives before the defect is recalled by the automobile industry. Loss of reputation and consumer confidence could cripple the manufacturer's business.
Analyze Requirements
It is an ideal world that allows no downtime for maintenance. Some B2B and B2C have scheduled downtime, avoiding peak hours and seasons. These organizations usually alert their users to the scheduled downtime to reduce the inconvenience and loss of customer dissatisfaction. For example, my bank is unavailable online from 11PM to 4AM on Saturday nights. I know this, so I do not attempt to access the site for transactions. A newcomer to the bank's website may not be aware of the policy and thereby become discouraged from using the bank's services. Whenever an organization plans to have a schedule shutdown, it is advisable to provide notice to customers of the downtime. Other organizations require continuous availability, even with planned maintenance. High availability requires a strategy that mitigates the risk of interruption, both internal and external. An organization with a high availability strategy will adopt a selection of technologies to address interruption challenges and will review its availability requirements as a dynamic system. High availability costs money. Fault tolerant software costs grow in the order of reliable (least expensive), resilient (less expensive), high availability (more expensive), and continuously available (most expensive). Application services that require continuous availability utilize multi-site architectures. However, the requirement of multiple sites complicates the design of the application and generates other database issues such as partitioning, load balancing and replication. The problem of designing the software correctly is compounded by the complexity in assessing the reliability and dependability of the software when it is in development stage, especially when millions of lines of code are written.
Fault Classification
Software faults fall into many categories. Some are determined to be permanent design faults, which are easier to identify and eliminate in the testing phase. Others may be temporary, transient and intermittent faults, which occur on occasion and are difficult to troubleshoot because they cannot be reproduced at will. Transient and intermittent faults are not readily identified in testing, and usually manifest after deployment. Even a mature program in a fault tolerant computer system may experience both transient and intermittent faults. The difference between the two faults is reparability. Transient faults are errors that manifest from a temporary environmental condition and are deemed irreparable. Unstable software states cause intermittent faults, which are reparable.
Fault Reliability Techniques
System reliability can be increased through fault avoidance and fault tolerance. Fault avoidance is much more difficult to achieve as it will reduce, but not eliminate the chance of
...
...