Software Reliability Engineering

Reliability

Reliability is a measure of continuous delivery of the correct service or equivalently, of the time to failure. In other words, it is the probability that the software will work without failure for a specified period of time (Far, 2006). Reliability is a necessary part of a healthy system design and development process. It is the right way to ensure right result to the right user for a given input to the system.

Lets say, RMS is a web based Realty management system for the Home Buyers/Sellers. RMS has been designed in such a way that the system executes without any redundancy and utilizing most effective design and searching criteria. It consists of several different useful functions and interfaces. It provides the right information to the right user within a very short processing period. Now, the reliability of the system will be discussed.

For the reliability component, we will answer the following six questions that will cover all aspects of reliability in the system.

  1. How will you define failure for the product?
  2. Choose the natural or time unit you will use for the product.
  3. Set the product failure intensity objective (FIO).
  4. Find the expected product acquired failure intensity, based on the failure intensities of the hardware and acquired software components.
  5. Determine the product-developed software failure intensity objective.
  6. How will you balance fault prevention, fault-removal, and fault tolerance strategies?

How will you define failure for the product?

A system failure is an event that occurs when the delivered service deviates from correct service. Thus, a failure is a transition from correct service to incorrect service. In other words, the user gets incorrect result or no result at all. Therefore, to design Reliable software, we have to define the Failure Severity Classes of that product.

A severity class is a set of failures that share the same degree of impact on users. Common classification criteria include impact on human life, cost impact, and service impact. Each of these classes may be further subdivided but should be separated by an order of magnitude in impact. Let us first define different levels of severity classes. The first one is Mild which means that the failure is very simple in degree of destruction and the system has the ability to overcome the failure. The second level is Mediocre which means that it should not be expected by the user and needs to be corrected before system delivery. Last one is very critical and we call it Severe which means that if the failure occurs, the system may collapse and deviates from the correct service. It may cause catastrophic failure to the system and ended up with no result or incorrect result.

  1. Incorrect user input/selection/search criteria may result incorrect or invalid property information or end up with an error message or warning. This is not critical and the system has the ability to limit user's choices and preferences (Mild)
  2. Correct user selections but incorrect input registered by the system may result in incorrect/invalid property information/no property information (Mediocre to Severe)
  3. Correct user access Information but incorrect system access or access different user type preferences/information. Which may result severe damage of existing system information (Severe)
  4. Failure in the initial connection to the online RMS (Severe)
  5. Failure to print correct property information (Mediocre to Severe)
  6. Failure administrative access for the System Management (Severe)
  7. Failure to add new property information (Severe)
  8. Failure to update existing property information [Such as Offer Made or Cancelled] may result in invalid information access (Severe)

Choose the time unit you will use for the product?

Realtor Management System is basically a Real-Time Information System which reflects the updated information during any point of time. It should serve the customer in 7X24 bases; means 7 days/week and 24 hours/day, then the operational hour of RMS will be 168 hours per week and 8760 hours yearly.

The RMS is used for two reasons: for the customers/realtors to query the right property information based on their preferences which the unit of measurement will be the hours in operation. For updating or uploading the property information, we can measure the system units of number of jobs/property. The upload process consists of two parts: uploading static/One time information [such as Provinces/Districts/Communities/Other home preferences and choices] and the actual information of a specific property. For the administrator to view, add, modify, and delete information will be based upon operational hours.

Set the product failure intensity objective (FIO)

The RMS needs to be up and running almost all time means 24 hours a day and 7 days a week [About 168 hours/week]. Customers and Realtors may search for the right property at any point of time. But, the maximum transactions are expected to happen between 09:00 AM – 09:00 PM [84 Hours/week]. Saturdays and Sundays would be the busiest days. We are expecting 100 operational hours/week. RMS can accommodate 600 realtors/customers searching for the right property. Each property search process may take half an hour while following each and every step of the operation. The administrator would add approximately 50 properties/day. For each property there could be 5 query processes to be executed. Using maximum capacity of the system, there could be (1,20,000 + 1750 = 1,21,750) means approximately 1,22,000 transactions/week.

Looking at the ideal target FIO for the RMS, there are two possible ideal targets that could be considered. For the first target, it can be subdivided into two categories of which first one is querying the properties, the FIO is roughly around 1 failure per 1000 operational hours. And, the second one is printing property information (Sending information to the printer) is about 1 failure per 70,000 print jobs. The value of 70,000 print jobs is equivalent to 10 weeks in full operation, where 1000 print/day is expected. 1000 operational hours is equivalent to roughly 10 weeks. FIO for the second target, which is adding/updating property information, can be subdivided into two categories. The first category is updating property information; the most ideal intensity is 99.5% transactions complete without failure. Second One is sending automatic notification to the registered realtors which is expected to be 99% transactions completed without failure.

For the administrative usage, the ideal target FIO for the RMS must try to achieve the same target level of FIO as the RMS operating system. The main reason behind that is, customer/realtor must be able to query the properties at all times and the administrator must be able to access the database to add, delete, or modify property information at any point of time.

Find the expected product acquired failure intensity, based on the failure intensity of the hardware and acquired software components:

The proposed system would be developed in C#. Since the compiler of C# is on Windows platform, RMS is expected to be deployed in a Windows environment. The detailed specification of the software system is given below:

Type

Platform

Language

C#

Platform

.NET Framework 1.2

Database

SQL Server 2000

Server System

Windows 2000 Server

Database

SQL Server 2000

Server System

Windows 2000 Server

Web Server

Internet Information Server (IIS)

Mail Server

SMTP Mail Server

The hardware specification of the whole system would be:

Web server

CPU:

2 x 2.8 GHz Pentium 4

Memory:

2 GB

Disk:

80 GB

Network:

100BaseT

SQL server

CPU:

4 x 2.8 GHz Pentium 4

Memory:

4 GB

Disk:

4 x 40 GB RAID 0

Network:

100BaseT

From a Microsoft TechNet Web site, we get the following table which summarizes the overall availability of the hardware platforms using the same MTTF vectors.

Table1: Availability of Hardware Platform:

Cluster

One week

One month

Internet-facing firewall NLB cluster

0.998867

0.999735

Web NLB cluster

0.999092

0.999788

Search NLB cluster

0.999498

0.999883

Internal firewall NLB cluster

0.999197

0.999813

SQL Server cluster

0.999504

0.999884

Total availability

0.996164

0.999103

Table2: Availability vs. Downtime:

Acceptable uptime percentage

Downtime per day

Downtime per month

Downtime per year

95

72.00 minutes

36 hours

18.26 days

99

14.40 minutes

7 hours

3.65 days

99.9

86.40 seconds

43 minutes

8.77 hours

99.99

8.64 seconds

4 minutes

52.60 minutes

99.999

0.86 seconds

26 seconds

5.26 minutes

So, Summarizing Table 1 and Table 2, we get the following results:
Availability (A) = 0.996164/Week means 99.6% Availability/Week
Downtime/Week (tm) » 3 Minutes x 7 = 21 Minutes = 0.35 Hours [As 99.9 represents 86.40 minutes/day so we assume it will be close to 130 minutes/day]

According to the Formula of Failure Intensity,

l = (1 – A)/ A tm= (1 – 0.996164)/( 0.996164 X 0.35) = 0.000897/0.35= 0.00256 / Week

For RMS, 1, 22,000 transactions would happen/week. So, Failure Intensity of the above described Hardware/OS/IIS/SQL Server is 0.00256 Failure / 1,22,000 Transactions or 0.02 Failures / 1,000,000 Transactions.

Table 3: Calculating total FIO of Hardware:

System/Platform

Failure Intensity (l) / 10 Million Transaction

Hardware - lHW

0.2

Email Server- lEML

20 [Approximated ]

Other Systems - lOTH

30

Total Failure Intensity – (lsys)

lsys = (lHW+lEML+ lOTH) i.e. l » 50.20

So, the expected failure intensity of the total system is 5 failures for 1,000,000 transactions or 50 failures for 10 million transactions.

Determine the product-development software failure intensity objective

RMS consists of five system components. These five components have been subdivided into different sub components to form several different classes. We use these components to estimate the target FIO for the whole software (Vuong et al., 2002). The five components with estimated failure intensity objectives are:

1. External Inputs:

Any input provided by the customer/realtor or system administrator that the software must process. (Estimated FIO = range between most limiting factor and greatest factor = 10 failures / 1,000,000 transactions to 20 failures /1,000,000 transactions.)

  1. Auto fill Combo Boxes according to user selections (Province, District, Community, Property Type, Property Category, Property Sub Category, Price Range, Realtor Company, Realtor etc.): 10 failures/1,000,000 transactions
  2. Search (Properties): 20 failures / 1,000,000 transactions

2. External Outputs:

Any message provided to the customer, realtor, system administrator, or outside systems. (Estimated FIO = range between most limiting factor and greatest factor = 10 failures / 1,000,000 transactions to 20 failures /1,000,000 transactions.)

  1. Print Property Information: 10 failures / 1,000,000 transactions
  2. Confirmation of Update/Add information : 20 failures / 1,000,000 transactions
  3. Automatic email notification to the Realtor: 10 failures / 1,000,000 transactions

3. External Interface:

Any external interface that is not a part of RMS. (Estimated FIO = range between most limiting factor and greatest factor = 10 failures / 1,000,000 transactions to 50 failures /1,000,000 transactions.)

  1. Property Information Database: 10 failures / 1,000,000 transactions
  2. Picture Upload Interface integrated with OS File System: 20 failures / 1,000,000 transactions

4. External Inquiries:

External Inquiries: Any interactive requests by the customer, realtor, system administrator, or outside systems that requires a response from our system. (Estimated FIO = range between most limiting factor and greatest factor = 10 failures / 1,000,000 transactions to 20 failures /1,000,000 transactions.)

  1. Search Properties: 20 failures / 1,000,000 transactions
  2. Printout Properties: 10 failure/1,000,000 transactions
  3. Auto fill Combo Boxes according to user selections: 10 failure/1,000,000 transactions

5. Internal System Access:

Any physical file in RMS. (Estimated FIO = range between most limiting factor and greatest factor = 10 failures / 1,000,000 transactions to 20 failures /1,000,000 transactions.)

  1. RMS Database information access: 10 failures / 1,000,000 transactions
  2. Access to System Resources: 10 failures / 1,000,000 transactions

From the above discussion, we can find out the FIO for each component to the live system.
The failure intensity for Serial connections is based on the formula:

l = l1xl2x……….xln = ÕlI for I =1 to n

Means it multiplies the failure intensity of each component of the system. But, we can reduce the failure intensity by putting the components parallel to each other. Realistically, it's hard to perform software parallel component design. We sum the failure intensities of each component of the system. The formula for parallel system is:

l = l1+l2+………. +ln = Ã¥lI for I =1 to n

Thus the target FIO of the system ranges from 10 failures / 1,000,000 transactions to 20 failures/1,000,000 transactions. Therefore the product-developed software failure intensity objective is the target FIO minus the platform FIO which ranges from 0 to 15 failures / 1,000,000 transactions (20 – 5 = 15).

  1. Let's say one time cost of the software is $50,000. So, According to COCOMO's Design – 1 Failure/50,000 Transactions or 20 Failure/1,000,000 Transactions [It is based on Approximated Cost and Transaction Mapping Criteria]. So, according to our research and design, we can meet the FIO of the product.

How will you balance the fault prevention, fault removal and fault tolerance strategies:

Fault Prevention:

  1. Initial Investigation and Requirement review: 30%. Realtor/Customer requirements and required time/search gives us the clear idea of what the output would be and which way?
  2. Design review: 30%. An efficient and mature design is the heart of the system. An efficient database and system design results into a stable and efficient system process. Better security approach, enforced integrity constraints, simple code developments are the key to prevent disasters of the system. The System Flow Diagram and Data Flow Diagram should be clearly documented before the development process.
Fault removal:

  1. Code inspection: 20%. Early detection of errors and step by step debugging of the code may save the system from severe system failure at the operation stage. It's really hard to find the error code when it's integrated with several internal and external components of the system.
  2. Unit test: 5%. An efficient system design requires a very little amount of unit tests. So it can be kept minimal in this case.
  3. System test: 5%. If the design is good and a personal review process detects and removes most errors, then testing the combined system can also be kept to a minimal.

Fault tolerance:

10%. A software system with a better design might not require enough fault tolerance techniques. But, a better error handling and error recovery procedure needs to put in place. By enforcing Object Oriented programming technique, we may reduce redundancy and faults in the system.

The main strategy is to use fault prevention in our project rather than incorporate fault tolerance into the design. Therefore we would concentrate on ISO 9000-3 design guidelines (Praxiom Research Group Limited, 2003):

  1. A commitment to quality by stating the quality objective, understanding the client's requirements, and developing strategies to accomplish quality. (ISO 9000 - 3: 4.1.1) This means giving our best effort and taking the time to review with the client and team, the specifications of the system. As well, it means having a concise idea of what a quality system is to the client. We will also be thorough in our testing to ensure quality.
  2. Monitor and report all problems and system performance statistics so that quality can be measured and used to further develop improvement strategies. (ISO 9000- 3: 4.1.2.3) This means keeping good accurate records of any system problems and concurrent analysis of the statistics to improve upon the current project and saved for future projects.
  3. Incorporate due diligence into the software design. Make sure that the design follows all specifications and can be implemented systematically. Try and avoid previous project pitfalls. Arrive at a consensus on coding conventions and norms so that all development team members can understand each other's code. (ISO 9000-3: 4.4.1)

Therefore, we will design the product which would fulfill the expectations of the user and develop code without redundancy and have the maximum accuracy. We will also establish the standard code conventions so that the code is sharable among the team mates.

Conclusion

The Reliability of the system has been measured to produce target system performance. The Failure Intensity Objective (FIO) of RMS has been set at 15 failures/1,000,000 transactions.

References:

[1] Dr. Behrouz Far. University of Calgary. 2006, http://www.enel.ucalgary.ca/People/far
[2] Handbook of Software Reliability Engineering Edited by Michael R. Lyu
[3] Software Reliability EngineeringJohn D. Musa
[4] Software-Reliability-Engineered Testing John D. Musa, AT&T Bell Laboratories and James Widmaier, National Security Agency
http://www.stsc.hill.af.mil/crosstalk/1996/06/reliabil.asp
[5] MCSE Training Kit (Exam 70-226): Designing Highly Available Web Solutions with Microsoft® Windows® 2000 Server Technologies
http://www.microsoft.com/mspress/books/sampchap/5357a.asp#110
[6] Microsoft Solution for Internet BusinessPerformance and Capacity Planning
http://www.microsoft.com/technet/itsolutions/citsrv/ib/msib2tca.mspx
[7] Planning Fault Tolerance and Avoidance By Charlie Russel and Sharon Crawford
Chapter 7 from Microsoft Windows 2000 Server Administrator's Companion, published by Microsoft Press
http://www.microsoft.com/technet/prodtechnol/windows2000serv/plan/planning.mspx
[8] ISO IEC 90003 2004 SOFTWARE STANDARD
http://www.praxiom.com/iso-90003.htm

Comments

  1. This is very good information.i think it's useful advice. really nice blog. keep it up!!!

    - reliability centered maintenance

    ReplyDelete
  2. Thanks for sharing! This page was very informative and I enjoyed it. Maintenance training

    ReplyDelete
  3. Great post. I hope you can write more good stuff like this article.

    RCA Software

    ReplyDelete

Post a Comment

Popular posts from this blog

Cloud Computing Technology Assessment

Database Testing With DBUnit

Data Science, Big Data & Microsoft Machine Learning