Background:
A specialized system vendor used multi-level hardware redundancy to meet customer needs for systems yielding ultra high availability. The firm was quite proud of its innovative systems architecture and the (relatively) low ancillary and maintenance costs that resulted from the design. Since the redundancy design required four physical components in two twin component units for each logical component, initial system costs were higher than other systems on the market. However, no other system on the market could match these systems in ultra-high availability.
In normal operations, all components performed as usual. When a failure occurred, the Operating System (OS) took the failing component set offline while the functioning set continued normally. The OS immediately flagged HQ Logistics of the failure who immediately sent a replacement to arrive the next day when the systems admin staff would replace the failing twin component. The ultra high availability resulted from the fact that given projected failure rates, the working twin set was highly unlikely to fail while the replacement was under way to the site.
Simplified applications programming was another benefit accruing from the design. Except in very special cases, applications programmers did not need worry about hardware error handling as the hardware design and the operating system (OS) made this transparent to them.
These features made this equipment very attractive to the company’s target markets.
Current situation:
A potentially large customer approached the firm with a special request. A proposed new application needed the ultra high availability for the core systems at head quarters but the branch office systems could tolerate some unscheduled downtime. While the customer would prefer a single vendor for the entire project and preferred this vendor’s approach to fault tolerance, the budget could not manage the projected total systems cost. Since ultra-high availability was a critical component of the customer’s proposed new configuration, the customer approached the vendor with a proposal:
|
The customer would purchase standard high-end systems for the back end and for HQ use but, to lower total costs, the customer requested significant changes to the standard system architecture for the branch office systems where fault tolerance requirements were not as stringent. The customer proposed the vendor modify the low-end systems by removing the redundant hardware – a process known internally to the vendor as “simplexing.” |
Since all systems temporarily went “simplexed” when a component failed and the OS took it off line until a replacement arrived, the engineering development and manufacturing functions saw no impediments to the sale. Sales and other functions were quite happy about the size of the order.
What Happened Next:
Service agreed to support these modified systems, but had some concerns about the increased needs in the field for spares and standby labour. Roy Sequeira agreed to study these costs and estimate their impact on the Service organization.
Roy had just completed a Service model to forecast how new products would affect Service resources, revenues, and costs. This model showed that the proposed modified design had some rather severe operational implications for Service that affected service profitability and had a potentially devastating impact on the modified product’s field usability. The model also made it easy for others to follow and understand the analysis.
Normal Failure Mode:
As discussed earlier, normal systems, on detecting a failure in any hardware unit would shut down the failing unit immediately and flag the OS, leaving its partner unit to provide logical continuity. Diagnostic routines within the OS would then test the failing unit to determine if the failure was transient or permanent. If transient, and if the transient failure count was less than a threshold value, the failing unit would be resynchronized with it’s partner and every thing was back to normal.
On permanent failures, the OS would leave the failing unit off line and flag HQ, who would then sent out a replacement to arrive the next morning when the customer would remove the bad unit and replace it with the new one. The OS would then synchronize the two units and things would be back to normal again.
The systems operator could handle most failures with no need for a field engineer to intervene.
Modified” Failure Mode:
The modified systems would behave very differently. With no partner unit to continue system operation, any failure in the modified architecture caused the system to lock up rather than ride out transient failures the regular systems handled with no difficulty. The customer had requested the industry-standard 4-hour response for such failures at branch level. This meant dispatching a field engineer with the required part to make it operational again.
Since customer operations staff handled most failures on “normal” systems, the company had relatively few field engineers on staff. Responding to this proposal required deploying additional field resources to respond to the customer in a timely manner.
While maintenance on the modified systems would bring in only 60 percent of the MMC on unmodified systems, total maintenance costs (in terms of actual dollars) stayed the same as that required by standard systems. This would depress service margins by about 43%, forcing them into negative territory, and result in heavy service losses.
Unscheduled downtime would be quite high. The added field resources could not fully compensate for the increased downtime. This would adversely affect market perception of product reliability and availability – a major sales and marketing feature.
Results:
Roy showed the overall negative aspects of the proposal outweighed the benefits. The large size of the potential order was definitely very attractive. However, Roy’s analysis clearly and conclusively showed the company could not live with the Service consequences. The data and accompanying analysis convinced doubters, minimized the emotional level of further discussions, and enabled a realistic decision the company could live with.
The company regretfully declined the order.
Summary
The contrarian approach taken by the analysis, taking nothing for granted and challenging some of the company’s fondest assumptions, was essential to finding the problem with this “opportunity.” There was a lot of initial resistance to the findings, especially when some generally accepted operational assumptions were proved wrong, but the dispassionate focus on facts carried the day.