Executive Summary:
This is an excellent example of how:
- Starting from fundamentals
- A knowledge how various functions operate under pressure, and
- The ability to integrate knowledge from different disciplines
Introduction:
A multi-national computer manufacturer was getting ready to launch a new CPU to power a top-of-the-line flagship system. Marketing estimated sales of well over 19,000 systems within five years. It was to be the replacement for the current top product line.
Using advanced technology and state of the art maintenance architecture, the new system aimed, among other things, at simplifying diagnostic processes so that it would be essentially self-diagnosing. As planned, all that field engineers would have to do would be to follow instructions issued by the self-diagnostic process.
Note on recommended diagnostic process:
The internal self-diagnostic hardware and firmware could isolate about 75% of all failure to one module
out of 17 in the CPU, with an incremental percentage of failures isolated to two and three modules.
However, in one failure mode, the self-diagnosing process could not isolate the failure at all. Another
aspect of this failure mode was that failures were rather elusive; occurring at lengthy intervals.
For handling this failure mode the design team planned an iterative “divide and conquer” strategy to isolate the failing module. The failure isolation technique iteratively cut down the number of suspect modules until problem resolution. See Appendix A for details.
Background:
A number of groups were intimately involved with the preparation for launch.
As always, Engineering designed the CPU to meet specifications and the requirements of other groups such as Marketing, Sales, and Service. Service had an internal engineering function that worked closely with design and development to specify an appropriate maintenance strategy and ensure the final product met Service needs. This group had designed the maintenance process and collaborated with Engineering in implementing it. Consequently, group members had a strong vested interest in the maintenance design and strategy.
Another function (NPI) was directly responsible for interfacing with field operational groups and representing their interests in design and development meetings. Roy Sequeira, with his strong technical expertise, managed the field interface for this new system. By the time he came on board, the design was complete and accepted and everything moved towards the release date.
A key policy that significantly affected what came later was that of scrapping all modules that went through the repairs process three consecutive times with no fault found (NFF.)
Initial misgivings:
Roy studied the maintenance strategy quite carefully and felt uncomfortable about including the iterative
isolation process in the recommended
maintenance strategy. Normally, field engineers used this technique as a desperate final attempt at problem
resolution when all other methods
of problem isolation had failed. Usually, by the time this method was invoked, the customer was usually
quite annoyed as the system had been
unavailable for an extended period and the iterative process would extend the “down time” even
further. Roy knew customers would be unwilling
to allow field engineers the time needed for ultimate problem isolation; once the CPU functioned properly
they would not allow time for
executing the rest of the recommended process.
With first-hand experience in how field engineers operated, Roy also knew they would resort to massive swapping when faced with elusive problems, especially since it was part of the recommended maintenance strategy.
However, the engineering functions believed management could contain and control any fallout by imposing strict discipline on the field.
General operational scenario:
Roy decided to model the expected scenario when using the iterative process modified for expected customer reactions.
If the first pass through the iterative process with half the modules replaced eliminated the problem, the customer, now with an operational CPU available, was highly unlikely to permit field engineers to continue with the process. If it did not eliminate the problem, the customer would insist that all the remaining modules be replaced at once.
Thus, in either case, each time this failure mode occurred half the CPU modules went back to repairs for them to isolate the failure and repair the module. Even if the repair function succeeded in isolating the failure and repair it, the remaining modules would be marked “No Fault Found” and be put back on the shelves as good modules for return to the field.
As noted above, the policy called for scrapping modules marked “No Fault Found” thrice. There would therefore be a large number of perfectly good modules in the scrap heap.
Analysis:
Roy modeled the cost associated with this failure mode. When he saw the preliminary results, he had his model
and assumptions checked and
verified by the internal management consulting group. They concurred with his findings and supported his next moves.
Roy’s model showed that, given marketing’s expected number of units sold, the predicted Mean Time Between Failure (MTBF) for this failure mode, and an average cost per module of $3,000, these scrapped modules would cost the company over $440 Million. Give this staggering cost, Roy felt it unnecessary to model labour and logistics transport costs as well.
This analysis and information burst like a bombshell when presented to service management and the development community.
Follow up:
Roy presented his model and findings to the main Service Management community. After considerable probing and
verification, they agreed to
present the information the main corporate product management committee.
Corporate management finally decided to send the CPU and system back to the drawing boards.
Appendix A: Fault Isolation Process:
The problem isolation process was an iterative process that worked as follows:
Initially all CPU modules are suspect.
- Divide the suspect modules into two equal groups and designate one as the “target.”
- Replace the target group with modules from known good spares.
- If the problem recurs, the problem is not in the target group.
Replace original target modules in the CPU
Designate the remaining modules as suspect
Repeat from Step 1 above - If the problem disappears, the modules just replaced are now the suspect group
Repeat from Step 1 above
See following flowchart for details
Note:
Failure resolution is deemed to occur when the CPU operates flawlessly for at least three times the longest period
between failures.