The Pentium Chronicles: The People, Passion, and Politics Behind Intel's Landmark Chips (Practitioners)
Page 14
Figure 3.1. Sightings-to-bugs flowchart, part 1.
Figure 3.2. Sightings-to-bugs flowchart, part 2.
We deliberately withheld the authority to mark the issue as resolved, saving that role for the validation engineer who filed the issue in the first place. That validation engineer had to then demonstrate that the previously failing test now passed, and that the new fixed model also passed some minimal regression tests so as to catch any obvious system-wide cases where the new fix broke something that used to work.
We found this process extremely valuable because it guaranteed that at least two project engineers had deep knowledge of both the bug and its cure. It also removed the temptation for the design engineer to continually short-circuit the process by simply unilaterally deciding it really was not a bug or he “thought he had fixed it” or any of the thousand other delusions a creative person can rationalize his way into.
MANAGING VALIDATION. Inherent in both the mismanagement of design errors and our
process for successfully handling them is a critical requirement: Successfully manage the validation effort. This takes sublime judgment, which I have been able to distill into four don’ts (since the first step in doing something right is knowing what not to do).
First, don’t use the validation plan as a performance measure. A validation manager finds herself embedded in a design project that values objectivity and measurement. She is asked to provide indicators and metrics by which her superiors can gauge her team’s overall progress. Because the validation plan most closely resembles a list of tests to be run, an obvious metric is to measure how much of the plan the team has accomplished at any given time. The flaw in this thinking is that the validation crew has conceived the plan while carefully considering all the technical corners of the design it must cover. You cannot reasonably expect them to anticipate how the sequence of those tests will jibe with what the design is capable of at any given week, nor with what the design team may actually need that week.
More important, validation teams learn as they go. And the main thing they learn is where their own plan had holes and weaknesses. If a validation team is being managed strictly to the fraction of the plan they have completed, they may become fatally discouraged about adding any more tasks to it.
Second, don’t use the number of bugs found as a performance measure. Late in the project, after most of the RTL has been written and built into a chip model, validation applies their accumulated mass of tests, along with randomly generated code and other methods, to try to find any design errata. The rate at which they find bugs depends on a mix of the design’s innate bugginess, how close to the surface the bugs are, how many computing resources are available to the validation team, and how quickly the team can resolve the list of currently open sightings. Measuring validation’s output strictly in terms of bugs found per week can quickly distort the entire validation process. Coverage matters, too. If validation finds that all their testing has turned up very few bugs in functional unit X, but revealed a veritable bug killing field in functional unit Y, they must be allowed to increase their pressure on Y without completely giving up on X.
Third, don’t use the number of tests passed as measure of project health. Design projects running at full speed can be intimidating to upper management. They see the schedule deadline looming large, design errata being found at a steady or increasing rate, and a steady stream of bad news (higher power needed, performance deviations, larger die size), with no guarantees that the stream will not turn into a flood. They are tempted to ask the crucial question, “Is this project lurching toward a tapeout convergence, or is it exploding right before my eyes?” How can they tell?
One indicator they look at is the rate at which new bugs are manifesting in the design. Managers want the design errata rate to decrease steadily toward the tapeout window and then essentially hit zero per week for several weeks before taping out. That is what they would like, but that is not what happens. What happens is that by the time the RTL has matured enough to run the really tough tests and the validation crew has disposed of the easier bugs, not much time is left before formal tapeout. In fact, as the validation team gets more expert at wielding their tests and debugging the failures, the overall errata rate may well go up in the project’s last few weeks.
To avoid upper management’s wrath, the validation team might choose to accentuate the positive. It is easy to rationalize. After all, the validation plan called for every combination of opcode and addressing mode to be checked, so it is not necessarily duplicitous to report that all those seem to work, instead of concentrating your overall validation firepower on the areas yielding the most bugs.
Resist the urge to even think in that direction. Instead, let the project’s indicators tell you the truth and guide whatever actions are appropriate. If some particular part of the design is generating more than its share of bugs, increase the design and validation resources assigned to it. Pay attention to the type as well as the number of bugs. The bug pattern may reveal an important lesson for either of these dimensions. And once having formed as accurate a picture as possible of the project’s health, relay that picture to your management, the bad with the good. (Then take your licking with aplomb. That is why they pay you the big bucks.)
Fourth, don’t forget the embarrassment factor. Imagine a design error such that the RTL model is yielding this result: 2 + 2 = 5. (Ignore for the moment that a bug that simple would never make it into the wild, since the operating system would never successfully boot on a machine broken this badly. In fact, for a bug this egregious, the rest of the design team would probably invent some unique punishment that is too terrible to put in print.) There is a class of bugs that would prove beyond reasonable doubt that the design team simply hadn’t ever tested whatever function they are found in. Worse, the implication to the buyer is “If they didn’t test that, what else didn’t they test?” For this reason, every opcode and every addressing mode really should be tested at some point in chip development, and anything else with a high embarrassment factor, whether or not this “coverage” testing is detecting design errata at a high rate.4
Plan to Survive Bugs that Make It to Production. Those who are not engineers know two certainties: death and taxes. Engineers know a third: there are no perfect designs. Every single product ever made has had shortcomings, weaknesses, and outright design errata [6].
The goal of the first two parts of my avoid/find/survive product-quality algorithm is to prevent design errata in the final product. The focus of the third part is to minimize the severity and impact of the errata that make it into the product. Some bugs are hardy or well disguised enough to complete that journey. The FDIV flaw was a subtle, complex bug that survived both design and validation.
I once worked on a microprocessor design team (not at Intel) whose motto was “From the Beginning: Perfection.” We even had that slogan emblazoned on our Tshirts.’ But the chip, far from being perfect, underwent an aggravating 13 steppings (design revisions) before it was really production ready. I believe a significant contributor to this was that the project team took this slogan too seriously. Once the project team convinces itself that a perfect product is an achievable goal, that mindset absolves the team from considering what they will face when the silicon comes back from the fabrication plant. Then, when it comes back from the fab in an obviously imperfect state, the design team will not have provided themselves any resources with which to debug the chip.
NASA has a well-tested methodology for dealing with unforeseen eventualities. For mission-critical facilities on a spacecraft, for example, NASA provides backups and sometimes backups for the backups. But merely providing the additional hardware is not enough. You must also try to anticipate all possible failure modes to make sure that the backup is usable, no matter what has happened.
Malfunctioning microprocessors are infuriating devices to debug in the lab. Computers in the 1980s were of medium-scale integration, and you could usually directly mea
sure whatever signals or buses turned out to be of interest. With microprocessors, especially those with caches and branch-prediction tables, an awful lot of activity can occur on the chip with no outwardly visible sign. By the time it is externally visible that things have gone awry, many millions or even billions of clock cycles may have transpired. You could be chasing a software bug in the test, an operating system bug, a transient electrical issue on the chip or on the bus connected to it, a manufacturing defect stuck-at fault inside the microprocessor, or a design error. At this instant, as you stand there helpless and be fuddled in the debug lab, the scales fall from your eyes, and you see clearly and ruefully that during design you should have provided an entire array of debug and monitoring facilities, with enough flexibility to cover all the internal facilities you wish you could observe right now.
During design you should have provided an entire array of debug and monitoring facilities, with enough flexibility to cover all the internal facilities.
Having learned this lesson on previous machines, we architects imbued the P6 and Pentium 4 microprocessors with a panoply of debug hooks and system monitoring features. If a human debugger has access to the code being executed and wants to see the exact path the processor is taking through that code, all she needs from the microprocessor is an indication of every branch taken. When that sequence of branches diverges from what was expected, she has a pretty good idea of the bug’s general vicinity and can begin to zero in on it.
There are two subtleties in providing debug hooks. The first is not to use their existence as a crutch to do a poorer job at presilicon validation (a fear upper management often expresses). The second is to take validation of these debug hooks as seriously as you do the chip’s basic functionality. After all, “If you didn’t test it, it doesn’t work” applies to all aspects of the chip, not just the mainstream instruction-set architecture stuff.
A Six-Step Plan for High Product Quality
There are no sure-fire recipes for getting a product right. But there are effective tactics and attitudes. Here are six good ones.
1. Incorporate only the minimum necessary complexity in the project in the first place. Do not oversubscribe the team.
2. Make sure the design team and corporate management agree on the right level of product quality. Empty aphorisms like “achieve highest possible quality,” “no product flaws,” or “20% better than competitor X” are worse than useless because they lead to insidious cases in which the design team is invisibly and actively working at cross-purposes to project management, executive management, or itself.
3. Do not let the design team think of themselves as the creative artists and the validation team as the janitorial staff that cleans up after them. There is one team and one product, and we all succeed or fail together.
4. Foster a design culture that encourages an emotional attachment by the designers to the product they are designing (not just their part of that product). But engineers must also be able to emotionally distance themselves from their work when it is in the project’s best interests.
5. Make sure the validation effort is comprehensive, well planned, and adequately staffed and funded, with the goal of “continuously measuring the distance from the target,” thus ensuring product quality is converging toward success.
6. Design and religiously adhere to a bug-tracking method that will not let sightings or confirmed bugs fall into the cracks.
The idea that the design culture should encourage emotional attachments, yet the engineers must be able to sometimes turn that emotion off, may seem inconsistent or even mystical, but it is actually quite simple. The emotional attachment just means the engineer cares about what she is designing. She wants to get it right, and she wants the product to succeed. To do a proper design review, however, the designer must check her ego at the door, and realize that it is in the best interests of the overall goal-a successful productthat her design undergo some rigorous scrutiny [28]. The commitment to the overarching goal is what will guide the engineer in knowing when to override the emotional attachment.
The Design Review
Designers make mistakes, but good designers strive to avoid making the same mistake twice, and they usually succeed, particularly if another equally talented designer is diligently checking what they have created. Software engineers have learned this as well [29]. Practitioners of extreme programming also practice extreme reviewing, calling for the checker to sit at the elbow of the coder to check the code in real time as the coder first enters it [23]. Hardware design practices a less extreme form of cross-checking, called the design review (although “extreme designing” might be worth trying some day).
Designers make mistakes, but good designers strive to avoid making the same mistake twice.
In most cases, a design review is a formal process whereby the engineer of the unit being reviewed presents her design to a panel of peers who then try to find any holes or errors in that unit. Design reviews have many styles, but they all include some kind of panel of peers. For the P6 and Willamette projects, our panel consisted of formal reviewers who were expected to do preparatory reading and study before the event, and 10 to 20 other interested designers and observers. The observers were not necessarily expected to actively contribute, although they sometimes did. Their role was to learn about their neighbor’s unit design and to reinforce in everyone’s mind that this project took its design reviews seriously.
The reviewers must have enough information to be able to follow the review and contribute to it. The unit designer has to furnish this information, particularly that which establishes the design’s context: its place in the overall system; the function it is expected to fulfill; its constraints in terms of power, performance, die size, and schedule; and early alternatives to the approach that was eventually selected. Depending on the design itself, the designer might also need to provide block diagrams, pipeline and timing diagrams, and protocols, as well as describe finite-state-machine controllers and possibly furnish the actual source code or schematic diagrams.
When design reviews are done properly, the outcome is a list of ideas, objections, concerns, and issues identified by the collective intellect of the review panel. The reviewee then e-mails that list around to all attendees and project management, along with plans for addressing the issues and tentative schedules for resolving all open items. The overall project managers incorporate these issues and schedules into the overall project schedule and begin tracking the new open items until they are resolved.
At the review’s outset, the team should designate a scribe to capture the ideas, suggestions, proposals, and any questions the review team asks. These will range from observations to pointed queries to suggestions for improvements or further study. These ideas and the interchanges between the presenter and the reviewers are the very essence of the review. The job of distilling this important but sometimes unruly stream of information is critical. The project should already have a central Web site accessible to all project engineers. After the meeting, the scribe should not only distribute minutes to the review participants, but also archive them onto the project Web site. Engineers who could not attend the review can then check its conclusions and spot surprises or inconsistencies.
The scribe should strive to be a neutral observer as much as possible. He or she will probably have to fight the urge to put a personal spin on what was said (and especially why it might have been said). Some value judgments are inevitable, of course, and many such judgments will turn out to be extremely useful. But one of the reasons the scribe should circulate the meeting minutes is precisely so that other attendees can check them for objectivity and completeness. These other attendees will want to check them, because they may well find their own name on the list of follow-on work to be done.
The scribe cannot be the presenter. Try that, and you will not get either job right. The presenter is utterly engaged in making sure the reviewers are getting the technical information they need to do an adequate
job of covering the design. Presenting the design itself to people who are trying to poke holes in it is an intellectual exercise not for the faint of heart; it causes the presenter to make constant mental leaps from “This is how it works” to “This is why it doesn’t work some other way” to “My intuition says your wild idea won’t work, but I can’t fully articulate why at the moment,” and so on. The presenter is the one person at a design review who is 100% occupied and cannot take on any other roles.
Many design review meeting notes are appropriate for wider distribution than just to the attendees. It is not uncommon, for example, for the reviewers to suddenly realize that an issue is of far wider scope than just the unit under review. The reviewee must ensure that such issues reach the proper forum for global resolution.
How Not to Do a Review. A design review can also be counterproductive, whether because of the reviewee’s defensive attitude or management’s lack of monitoring. Since I have already established that people often learn by mistakes, here are a few ways to get a first-class education:
It is more important that the entire unit be reviewed to at least an acceptable level of thoroughness than to hammer one item flat and miss several others entirely.
Have everyone deliberate on the world’s greatest solution to a minor issue. Reviews are not a great place for group thinking. Engineers like living in the solution space; give them a problem, and they will smile as they do an irreversible swan dive into that space. Multiply that by 30 people in a design review setting and you have an enjoyable but useless afternoon. It is more important that the entire unit be reviewed to at least an acceptable level of thoroughness than to hammer one item flat and miss several others entirely.
Measure the design’s success inversely to the number of issues identified. In this scenario, the reviewee exits the review gloating, “They didn’t lay a glove on me,” and actually believes the review’s success is inversely proportional to the length of the issues list identified in the review. That person should immediately be fired for incompetence and eminent stupidity for having wasted a lot of very busy people’s time and gotten nothing for it. My experience is that the people whose designs are so outstanding that they have little need for design reviews are the same ones who least mind having their work checked by their peers. I do not think that is a coincidence.