Book Read Free

The Pentium Chronicles: The People, Passion, and Politics Behind Intel's Landmark Chips (Practitioners)

Page 13

by Robert P. Colwell


  This strategy has the same flaw as “hire only geniuses.” Smart validators will take risks to ensure product quality and, being human, they will make mistakes. In the end, a validation team motivated by the final product quality is much more productive than a team simply trying to avoid the boss’s tongue-lashing.

  Avoid/Find/Survive

  I believe that nature has a set of immutable laws wired directly into its fabric, and engineers must observe those laws or suffer the consequences. One of these laws is that no matter how assiduously we work, no matter how motivated and inspired our validation effort, design errors will appear in the final product. This suggests three things:

  1. Design errors will appear in a design by default, and we must strive mightily to prevent them.

  2. Some errata will elude us and we must find them in the design before they “get out the door.”

  3. Some errata will hide so well that they will make it into production and be found by a paying customer.

  If part of a system architecture has a high probability of incurring errors, design and production processes must take that into account. When NASA sends a deep-space probe that can transmit with only a few watts of power, it knows that digging the signal out of the noise will be a challenge, and the radio protocols it adopts reflect that error tolerance. Likewise, CD-ROMs and DVDs, as well as hard drives, incorporate extensive error detection and recovery, because in any storage media (including the brain) errors will naturally arise during access.

  Similarly, project managers have to know that mistakes will end up in their design and take appropriate measures before, during, and after they have manifested. A defense-in-depth strategy is the best approach to design flaws I have found: avoid, find, survive.

  Design to Avoid Bugs. Design engineers must constantly juggle many conflicting demands: schedule, performance, power dissipation, features, testing, documentation, training, and hiring. They may intuitively know that if they spent the next two weeks simply thinking long and hard about their design, they would produce a far better design with fewer errors than if they had spent the two weeks coding. But their reality is a project manager who must ensure that the project stays on track and on schedule and whose favorite way to do that is to measure everything. It is hard to quantify the project benefits of meditation, but a manager can tally how many lines of RTL the designer could have generated had she not spent two weeks simply thinking. This translates into subtle pressure to favor immediate schedule deadlines now, and if more design errors creep in as a result, then so be it. Someone can attend to those later. For now, the project is measurably on schedule and tomorrow may never come.

  Management must undo this mindset by emphasizing how much less expensive it is to get a design right in the first place than to have to create a test, debug it when it fails, and then fix the design without breaking something else in the process. Managers can also help by being sensitive to the huge array of issues designers face. Balancing schedule pressure is tricky. Too little, and the project might slip irreparably. Too much, and the pending trouble moves off the schedule, where it is visible, and into some other area, where it becomes design errata, eventually becomes visible, and wreaks havoc on the project schedule anyway.

  When I think about finding this balance, I look at what scientists have found in dealing with natural phenomena. The point of an oil pipeline’s maximal throughput is the amount and velocity of oil that flows fast, but smoothly, just before the point where a little more liquid in the pipe would cause turbulence to occur. If that turbulence does occur, the overall flow does not just diminish slightly; it falls a lot.

  Likewise, the efficiency of an airfoil (as well as the generated lift) depends critically on its angle of attack. And the point of highest efficiency is that angle just before the onset of turbulence. A little beyond that, turbulence occurs, and lift falls drastically. (When this happens to one wing of an airplane but not the other, it can cause a flat spin from which recovery is very difficult.) As with the oil pipeline, the optimal policy is probably to carefully edge the project up to the onset of turbulence, where efficiency is at its peak, and then back off a little, on the grounds that going any further gains only a little and risks a lot.

  Astute project leaders such as Randy Steck were particularly adept at finding this balance point, especially when accumulated project slippage was pointing toward a formal schedule slip. Few tasks are more unpleasant than having to officially request your boss’s approval on a schedule slip. And the boss likes it that way; part of his effectiveness is to motivate you to do everything in your power to avoid this eventuality. But in that exact sense, a project manager who pushes back hard on her subordinates’ requests for more time must still be sensitive to the possibility that unless more time is granted, the project’s quality will slip below acceptable levels.

  Few tasks are more unpleasant than having to officially request your boss’s approval on a schedule slip.

  Architects have a first-order impact on design errata because they are the ones who imbued the design with its intrinsic complexity. They have ameliorated (or exacerbated) this complexity by clearly (or not clearly) communicating their design to the engineers who are reducing it to practice. Architects write the checks that the design engineers have to cash. If the amount is too high, the whole project goes bankrupt.

  Architects have a first-order impact on design errata because they are the ones who imbued the design with its intrinsic complexity.

  One day during the P6 project, the three of us who had already designed substantial machines-Dave Papworth, Glenn Hinton, and I-were comparing notes on where design errata tended to appear. We realized that even though the three of us had worked on far different machines, the errata had followed similar patterns. Datapath design tended to just work, for example, probably because datapath patterns repeat everywhere and so lend themselves to a mechanical treatment at a higher abstraction level. If I refer to a bus as Utility_Bus[31:0], I don’t have to tell you what hardware description language I am using. You know immediately how wide the bus is and that no bits of that bus have inadvertently been left out.

  Control paths are where the system complexity lives. Bugs spawned from control path design errors reside in the microcode flows, the finite-state machines, and all the special exceptions that inevitably spring up in a machine design like thistles in a flower garden. Insidiously, the most complex bugs, which therefore have a higher likelihood of remaining undetected until they inflict real damage, live mostly “between” finite-state machines. Thus, anyone studying an isolated finite-state machine will likely see a clean, selfconsistent design. Only analysts well versed in studying several finite-state machines operating simultaneously have any chance of noticing a joint malfunction. And even then it would be an intellectual feat of the first order.

  For the first time in our collective experience, we ended up with a machine that had essentially no important errata associated with traps, faults, and breakpoints.

  In light of our discovery, we surmised that by careful architecting, we might be able to rule out a whole class of potential design errata. We began looking for ways to simplify the P6 core’s exception handling. We noticed that we could implement many of these exceptions on top of the branch misprediction mechanism, which was complicated itself, but so intrinsic to the machine’s basic operation that it got a huge amount of exercise and testing.

  Our strategy worked, and for the first time in our collective experience, we ended up with a machine that had essentially no important errata associated with traps, faults, and breakpoints. We also found that ruling out a class of design errata this way was by far the most cost-effective strategy for realizing a high-quality machine.

  This strategy also makes sense in light of the design’s complexity. Complexity breeds design errata like stagnant pond water breeds mosquitoes. Some bugs on the original P6 chip, for example, required many clock cycles and the complex interactions of six major functional units to manif
est. In such cases, it is not reasonable to blame any of the six functional unit designers, and it will probably be unavailing to ask why validation did not catch the error. Such bugs lie squarely with the architects, who need to think through every corner case and make sure their basic design precludes such problems. We should have applied our good idea everywhere in the design, not just to those items that were already on our worry list.

  When Bugs Get In Anyway, Find Them Before Production. No amount of management attention to presilicon testing and no degree of designer diligence and dedication will avoid all mistakes. Design and validation errors are inevitable (even if you still harbor some hope that perhaps there might be a way to avoid them, however theoretical, it’s still better to plan and execute your project as if there weren’t). It falls to the validation crew to find these mistakes before production and to work with the designers to fix them without breaking anything else.

  Validation teams have the same kinds of crushing pressure as the design teams, and then some. Validators must perform their task without noticeably changing the tapeout schedule, but until the RTL model is reasonably mature, they are limited to trivial tests. The design team typically has several months’ head start, but from that time to tapeout, validation is supposed to identify every design flaw and verify that their remedies are correct.

  By the very nature of validation, that expectation is doomed to nonfulfillment. The validation plan may be thorough, but it is of necessity incomplete. How can validation test everything with finite time and finite simulation cycles? Given the combinatorial-state explosion that today’s complex machines imply, they cannot even get close.

  The testing team also learns as it goes along. If a chip unit is behaving in stellar fashion and yielding almost no bugs, while another unit is behaving very badly across a range of tests, the validation team will shift resources to the flaky unit. Some handwritten tests may be finding no bugs, while others find one after another. The testers will do the obvious: Extend the use of the efficacious test, and if they are really on the ball, at least inspect the no-bugs-found test to see if the test itself could be defective. (Yes, it happens. I have personally written at least one validation test that saw routine use for quite some time before I noticed that it would indicate success no matter how broken the machine it was supposedly testing actually was. This kind of revelation severely taxes your confidence in the overall project, not to mention in your own skills!)

  “I have no idea why I did that. What was I thinking?”

  IDENTIFYING BUGS. Design errata, bugs, errors, mistakes-whatever you call them, they are exasperatingly elusive because they can come from just about anywhere. Some are caused by miscommunication: inaccuracies in documentation, design changes that were not fully disseminated, and misunderstandings about when mutually dependent design changes were to be made. Others are caused by the designer’s muddy thinking, often exacerbated by short schedules or management pressure. Sometimes, a designer simply did not think through all circumstances under which his design must correctly operate or did not identify what actually constituted correct operation in every possible situation. And there are always the “oops” bugs that engineers and nonengineers alike instantly recognize: “I have no idea why I did that. What was I thinking?”

  I do not know any comprehensive, useful theory that consistently helps identify any type of bug, but I can offer some rules of thumb.

  First, be careful how you measure bugs if you want to know about them. Well-run projects are tracked by data the project managers collect. We managers have a good feel for when the project will tape out, its expected performance, and for how much power the chip will dissipate, because we collect a lot of data on those items and track their trends carefully each week.

  But the measure-and-extrapolate managerial instinct can backfire when applied to design errata. Many bugs are found close to home by the designers themselves. If designers sense even the tiniest amount of pressure to minimize design errata or to reward those with fewer bugs, they will instantly and directly translate that into “Quietly fix the bug and don’t tell anyone else about it.” And you will not know they did it. In fact, for a while it will seem as though the project is doing better than before, with less effort going to fix errata.

  The same concern applies to validation, in which test coverage is every bit as important as sheer numbers of bugs detected. I will revisit this idea later in the chapter (see “Managing Validation” on page 65).

  Second, look closely at the microcode. Although microcode seems to have disproportionately more bugs than anything else in the chip, if you understand the x86 and the design process, that outcome is understandable. The x86 is an extremely complex instruction-set architecture, and most of that complexity is embedded in the microcode. You might think that a company that has successfully implemented x86 chips for close to 30 years would have “solved” the microcode problem long ago, but they haven’t, because no such solution exists. Every time the microarchitecture must be changed, so must the microcode, and all fundamental changes to either will expose new areas in the microcode, for which the past is a poor guide to correctness.

  Microcode also tends to be buggy because, unlike hardware, it tends to remain changeable right up to a few weeks before tapeout. Late in project development, if a sig nificant bug is found in a functional unit, the unit’s owner will often ask to make a small change to the microcode that would ameliorate the conditions under which this particular bug would manifest. And very often, such a change is indeed possible and does what the functional unit’s owner wanted. The unit owner is now happy, but as he walks away smiling, another unit’s owner comes in looking worried and contrite, and the process repeats. A project’s last few months can see a lot of such changes, and taken together they can make a hash out of previously respectable-looking microcode source. These microcode fixes are also uniquely susceptible to the latechange syndrome, common to all engineering endeavors: If you make a change to an engineering design late in the project, that change is at much higher risk of bugginess than other design aspects. The team is tired, the pressure is high, and there is not enough time left to redo the past two or three years of testing and to incorporate the effects of this new change on the design. Although unrelated to its microcode, the infamous Pentium FDIV flaw was due to exactly such a late change to an existing design.

  If you make a change to an engineering design late in the project, that change is at much higher risk of bogginess.

  TRACKING Bucs. At first glance, it seems pretty clear what validation needs to do: Create a comprehensive list of tests such that a model that passes them all is considered to be of production quality. Then test the RTL models, identify the tests that do not pass, find out why, get the designers to fix the errors, and repeat until finished.

  In the real world, however, the validator’s life is often messy. The tools have bugs, the tests can be faulty, and because the RTL model under test is not a stationary target, tests that passed last week might fail this week. One design bug might prevent several tests from passing, and one of those tests might have found a quite different bug. Or a validation test might have assumed correct functioning was of one form, but the design engineer might have assumed something different. Is that a design bug? You cannot tell because you need more information.

  Also, a validator chasing one specific test commonly stumbles across something else quite by accident. Said validator would be within her rights to ignore New Phenomenon B, on the grounds that she should not allow herself to get distracted while trying to nail Presumed Bug A. But experienced designers and validators have learned that if something serendipitously pops up during testing with “I’m probably a bug” written on its forehead, you have two choices. You could let it go back into hiding or deal with it now. If you let it go, the odds are extremely high that it will come back and bite you later. Given that you will almost certainly have to deal with it sometime, it is best to capture what you can about it now so that you
can reproduce it later when you can focus on it. Then go back to chasing Presumed Bug A.

  We dubbed any incident found during testing that could have a bug as its root a “sighting,” and we learned to be very dogmatic about these incidents from our experience in other design projects. The rule was that anyone had to report a sighting to a general database, along with the specific RTL model, the test that generated the sighting, the general error syndrome, and any information a validation engineer might need to try to reproduce the sighting. See Figures 3.1 and 3.2.

  A validation engineer, typically but not always the one who filed the sighting, would then attempt to reproduce it and find its root cause. Were other bugs recently found that might explain this sighting? Have any tools or test deficiencies been found that might be relevant? After checking the easiest answers first, the validator would then begin zeroing in on exactly what wasn’t going according to plan and what might be causing the problem. In many cases, he could collect evidence in a few hours that a real bug in the design was the culprit, and would then officially change the issue’s status from “sighting” to “bug.”

  Once an issue attained the status of bug, it was assigned to the most appropriate design engineer-whenever possible, the person who had put it into the design in the first place. This was not a punishment. Rather, the idea was to let the person who owned that part of the design be the person who fixed it and, thus, minimize the odds of the fix-one-breaktwo problem occurring (an all-too-common occurrence, in which the fixer introduces two new bugs while attempting to fix the first one).

  A few hours or days later, depending on the bug’s severity and the designer’s workload, the design engineer would come up with a fix and was expected to build and check a model that embodied it. This sanity checking had a three-pronged goal: (1) determine that bug was really gone, (2) establish that nothing had broken as a result of the fix, and (3) ensure that the previous hole in designer’s unit tests was now filled so that this bug could never come back. Once the design engineer’s model passed this sanity check, he or she could mark the official issue status as “fix pending.”

 

‹ Prev