The Pentium Chronicles: The People, Passion, and Politics Behind Intel's Landmark Chips (Practitioners)
Page 16
SUCCESS FACTORS
When the realization phase begins, the team has settled on one direction and has made some progress toward the RTL model that describes the actual product. The classical output of the realization phase is a prototype, a single instance of the product being developed. For silicon chips, the production engineers actually make several wafers’ worth of chips on that first run, but the principle is the same build a few, test them, fix what’s broken, and when the time is right, move the project to the production phase (see next chapter).
The realization phase puts the pedal to the floor and designs a real product that conforms to the ideas from the earlier project phases. Whereas errors in the earlier phases could impact schedule, an error in the realization phase may directly affect hundreds of design engineers who are working concurrently. Assiduous project management and communication are key to the realization phase.
Balanced Decision Making
As a design proceeds, design engineers make dozens of decisions every day about their unit’s implementation. They strive to balance their decisions in such a way that they meet or exceed every major project goal (performance, power dissipation, feature set, die size) within the schedule allotted. The very essence of engineering is the art of compromise, of trading off one thing for another. In that sense, any decision by any design engineer might impact performance.
The very essence of engineering is the art of compromise, of trading off one thing for another.
At the beginning of the P6 project, we thought the relevant architect should bless all performancerelated engineering decisions, but this requirement, though ideal in theory, quickly became impractical. There were simply too many such decisions and not enough architects to implement them and stay on schedule. We eventually established a rule, proposed by Sunil Shenoy: If you (the design engineer) believe that the performance impact of the choice you are considering is less than 1% on the designated benchmark regression suite, you are free to make the choice on your own. Higher than a 1 % performance hit and you must include the architects in the decision.
Mice do not eat much per bite, but over time the cheese keeps getting smaller.
This 1% rule generally worked well and certainly salvaged the overall project schedule. On the downside, the model ended up absorbing quite a few < 1% performance hits. Dave Papworth likened this to mice nibbling at a block of cheese: Mice do not eat much per bite, but over time the cheese keeps getting smaller.
The performance hits were usually independent of one another, and their combination at the system performance level was typically benign. But occasionally, the combination would be malignant, noticeably dragging down overall performance and drawing the ire of architects and performance analysts. Over time, the cumulative effect of these minor performance losses would cause aggregate system performance to sag imperceptibly lower almost daily until performance engineers would become alarmed enough to notify project management that performance projections were outside the targeted range. We would then essentially stop the project for a week, intensively study the sources of the performance loss, repair those areas, and reanalyze. We repeated the process until performance was back where we expected it to be.
Why then was the 1 % rule important or even desirable? The simple answer is time. There weren’t enough architect-hours to try to oversee every design decision that might affect performance. Picture a team of 200 designers, each making 10 decisions a day that might affect performance. We used the 1% rule, not because it was perfect, but because the alternative (utter chaos) was unworkable.
Documentation and Communication
We architects were creatively relentless in our attempts to transfer information. The first document written as a general introduction to the P6 was “The Kinder Gentler Introduction to the P6,” an internal white paper intended to convey the general philosophy of the P6 microarchitecture. Next was “The P6 Microarchitecture Specification,” or MAS, the first of what became a large set of documents detailing the operation of every unit on the chip, including new features, basic operation and pipelining, and x86 corner cases that our proposed design might affect.
Capturing Unit Decisions. Once the realization phase is underway, each unit group must begin to record their plans and decisions. For the P6 project, these records were the microarchitecture specifications. Each MAS described how the unit would execute the behaviors outlined in the unit’s behavioral specification. Each unit group maintained their own MAS, which grew with the design, and distributed it to all other units. All MASs were written to a common template, so designers from one unit could easily review the MAS from another unit.
Each MAS included
��� Pipeline and block diagrams
��� Textual description of the theory of operation
��� Unit inputs and outputs and protocols governing data transfers
��� Corner cases of the design that were especially tricky
��� New circuits required for implementation
��� Notes on testing and validation, which we required so that design engineers could think about such things during the design, when it is easiest to address them
MASs were begun early in the realization phase, well before all the major design decisions had been collectively rendered. This timing was purposeful the act of writing this documentation helped identify conceptual holes in the project. We began each MAS as early as we could, but only when we knew enough about the design that it was not likely we would have to tear the MAS document up and start over.
Integrating Architects and Design Engineers. Perhaps P6’s strongest communication mechanism was not a videotape or document at all, but the architects’ participation in the RTL model’s debugging. Some engineering teams have the philosophy that the architect’s job is finished when the concept phase documentation is complete. That is, once the project’s basic direction is set, the architects are free to leave and start another project. This idea is reminiscent of the pipelining concept that underlies all modern CPU designs. It works brilliantly there, so why not apply it to people too?
I am sorry, but pipelining people, especially architects, is a monumentally bad idea. The architects conceived the machine’s organization and feature sets and invented or borrowed the names for the various units and functions. They know how the machine is supposed to work at the deepest possible level of understanding, and in a way that other engineers cannot duplicate later, no matter how smart or experienced they are. The software industry is famous for inadvertently introducing one bug while fixing another one. Exactly the same malady will strike a chip development. It is all too easy to forget some subtle ramification of a basic microarchitecture canon and design yourself into a corner that will remain shrouded in mystery until chip validation, when the only feasible cures are painful to implement.
Pipelining people, especially architects, is a monumentally bad idea.
The architects are the major corporate repository of a critical knowledge category. Every design decision reflected in chip implementation is the result of considering and rejecting several alternative paths, some of which might have been nearly as good as the one chosen and some of which might have been obviously unworkable. The point is that the alternatives probably looked very appealing until the architect realized some subtle, insidious reason that the particular choice would have been disastrous later on.
If no one retains that crucial information, future proliferation efforts will suffer. Downstream design teams will have to change the design in some way, and if they stumble across one of these Venus flytrap alternatives, their product may be delayed or worse. The original architects are the best ones to tend those exotic plants and instruct others in their care and feeding.
Another reason not to pipeline architects is that architects, like all engineers, must use what they have created to solidly inform their own intuitions about which of their design ideas worked and how well, and about which ideas turned out not to b
e worth the trouble. Pipelining the architects is equivalent to sending them driving down a highway with their eyes closed. They may steer straight for a short time, but without corrective feedback, they will soon exit the highway in some fashion that will not do anyone any good.
PERFORMANCE AND FEATURE TRADEOFFS
P6 was an x86 design from the heyday of x86 designs. High performance was the goal, and the reason our chip would command high selling prices. But performance was not the only goal; tradeoffs were necessary. Making sure that the design team had the information needed to make these tradeoffs consistently and in an organized fashion was both a technical and a communication problem.
(Over-) Optimizing Performance
In the preface to this book, I used the analogy of a sailing ship to describe some aspects of chip design. The project’s realization phase also fits with the ship metaphor, except that this ship is attempting to navigate its cargo through tricky waters. While most of the vessel’s crew go about their daily tasks of keeping the engines running and the ship clean, maintaining supplies, and doing most of the navigation, the captain must keep an eye on the overall voyage. Each crew member’s individual task is difficult and pressure-filled, and the human tendency is for each to become accustomed to thinking of their task as something of an end in itself. It is up to the people on the bridge to constantly remind themselves and everyone else that they did not set out on this journey just to cruise around. Success will ultimately be measured by whether the cargo gets to the right port at the right time.
I will not presume to tell you how to run your particular ship, but I will point out some of the hazards along the shipping channel.
Perfect A; Mediocre B, C, and D. In this type of overoptimizing, the architect perfects one design aspect to the near exclusion of the others. This shortchanges product quality because architects working on idea A are, therefore, not working on ideas B, C, and D, and as the project lengthens, the odds of including B, C, and D go down. And idea A often has an Amdahl’s Law ceiling that is easily overlooked in the heat of battle: Idea A may have been conceived as a solution to a pressing, specific performance problem, but any single idea may help that problem only so far, and to improve it further would require much more sweeping changes to the design, thus incurring further development costs and project risks. One must not become so myopically fixated on one project goal that other goals are neglected. B, C, and D will not be achieved as a by-product of achieving A.
Truly great designs are not simply those that post the highest performance scores, regardless of the costs.
The Technical Purity Trap. A common tendency, especially among inexperienced engineers, is to approach a development project as a sequence of isolated technical challenges. Rookies sometimes think the goal is to solve each challenge in succession, and that once the last problem has been surmounted, the project will have ended successfully. Experienced engineers know better. Subtleties abound, circumstances change, buyers’ needs change, technical surprises arise that require design compromises, and schedule pressure only gets worse. Truly great designs are not simply those that post the highest performance scores, regardless of the costs. Great designs are those for which the engineers had a clear vision of their priorities and could make intelligent, informed compromises along the way. Why do engineers tend to focus on performance to the exclusion of other factors? To answer this, consider the responses of two engineers, one new and one experienced, to the question, “Which is a better car, a Mercedes S-class sedan or a Ford Taurus?” The new engineer will eagerly compare the technical details of each and easily reach the conclusion that Mercedes is by far the superior vehicle. The experienced engineer would then ask, “If the S-class sedan is so superior, why does Ford sell 15 Taurus models for every Sclass car that Mercedes sells?” because she knows the comparison involves more than simply sorting horsepower ratings. Cost, in particular, is key. If these cars cost the same, the sales ratio would likely be considerably different.
The lesson is to follow the money. If Ford engineers forget why people are buying the Taurus, they may err in designing new cars, pricing their vehicles so high that the buyer either can’t afford the car or realizes that at that price, he can shop at quite a few more car manufacturers. Likewise, if Mercedes designers lose track of why people buy S-class luxury cars they will alienate their customer base. Suffice it to say, rich people do not like to waste money any more than the less fortunate.
This lesson translates well to computers. Designing the “fastest computer in the world” is a great deal of fun for the designers, but it is an engineering joyride reserved for very few. The rest of us must design machines that accomplish their tasks within first-order economic constraints.
In an insidious way, microprocessor vendors who succumb to the allure of trying to be the fastest computer will win in the near term, but they will lose in the long run, a decade or more down the road. The reason is simple: Only a small market-at most a couple of million units a year-will pay large premiums to keep a niche vendor afloat. A user base that small cannot support the design costs of world-class microprocessors, not to mention the cost of state-of-the-art IC processing plants (fabrication plants, or fabs). When that vendor is inexorably driven out of business by these extraordinarily high costs, the mainstream, cost-constrained vendor is still there. And with one or two more turns of the Moore’s Law wheel, that mainstream vendor inherits the mantle of “world’s fastest” without having even tried for it.
Voltaire is often credited with the saying, “The Best is the enemy of Good,” which means that myopic striving toward an unreachable perfection may, in fact, yield a worse final result than accepting compromises on the way to a successful product. Everyone wants their product to be the best; achieving that is great for both your career and your bank account. The trap is that taking what appears to be the shortest path to that goal technical excellence to the exclusion of all else-can easily prevent you from reaching it. In plainer terms, if you do not make money at this game, you do not get to keep playing.
The Unbreakable Computer
Several of us P6 architects were interested in designing a computer that would never crash. Anyone who has experienced the utter frustration of a particularly inopportune system crash will empathize. As computer systems visions go, this one is killer. No matter what happens, no matter what breaks in the hardware or the software, the machine can slow down, but it can’t stop working. That would be nice, wouldn’t it?
That hardware designers have no control over the operating system or the applications should have shown us an upper bound on system stability that fell far short of our “unbreakable” vision. But even things we could control stubbornly refused to configure into anything approaching unbreakability. Permit me a Bill Nye moment (you know, the science guy with the great jingle): Electrically speaking, we live in a noisy, hostile universe. Electromagnetic waves of all frequencies and amplitudes are constantly bombarding people and computing equipment. Very energetic charged particles from the Big Bang or cosmic events collide with atoms in the atmosphere to generate streams of high-energy neutrons, some of which end up smashing into the silicon of microprocessors and generating unexpected electrical currents. Temperatures and power supplies fluctuate. Internal electrical currents generate capacitive and inductive sympathetic currents in adjacent wires. The universe really does conspire against us.
The universe really does conspire against us.
On the basis of the statistics we observe from these and other events, we design recovery mechanisms into our microprocessors. If one of these unfortunate events occurs, the machine can detect the anomaly and correct the resulting error before it can propagate and cause erroneous data to enter the computation stream.
Error detection and correction schemes have their dark side, however. They impose an overhead in performance and complexity and a real cost in die size. Worse, although they help make the machine more reliable, they are not foolproof. For example, if an error
correcting code is applied across a section of memory, then, typically, a single-bit error will be correctable, but if two bits are defective, our scheme will note that fact but be unable to correct it. And if more than two bits are erroneous, our scheme may not even notice that any of them are wrong.
The most stringent constraint on the ability to design an unbreakable engine, though, is that while the “state space” of a correctly functioning microprocessor is enormous, the possibility space of a malfunctioning machine is many orders of magnitude larger. Basically, unless you are designing an extremely simple machine, you cannot practically anticipate every way in which the machine might fail, which is what you need for a detection and recovery scheme. Moreover, even if you could somehow catalog and sandbag every single-event failure, it still would not be good enough. Failures can and do occur in pairs, or triples, or n-tuples.
Perhaps the day will come when a very different approach to this problem will present some affordable solutions, but today the best we can do is buttress the machine against its clearest threats and test it extensively and as exhaustively as human resources will permit.
Machine reliability raises some interesting philosophical issues, though. If an error is detected, should the machine attempt to roll back to a previous saved (presumably correct) state and restart from there, hoping that this time the error will not manifest? Many databases have this capability. Indeed, the Pentium 4 has an equivalent, in that forward progress of the engine is self-monitored, and if too much time has elapsed since forward progress was last detected, a watchdog timer will flush the machine and restart it from a known-good spot.