The Pentium Chronicles: The People, Passion, and Politics Behind Intel's Landmark Chips (Practitioners)
Page 22
Meanwhile, the marketing team is preparing the customers so that they will be ready when early production units arrive. These customers are preparing their own systems around this new chip, so they often have questions, suggestions, and concerns that require technical expertise to resolve. Technical documents must be updated and distributed. Collateral such as tools, performance models, and reference designs must be tuned to represent the chip’s final production version.
Although not new to the production phase, the job of properly managing the sightings, bugs, and engineering change orders (ECOs) takes on a new urgency and importance. If the technologists who were judging bug dispositions during earlier project phases made an error, the project might suffer a schedule hit as the bug was later rediscovered and redisposed of. But in the production phase, bugs that are not properly handled have a high likelihood of escaping into the wild, meaning into a customer’s system, raising the everpresent specter of an FDIV-style mass recall. (See the section, “Was the P6 project affected by the Pentium’s floating point divider bug?” in Chapter 7.)
The production team’s responsibility then is to finish the job the design team has begun. In much too short a time, they must polish a raw, brand-new design into a form suitable for the safe shipment of tens of millions of copies. Tension is constant and comes from all sides: Management screams about schedules and product cost; validation constantly lectures on the dangers of another FDIV; marketing and field sales remind you that your project is late and that only an inspired performance by them will stave off the unbelievably strong competition that you have ineptly allowed to flourish; and the fab plant managers remind everyone of how many millions of dollars a day are lost if the chip isn’t ready on time and their plants run out of things to make. The last source of tension is not a minor concern. Modern silicon fab plants cost almost as much to leave idle as they do to run at full production. Nothing out = No revenue = Very unhappy fab managers.
Validation is a good example of how much pressure is brought to bear in the production phase. Despite the millions spent on fast validation servers and the validation engineers who keep them busy, there is nothing like real silicon. In the first few seconds of running it, far more code is executed than in all the presilicon validation exercises combined. We are talking of ratios of 10 or 100 cycles per second for presilicon simu lation to 3,000,000,000 cycles for real silicon. A lot of headaches can crop up in that many cycles.
OF IMMEDIATE CONCERN
The only job in the production phase is to bring the product to market. Unfortunately, that simple idea translates into knocking down every barrier that might otherwise prevent ramping up to large shippable volumes in a few months. In this phase, cleverness gets you only so far; brute force in the form of a lot of hard work by a lot of people must take you the rest of the way.
Functional Correctness
The first barrier to knock down is any remaining functional inaccuracy. It is a mistake to think that postsilicon validation is just an extension of presilicon testing. The two have unique advantages and disadvantages. Suppose you have a cache with 4 Mbytes of storage, arranged in 128 K lines of 32 bytes each. The RTL that instantiates that cache in the chip model will have a loop, essentially telling the RTL compiler each line’s structure, causing that compiler to mechanically replicate that structure. Thus, in the RTL simulation universe, every line will behave identically. Knowing that this is how the cache was implemented in RTL, a presilicon validator can reasonably expect that if she proves line N correct, then line N + 1 is very likely functionally correct as well. Of course, good validators will always check the end conditions, such as the first and last lines, and possibly other internal boundaries, but at the functional testing level, validators can and do exploit regularities to guide their testing to the areas of highest payoff. They can divide and conquer, testing each abstraction layer in isolation.
It is a mistake to think that postsilicon validation is just an extension of presilicon testing.
For a real silicon chip, however, all abstraction levels must simultaneously work correctly. If the RTL functionality contains an error, the microprocessor will not perform to satisfaction, even with perfect circuits and no manufacturing defects. If the cache line drivers have a circuit marginality, for example, access to lines close to the cache unit’s edge might work at full speed, but those buried deep inside the unit or at its opposite edge might need more time than the clock rate allows. In this scenario, the presilicon tester’s assumption about the mutual correctness of cache lines N and N + 1 would be dead wrong.
Manufacturing and physical characteristics cannot be ruled out in postsilicon either. When a postsilicon validator approaches a new chip, she knows very little about it, especially for the earliest chips. Fatal design errors could be preventing the chip from running any code out of main memory, for all she knows. Whole sections of this chip, or all chips at this revision level, could be completely unusable. Power-to-ground shorts might be preventing the chip from even powering up.
Fortunately, the errors are typically not on that scale. As validation and design engineers test more and more of the design over the first few weeks of new silicon, they some times find outright design errors, but more commonly they find areas that just need improvement. Meanwhile, the presilicon validation team is completing its validation plan and probably finding more design errata.
Speed Paths
Another important obstacle is any design artifact that might be keeping the chip from reaching its clock target. The maximum clock rate of a microprocessor has been its principal marketing feature and a first-order determinant of its eventual performance.
Even in time-tested designs, silicon manufacturing has a speed curve. Multiple fab plants worldwide built to the same standards and using identical equipment and processes will not produce the same chips. Some will function at faster clock rates than expected; others will be slower. If everything goes well, the median will be close to what the earlier circuit simulations predicted.
With a cutting-edge, flagship design, however, odds are that the chips will not be as fast as the design team intended. This is hardly surprising, since it takes only one circuitpath detour or some overlooked design corner to limit the entire chip’s speed. Typically, tens to hundreds of these speed paths are in a chip’s initial build, and the production engineering team must identify and fix them before mass production can begin.
Speed-path debugging can be extremely tricky. The presilicon RTL simulation involved tremendous complexity, but at least the engineer could monitor any set of internal nodes of interest as the simulator was running. With postsilicon debugging, you simply cannot reach every set of internal electrical connections. Instead, you end up making some inspired guess, inferring what the problem might be, and then making even more inspired proposals about how to prove your conjecture.
In the past, engineers addressed speed-path debugging by increasing the supply voltage, which tended to make all circuits faster. This strategy quickly lost its appeal when the chip’s power dissipation became untenable, since it increased as the square of the voltage difference. In effect, when the voltage goes up a little, the clock goes up proportionately, but thermal power goes through the roof. Even without the thermal power concern, you can raise the supply voltage only so high without running afoul of physical constraints such as oxide thickness and long-term chip reliability.
The production team uses a host of techniques, nearly all proprietary, to drive aggregate chip speed up to its target. I can generalize and say that the process is humanintensive. The circuits effort proceeds around the clock with multiple tag teams of experts, and it is punctuated by new tapeouts followed by weeks of waiting for the “new improved” chip to come back from the fab plant, so that the process can start all over again. Eventually, the fab plant begins reporting high enough yield at the desired clock frequency to warrant higher volume production, and the speed-path team can wind down.
Chip Sets and Platf
orms
For the same reasons that thoroughly testing a CPU design’s performance presilicon is impractical, it is even harder to test its performance with its accompanying chip set and platform (motherboard) design. Chip sets are designed on a very different timeline from CPUs; 9-12 months instead of 48, for instance. This means that even if you could hook the CPU and chip set simulations together presilicon, the chip set does not even exist until the final year of CPU development.
This increases the odds of more surprises when the actual chip set and CPU silicon first meet in the debug lab. The bus interfaces for both the CPU and chip set reflect certain policy decisions. In isolation, these decisions may seem innocuous, but when the two interfaces begin communicating, these decisions could easily turn out to have unfortunate consequences in certain corner cases.
There are some remedies for such interaction problems and we did use some of them. For the P6, we emulated the entire CPU, connected to a real chip set (although not the chip set being designed for it) in a large reconfigurable engine from Quickturn, Inc. (now owned by Cadence Design Systems). On the Pentium 4, we simulated the CPU and a mockup of its chip set as extensively as we could to help drive out mutual incompatibilities.
The one thing you can count on is that surprises are inevitable.
In the end, the one thing you can count on is that surprises are inevitable, and the production team will have to find them, identify them, decide which must be fixed (and how), and which can be lived with. This takes time, people, and a good working relationship with the early development partners.
SUCCESS FACTORS
Because the production engineering team comprises design, validation, management, marketing, and product engineers, it must balance a variety of concerns on a very tight schedule. There is no time to explore nuances in someone’s point of view. Communication must be direct, succinct, and frequent.
To satisfy this requirement, we created the “war room,” a designated room for daily meetings, during which the team assimilated new data and decisions, planned out the next day’s events, and coordinated with management. (Some politically overcorrect team members attempted to call it the peace room, but after a few months of tie-dyed shirts and responses of “Hey, chill, dude” instead of “I’ll get right on it,” we reverted to the original name.)
The war room was fundamental to the production engineering team’s main job, which was to nurture the product to a marketable state, an effort that entails such unenviable tasks as writing test vectors and managing performance and feature surprises.
Prioritizing War Room Issues
The war room team must successfully juggle a steady stream of sightings, confirmed bugs, new features, marketing issues, upper management directives, and the constant crushing pressure of a tight schedule. On a daily basis, however, its most important function is to prioritize the list of open items to ensure that they can be disposed of within the required schedule.
Some issues on the war room list will be quite clear: “This function does not work correctly, and it must. Fix at earliest opportunity.” A clear priority is when the team knows that the errant function is preventing a lot of important validation. In that case, they might decide that a new part stepping is appropriate. New steppings cost at least a million dollars each, and they require more coordination with external validation partners. Every project has a budget for some number of postsilicon steppings, so commissioning a stepping effectively fires one of the project’s silver bullets. Because steppings also take a few weeks for the lab plant to turn around, it makes sense to combine multiple fixes into one stepping whenever possible.
Other war room issues are not as black or white. Sometimes, the team finds that new features with no correspondence to any previous chips have some corner case that does not quite work as intended. Such features have no x86 compatibility issues, and waiting for a stepping or two could possibly save money and time. The war room team faces some tough decisions in these gray areas, but they must make them or risk losing the schedule.
Perennial gray-zone cases are performance counters and test modes. Performance counters are a set of non-architecturally defined features that let programmers tune their code for high performance. There is more than one way to measure some complex system phenomenon, so if one way doesn’t work or produces results that are slightly off, it is probably not worth a new stepping just to fix it. In fact, depending on the schedule status and many other factors, some issues are simply marked “future fix” or even “won’t fix.”
Upper management can and does get directly involved, and not always for the reasons you would think. There is a silent period around any official corporate disclosure that might affect the company’s stock price. During that time, all corporate officers must avoid any stock transactions because they know more than external investors and cannot trade on that knowledge. If an important bug comes to their attention during times when the trading window is open, the window shuts abruptly and remains closed until the bug has been resolved. Typically, most executives wait this out with relative patience, but those who were planning to pay for a boat or house through sale of stock become maniacally interested in resolving the errata. In these cases, the team might wish the war room had been encased in lead under some unspecified hunk of rock.
Managing the Microcode Patch Space
The microcode patch facility is an example of what the war room had to deal with. The facility is an important feature of Intel Pentium microprocessors starting with the P6, and to understand its importance to the production phase, some history is in order.
Minicomputers of the 1970s and 1980s had reloadable microcode, and when design errata were discovered, it was fixed by someone writing new microcode and distributing it to the field. Beginning with the P6, Intel mimicked this process with the patch facility-a new microcode facility that provided the ability to fetch an alternative microcode source. The patch facility can handle a fairly arbitrary subset of the microcode ROM because every time the microprocessor powers up, one of its tasks is to read from a predesignated area of system memory to see if what is there looks like a valid microcode patch. After some nontrivial checking (after all, we want the machine to accept only legitimate Intel-supplied patches, not some hacker’s evil output, and we want to make sure the revision level of any patch found is appropriate to the CPU’s stepping), the patch would be accepted, and whatever bug or bugs that patch was aimed at would manifest no more.
The catch is that the microcode patch space is limited, which means that only so many patches can be applied. If someone finds a design error after that, the choices are bleak: evict a previous patch, forget about patching the new error, or recall the silicon (the latter usually at huge expense and inconvenience for both the customer and Intel).
Obviously, then, someone must zealously guard the microcode patch space. If a bug is merely an irritation or lends itself to a workaround, it should not be allowed to take up permanent residence in that space.
This is not as easy as it sounds, because microcode is a versatile medium in a microprocessor such as Intel’s, and almost anything beyond a certain complexity threshold will involve microcode. Translation: A suitably clever microcode patch can, in principle, fix a surprising number of bugs that have nothing directly to do with microcode.
You can also ameliorate certain chip set bugs that involve bus modes in this way. On the other hand, is that wise? If a small microcode patch can prevent the recall of a large number of motherboards, it is justified, but if the patch uses up half the CPU’s patch space and accomplishes nothing more than shaving two weeks off the chip set’s development schedule, it is probably not worth a permanent patch.
It takes exquisite technical judgment to be able to weigh what is known about a bug, its severity, how long it will take for a future stepping to arrive that fixes it, the volumes of all affected chips, how full the patch space already is, and how many future bugs are likely to appear that would use up more of it. No formu
la, algorithm, or algebraic equation, no matter how cleverly applied, can tell you how to wield the microcode patch space.
Your best resource for that is the most senior technical person, which in our case was Dave Papworth. Intel owes Dave an eternal debt of gratitude for the stellar job he did on this task for the first six years. Validation would inform the war room of the important sightings and bugs and Dave would essentially inform the war room of how the bugs were to be disposed of. It was like watching Michael Jordan play basketball or Tiger Woods play golf. Some people are so good at what they do, they make it look easy, no matter how hard it is, and it is only when you try it yourself that the amazement takes over.
PRODUCT CARE AND FEEDING
As every parent knows, or will find out when the time comes, you don’t actively raise your children until they hit the magical age of 18 and then launch them into the world, reducing your parenthood to mere observation of their progress. Do launch them (a 30-yearold camping on the sofa is not a pretty sight), but the process of guiding, correcting, nudging, fixing, helping, and otherwise adding value must go on.
The same is true of a product. As with your children, you invest heavily in preparing them for the world and you hope that they are so ready that they will be a raving success with or without you. But they have strengths and weaknesses, and areas where some inspired collaboration can make a big difference. Your “product” continues to need you even after production has commenced.