So although a rollback approach will work in some scenarios, in many others its success will be questionable. A four-CPU system, for example, has up to 12 caches among the CPUs, all controlled via the MESI (Modified-Exclusive-Shared-Invalid) protocol, a coherence scheme in which caches trade information as necessary to ensure overall consistency. But if just one of these caches were to fail in such a way that that cache’s local error-correction mechanism could not fix, all four CPUs (the entire machine) would be potentially compromised, because the broken cache could have contained data that simply was not anywhere else in the system.
After considering some fairly aggressive schemes to improve system reliability, we fl nally decided to do conservative improvements to Intel’s previous “machine check architectures,” checking and reporting on errors, but making no heroic attempts to roll back to presumably safe checkpoints, and making no promises to users along those lines. Maybe some day we will.
Performance-Monitoring Facilities
As recently as the 1980s, computers were still being implemented with ICs of such limited complexity that it took many hundreds of them to comprise a system. Interconnecting so many separate chips meant that the printed circuit boards (PCBs) had to carry thousands of wires. And while these chips and interconnects meant higher manufacturing cost and possibly an impact on system reliability, having easy access to them during debug and test was an unalloyed blessing. Test equipment vendors such as HewlettPackard and Tektronix sold logic analyzers that could collect digital information from anywhere in the machine. If the system was malfunctioning, a technician could track down the problem by following the symptoms wherever they led. Likewise, if a performance analyst wanted to know how often a given functional unit was being used, she could attach her test equipment directly to that functional unit and simply monitor it.
Then, along came microprocessors. Early microprocessors were not so bad. Lacking substantial internal state such as caches and branch prediction tables, these CPUs had to signal all their activities on their buses. If they needed the next instruction, you could see the request on the bus and the memory’s response to that request. True, you could no longer just attach your logic analyzer and watch the program counters sequencing merrily by, but the machine could not go too far awry before giving unmistakable clues on the bus.
To enhance performance, microprocessors began including internal caches in the late 1980s and branch prediction tables shortly thereafter. Modern microprocessors can have several megabytes of cache, more than enough for the processor to execute thousands or even millions of instructions before any external evidence of what the processor is actually doing appears on the frontside bus. If something went wrong at instruction N, and the engineer trying to debug it does not learn of the malfunction until instruction N plus one million, that engineer is about to have a bad day.
At Multiflow Computer, we had included a set of performance-monitoring facilities directly in the computer itself. With no recourse to logic analyzers, you could get the machine’s diagnostic processor to “scan out” the performance-monitoring information and present it in a variety of useful ways. After having used and loved that facility for several years, Dave Papworth and I resolved to provide something similar in any future designs, especially microprocessors, in which overall visibility is the most restricted.
Counters and Triggers. For the P6, we therefore proposed and implemented a set of hardware counters and trigger mechanisms. Our intention was to provide enough flexibility so that the performance analyst could set up the performance counter conditions in many ways to help zero in on whatever microarchitectural corner case was turning out to be the bottleneck in his code. But we could not spend a lot of die area on the facility, and we absolutely wanted to avoid introducing any functionality bugs associated with the performance counter apparatus.
Intel eventually commissioned a software group to create a good user interface to these performance counter facilities. Their program is called VTUNE, and is of tremendous value to programmers tuning their code for maximum performance.
We designed the performance counter facility originally for ourselves, with an eye toward minimizing die size and designer time. (Yes, we had to sneak these features in, and you can only do that if overall impact is very small.) We asked the designers of various chip units to provide for measurements of certain features, which they dutifully did. Unfortunately, sometimes they had to use signals that had already been qualified with other signals; some measurements would include certain global stalling conditions, whereas others might not, for example. This sometimes led to user confusion.
Some of the items we architects wanted to measure were so detailed and deep in the machine that, for users to actually understand them, we would have had to publish much more than prudent intellectual property guardianship would allow. We chose not to publicize these because they were simply not going to be of much use to anyone outside Intel.
Protecting the Family Jewels. Another problem we had was one of the major performancerelated things any programmer would want to do: correlate the microop stream to the assembly language stream that spawned it. That correlation is fundamental to the way P6 works, and the performance-monitoring facility could provide the microop information required. The only problem was that this was tantamount to publishing our microcode. The programmer merely had to generate a program with every x86 instruction in it and VTUNE would report the microcode streams corresponding to each. But Intel’s microcode was considered part of the family jewels, so exposing it at this level was out of the question.
I proposed a compromise. For any x86 instruction that mapped into a single microop, VTUNE would accurately report the mapping. After all, for these simple operations, what the microop must do was no mystery. But for complex instructions, VTUNE would report only the first four microops and how many others were required to realize a given complex x86 instruction. Performance-sensitive code would not normally use these complex x86 instructions anyway, and if it did, programmers would be unable to change anything about the microop stream even if they wanted to. This is still how VTUNE operates today.
Testability Hooks
In the late 1990s, the Focused Ion Beam (FIB) tool debuted, and for the first time silicon debuggers could make limited changes to the chip. This capability was and still is supremely important because design bugs are extremely good at hiding other bugs. Suppose, for example, that a bug is in the floatingpoint load-converter unit (FPLCU). Until that unit performs its job properly, you cannot test the floatingpoint adder, multiplier, divider, and so on. Without an FIB tool, a failing diagnostic would have led the technician to the FPLCU, but he would have had to stop there and place all floatingpoint tests offline until the FPLCU was fixed on a new silicon stepping. Getting that new stepping could easily take three to six weeks, depending on how busy the fab plant was with current production.
Now suppose the floatingpoint multiplier also had a bug. The time to get silicon with a working floatingpoint multiply would probably be closer to 17 weeks-two weeks to identify the FPLCU bug, a week to conceive and implement the fix for it in RTL, six weeks to get the new stepping back from the fab plant, a week for further testing, a week to find and resolve the new floatingpoint multiply bug, and six more weeks to get that stepping back. An FIB machine lets you reach in and fix or work around the FPLCU bug, so the floatingpoint multiply bug hiding behind it becomes apparent.
We were not sure we would have an FIB tool in the P6 generation, so to help expose bugs hiding behind other bugs and generally give engineers more tools during debugging, we gave the P6 an extensive set of debug/testability hooks. Our intention was that for any feature that was not crucial to the chip’s basic functioning, there should be a way to turn it off and run without it.
An example is the “fast string move” capability in the P6 microcode, which had an extremely complex implementation. We worried that, despite all our presilicon testing, some piece of real code might cause it to fa
il, so we wanted a way to turn it off and revert to the existing known-correct string-handling microcode if necessary.
We tried to minimize power dissipation in the P6 by not clocking any units that did not absolutely need it, but we worried that the designers could easily have missed something in the logic that handles the power-down circumstances. To make us feel better, we provided a separate power-down-disable control bit for every unit.
A truly insidious psychological artifact rears its ugly head in designing perfomance-monitoring facilities. Designers and validators are very, very busy people. They routinely miss important family events and give up holidays, weekends, and evenings attempting to keep their part of the overall design on schedule. They, therefore, value their time and energy highly, and ruthlessly triage their to-do lists. When items appear on those lists that are not clearly and directly tied to any of the major project goals, those items inevitably become bottom feeders. Consequently, they get less time during design, which implies that they are probably less well thought out than the mainstream functionality. By then there is less time to validate them, and the validators are always way behind schedule at this point, so these items also get less testing than they deserve.
Moreover, during debugging, performance-monitoring facilities that are not working quite right will not hold up the activities of very many people. That’s not to say that debuggers do not depend on their tools. If the testability hooks intended to speed debugging are themselves buggy, the confusion they generate can easily outweigh the value they bring.
The only way I have ever found to ensure that performance monitoring and testability hooks get properly implemented and tested is to anoint a special czar, whose job is to accomplish that. Without that czar, the shoemaker’s children will still go barefoot.
GRATUITOUS INNOVATION CONSIDERED HARMFUL
Engineers fresh from college are uniformly bright, inquisitive, enthusiastic, and cluechallenged, in the sense that they are somewhat preconditioned to the wrong things. Perhaps at some point in the college education of every engineer (myself included), someone put us in a deep hypnotic trance while a voice chanted, “You are a creative individual. No matter what someone else has designed, you can do it better, and you will be wildly rewarded for it.” Or maybe new engineers just lack the experience to know what has been done before and can be successfully reused, versus what is no longer appropriate and must be redesigned. Whatever the reason, almost all new engineers tend to err on the side of designing everything themselves, from scratch, unless schedule or an attentive boss stops them from doing so.
This disease, which I call “gratuitous innovation,” stems from confusion in a designer’s mind as to why he is being paid. New engineers think they are paid to bring new ideas and infuse stodgy design teams with fresh thinking, and they do contribute a great deal. But many of them lose sight of an important bottom line: They are paid to help create profitable products, period.’ From a corporate perspective, creating a wildly profitable product with little new innovation is a wonderful idea because it minimizes both risk and investment.
“Gratuitous innovation” stems from confusion in a designer’s mind as to why he is being paid.
Engineers who understand that the goal is a profitable product, not self-serving new-patent lists or gratuitous innovation, will spend much more time dwelling on the real problems the company and product face. To be sure, in some cases, new ideas will bring about a much better final product than simply tweaking what has already been done, but my experience is that unless you restrain the engineers somehow, they will migrate en masse to the innovation buffet.
An example is the design of microprocessor circuits. Integral to any such design are catalogs of cell libraries, existing designs that accomplish some list of desired functions. The library may contain more than one entry for accomplishing a given function, with one entry optimized for speed and another for die size. Engineers who are optimizing their design for reuse can often find existing cell-library components. Engineers who are optimizing their design for their career (whether real or only perceived because their managers know this game) will tend to insist on creating custom cell libraries, thinking that only in that way will they end up with their names on patents.
It can be fun to get wooden plaques bearing the seal of the U.S. Patent Office and your name, but it’s much more fun, and ultimately much more lucrative for all concerned, to concentrate on the product and its needs. Real innovation is what attracts many of us to engineering in the first place. Never confuse it with the gratuitous version, which only adds risk to the overall endeavor.
VALIDATION AND MODEL HEALTH
So far, I have devoted this chapter to the architects and design engineers with a few validation mentions here and there. I am about to correct any misunderstanding about that: The real work of validation is concentrated in the realization phase, and the primary task is to drive the RTL model to good health and then keep it there.
A Thankless Job
Corporate management has a difficult job overseeing microprocessor development. Projects like P6 and Willamette take 4+ years of efforts by hundreds of people, cost hundreds of millions of dollars, and potentially affect company revenues by many tens of billions of dollars. If things are not going well in a development effort, managers want to know as early as possible so that corrective actions can be taken.
As the RTL model develops, upper management can collect statistics such as number of new RTL lines written per week and track that running total against the original esti mates of how many lines are needed to fully implement the committed functionality. The original estimates are usually too low and managers continuously revise them upward as the project matures. These revisions tend to be predictable, and you can make reasonably accurate extrapolations for final RTL size surprisingly early in the project.
But when the same upper management focus turns to presilicon validation, difficulties abound. The validation plan shows all the tests that must be run successfully before taping out, and there is a running total of all tests that have run successfully, but neither is terribly helpful. You cannot simply measure the difference between them, nor can you simply extrapolate from the improvement trend.
When the validation plan is conceived at the project’s beginning, its designers try to account for all that is known at that time by asking questions, such as
��� Which units will be new, and which will be inherited unchanged from a previous design?
��� What is each new unit’s intrinsic degree of difficulty?
��� What is the most effective ratio of handwritten torture tests versus the number of less-efficient but much more voluminous random-test cycles on each unit, and on the chip as a whole?
��� What surprises are likely to arise during RTL development, and what fraction of overall validation effort should be held in reserve against such an eventuality?
��� How long will each bug sighting take, and how long will it take to resolve them?
��� What role will new validation techniques play in the new chip (formal verification, for example)?
During RTL development, upper management was clearly unhappy about how their quick-and-dirty, validation-progress metric was behaving. Perhaps more to the point, they were unhappy that the fraction of the original validation plan that was being accomplished week by week was not shrinking on any acceptable trendline; in fact, it appeared that the fraction of overall weekly validation effort was shrinking, not growing, because the validation team was alertly adding new testing to the plan as they learned more about the design, and the plan was growing faster than the list of now-running tests. In effect, for a while it looked as though the validation effort was falling behind by 1.1 days for every day that went by.
After a week of unproductive meetings on the topic, management asked us to conceive a metric that we would be willing to work toward, one that would show constant (if not linear) progress toward the quality metric re
quired for the chip to tape out.
Choosing a Metric
We proposed a “health-of-the-model” (HOTM) metric that took into account what seemed to me, Bob Bentley, and his team, to be the five most important indicators of model development, and we weighted them as seemed appropriate:
1. Regression results. How successful were the most recent regression runs?
2. Time to debug. How many different failure modes were present, and how long did it take to analyze them?
3. Forward progress. To what extent was previously untried functionality tested in the latest model?
4. High-priority bugs. Number of open bugs of high or fatal severity.
5. Age of open bugs. Are bugs languishing?
We then began tracking and reporting this HOTM metric for the rest of the project.
The fair amount of subjectivity in these indicators was intentional. We recognized that the strong tendency is to “get what you measure,” and we did not want the HOTM metric to distort the validation team’s priorities until we had accumulated enough experience with it to know if it was leading us in the right direction. Because we were the ones who had conceived the validation plan, we knew it was a very valuable, yet necessarily limited, document. Despite our best efforts to be comprehensive and farsighted, if history was any guide, we would discover that some parts of the validation plan would place too much emphasis on some part of the design that turned out not to need it, while other parts would turn out to be the most problematic and require much more validation effort than we had expected. We did not want to find ourselves unable to respond appropriately to such exigencies on the sole basis of some document we ourselves had written, knowing only what we knew two or three years ago.
The Pentium Chronicles: The People, Passion, and Politics Behind Intel's Landmark Chips (Practitioners) Page 17