Test Vectors
You were not sure you could run an entire marathon. Sure, you trained for it, hitting the pavement every night until your shoes wore out and your knees begged for mercy. But until you were actually part of the sweating, groaning, moving mass, you could not be sure if the training was enough. So you set a modest goal: Finish the race with some dignity, at least at a slow jog.
Finally, the big day arrives, and you run and run, and run some more. When you are sorely tempted to walk, some disembodied voice calls out, “Keep it up! Good job! You’re halfway there.” So you pick up the pace, persevere through the pain, keep your eyes on the backs of the people ahead of you, and try not to notice that your watch seems to have stopped. After agonizing hours, you hear shouts and see crowds leaning into the road, urging you on. Yards ahead is your goal and you are still on your feet. You lurch across the finish line in a kind of dream state, certain that you have never been this tired in your life and convinced that no one with a shred of sanity does this more than once.
And then a man rushes up to you, all business. “That was great,” he says, “but you’re already late, so change into your biking gear, and remember your target speed is 179 km/hr! Oh, and make sure your swimsuit is handy. Come on, buck up! This is only the first leg of your triathlon!”
The sinking feeling of this spent marathoner only approximates what design engineers feel when they have made it to tapeout only to be told to hurry up and generate production test vectors, which everyone needed yesterday.
Test vectors are an artifact of the silicon manufacturing process. Through an incredibly elaborate series of chemical, mechanical, and photolithographic processes, silicon is transformed from being little more than sand (silicon dioxide) into microprocessors with hundreds of millions of transistors and billions of wires. Before they are even cut away from the wafer, each die is individually powered up and given rudimentary testing. Testing at this stage saves work and expense for packaging, since it makes no sense to package a die you know is nonfunctional.
The testing procedure is relatively straightforward. A robotic arm positions an array of metallic whiskers over each die. The whiskers are lowered onto the bond pads of the die under test, the die is powered up, and its inputs are driven in sequences of logic values that cause good chips to respond in certain predictable ways.
Less intuitive are conjuring up the required inputs and defining the expected outputs, tasks that make up the largely unrewarding job of writing test vectors. As you might guess, nobody is eager to volunteer for this. It is a deadly dull, error-prone, and openended process. You cannot test everything; there is not enough time and you are allocated only a certain number of input vectors. For most of the design project, design engineers spend each new day building on what came before. If they did a great job preparing for part of their design, that part will be commensurately easier to complete, so they are motivated to do a great job. With test vectors, there is always a sense that you are doing it for someone else, someone you do not even really know. It can be challenging to keep someone interested in a task that they view as forced charity.
On any mandatory march like test vectors, it is imperative to properly supervise the design team. Their leaders must make sure that everybody contributes equally. If whiners can get off the test vector writing hook, word will circulate that whining gets you what you want, and you can guess what happens after that. Project leadership must clearly show how much this task is valued and ensure that those contributing to it are being noticed and recognized. The engineers will still grumble, but the task will be completed within the weeks or months allocated.
Performance Surprises
Early silicon is scarce, so it must be allocated to customers that are best prepared to help nurse it through its infancy. Companies like IBM, Dell, and Compaq/HP have traditionally been extremely valuable partners in testing early silicon and reporting any observed anomalies in functionality or performance.
As I hope the earlier chapters of this book made clear, tremendous effort and cost is expended presilicon to ensure that basic functionality is correct. Most of this effort is in the form of RTL simulations, which are cycle-by-cycle accurate with respect to the final silicon.
Correctly executing code is only a prerequisite.
Correctly executing code is only a prerequisite. The final customer assumes the code will execute properly, but that is not the only reason why they bought the chip. Customers buy new computers because they believe that new computer will do things their previous one could not, such as run new applications, or old ones much faster or more reliably. The problem with performance is that it is much more difficult than functionality to simulate presilicon. The basic reason is “state,” the amount of information that must be accumulated before the machine is ready to do some particular task. A modern processor has caches, branch predictors, and other internal databases that must be warmed up over many millions of clock cycles.
So even though presilicon testing has reasonably wrung out the initial silicon, the SRTL that defined that silicon has had relatively minimal performance verification. Therefore, performance surprises await early silicon, and such surprises are never in your favor.
Benchmarks are the only plausible way to tune a developing microarchitecture and find at least some of those surprises. The real code that will run on the designers’ new system is intractable. It is too big, uses libraries that do not yet exist, and is aimed at an operating system that is not even accessible yet. It also makes no sense to run existing applications. You would have to run thousands of them, which is hardly practical, and even then, they cannot cover all the design corners.
Benchmarks are supposed to represent the real code in all the important performancerelated ways, while being much more manageable in terms of slow simulations. But which benchmarks`? And who chooses them?
In yet another feat of deft judgment, the design team (in particular the performance analysts) must consciously predict which software applications will be the ones that really matter in a few years, and then find or create benchmarks and a performanceprediction method on which they can predicate the entire project’s success or failure. The trick is to fit all the benchmarks into the acceptable simulation overhead space, so one benchmark cannot take up so much space that it becomes the only one that gets analyzed presilicon.
All these uncertainties leave a lot of room for errors, changes, and misjudgments. A very common error is to fixate on the benchmarks that the team relied on for the previous chip generation. Another is to incorrectly anticipate how the buyer’s usage model will change. We were somewhat guilty of this with P6, having conceived and designed a workstation engine in 1990 that ended up being used as a Web server in 1996. We got lucky, however, because our basic design turned out to be compatible with the performance rules of an I/O-intensive Web server workload. But it is always better to correctly anticipate the workloads that will be of interest and then find ways to model those workloads that are compatible with the methods to analyze presilicon performance.
When the chip goes into production, the pool of people able to run real code on these chips goes up enormously, as does the number of software applications. Consequently, chances increase that some of those applications will stray outside the expected performance envelope. The envelope’s outsides tend to be asymmetric, however. The applications that run faster than expected will delight their users, but delighted users rarely call to tell you how happy they are with their applications. Disgruntled users, on the other hand, will forcibly bring slower-than-expected applications to your attention, along with their deepest convictions that you have wronged them. Analyzing the software in question and working with the application developer will resolve most, if not all, of these slowpoke ap plications. Often, the developer has used a suboptimal set of compiler switches, or the user’s system was not configured for highest performance, or the software vendors did not follow the performance-tuning rules (or at least the rules w
ere not clear in some area crucial to this particular application).
Many performance issues in early silicon can be dealt with via compiler changes, operating system changes, or software vendor education. An example is the original Pentium Pro’s handling of partial register writes. Because of historical artifacts in the x86 register definitions, it is possible to directly address various subfields of a nominally 32-bit register like EAX. But the P6 microarchitecture renames registers by keeping track of register writes and register reads, and rather than triple the complexity of the register renamer, we chose instead to provide only enough functionality to correctly handle partial register writes. Used as we intended, this scheme was reasonably inexpensive and yielded very good performance. P6 marketing had a campaign to make this and several other performance tuning hints available to the software vendors well before P6 silicon arrived.
We received confirmation that this tuning scheme worked when shortly after we had sent P6 early production units out, a report came back from the field support engineers that the developers of the game Monster Truck Madness were unhappy with the performance they were seeing. A cursory inspection of their code showed it was riddled with partial register usage, which a compiler change quickly fixed.
Feature Surprises
If there were such a thing as a computer geologist, Intel’s x86 architectural origins would be somewhere around the Paleozoic Era.
Geologists can read the Earth’s local history by interpreting the various rock strata. If there were such a thing as a computer geologist, Intel’s x86 architectural origins would be somewhere around the Paleozoic Era, as Figure 5.1 shows.
One such era relied heavily on 16-bit code, a mode in which the CPU used only 16-bit data registers and coincides with when dinosaurs first roamed the Earth.
As I described in the section, “Legacy Code Performance” in Chapter 2, we P6 architects had to make an explicit choice as to what kind of code our new machine would optimize: older 16-bit or newer 32-bit. This was not a question of running one and jettisoning the other; P6 had to get correct answers no matter what kind of x86 code it encountered. The choice confronting us was how to strike the best performance balance between old and new.
After not much debate, we decided that 32-bit code was clearly the future and that if x86 were going to keep up with 32-bit-only competitors such as MIPS, Sun, HP, and IBM, we needed to go after that target directly. If we lost that war, we might sell one more generation of x86s but we would fall further and further behind everyone else. If we were going to catch up to our non-x86 competitors, it was now or never. For the long-term good of the x86 franchise, we adopted a formal target of 2x for 32-bit code and 1.1 x for 16-bit code’ and proceeded to optimize P6 for 32 bits.
We had many meetings with Microsoft, swapping notes and ideas about the future of DRAM sizes and speeds, hard-drive evolution trends, and our respective product road maps. From 1991 to 1993, Microsoft made it clear that Chicago, which eventually became Windows 95, would be their first 32-bit-only operating system. Then, in 1994, they announced that a key piece of Chicago would, regrettably, remain 16 bits: the graphics display interface.
This was unhappy news. It was far too late in Pentium Pro’s development to change something as fundamental as our overall optimization scheme. But we were not too worried; after all, Windows 95 would be a desktop operating system, and the Pentium Pro was not intended as a desktop CPU. (Its two-die-inpackage physical partitioning afforded much higher performance, but at a much greater manufacturing cost and, thus, with constrained production volumes.) Unix dominated servers and workstations, and Unix was strictly 32 bits. We knew that proliferations of the original P6 chip would be aimed at desktops, but we bet that the original Pentium could hold that market segment for the year it would take the Pentium II team to put in some 16-bit tweaks and retarget the P6 for desktops.
That is exactly what happened, although, in retrospect, it is clear that we made a riskier bet than we had intended. The Pentium chip did, in fact, hold its own on the desktop, but luck played a larger role than we foresaw-AMD’s K5 chip turned out to be late and very slow. Lack of competition, combined with a vigorous marketing campaign centered around dancers wearing colorful bunny suits, made a success of Pentium long enough for the P6-based Pentium II to become ready, with its improvements to 16-bit code performance.
Figure 5.1. Intel history of an ancient architecture. Photo by John Miranda (www. johnmirandaphoto.com).
Making Hard Decisions
Eventually, I would find myself excoriated for the decision not to emphasize 16-bit code performance, but looking back at the experience, I am unrepentant. You cannot explore every idea to equal depth, cover every base, hedge every bet, and refuse to make any decisions until all the data is available. All the data is never available. This is true not only in engineering, but in every important human endeavor, like marriage, family, and choosing a job or home. To choose one path among several is to fundamentally exclude other sets of possibilities; you cannot have it both ways. With perfect hindsight, our decision to optimize 32-bit performance was exactly the right one. A subsequent P6 derivative microprocessor added 16-bit performance tweaks, and since that particular part was aimed at desktops, where 16-bit code was still important (legacy and Windows 95), we were covered in that market. Meanwhile, our 32-bit optimized part was opening new markets for Intel in workstations and servers, where 16bit code was irrelevant, and those markets became major new revenue streams.
LDS Elder Robert D. Hales says, “The wrong course of action, vigorously pursued, is preferable to the right course pursued in a weak or vacillating manner.” Engineering is about taking calculated risks, pushing technology into new areas where knowledge is imperfect, and if you take enough risks, some of them will go against you. The trick in a project’s concept phase is to know when and where you are taking risks and to make sure you can either live with a failure (by permanently disabling an optional but hopedfor new feature, for example) or have viable backup plans in place. And never forget Will Swope’s dictum: Your backup plan must be taken as seriously as your primary plan; otherwise it is not really a backup plan. Thinking you have a backup plan when you really do not is much more dangerous than purposely having none. The Space Shuttle Challenger’s second O-ring exemplifies this trap [19].
To choose A is not to choose B. People who try too hard to get both, as a way of avoiding the difficult choice between them, will end up with neither.
I consider our 16-bit choice on the original P6 to be among our best decisions. We recognized an important issue, considered all of its ramifications, placed an intelligent bet on the table, and won. To me, making such compromises openly and rationally is the essence of great engineering.
Not everyone saw it that way. In 1996, under the guise of conducting a routine interview, an industry analyst took me to task in no uncertain terms over our 16-bit choice. How could we have been so expressly incompetent in failing to recognize how much 16bit code remained in the world? Did we not realize we were threatening not only Intel’s immediate future, but also the future of the industry as a whole? What unmitigated hubris, what fatal ignorance! As far as this man was concerned, all the people involved should be chained to the same rock as Prometheus and undergo the same liver surgery.
Over the next few years, different internal groups within Intel would sporadically “discover” the Pentium Pro’s perceived weakness. E-mail flew, presentations were given, and task forces were formed. If there ever is a Monday Morning Quarterback league, it will not lack for players.
What all these objectors fail to see is that design is the art of compromise. You canot have it all. In fact, you cannot have even most of it. The Iron Law of Design says that, at best, you can have a well-executed product that corresponds to a well-conceived vision of technological capability/feasibility and emerging user demand, and that if you strike a better compromise than your competitors, that is as good as it gets. If you forget this law, your desi
gn will achieve uniform mediocrity across most of its targets, fail utterly at some of them, and die a well-deserved and unlamented death.
Executive Pedagogy and Shopping Carts
An important part of product care and feeding is to expect the unexpected, and I had many opportunities to exercise that philosophy during the P6 project. Corporate management was beginning to show interest in our microarchitecture, and requested that I give them a tutorial on how it worked.
This was a frightening prospect. I knew how it worked, but I also knew how unprepared our executives were to grasp out-of-order, speculative, superpipelined microarchitectures. How could I give them some intuitions about our motivations and choices in the design of this complicated engine?
The night before I was to give this talk to our executive VP, I still had not solved this conundrum. I knew I could always be boring and pedantic, and give the listener a straightup data dump. If they could not keep up, too bad.
But that is just not my style and, anyway, I was proud of our design and I really wanted our executives to understand, at least to some extent, how thoroughly cool it was. So I decided to pitch the talk at my mother-a smart person, but one with no technical background. If she could follow the discussion, then Intel execs might, too.
I did not think my mother would have the patience or interest to learn enough about microarchitectures to approach this topic on its own terms. So I started thinking about analogies, and I finally came up with one involving shopping carts and supermarkets, one with enough parallels to an actual microarchitecture to illuminate the concepts without too much distortion.
I also happen to like this analogy because it supports one of my pet conjectures: Computer design looks a lot more mysterious than it is because familiar ideas tend to be hidden by engineers who rely heavily on the passive voice and routinely forget to eschew obfuscation. Actually, computer science has very few original concepts. Once you get past the buzzwords and acronyms, you can fairly easily explain the ideas using a range of familiar contexts. I happened to pick grocery shopping, which nearly everyone has done at some time.
The Pentium Chronicles: The People, Passion, and Politics Behind Intel's Landmark Chips (Practitioners) Page 23