My part of the game was to think through the conditions he specified, and let my intuition provide my best guess (which is exactly what he had done the previous night). Of course, I would cheat a little bit, because I knew that if the answer were obvious, Dave wouldn’t be smiling, so I offered him the most extreme among the possibilities my intuition said were at all feasible. Then, when he told me what he had found, I would realize how far my experience had led me astray, which then motivated me to apply the right amount of corrective mental pressure. Consequently, the next time we played this game, I would get closer on my first try. Occasionally, my intuition even turned out to be right, but many times that meant I had just found a bug in DFA.
This DFA-induced intuition retuning was key to P6’s overall success in the concept phase. Data has a way of making you ask the right questions.
DFA Frontiers
Data has a way of making you ask the right questions.
DFA was a great success. It was a conceptual flashlight that we used to illuminate many dark corners of the design space, where lurked subtle interactions among multiple design choices. It allowed us to quantify our ideas, to show which ideas were better than others and by how much. DFA guided us to places where performance was being lost so that we could concentrate our creativity on those areas.
One of our more interesting forays was a pursuit we dubbed “walking the rim of the known universe.” If DFA was not constrained, the results it returned essentially answered the question, “How much intrinsic instruction-level parallelism is in this program?” For certain programs, such as vectorized, floatingpoint codes, intrinsic program parallelism is known to be essentially unbounded. But for most programs, the academic literature varies wildly in its estimates, from numbers as low as 2 to as high as 100. If intrinsic parallelism really was only 2, then investing a huge amount of design time and die space on an out-of-order microarchitecture would never be repaid in performance. It would be like drilling a deep well somewhere you knew had no water.
DFA quickly resolved the theoretical limits for x86 parallelism. While touring the galaxy with it, however, it dawned on us that it would be straightforward to also make it answer an important follow-on question, “How much of the theoretical maximum parallelism is still available when running on hardware with real constraints on functional unit numbers, bus bandwidth, memory bandwidth, branch prediction, and instruction decoder width?” Within a few weeks, we had rigged DFA with default values for a representative hardware configuration, with command-line options to override them. This arrangement let us identify out-of-order-isms, areas where this still-new out-of-order universe collided with our intuitions. We ran a series of experiments with DFA set to assume an infinitely capable microarchitecture, except for one important subsystem that was constrained by reality.
The first thing these experiments showed us is that the weakest-link-in-a-chain model still applied: An infinitely capable machine might show that intrinsic parallelism on a given program would be, say, 35x, but constrain anything in the machine, and even if 90% of the engine is still infinitely capable, the overall performance will slow drastically. Constrain any two things, and performance falls even more, although not as precipitously. Constrain everything, as in a real machine, and DFA’s results begin approaching reality.
By its very nature, however, DFA would tend to overstate parallelism because anything left out of the run would tend to unconstrain it, thus (artificially) boosting parallelism. Much of DFA’s subsequent development over the next several years was devoted to installing ever more accurate constraints.
DFA had one other major limitation. Besides being out-of-order and superscalar, P6 performed speculative execution. The “hello world” example given in “Out-of-Order, Superscalar Microarchitecture: A Primer” in the Appendix is contrived, but it demonstrates a realistic effect: A conditional branch occurs every few program instructions. If a microarchitecture cannot reorder code beyond a conditional branch, the supply of available work will dry up to the point where even designing a superscalar engine would be moot. If the engine does permit code to be reorganized around conditional branches, the supply of concurrent work is much bigger. This larger supply comes at the price of much bookkeeping hardware, however, which must notice when a branch did not go the expected way, repair the damage to machine state, and restart from that point. DFA had a built-in weakness with respect to speculative execution. Because it could see only the trace of what an actual program did, it did not have any way to go down the wrong path occasionally and then have to recover from that, as a real speculative engine would.
We could have addressed this shortcoming by giving DFA access to the program object code and datasets, as well as the execution trace, but it seemed too much work for the improvement we could expect. We ended up collecting enough heuristics to reliably cover the tool’s shortcomings, but even so, under certain conditions, it was relatively easy to fool ourselves.
The act of codifying one’s thinking unfailingly reveals conceptual holes, mental vagueness, and outright errors.
A subtle benefit of a handcrafted tool such as DFA is that it is written directly by the architects themselves. The act of codifying one’s thinking unfailingly reveals conceptual holes, mental vagueness, and outright errors that would otherwise infest the project for a much longer time, thus driving up the cost of finding and fixing them later. Having the architects write the tool also forces them to work together in a very close way, which pushes them toward a common understanding of the machine they are creating.
We probably got too enamored of this tool, however. As we finished P6 and were beginning the Willamette project (which would eventually become known as the Pentium 4 microprocessor) we once again needed an early behavioral model. We could start from scratch, buy something off-the-shelf, or stretch DFA (despite some misgivings that DFA did not appear to be a great fit for the Willamette machine). Sticking with what worked in the past, we chose to extend DFA, and quickly found that it was taking exponentially more work to get marginal increments in usefulness out of it. The moral of the story is, don’t fall in love with your tools but rather use or make the right tool for the job.
PERFORMANCE
As I described earlier, our immutable concept phase goal was performance. Doubling the P5’s performance was a simple, straightforward mission, exactly the kind you could rally the troops around.
Benchmark Selection
It didn’t end up being that simple. The troops had a lot of questions. Performance on which of the dozen important benchmarks? Did we have to get 2x performance or more on all of them? Or did the global average have to be 2x, with some benchmarks substantially better than that, and perhaps a few that were much slower? And if some programs could be less than 2x, how much less could they be? Was it okay for the slowest chip of the P6 generation to be slightly slower than the fastest chip of the previous generation? Were some benchmarks more important than others, and therefore to be weighted more heavily?
We also had to determine what methods we would use to predict P6 performance. Would benchmark comparisons be made against the best of the current generation, or today’s best extrapolated out a few years? In some cases, such as transactions processing, actual measurements are extremely difficult to make, and theoretical predictions are even harder. Yet that type of programming is important in servers, and you cannot just ignore it.
Avoiding Myopic Do-Loops. It is dangerous to use benchmarks in designing new machines, but not as dangerous as not using them, since the best you can say about computer systems designed solely according to qualitative precepts instead of data is that they are slow and unsuccessful. But relying too heavily on a benchmark has its own pitfalls. The most serious is myopia: forgetting that it is, after all, just a benchmark, and therefore represents only one kind of programming and only to some limited degree.
It is dangerous to use benchmarks in designing new machines, but not as dangerous as not using them.
Myopia
develops in a manner something like this. After driving DFA all night, an architect announces that benchmark X is showing a performance improvement of only 1.6X over P5 instead of the targeted 2x, and the reason is a combination of memory stack references, which are inherently sequential and thus resistant to some kinds of out-of-order improvements. The architect proposes that to fix this problem, the microarchitecture must now include heroic measures associated with stacks. After several days of intense effort, the architects believe they may have a palliative measure that they can actually implement and this should help benchmark X. They modify DFA to include the new design and reanalyze the benchmark. Not wanting to seem ungrateful for everyone’s hard work, DFA reports that benchmark X is now 1.93X, which the happy architects agree is close enough for now.
The next night, however, when the same architects run another performance regression (which they do because the project is so well managed) they are not as happy. DFA reports that the new changes have helped 70% of all benchmarks, some a little, some a lot, but the new changes have also slowed the other 30%, and in benchmark Y’s case the slowdown is rather alarming. Now what? The architects have several choices. They could
1. Back out the heroic measures and try to fix benchmark X some other way
2. Leave the heroic measures in and try to fix only the worst of the collateral damage
3. Reconsider their benchmark suite and ask how important benchmark X is versus benchmark Y
4. Intensively analyze benchmark Y to find out why a seemingly innocuous change like the fix for X should have had such a surprising impact on Y (a choice that invariably leads to back to choice 1 or 2)
Such do-loops can rapidly become demoralizing, as architects spend all their time pleasing the benchmark and lose sight of the project goal: how to achieve true doubled performance over the previous generation. Moreover, given the difficulty and intellectual immersion the do-loop demands, architects must make superhuman efforts to rise above it and ascertain that project performance is really still on track.
FloatingPoint Chickens and Eggs. Intel Architecture x86 chips had not been known for industry-leading floatingpoint performance in the 1980s and early 1990s. In fact, their relative anemia on floatingpoint code was a major reason that RISC vendors had the engineering workstation market to themselves at that time. But because of that history, we were not being asked for much floatingpoint performance from the original P6 chip. Essentially, two times the P5’s floatingpoint performance was not very hard to achieve, and because the x86 floating point had been slow for so long, no x86 applications needed fast floating point.
This was a classic chicken-and-egg scenario. Nonetheless, we believed that the P6 family would find a ready home in workstation and server markets if it had competitive floatingpoint performance, so we included it in our design decisions even though normal benchmarking would have excluded it.
Floatingpoint performance also raises some subtle issues in project goal setting. Intel had a large market share in the 1990s; it supplied the microprocessors for over 80% of all desktops. Knowing this, an insidious mental distortion can easily set in. One could start thinking that whatever Intel does is de facto good enough, and whatever floatingpoint performance our chips supplied, the industry would simply adjust their software demands accordingly.
Seek a balance between what the technology makes possible, the cost of various design options, and what buyers can afford.
But that is no way to keep an architectural franchise healthy. It is a much better idea to seek a balance between what the technology makes possible, the cost of various design options, and what buyers can afford. If one makes that tradeoff optimally, one’s products will be competitive independent of existing market share. That is what we tried to achieve with P6.
Legacy Code Performance
The hardware design community is often guilty of overlooking a glaringly obvious market truth: Buyers spend their money to run software; hardware is only a necessary means to that end. Yes, some people do buy computers (or at least they did at one time) just to have bragging rights to the fastest machine on their block. And some have bought computers just to run some new application, not much caring about the existing software legacy base. In the late 1990s, about one to three million new microprocessors could be expected to be sold at an exciting markup to people playing leading-edge games. For these users, the additional performance edge was tangible and valuable.
But the economics of microprocessors reward much higher volumes. Mainstream x86 designs can expect to ship upward of 100 million units, which is where their true return on-investment lies. For P6, this meant we had to pay attention to the existing code base during the concept phase, not just new benchmarks.
In many ways, legacy code is a more difficult target for a new microarchitecture. A mountain of x86 code has accumulated since the 1970s. Designing an engine to correctly run both legacy code and modern compiled code at world-class performance levels is like designing a new jet fighter to run on anything from jet-A fuel to whale blubber and coal.
The compiler group could not help us here. The legacy code was compiled some indeterminate number of years earlier, and its target machine was now considered thoroughly obsolete. The compiler probably optimized for all the wrong things compared with more modern designs.
For us, the most difficult aspect of legacy code was judging how important relative performance would be. Would this software continue to be heavily used? Would anyone run it on the new platform? If the new microprocessor sped up this code, would anybody care or even notice`? Even if our best judgment said that existing code was in heavy use and would be used on our new machine, and that performance was potentially important, we still had to contend with the expense of accommodating changes aimed specifically at the legacy code.
Early in P6’s microarchitecture development, for example, we realized that general computing’s future would be 32 bits, not the 16-bit code characteristic of the i386 and i486 era. Eventually, it would migrate to 64 bits, but our collective best judgment in 1991 was that the final move to 64 bits would not impact the P6 family, so we decided to focus our efforts on 32-bit performance. We checked with Microsoft about their plans for migrating Windows operating systems and were assured that their “Chicago” release (eventually renamed Windows 95) would be 32 bits only, all legacy 16-bit code having been excised. That was great news for us, until late in 1993, when Microsoft admitted that they had been unable to replace the old 16-bit graphics display drivers with the new 32-bit versions they had created for Windows NT and would have to retain the old code to maximize hardware compatibility with old monitors and video cards. I describe the scramble that ensued in “Performance Surprises” in Chapter 5.
Intelligent Projections
Our collective best judgment in 1991 was to focus our efforts on 32-bit performance.
Performance projections are extremely important to a design project, and no less so to the corporation itself. Big design companies such as Intel will usually have several CPU design projects going at any time, plus their associated chip-set developments, and many other related projects, such as new bus standard developments, electrical signaling research efforts, compiler development, performance and system monitoring tools, marketing initiatives, market branding development, and corporate strategic direction setting.
All these must be coordinated. It is hard enough to convey information accurately to a customer about a new design’s schedule, performance, power dissipation, and electrical characteristics, without also having to explain why the design appears to compete with another of your company’s products. Customers have the right to assume (and they will) that their vendor has its act together and is actively coordinating its internal devel opment projects so as to achieve a seamless, comprehensive product line that makes sense to the customer and is something the customer can rely on in making their own product road maps. That customer’s customer will assume the same thing of them, after all.
This is a challenge in any venue. Have you ever watched a duck languidly gliding across a pond? That placid appearance doesn’t come without furious churning below the surface. A seamless product line is no different, except for the feathers and webbed feet.
All Design Teams Are Not Equal. Many executives approach the task of managing multiple design teams by reasoning that if all design projects do things the same way and use the same tools to do their performance projections, comparing their expected outcomes should be rather trivial. When such grand unification attempts inevitably fall short, they blame “not invented here” and interproject rivalries. Those can be major influences to be sure, but a more important reason all projects can’t do things the same way is that they aren’t designing the same things, and the teams do not comprise the same designers.
Sports teams provide a persuasive analogy. One basketball team might have two very tall players, and base their game strategies on getting the ball to them. Another might have fine outside shooters. A third could be exceptional at passing and running. Design teams, too, are a collection of all the individual talents and abilities of their members, amplified (or diminished) by interpersonal team dynamics and overall management skill. To treat all teams the same is to cripple the exceptional teams while implicitly insisting that the weaker design teams somehow perform above their capability.
I believe a company’s strongest design team must be allowed to do what its leaders believe is necessary to achieve the finest possible results. Other teams in the company may find it helpful to follow their lead, or may themselves be strong enough to strike out in new directions. Upper management must make this judgment call; if they refuse, and instead end up with a onesize-fits-all direction, it’s almost certainly wrong.
The Pentium Chronicles: The People, Passion, and Politics Behind Intel's Landmark Chips (Practitioners) Page 6