The Pentium Chronicles: The People, Passion, and Politics Behind Intel's Landmark Chips (Practitioners)
Page 31
I think the future is mobile. In particular, I think the future is battery-or fuel-celloperated and that what will be valued in the mainstream a decade or two from now will be sufficient performance to accomplish some desired end goal (like HDTV playback or GPS reception) at very long battery life. This is a daunting goal, and so far removed from historical CPU design goals (especially the highest possible performance at all costs) that it will take a minor miracle to get existing teams to achieve it. Without inspired direction from corporate leadership, it is not going to happen. Business as usual is no longer going to work. I left Intel because I did not want to be the one who proved that.
AND IN CLOSING I’D JUST LIKE TO SAY .. .
You can spend your entire life as a design engineer and never have the good fortune to find yourself on a team of incredible people like I did on the P6 project, but I hope you do. It is an experience that will elevate your own abilities, give you a feeling of intense acceleration, like a roller coaster heading over the steepest drop, and the incredible sensation of having your own intellect amplified beyond reason by proximity to so many brilliant people. Engineering is fun, but engineering in a project like P6 is sublime.
Years later, when you think back over the project and the great products it created, there will be a feeling of immense satisfaction combined with awe at the sheer talent and dedication that were there in such abundance. In all your design projects, do your part to arrive at such a happy circumstance: Hold every project you find yourself in to the highest standards. Expect your team to attain greatness and never settle for less. If a particular project does not turn out the way you wanted, remember that mistakes are the learning curves with the steepest slopes, and redouble your commitment to the next project. As Goethe said, “Whatever you can do, or dream you can, begin it. Boldness has genius, power, and magic in it.”
APPENDIX
OUT-OF-ORDER, SUPERSCALAR MICROARCHITECTURE: A PRIMER
Assume that you are a programmer and your job is to write an assembly program that will cause the computer to calculate the sum of the integers 2 and 3.1 You fire up your text editor, and you type in something like this:
Clearly, if these four instructions execute in the order specified {I1, 12, 13, 141 the correct answer will appear on the output. It is also clear that if 11 and 12 are done in the reverse order, the program still gets the right answer. The Add in 13 does not care which value, 2 or 3, was loaded into its register first.
But instructions 3 and 4 cannot be done in any order other than that shown. If the Add occurs before both values have been loaded into their registers, then it will add whatever happens to be in those registers and cheerfully write that sum into R2. Likewise, if 14 occurs before 13 has executed, 14 will report whatever happened to be in R2 before this pro gram even ran. There is a dependency chain here: 14 depends on the result of 13; 13 depends on the results from both 12 and 11. Therefore, 14 cannot execute until all three prior instructions have produced their results. There is no dependency between I1 and 12; they are completely independent instructions and, therefore, can execute in any order desired.
An out-of-order microarchitecture like the P6 is designed in such a way that it can detect instructions that must execute in program order (because of true data dependencies, like the one between 14 and 13 in the example). There is no magic here; if instructions must execute in a particular order to get the correct result, then the P6 carefully observes that order. But if (as often happens) instructions do not have to execute in the order they happened to be listed in in the program, the P6 will execute them in whatever order it deems best.
P6’s micro architecture is also superscalar. This means that it has enough hardware resources to execute multiple instructions at the same time, a potentially very large performance gain, if those resources can be kept usefully busy. For Program 1, the two load instructions can run at the same time, because neither depends on the other. But 13 and 14 must wait until both loads have completed, and 14 must wait until 13 has completed. True data dependencies prevent out-of-order execution, as well as simultaneous superscalar execution.
Running Program 1 out of order probably does not seem like a very promising performance win, although shrinking a four-cycle program to three by performing the two loads in parallel actually is a significant gain. But consider a slightly more complicated example. Suppose your task is to sum the numbers from 1 to 10. Ignore Gauss’s theorem for solving this immediately; think about a straightforward extension of the code in Program 1. Program 2 has to do ten loads to get the numbers being summed. Our computer can do only pairwise additions, so five adds produce five intermediate results, and four more adds complete the calculation.
Program 2 thus has considerably more leeway for executing out of order or concurrently. In this example, I have assumed a microarchitecture with at least 18 registers, and I have not reused any of them, to make the example easier to understand. Program 2 needs to do ten loads, which are independent of each other. Assume that the out-of-order machine can identify those and run them two at a time, and to keep things simple, that all instructions take one clock cycle to complete. The loads in 11 and 12 will execute in clock cycle 1 and therefore, 13 can execute in clock cycle 2 (both of its operands being ready). The loads in 14 and 15 can execute as a pair in clock cycle 3, followed by 16’s add in clock 4, and so on, down to the series of adds starting with 113.
There are some data dependencies in 113-116. 116 needs the result of 115 before 116 can execute. 115 needs 114’s result. 113 needs 12’s and 16’s results.
With this judicious code rearrangement, the program still gets the correct result but takes 12 clock cycles instead of the original 20.
Part of the motivation for designing an out-of-order, superscalar engine is to exploit these opportunities for concurrent instruction execution. Equally important is the out-oforder engine’s tolerance of cache latency. Caches are fast, physically local hardware structures that maintain copies of what is in main memory. When the processor executes a memory load to an address for which the cache happens to have a copy, that load is said to hit the cache and might take only one or two clock cycles. If the cache does not have a copy, the load misses the cache, and the load instruction incurring the miss must now wait until the hardware contacts main memory and transfers a copy of the missing data into the cache.
In P6’s era, cache misses required a few dozen clock cycles to service. With the Pentium 4 generation, a cache miss could take several hundred clock cycles, and the general trend is worsening with each new process generation. As in Program 2, most code must perform a series of loads to get data, then operate on that data, and, finally, send out the data just created. Until the loads have completed, nothing else can happen; the program stalls for the duration of a cache miss.
An out-of-order engine partially circumvents this limitation: If one load misses the cache, all instructions that have a true data dependency on that load must themselves wait. But any instructions that are not data-dependent on that load, including other loads, can “go around” the stalled load and try their luck.
PLAUSIBILITY CHECKING
Initially, we intended the P6 DFA to answer one simple question: Did the published performance results for out-of-order engines apply to an Intel Architecture design? Was it at all plausible that an out-of-order design would speed up x86 code execution?
We had grounds to worry that it wouldn’t. All modem instruction-set architectures have reasonable numbers of registers, fast temporary storage that instructions access directly. The Intel Architecture has only four general registers, EAX, EBX, ECX, and EDX, plus three others that can sometimes be used. When these are not enough to support a given computation, the rest of the storage must reside in main memory, increasing the load on the caches, bus, and memory. Before we bet the company’s future on an out-of-order microarchitecture, we wanted some evidence that it would work.
The best model for an out-of-order microarchitecture would be a progr
am that includes all the relevant structures, programmatically interconnected the same way they would be in the actual design. All unit interfaces would be present just as in the actual design, and units would interact (logically) exactly as their silicon counterparts would.
The trouble with this kind of model is that you have to design the machine before you can model it. By the time you have finished writing the model, a year has gone by, and you are committed to the design before you have ever really used the model to verify the approach. This is not project risk mitigation; it is prototyping.
We needed a way to evaluate design choices that did not require completing the entire design before obtaining useful results. To illustrate my point, consider real code (which is exactly what we did in the P6 project, although it was a bit more complex). This C code is for a slightly modified “hello world,” with some extraneous integer manipulations thrown in to make things more interesting:
Main initializes the registers, then we enter the loop at .L5, corresponding to for (i=j =o; i<25; i++) j += i; in the C program. The leas instruction uses registers eax and edx as inputs, and writes edx with its output. leaf is an Intel Architecture artifact that requires some extremely complicated addressing modes. Implementers are thus forced to provide address generation units that are so capable that they can be used as generalpurpose integer computation functional units in some cases. This is one of those cases.
The inci instruction increments register eax, which implies that eax is both an input and an output for this instruction. The cmpi instruction compares the value in eax to the constant 24, and (implicitly) writes the flags register, which the conditional branch i i e . L5 then checks.
Even this small code slice contains multiple chains of dependencies, which are the first-order determinant for the question, “If you had infinite hardware, what is the upper limit on code speedup you could get?” If you know the answer, you can then ask corollary questions such as, “What if the hardware weren’t infinite; instead of 3,000 integer ALUs, what if I only had four, but all else was as before?” It was DFA’s job to traverse a “trace” of the program execution, observe all true data dependencies, and then move all instructions to the earliest clock cycle possible with the specified hardware constraints.
For example, the first instruction in this program is pushi %ebp. This x86 instruction is actually a sequence of two operations, one to copy the value of ebp onto the program stack, and the other to adjust the top-of-stack (TOS) pointer appropriately. Copying ebp to the right place on the stack requires the value of the top-of-stack pointer so, clearly, that copy must precede the TOS pointer adjustment. But the next two instructions do not depend on either of the two operations from the pushi instruction; both can proceed in parallel with the ebp copy in clock cycle 0, if the hardware configuration is sufficiently robust, as specified on the command line to DFA.
What do you get when you xor a register with itself, as in xor eax, eax? You get a zeroed register. If the goal was to clear a register, why would you use this convoluted method instead of simply moving a constant 0 into the register? Because xor eax, eax happens to be more efficient object code, and became popular in the Intel 486 era, when disk and memory were extremely limited. In our out-of-order world, however, this construct could be problematical; it appears that this instruction cannot execute until the most recent instruction preceding it that wrote eax has completed. We eventually put hardware in the P6 decoder to notice this special construct and allow this instruction to proceed without (pointlessly) waiting for any previous register writers.
The output of early DFA runs looked like the following:
The first number (in parentheses) is the earliest clock cycle that DFA believes (subject to hardware constraints such as the number of ALUs) a given instruction could be executed while still observing true data dependencies. The first instruction seen is assigned clock cycle 0. The next one, the may at 0x14e9, has an input that depends on the outcome of the push (through the ebp register). So DFA schedules the may for cycle 1. The xors at oxl4eb and Oxl4ed don’t depend on any previous instruction (being an x86ism for clearing the registers), so they are scheduled as early as possible, during clock cycle 0. The leaw at Oxl4ef requires register edx as an input, so it must be scheduled after the previous write to edx has completed; that would be the xor edx that was just scheduled for clock cycle 0, so this 1 eaw is scheduled for clock cycle 1, and so on.
With its ability to show how existing x86 traces would map onto a wide range of theoretical out-of-order engines, DFA showed us what microarchitectural features were important. It helped us decide the sizes of temporary staging areas and predicted the resulting performance. This support helped establish that the project performance goals were reachable in the silicon die area constraint we had.
BIBLIOGRAPHY
[1] Y. N. Patt, W. W. Hwu, and M. C. Shebanow, HPS, a New Microarchitecture: Rationale and Introduction, in Proceedings of the 18th International Microprogramming Workshop, Asilomar, Dec. 1985.
[2] Y. N. Patt, S. W. Melvin, W. W. Hwu, and M. C. Shebanow, Critical Issues Regarding HPS, a High Performance Microarchitecture, in Proceedings of the 18th International Microprogramming Workshop, Asilomar, Dec. 1985.
[3] D. Patterson, Reduced Instruction Set Computers, Comm. ACM28:1, 189.
[4] D. Phillips and R. O’Bryan, It Sounded Good When We Started, Wiley, 2004.
[5] S. Shem, The House of God, Putnam, 1978.
[6] H. Petroski, Small Things Considered, Knopf, 2003.
[7] H. Petroski, Design Paradigms, Cambridge University Press, 1994.
[8] http://www.snopes.com/critters/wild/frogboil.htm
[9] http://inventors.about.com/library/inventors/bledisonpatents.htm
[10] http://plus.maths.org/issuel0/news/mars/
[11] J. Weissman, Presenting To Win: The Art Of Telling Your Story, Prentice-Hall, 2003.
[12] D. Kushner, Masters of Doom, Random House, 2003.
[13] R. Turnill, The Moonlandings, Cambridge University Press, 2003.
[14] A. S. Grove, Only the Paranoid Survive, Currency Doubleday, 1996.
[15] A. Mishkin, Sojourner, Berkeley, 2003.
[16] A. S. Grove, High Output Management, Random House, 1983.
[17] C. Christensen, The Innovator’s Dilemma, Harvard Business School Press, 1997.
[18] R. Feynman, Surely You’re Joking, Mr. Feynman! Norton, 1985.
[19] D. Vaughan, The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA, University of Chicago Press, 1996.
[20] R. Feynman, What Do You Care What Other People Think?, Norton, 1988.
[21] G. B. Shaw, Man and Superman, Epistle Dedicatory, 1903.
[22] R. P. Colwell, Latent Design Faults in the Development of Multiflow’s TRACE/200, in 22nd Annual International Symposium on Fault-Tolerant Computing, Boston MA, July 1992.
[23] K. Beck, Extreme Programming Explained: Embrace Change, Addison-Wesley, 2000.
[24] K. Sabbagh, Skyscraper, The Making of a Building, Penguin Books, 1989.
[25] R. P. Colwell, G. Brown, and F. See, Intel’s College Hiring Methods and Recent Results, in Microelectronics Systems Education Conference, 1999.
[26] L. Zuckerman, Suit by Digital Says Intel Stole Pentium Design, NY Times, May 14, 1997.
[27] B. Colwell, Employee Performance Reviews, IEEE Computer, September 2002.
[28] B. Colwell, Design Reviews, IEEE Computer, October 2003.
[29] T. Gilb and D. Graham, Software Inspection, Addison-Wesley, 1993.
[30] B. W. Kernighan and D. M. Ritchie, The C Programming Language, Prentice-Hall, 1978.
[31] R. P. Colwell, The Performance Effects of Functional Migration and Architectural Complexity in Object-Oriented Systems, PhD thesis, CarnegieMellon University, 1985.
[32] P. M. Hansen, M. A. Linton, R. N. Mayo, M. Murphy, and D. A. Patterson, A Performance Evaluation of the Intel iAPX 432, ACM SIGARCH Computer Architecture News, 10, 4, June 1982.
[3
3] S. Covey, The 7 Habits ofHighly Effective People, Fireside, 1989.
[34] J. Shen and M. Lipasti, Modern Processor Design: Fundamentals of Superscalar Processors, McGraw-Hill, 2005.
[35] T. Y. Yeh and Y. N. Patt, Two-level Adaptive Branch Prediction, in 24th Annual Symposium on Microarchitecture, 1991.
[36] Sanders Shoots for the Body, PC Week, November 6, 1995, p. 149.
GLOSSARY
INDEX