The Pentium Chronicles: The People, Passion, and Politics Behind Intel's Landmark Chips (Practitioners)

Home > Other > The Pentium Chronicles: The People, Passion, and Politics Behind Intel's Landmark Chips (Practitioners) > Page 10
The Pentium Chronicles: The People, Passion, and Politics Behind Intel's Landmark Chips (Practitioners) Page 10

by Robert P. Colwell


  A design project will always have unquantifiable issues.

  My point is that a design project will always have unquantifiable issues. To resolve them, about the best you can do is go with the intuition of your best experts. Don’t let them off the hook too easily, though. Pure intuition should be a last resort, after the experts have tried to find ways to first quantify that intuition. For P6’s technical issues that resisted quantification, we found that even a failed effort to quantify them was invariably instructive in its own right and well worth attempting. It also helped cement the data-driven culture we wanted.

  Managing New Perspectives

  Validation is not an active part of the concept phase, generally because validators are still immersed in testing the previous project. It should, however, be included in the refinement phase, partly to improve the validation team’s morale and partly to help spot the cases in which seemingly innocuous choices made now will have major validation impact later. Among the hundreds, sometimes thousands, of major unknowns still in this phase, the only difference between two workable approaches might be that one makes validation much easier.

  Architects must constantly fight the urge to put both hands in the cookie jar.

  Product engineering must also start to get involved in the refinement phase. In P6, we had been tantalized by a packaging engineer’s offhand comment that “if we wanted it, they could actually stick two silicon dies into the same package and run bond wires between them.” This looked like a great solution to the problem of keeping the L2 cache close to the CPU. At that time, process technology was not advanced enough to put the cache onto the CPU die itself. But if the L2 cache were a separate chip on the motherboard, electrical signaling would be too difficult to run at the full CPU speed, substantially cutting overall system performance. We took the two-die-inpackage approach and got the performance we wanted, as well as the desired MP capability. Years later, we were told that production of these two-die-inpackage parts required manual intervention, which was fine for the limited volumes of servers and workstations, but would have been a crippling constraint on desktop volumes.

  Architects must constantly fight the urge to put both hands in the cookie j ar. If they can resist, they’ll see great benefit in keeping production engineering in the loop, starting with the refinement phase, which is the perfect time to bring them up to speed on general project directions. Do not be surprised if what they say has a measurable impact on the project. As with validation, product engineering knows things that the technical development leadership needs to learn.

  For example, having decided that P6 would be capable of multiprocessor operation, the obvious question was how many microprocessors could be on the same bus. Simulations would readily tell us how much bus bandwidth we needed to reasonably support four CPUs on a bus, and accurate simulations would even illustrate the effects of bus protocol overhead. But bus bandwidth cannot be increased without limit because fast buses are expensive. The architects knew how much bandwidth a four-CPU system needed, the circuit designers knew how longer wires on the motherboard would affect the bus clock frequency (which is a first-order determinant of the bus bandwidth), but only the product engineering team knew how large the CPU packaging had to be. They also knew how many layers of signals would be available in the motherboard, the impedance and variability of each, the stub lengths to the CPU sockets, and the relative costs of the alternatives being considered.

  Planning for Complexity. It is important to plan ahead for the complex issues that will become part of the POR. When we chose to make P6 an MP design, I realized that, as a team, we were not deep in expertise on that topic, so I did some checking around. It was time well spent, because I found that all the CPUs reported in the industry that had attempted this feat had had major problems. Either they were late or they had design errata that prevented initial production from being usable in a multiprocessing mode.

  I proposed to the division’s general manager that the project should bring in new engineers, whose job would be to get P6’s MP story right on the first try. We recruited and hired a team of five, led by Shreekant (Ticky) Thakkar, an ex-Sequent Computer architect, to make sure that our CPU cache coherence schemes, our system boot-up plan, our frontside bus architecture, and our chip set would work together properly in an MP mode.

  BEHAVIORAL MODELS. Prior to the P6, Intel did not model its microarchitectures before building them. These earlier efforts began with block diagrams on whiteboards and leftover unit designs from previous projects. Some of the design team would start working on circuits, some would write a register-transfer logic (RTL) description, and the rest would devote themselves to tools, microcode, placement, routing, and physical design. Before joining Intel, I was one of seven engineers who designed the entire Very Long Instruction Word (VLIW) minisupercomputer at Multiflow Computer, a mid-1980s startup that went out of business in 1990 after shipping over 100 machines. At Multiflow, we didn’t even have units to borrow from, but we still didn’t do any modeling. We designed straight to hardware and did our testing and learning from the circuit boards themselves. If you saw one of our early memory controller boards, you would immediately grasp the dark side of that strategy-the early boards were festooned with over 300 “blue-wire” modifications, the fallout of a design oversight.

  The oversight was pretty fundamental and likely something that modeling would have revealed. Multiflow’s main memory incorporated error-correction coding (ECC). Along with the data being fetched, a set of syndrome bits were accessed, which could identify a single bit as being erroneous. ECC bit flips are rare, and when one was detected, the machine required three more clock cycles to do the ECC check and correction. To maintain performance, we did not want to routinely run all memory accesses through the ECC circuits; instead, our plan was to stall the entire machine during the rare times an ECC flip had to be corrected. Unfortunately, we forgot to provide a way to stall everything and steer the data through the ECC circuit on a bit-flip detection. Oops!

  For the most part, seat-of-thepants computer design seemed to be extremely efficient, especially given an experienced team with high odds of making compatible implicit assumptions. But there are limits to how much complexity can or should be tackled that way. For an in-order, two-way superscalar design such as the original P5, seat-of-thepants was doable, especially since the P5 design team had just completed what amounted to a one-pipeline version of the same chip, the Intel 486. (I’m not advocating this design approach, mind you; I’m just relaying history and explaining why the P5 design team succeeded despite the lack of performance or behavioral models to guide their decisions.)

  Without a performance model to keep us honest and expose where our intuitions were wrong, we could have easily created a P6 with all the complexity and none of the performance advantage we had envisioned

  We could not have done P6 this way. In a section of the previous chapter (The “What Would Happen If”Game in Chapter 2), I described the process we followed in recalibrating our intuitions about out-of-order microarchitectures. Without a performance model to keep us honest and expose where our intuitions were wrong, we could have easily created a P6 with all the complexity and none of the performance advantage we had envisioned.

  In the same section, I also said that DFA was a peculiar model. Wielded properly, it could tell you the performance impact of various micro architectural choices, but because it was arriving at its answers via a path other than mimicking the actual design, there were limits to how deeply it could help you peer into a design’s inner workings.

  So, somehow, we had to get the project from a DFA basis to a structured RTL (SRTL) model, because SRTL is the model all downstream design tools use and the one that is eventually taped out. In some form, nearly all design activities of a microprocessor development are aimed at getting the SRTL model right. The question was how to translate our general DFA-inspired ideas for the design into the mountains of detail that comprised an SRTL model.

&n
bsp; It seemed to me that we needed a way to describe, at a high level, what any given unit actually did, without having to worry too much about how that unit would be implemented. In other words, we wanted a behavioral model that would be a stepping stone to the SRTL model. This behavioral model would be easy to write, since most of the difficult choices surrounding the detailed design would be elided, yet the model would still account for performancerelated interactions among units. Rather than describe a 32-bit adder in terms of dozens of two-input logic gates, for example, we could write a placeholder model for that unit with code like OUT := INI + IN2. Because we could create such behavioral code quickly, we could build a working performance model without having to delay all performance testing until the most difficult unit worked through all its design conundrums.

  Just to get our ideas onto firmer footing, we started writing a C program to model the essentials of the P6 engine we had conceived. We called this program the grassman, because it wasn’t mature enough even for a strawman. After a few weeks, we had coded the basics, as well as realized that coding would take several months to complete and would not easily translate to SRTL.

  While reading the SRTL manual one day, I noticed that Intel’s hardware description language (iHDL) referred to a behavioral capability. I checked around and found that the language designers had anticipated our quick prototyping needs and had made provision for behavioral code. “Just what we need!” I thought, and went about collecting enough RTL coders to begin the P6 behavioral model development. Equally exciting was the prospect that we would not have to translate our behavioral model into SRTL when we completed it. We would write everything behaviorally at first, and then gradually substitute the SRTL versions in whatever order they became available. This process also promised to free the project of the “last unit” effect on performance testing I described earlier.

  The only problem was that we were the first to ever try using iHDL’s behavioral mode. Undaunted (no one had ever tried to do an out-of-order x86 engine, either), we launched into creating the behavioral model with a few designers and five architects. I estimated to my management that we would have it running code within six months-Christmas or bust. I wrote three quick C programs, carefully tailored to need only what would be running by then, and posted them as the immediate targets.

  On the second day of the behavioral RTL (BRTL) effort, it was clear we were months off the schedule I had conceived only yesterday. The behavioral mode of iHDL had major shortcomings. Some were built into the language, some were the growth pains of any new tool technology, and some were just the inevitable surprises that are always present at the meeting of reality and ideas. But the main reason my schedule was so far off was that I had built it assuming that 100% of the BRTL effort would go into coding. I had not accounted for so many unresolved microarchitectural issues, some of which were quite substantial. In my defense, I wanted to begin the BRTL in part to force our microarchitectural unknowns to the surface. I had just grossly underestimated how many unknowns there were. I wasn’t even close.

  I sidled abashedly into my boss’s office and told him it was already clear that yesterday’s schedule estimate had been wildly off, and please could I have a few dozen really bright design engineers to help me write the BRTL. After an obligatory and well-deserved upbraiding, Fred Pollack and design manager Randy Steck agreed to commit the core of the design team to helping create the behavioral model.

  There ensued an interesting symbiotic (and unanticipated) connection between the newly deputized BRTL coders and the microarchitectural focus groups in which they were already working. Looking back, the pattern is much clearer than it seemed when we stumbled into this arrangement. Writing a computer program to model something will always reveal areas of uncertainty and issues with no clear answers. Likewise, writing a model forces subtle problems into the open. A common syndrome when working in abstract conceptual space is to believe that you have a solution for issue A, another for issue B, and so on. When you look at each solution in isolation, it seems feasible, but writing a model forces you to consider solutions A, B, and so on together. Only then do you realize that the workable solutions are mutually exclusive at a deep, irreconcilable level. Writing the behavioral model raised questions, and the focus groups set about resolving them. This virtuous loop continued throughout the behavioral model coding, for approximately 18 months.

  We ended up meeting our Christmas goal by sprinting from Thanksgiving on. The core BRTL team worked essentially continuously for six weeks on nights, weekends, and holidays. Two things became clear during this time. First, iHDL’s behavioral mode was only marginally higher in abstraction than the usual structural iHDL. Second, because we were committed to our initial concept of how P6’s out-of-order engine would work, we would not have enough schedule slack to make any substantial changes to that concept.

  In other words, we were stuck. So much for risk-reducing, quick performance models. Even with BRTL, we still had to identify every signal in and out of any given unit with proper unit pipelining and staging. That level of detail implies the same detailed designing required for SRTL, and that process takes time. Luckily (both in the sense of capricious luck and in the sense of luck related to hard work), our initial concepts proved to be workable and the BRTL helped us iron out the details.

  When we started the BRTL effort, we had hoped that having the BRTL as an intermediate project milestone would make it easier to get the SRTL going later. Having now complained that the abstraction level of iHDL’s behavioral mode was too low, it’s only fair to add that this made conversion to SRTL vastly simpler. And since the same designers who wrote the behavioral description of their unit were responsible for writing the SRTL, we avoided an entire class of potential translation errors. In essence, BRTL took too long, but SRTL was shortened and was of much higher quality in terms of functional validation.

  We purposely limited these new BRTL teams to fewer than ten people each, and through Randy’s customary foresight, we “rolled them up” through the lead architects, not their usual management chains.’ This combination of organizational tactics meant that the architects were firmly and visibly ensconced as intellectual leaders of the various refinement efforts, thus encouraging them to lead their new subteams directly (and quickly).

  I had hoped to keep the behavioral model up to date as project emphasis shifted to SRTL, but that wasn’t practical. The BRTL gradually morphed into the SRTL, and asking every designer to maintain both models would have cost too much project effort and time for the return on that investment. We now had the butterfly, but sadly, the caterpillar was gone.

  For years after the P6 had gone into production, I would occasionally hear comments from BRTL participants that the “behavioral model wasn’t worth it.” I think they are wrong. They are remembering only the price paid, without carefully considering what was achieved or what the alternative would have been without the behavioral model. The P6’s BRTL was expensive to create, and there is still much room to improve the language and how we used it, but it still taught us that computer design has moved far past the point where people can keep complex microarchitectures in their heads. Behavioral modeling is mandatory. The only question is how best to accomplish it.

  MANAGING A CHANGING POR

  Behavioral modeling is mandatory. The only question is how best to accomplish it.

  The only constant in life is change, and this rule is omnipresent in an engineering project. At any moment, a project’s state comprises a vast amount of information, opinions, decisions made, decisions pending, the history of previous changes to the POR, and so on. One day, the project’s official plan might be to implement the reorder buffer as two separate integer and floatingpoint sections; the next day, it might be to combine the two sections. (For the Pentium Pro, that very decision was made over a napkin in a Chinese restaurant in Santa Clara, where General Tso’s chicken was the computer architect’s sustenance of choice.) It is imperative that projects have a means for
unequivocally establishing the POR, as well as an organized, well-understood method for changing it.

  In the early 1990s, during P6 development, Web sites or browsers didn’t exist. Files could be shared, and we briefly entertained the idea that the project POR would be kept in a file that all project members could access. Regrettably, we reverted to the familiarity of classical engineering practice. We kept the POR in a red-cover document, so it was sure to become dated within days of its release and none of the current document holders could ever be sure they had the latest-and-greatest version.

  Some projects deal with this version problem by collecting all old documents and issuing the new one to the list of previous recipients. Ugh! A more reliable solution involving far less labor is to keep the POR online and rigorously control the means by which people can change it or access it.

  Humans are supposed to learn from their mistakes. If the project POR isn’t rigorously maintained in a place all design team members can access, they quickly learn that the only way to be sure of its present status is to go ask the person responsible for it. In the P6 project, I was that person, and it often felt as though the project POR was an amorphous cloud floating above my cubicle. I was determined to do better on the Pentium 4 project, but I don’t think I succeeded, and for a very dubious reason: an intra-project political power struggle, which was so unnecessary that the account deserves its own subsection (“ECO Control and the Project POR” on page 57).

  The Wrong Way to Plan

  While it seemed very large at the time, the P6 was a much smaller project than the Pentium 4 (Willamette) development, in every respect. It had far fewer designers and architects and had to satisfy only a handful of marketing engineers. (The P6 POR had to be acceptable to the project’s general manager and chief marketing engineer, as well as to the designers and architects.) The Willamette POR was an unstable compromise between mutually exclusive requests from mobile, desktop, workstation, server, and Itanium Proces sor Family proponents. Early in the Willamette project, our general manager appointed marketing as the POR’s owner, and marketing did what marketers always do: They called a series of meetings to discuss the issues. Have Powerpoint, will travel.

 

‹ Prev