The Pentium Chronicles: The People, Passion, and Politics Behind Intel's Landmark Chips (Practitioners)

Page 21

by Robert P. Colwell

Figure 4.2. Project tracking extrapolation as of 1991.

The Experiment. Still, Randy was concerned. Why did a constant inflation factor seem to be built into the estimate of effort remaining? Unless we understood where that inflation was coming from, we could not be sure it would remain linear. It just seemed somehow counterintuitive that so many people were discovering new work at a substantial fraction of the rate at which they were finishing existing work. Surely we could do a better estimation job if we put our minds to it.

Randy decided to experiment. He spent a week or two working with all the design supervisors, showing them the overall rollups and the inflation factor, soliciting their inputs on its source and asking them to rethink the methods they were using when creating their estimates of work remaining. One hypothesis was that the engineers were only looking ahead one task at a time, consistently underestimating each task by the same amount. Perhaps all they needed was to extend their planning horizons. If we could get all of them to cast their nets more widely and to apply the fudge factors that the data had implied so far, perhaps we could remove the inflation once and for all.

After Randy’s exhortations, the team did indeed look harder and increment their fudge factors. The result was an immediate, noticeable, but not overwhelming increase in the estimated amount of project work remaining. Over the next few months, the new aggregate estimate of work remaining stayed relatively constant. The inflation seemed to have been vanquished.

But as you can see from Figure 4.3, relative to Figure 4.1 and Figure 4.2, the inflation was not really gone. It was just hidden by the artificially increased work estimates the experiment induced. In 1993, the inflation was back, and had picked up exactly where The Experiment had temporarily obscured it.

The Mystery of the Unchanging Curve. I have always thought it a little spooky that the accomplishment curve, which combines so much data estimated from so many people, would revert to its original shape after a perturbation like The Experiment. It suggests the presence of an intrinsic “blindness” factor with respect to estimating work remaining, and that this factor is reasonably common across people. If so, it probably stems from the same source that influences so many project management books, which almost always recommend that project planners strive mightily to create an accurate estimate and then double or triple it.

The accomplishment curve’s nonlinear upward trend is likely the result of two largeproject realities: fluctuating team size and the accumulation of expertise. Project teams begin with perhaps a dozen people, and they ramp up to a peak that can number (as in P6’s case) in the hundreds, a peak hit several weeks before tapeout. Because the team is growing over the project’s lifetime, the amount it can accomplish per week grows as well. The peak-to-tapeout phase of a chip design, by the way, is exactly like standing on shore as a major hurricane hits land. The tapeout itself is the hurricane’s eye. As it passes, everything is temporarily calm while the fabrication team does its work for a few weeks, and there is comparatively little you can do until the silicon comes back.

Expertise influences the curve because, after working on a design for a year or two, the design team becomes expert on the tools’ use, weaknesses, and workarounds. Errors and failures that would have cost an engineer three days early in the project will cost her very little later because she will have learned how to stay out of trouble in the first place.

The work-remaining curve never actually did intersect with the accomplishment curve, mostly because it is no longer worth tracking when the project gets to within a few weeks of tapeout. We had always expected that when the difference between those curves reached zero we would have tapeout day. But that does not happen. Instead, as the engineers accumulate experience, they find more and more things they wish they had time to do. Validation learns a lot of very useful information as the project proceeds: which tests are finding lots of bugs and which are not, and which functional units are clean and which seem to be bug-infested. Validators will also pay attention to design errata being found in other chips in the company to make absolutely sure the company is never subjected to the agony of finding essentially the same embarrassing bug in two different chips. In general, validators will always be able to think up many more testing scenarios than any practical project will have the time and resources to carry out. And since their careers and reputations are on the line, they want to perform those tests. Project managers often have to modulate such work inflation, or the chip will tape out too late.

Flexibility Is Required of All

At the end of every engineering project is a crunch phase, during which the team is working at or beyond its maximum sustainable output and the mindset is very much that of a fight to the death. At that point, the engineers have been working on the project for years. All the fun and limitless horizons “This is going to be the greatest chip ever made” thinking has now given way to the cold realities of compromise. Or worse, by now all the pro ject engineers have become painfully aware of whatever shortcomings the project now includes. Years have elapsed since the early days of boundless optimism, and the competition has long since leaked details of what they are working on. Their designs are not as close to fruition as the press believes (and maybe as the competition believes) but the engineers on this project cannot help but compare that paper tiger to their almost finished creation and wince.

Figure 4.3. Project tracking extrapolation as of 1993.

Everyone is tired by this point tired of working on this chip, tired of working next to these people, tired of doing nothing but working. Their spouses are tired of carrying the entire burden of running the home and family, and they may not be as inspired with the engineer’s spirit of enterprise. Design engineers are always a cantankerous bunch, but with all this in the background, they are now positively cranky.

It is wise, nay essential, to manage projects so as to minimize this crunch phase. Six months in such a pressure cooker is about the most you can expect; longer than that and the project was not planned properly. Under no circumstances should an understaffed team be expected to make up for that management error by simply having a longer crunch phase. Design engineers who are dumb enough to fall for that ploy are not smart enough to design world-class products. You cannot have this one both ways.

Having said that, the P6 project’s crunch phase lasted about seven months. Fortunately, the team came through it in fine form. I know that with several hundred people under this much pressure, some divorces and separations are statistically likely, but I still wonder if we could have done anything to improve that part of the equation.

One of the key design managers on the P6 had observed the overall schedule closely and noticed that the team began with a very few people and ballooned to hundreds. If the initial small group were to settle on a micro architecture earlier, he reasoned, he could trade the wall-clock time saved for the same amount of time later and still hit the overall project schedule. If the architects would just get their acts together, he reasoned, the project would save a lot of money (since time saved multiplied by the number of heads affected is so different in the two cases). He believed the absence of late changes to the design would obviate the crunch phase. He unilaterally announced this plan at the beginning of the Pentium 4 development.

That was the start of two years of unpleasantness for the architects. Obviously, if we knew how to conceive a bulletproof, guaranteed-to-work microarchitecture on day 1, we would not have to spend days 2 through 730 continuing working on it. We do not know how to do miraculous conceptions like that. What we know how to do is this: conceive promising approaches to problems, refine them until our confidence is high, combine them with other good ideas, and stop when we believe the final product will hit the project’s targets.

The manager in question did not like that answer. He proposed that if the project’s architecture phase really had to proceed along those lines, then at least we could be honest about how expensive that part was. He was suggesting that we all spe
nd more effort alone until such time as we could come up with a microarchitecture for which we could reasonably guarantee no more changes. If that made a 1.5-to-2-year architecture phase into a 2.5-year phase, so be it. At least the large design team would not be whipsawed by late changes. And he still believed he could avoid any crunch phase that way.

I think he also believed that if upper management saw the “true cost” of designing the microarchitecture, much more pressure would be applied to the architects (earlier) instead of to the design team (later). In that sense, he was probably correct. By the time upper management got serious about applying schedule pressure, the project was in its final one to two years and the architects were no longer the critical path.

I don’t know how to prove this, but I believe that doing microarchitecture development the way we did on the P6 (and to an extent on the Pentium 4) is the optimum method. The architects get some time to think, and they tell management and the design team when their ideas have matured to the point of usability. Like everyone else, architects learn as they go. They build models, they test their theories, and they try out their ideas any way they can. If they are taking the proper types and number of risks, then some of these ideas will not work. Most ideas will work well in isolation, or the architect’s competence is in question, but many ideas that look good in isolation or on a single benchmark do not play well with others. It’s not uncommon to find two ideas that each generate a few percent performance increase by themselves, but when implemented together jointly degrade performance by a percent or two.

Designing at the limits of human intellect is a messy affair.

Designing at the limits of human intellect is a messy affair and I believe it has to be. The danger and schedule pain of design changes are real, but so are competition and learning. Projects must trust their senior technical leadership and project managers to make good judgments about when a change is worth making. Attempting to shut this process off by applying more up-front pressure to the architects does nothing useful, and I can testify from personal experience that it damages working relationships all around.

The Simplification Effort

Complexity is a living, growing monster lurking in the corridors of your project.

The dictionary says one of the meanings of the word “complex” is complicated or intricate. But those characterizations do not do justice to modem microprocessors. Trying to get your driver’s license renewed can be complicated; complexity, in the context of microprocessor design, is a living, growing monster lurking in the corridors of your project, intent on simultaneously degrading your product’s key goals while also hiding the fact that it has done so.

That is how it feels to a project manager, anyway. For large projects like P6, hundreds of design engineers and architects make dozens of decisions each day, trying to properly balance their part of the design so that it meets all of its goals: silicon die area, performance, schedule, and power. Every decision they make affects multiple design goals. On any given day, simulations may be telling a designer that her circuit is too slow; she redesigns it for higher speed, but now the power is too high. She fixes the power but it now requires more area than her unit was allotted.

Almost always, a designer’s desire to fix something about a design results in a more complicated design. Part of the reason is that the designer would like to fix, say, circuit speed without affecting any other goals. So rather than rethink the design from scratch, the designer considers ways to alter the existing design, generally by adding something else to it.

On any given unit, this incremental accumulation of complexity is noticeable but not particularly alarming. It is only when you notice that hundreds of people are doing this in parallel, and you roll up the aggregate result, that the true size of the complexity monster lumbering around the project’s corridors becomes apparent.

This complexity has many costs, but among the worst is the impact on final product quality in terms of the number and severity of bugs (errata). The more complicated a design, the more difficult is the task of the design team to get that design right, and the larger the challenge facing the validation team. While it is hard to quantify, it feels as though the validation team’s job grows as some exponential function of the design complexity.

I used to go to bed at night thinking about what aspects of the P6 project I might be overlooking. One such night in 1992, I realized that this daydreaming had developed a pattern: It kept returning to the topic of overall project complexity. I knew that we were accumulating complexity on a daily basis and I knew that this complexity would cost us in design time, validation effort, and possibly in the number of design errata that might appear in the final product. What could be done about it? I briefly pondered going on a one-man crusade to ferret out places in the design where the injected complexity was not worth the cost, but there was too much work to do and not enough time.

The last time in the project that I had found myself facing a task too big to handle alone, I had successfully enlisted dozens of other people on the project and together we got it done. Was there a way to do that again`? Back in the BRTL days, the design engineers did not have more pressing concerns and were relatively easy to conscript, but the project had since found its groove and everyone was incredibly busy all of the time. So asking them to put down their design tasks and help me with a possible simplification mission would not be a low-cost effort.

On the other hand, enlisting the design engineers themselves might have some tangible benefits besides additional sheer “horsepower” devoted to the task. They were the source of some of the added complexity, so they knew where to look. That could save considerable time and effort. And once they saw that their project leadership felt so strongly about this topic that we were willing to suspend the project for a couple of weeks in order to tackle it, perhaps the engineers would find ways to avoid adding unnecessary complexity thereafter.

We launched the P6 Simplification Effort, explaining to all why some complexity is necessary but anything beyond the necessary is an outright loss, and got very good cooperation from the engineering ranks. Within two weeks we had constructed a list of design changes that looked as though they would be either neutral or positive in terms of project goals and would also make noticeable improvements to our overall product complexity. This experiment was widely considered to be a success.

Just as I always do with die diets (mid-project, forced marches to make the silicon die smaller, mostly by throwing features off the chip), I wondered if some better up-front project management might have avoided the need for the Simplification Effort. I don’t think so. I think it is useful at a certain stage of a design project to remind everyone that there are goals that are not stated and are not easy to measure but are still worth pursuing. Perhaps stopping the project periodically has the same effect that “off-site” events have on corporate groups it gives people time to take a fresh look at what they are doing and the direction in which they are going, and this is very often a surprisingly high-leverage activity.

5

THE PRODUCTION PHASE

Two male engineering students were crossing the campus when one said, “Where did you get such a great bike?” The second engineer replied, “Well, I was walking along yesterday minding my own business when a beautiful woman rode up on this bike. She threw the bike to the ground, took off all her clothes and said, `Take what you want. ”’ The first engineer nodded approvingly and said, “Good choice; the clothes probably wouldn’t have fit. “

I like this account of the overly focused engineering students for several reasons. First, it’s funny because it twists the reader’s expectations, playing to a stereotype of engineers as humorless optimization engines, while simultaneously rebutting that stereotype with the wry smiles of the engineers who read it.

But I also like it because it graphically demonstrates the single-mindedness required of the engineering profession. Early in the design program, ideas flowed like water-the more the bet
ter. Architects, managers, and marketing people were encouraged to roam the product space, the technology possibilities, and user models to find compelling product features and breakthroughs. As the project evolved through its refinement phase, the vast sweep of possibilities was winnowed to only a few of the most promising. The realization phase settled on one of those semifinalists and developed it to a prototype stage. This is nice, logical sequence that makes sense to most technical folks, who are often not prepared for what happens next, even though they think they are: production.

The Pentium Chronicles. By Robert P. Colwell 115

After having spent 4+ years on the project, many engineers feel that their responsibilities have pretty much ended when they finish the structured register-transfer logic (SRTL) model. Their mindset is that they have designed the product they set out to create and now it is someone else’s job to make tens of millions of them. How hard can that be, compared to the intellectually Herculean task that has now been accomplished? The answer is, very hard, and it requires a whole new set of skills.

To the eternal consternation of executives throughout Intel, it takes approximately one year to drive a new flagship design into production. A huge amount of work follows the initial tape out, much of which cannot be done earlier, even in the unlikely event that engineers were available. The production engineering team (the corporate production engineers plus a substantial fraction of the original design team) must prove silicon functionality and show that circuits and new features work as intended with the compilers and other tools. The chip power dissipation must be within the expected range, the clock rate must hit the product goal, the chip must operate correctly over the entire target temperature and voltage ranges, the system must demonstrate the expected performance, and testers must create the test vectors to help drive production yield to its intended range. Any of these requirements could become problematic, so a great deal of highly creative engineering must literally be on call.

‹ Prev Next ›