The Pentium Chronicles: The People, Passion, and Politics Behind Intel's Landmark Chips (Practitioners) Page 29 Read online free by Robert P. Colwell

Home > Other > The Pentium Chronicles: The People, Passion, and Politics Behind Intel's Landmark Chips (Practitioners) > Page 29

The Pentium Chronicles: The People, Passion, and Politics Behind Intel's Landmark Chips (Practitioners) Page 29

by Robert P. Colwell

Considering that the manager in question was me, I had a decidedly different viewpoint on the whole affair and reached a substantially different conclusion. First of all, it was not my product, it was the Santa Clara x86 team’s chip, and it was now their management’s responsibility for dealing with it. Second, whether this Net denizen knew it or not, all technical products (not just Intel’s and not just Pentium) embody errors, inadvertent deviations from the published specifications. There has to be a formal communication channel between manufacturers and their customers in order to disseminate new informa tion about product errata, and it cannot possibly be based on random Intel employees posting whatever they want.

This incident proved to Intel that the rules had somehow changed. In the past, people did not really know what a computer did, what it was good for, or why they should care about the answers to those questions. But now, even an obscure design erratum like FDIV had become cocktail party knowledge. Microprocessors had entered the popular culture. I was sure this was an important change in the rules, but as yet had no idea of what the new rules were.

Neither did Intel’s corporate management. For a few months, they pursued a rational approach to the problem that had historically worked very well: Commission an internal technical team to assess the problem, reassure Intel’s customers that we were looking into it, and if we concluded that there was a real problem, we would get back to them with our plan. After all, this is what the company had done for design errata on previous chips and this process had worked fairly well overall.

What nobody noticed was that the rules really had changed. Until the early 1990s, computer users tended to be relatively technically sophisticated, using the machines to pursue science, engineering, or dedicated tasks such as running spreadsheets or doing documentation. The microprocessors were much, much simpler, as was the software running on them. And if there were bugs, resolving them was an issue between the microprocessor vendor and that particular customer.

As Intel’s tech teams investigated the source and implications of the bug, Pentium system users found themselves, for the first time in history, with the ability to easily compare notes with each other. Somebody wrote a short application that could be easily downloaded and run that tested the floating point unit of the chip. If the program found a Pentium with the errata, it informed the user that she had a defective microprocessor and should contact Intel for a replacement. Very few of those users could have written that test, but they could now all get a copy and run it.

From Intel’s perspective, this erratum was not all that serious. Most users did not use much floating point, the bug was very sporadic and manifested very infrequently, and a substantial amount of floating point code tended to be of the convergent kind (which could still be expected to reach a correct answer even if the bug did manifest; it just might take a few iterations longer to get there).

The buying public was having none of it. In effect, they were holding Intel to the same standard as they had held the makers of Tylenol or packaged chicken in the supermarket: If the product isn’t perfect, as judged by the buyers, not by the producers, then the product should be replaced at the manufacturer’s expense. Intel was horrified by this prospect, because they knew that no high-tech products (or for that matter, any products of any kind) were perfect, and the vast majority of buyers lacked the necessary technical skill to differentiate bugs from normal design compromises, or to judge which bugs were of recall severity and which could simply be worked around.

Eventually, Intel’s chief marketing VP had an epiphany. He told Andy Grove, “We have been going about this all wrong. The rules really have changed, and here’s how. Suppose you went to the car dealership to pick up your new Mercedes Benz SLE500 sedan. The car’s up on a pedestal, rotating around, and you perform your final inspection before driving away in it. To your horror, you see a big scratch on the driver’s door. The engineer in you realizes that the cargo capacity is unchanged, the fuel economy is as before, the number of passengers is the same, the engine horsepower is undiminished, and so on. But do you drive the car away`? No, you do not. Instead, you tell the dealer that you paid for a perfect car and that is all you’ll accept. That’s essentially what Intel’s cus tomers are now telling us. They now set the rules.” Within a week the company had announced a formal recall program, with all of its design engineers assigned to phone banks to reach all 5,000,000 Pentium customers and arrange a replacement for their chips.

Ironically, the entire episode may have been a net benefit to Intel, even with the $475,000,000 price tag. After all, Intel often spends more than that on marketing campaigns. And in the end, just as celebrities and politicians sometimes request, “Say what you want about me, but spell my name right,” when the FDIV issue finally subsided, the Pentium name had achieved a remarkable recognition rate among the general population worldwide. It had provided an opportunity for Intel to demonstrate its willingness to back up that brand name, and it taught Intel an extremely valuable lesson about how the rules had changed, and who now owned the job of making those rules.

Why did Pentium have a flawed floating point divider, when its predecessor, the

i486, did not?

For most of the Pentium design project, the floating point divider was exactly the same as the 486’s. But late in the Pentium project, upper management requested that the entire project search for ways to make the die smaller. (Smaller is always better in silicon chips, because you can cram more chips onto each wafer and the odds of each one working after fabrication go up as well.) Not all designers took this request very seriously, but the engineers working on the floating point divider did. They came up with an idea to save some space in a lookup table and one of them performed an analytical proof that the idea was sound. That proof turned out to be flawed, but the insidious side effect of having performed a “proof’ was that it misled the Pentium validation team into thinking there was no real threat of new bugs due to the late change to the FP divider. (Normally, validators get very suspicious about late changes, because they know that where they see human fingerprints they will find design errata, and they have learned from harsh experience that late changes are the most dangerous of all.) So the bug came from the die size reduction effort, and it got past validation due to the flawed proof, which seems not to have been checked by anyone. (This was not surprising, because validation in 1993 did not generally have to check formal proofs.)

Here is the punch line: The smaller FP divider unit did not make the Pentium chip any smaller. Why? For the same reason that if you somehow removed Kansas from the United States, it would not make Canada and Mexico any closer together (see Figure 7.1). To make a chip smaller, you would have to save area across the entire X dimension, or across the entire Y dimension, or both. Saving a little in the X dimension, when a neighboring unit is the same size as before, simply makes no difference to the overall size of the die.

Figure 7.1. Removing the state of Kansas does not make the perimeter of the United States any smaller.

So FDIV was not just a design error. It was not just a design error plus a broken formal proof. It was not even just a design error plus a bad proof plus a validation oversight. It was, first and foremost, a conceptual error at the project management level, because one simply must not take chances like this late in a project, unless one’s back truly is against the wall. Such changes should not be allowed, much less encouraged, when there is no possibility of payback from it; it is all risk and no reward.

How would you respond to the claim that the P6 is built on ideas stolen from Digital Equipment Corp.?

It still rankles me that some people in my field think P6 was a great success mostly because we stole the good stuff from others. As far as I can tell, this idea got its start in 1997, when some Intel executives gave interviews introducing Intel’s new internal research groups. Unfortunately, the Intel executives were quoted [26] as having said the new research groups were an appropriate investment, because “we have to
stop borrowing all our ideas from other people.” I was outraged at this mischaracterization of my design team’s creativity by my own management, and I stormed into my boss’s office and demanded that the next time these executives were in town for a review, I get the first 10 minutes of the meeting to redress this grievance.

Digital Equipment Corp. (DEC) officials had also read that interview, and they were apparently just as angry as I was, but for a different reason: They thought they were the ones getting ripped off. So they sued. Intel eventually settled the suit for $600,000,000, which convinced some people that we really had stolen DEC’s best ideas and that is why P6 was so good.

I did eventually get my chance to present my point of view on this to the executive in question, and I made it as clear as the English language and professional deportment allowed that P6 was a creation of Intel’s Oregon design team. For the same reasons that every new design incorporates the best of the art, there were features in the P6 for which one could trace a plausible ancestry. An example of this is a paper published by Yale Patt and Tse-Yu Yeh in 1991 on two-level adaptive branch predictors [35]. This paper appeared exactly when we had just realized that we needed a much better predictor than anything currently known. The general direction proposed by this paper helped us refine what eventually became our own branch predictor design. Likewise, certain functional blocks in the P6 were (purposely) named similarly to those in the original out-of-order design by Tomasulo at IBM in the 1960s, as a way of honoring the field’s pioneers.2 And as Prof. Wen-mei Hwu described in the foreword to this book, the rough outline of how one might go about designing a viable out-of-order engine had been trailblazed in the 1980s. This kind of idea sharing represents the best of what our academia/industry arrangement can achieve and I think both groups can be proud of their achievements. But in no case did we ever look at DEC patents or borrow any ideas from them or their products.

A similar case arose when Professor H. C. Torng of Cornell University somehow concluded that the P6 must have borrowed some of his patented ideas, and Cornell began sending annual licensing letters to Intel’s legal department. Every year, I would explain the situation: None of the P6 engineers had ever read the patents in question,’ so literally borrowing anything from them was impossible. True, since we had not read the patents, none of us could be sure what was in them, but I had read all of Prof. Torng’s publications while in graduate school (and I was the only one on the team who had). While his ideas were novel at the time they were published, they were clearly not relevant to what we were implementing on P6, and at no point were they even under active consideration for inclusion.

After several years of this ritual sword-rattling, imagine my astonishment when a colleague emailed me a URL that showed a smiling Prof. Torng holding a plaque and a check to Cornell for $2,000,000, awarded to him by an Intel VP and an Intel Fellow. The plaque said, “Thank you for your fundamental contributions to the P6 microprocessor.” I was nearly apoplectic at the thought that not only had my company failed to stand up for my design team, not only had people been awarded $2,000,000 for work that had nothing whatever to do with P6, but Intel’s legal department and upper management had not even bothered to inform us that they had found it more expedient to buy Cornell off than to fight the issue in court. We had to find out in the newspapers that we had once again been implicitly accused (by our own company) of illicitly appropriating other people’s ideas.

Having helped Intel deal with several lawsuits over the years, I do understand that the U.S. court system has a very difficult task in terms of providing justice in high-tech, complex cases in which a jury of one’s peers really lacks the background for grasping the necessary subtleties, and that situations sometimes dictate that a company occasionally has to treat such cases as business decisions, independent of what is truly right or wrong. There is a time and a place for business expediency. But there is never an excuse for treating one’s own design team this shabbily. It felt like we were stabbed in the back by members of our own team.

I eventually sent a very surly note to our executives along the lines of “If you intend to award two million dollars to every person on earth who did not contribute to the P6, the total is going to come to approximately twelve trillion dollars.” I never got a reply to that e-mail.

Whenever I hear an innocent athlete being accused of having succeeded only because he cheated by using steroids, I remember the DEC and Cornell incidents and I think I know exactly how he feels.

What did the P6 team think about Intel’s Itanium Processor Family?

When I joined Intel in 1990 to start a new design team in Oregon, I was prepared to contend with some understandable hostility from the existing x86 team in Santa Clara. After all, that team built Intel; they did the 286, the 386, and were finishing the 486. They were the chip design authorities in the company and if they had wanted to pick on the new kid on the block, they would have been within their rights.

But they didn’t. Every time I called or visited the P5/Pentium team, they were the very model of professionalism: helpful, engaged, interested, and forthcoming. Superb engineers such as Robert Dreyer, Jack Mills, Ed Grochowski, Don Alpert, Gary Hammond, John Crawford, Ken Shoemaker, and Ken Arora were not only knowledgeable, but a genuine pleasure to work with. If we needed advice or information about x86 compatibility, or design tools, we got it. The relationship between these (almost-rival) teams could hardly have been better.

Unfortunately, things always change, and if they are already good, there are few ways for them to get better and many ways for them to get worse. Intel found one of the many ways when it commissioned the Santa Clara team to conceive and implement a new 64-bit instruction set architecture, eventually known as the Itanium Processor Family (IPF), and then gave them (what I believe were) confused and conflicting project goals.

The first problem stemmed from the decision to form a partnership with HewlettPackard to jointly conceive and specify the new architecture. Understandably, HP wanted to contribute their best technical ideas to this new architecture, but also wanted to make sure that in no case would they end up competing against any of them. So an intercompany firewall was required as part of the deal between the IPF team and the IA32 team in Oregon. We in the Oregon design team did not care about whatever HP technology was being protected thereby, but we did care about the fact that this firewall had a strong tendency to compartmentalize the Santa Clara team from the rest of the Intel design community. Us-versus-them psychology creeps into such situations by default, but this firewall requirement made it much worse. In cases of profound disagreement, we could no longer mutually revert to data and simulations to resolve the issue.

The second problem was, I believe, intrinsic to the charter required of the Intel IPF team. In essence, they were told that their mission was to jointly conceive the world’s greatest instruction set architecture with HP, and then realize that architecture in a chip called Merced by 1997, with performance second to no other processor, for any benchmark you like. The justification for this blanket performance requirement was that if a new architecture was not faster than its predecessors, then why bother with it? Besides, HP’s studies had indicated that the new architecture was so much faster than any other, that even if some performance was lost to initial implementation naivete, Merced would still be so fast that it would easily establish the new Itanium Processor Family.

The one thing you do not do with uncertainties is to stack them all end to end and judge them all toward the hoped for end of the range.

This plan did not go over well with the Oregon design team. At one point, I objected to the executive VP in charge that no company had ever achieved anything like what he was blithely insisting on for Merced. No matter how advanced an architecture, the implementation artifacts of an actual chip would offset them until the design team learned the proper balances between features, design effort, compiler techniques, and so on. Moreover, there are always uncertainties in complex designs and new desig
ns most of all. The one thing you do not do with uncertainties is to stack them all end to end and judge them all toward the hopedfor end of the range. With any one issue, you can make an argument that it is likely to turn out at the high end of the desirability range, but you must not do that with every issue simultaneously. Nature does not work that way but, in effect, that is what Merced was assuming.

My criticisms were not accepted on a technical or rational level. Instead, I was accused of criticizing another project just to make my own look better. I said, “I would say exactly the same thing even if it were my own team designing Merced. You cannot expect any design team in the world to get so many things right on their first try. And with the number of new ideas in IPF, what Intel should be doing is designing Merced as a research chip, not to be sold or advertised. Take the silicon back to the lab and experiment with it for 18 months. At the end of those 18 months, you’ll know what new ideas worked and which ones weren’t worth the cost of implementation. Then design a follow-on, keeping the good ideas and tossing out the rest, and that second chip and its follow-ons have a chance to be great. It’s worth investing a year or two at the beginning of a new instruction set architecture that you hope will last for 25 years.” The response: “I hope you’re wrong. I cannot afford to design a chip that I cannot sell upon completion.” I said, “Then we have no business trying to design a new instruction set architecture.” The result was a stalemate.

The Oregon team also did not like the idea that with P6, we had opened the server markets for Intel and now were being told “Oregon is not even allowed to say the word server” for fear of too much internal competition. Given our concerns about the viability of any first-time chip like Merced, we felt Intel was unnecessarily risking its presence in the server space, since it was relatively easy to continue designing variations of our x86 desktop chips for servers for as long as necessary until IPF was ready to take over.

‹ Prev Next ›