The Pentium Chronicles: The People, Passion, and Politics Behind Intel's Landmark Chips (Practitioners)
Page 5
Managing Mechanics
The concept phase could also be called the “blue sky” phase because all things seem feasible, the horizons are very far away, and the possibilities are infinite. Many system architects enjoy this phase of a project so much that they are loathe to leave it. After all, the crystal castle they’re building in the clouds of their gray matter has only good features, unsullied by compromises, physics, economics, roadmap choices, customer preferences, and schedule pressures. Every university has some PhD students who enjoy the process so much that they never feel compelled to actually finish the work, and these people could eventually find themselves on concept phase architecture teams. The project leader must be able to sense when the project’s concept phase is nearing the end of its usefulness and have the fortitude to push the team to the next phase. (Conversely, project leaders must also be able to accurately discern when the project has not yet come up with ideas strong enough, and prevent the project from committing to a plan that cannot produce a successful product.)
Paying attention to some basic mechanics can make it easier to manage this phase and its transitions.
Physical Context Matters. When we first began working out the fundamentals of P6’s microarchitecture, we met informally in each other’s cubicles and quit each day when we were too tired to think straight. Unfortunately, we were also disturbing neighboring cubicles, so we started using conference rooms. But that didn’t work well, either; the whiteboards were cleaned nightly, and every time we met, we felt as if we were starting over rather than picking up where we left off. More to the point, other people knew we were in the conference room, which meant we could be interrupted. We discovered by trial and error that it usually took us about an hour to reliably get back to the point where we had left off the previous day. As Figure 2.2 shows, we also found that our period of peak brainstorming efficiency was about two hours, so it took about half of each brainstorm session just to reconstruct what we had done the day before. If we were interrupted, the one-hour startup transient would begin again, causing a severe loss in efficiency.
The startup transient was a function of several things. Which decisions had we tentatively made? Which open issues had we chosen to be in the “must attack this immediately” versus “probably has a workable solution, so leave until later” category? Which previously open ideas had we suddenly realized were probably not going to work and should be permanently abandoned? We were chagrined at how often we found ourselves groupthinking through an extremely complicated design scenario, only to suddenly realize that we’d already had this exact discussion two weeks ago, concluded it was all a dead end, and agreed not to revisit it.
When we did manage to get the same conference room, and the janitorial staff had observed our request to not clean the whiteboards (at least this time), we discovered that we could get back to our previous conceptual point much more quickly, because all our prior scribblings were still on the wall in our own handwriting. That context was invaluable in reminding us of the twists and turns of our previous meeting.
The Storage Room Solution. On one memorable day, when we again found our conference room unavailable, we solved all three problems-disturbing others, loss of context, and interruptions simultaneously. Dave demonstrated one of his many surprising skills by using his Intel badge to jimmy open a nearby storage room, which happened to be fairly large with lots of unused wall space. On the grounds that it’s easier to ask forgiveness than permission and with the same authorization that Dave used to get in the room, we procured lots of whiteboard units and installed them into our new meeting space. We were ecstatic! Nobody ever cleaned the storage room, nobody else could preempt us from using it, and we were never interrupted because no one knew where we were. Several months later, the room’s owner stumbled across our think session and evicted us, but by then the project was well on its way and we didn’t really need it anymore.
Sometimes, just meeting temporarily in a different room can break the logjam.
Figure 2.2. Brainstorming time segments and sequence.
The idea that physical context matters is hardly new. When musicians are trying to combat the debilitating effects of nervousness on their virtuosity, they often practice in the same venue in which they will deliver their recital. Likewise, students who are anxious about performing well on a test may benefit from studying in the same room in which the test will be administered.
The converse of this idea is also valuable. If a particular physical context has contributed to the current project state, it makes sense to change that context when the project occasionally stalls and needs a really good new idea. Sometimes, just meeting temporarily in a different room can break the logjam.
Beyond the Whiteboard. Although no one can dispute the usefulness of whiteboards, no matter how many you have, you will run out of space and have to erase some of your ideas to make room for new ones. Our solution to whiteboard limitations was to create a meeting record, or log, that captured our key decisions as we made them. The honor of record keeping fell to a designated “scribe,” whose job was to notice ideas and decisions worth writing down. The scribe would then circulate the meeting minutes among the rest of the concept phase team for error-checking as soon as possible after the meeting. The scribe not only captured decisions made and directions set, but also tried to document roads not taken and why. Brainstorming in the meetings commonly ended with an Aha! moment, in which everyone suddenly realized why an idea we had all thought had some potential would in fact not work. Or we might come upon a related idea that we all felt was a good one but not appropriate for our project. If no one had documented such moments as they occurred, they might have been lost forever.
Recording the roads not taken turned out to be particularly valuable for another critical concept phase task: accurately transferring the project plan into the design team’s many heads (an art form in its own right, as I describe later). Transferring the roads not taken helps the designers understand the project direction, since they then have a clearer picture of what it is not. It also helps them avoid making lower-level design decisions later that might not be congruent with the project direction.
“Kooshing” the Talkers. You want doers, not talkers, on a concept phase team. Doers write programs, run experiments, look up half-remembered papers, ask expert friends, and come up with boundary-case questions to help illuminate the issue. Talkers generate endless streams of rhetoric that seem to enumerate every possibility in the neighboring technical space. When you become exasperated and cut them off, however, you will discover that this long list has brought you no closer to resolving anything.
With dozens of complex concepts being invoked or invented, and a few overall project directions still on the table, the possibility space becomes enormous. Each choice you tentatively make generates five new issues, each of which embodies some number of knowns and unknowns. The team will not have much data to guide them in this phase. They must find their way through these choices with a combination of
1. Knowledge (what is in the literature or was proven at last week’s conference)
2. Experience (we did it this way eight years ago, and it worked or didn’t work)
3. Intuition (this way looks promising to me; that way looks scary)
Talkers monopolize the room’s extremely limited bandwidth. We had natural or rapidly converted doers on the P6 project, but if we had had a talker, there would have been one person expounding on every idea that popped into his head, while four others frantically wrote down the ideas that were flashing through their minds before they were forgotten or steamrollered by the next conceptual bit of flotsam bobbing along on the audible stream of consciousness. One technique we stumbled onto quite by accident was the Koosh-ball gambit. When brainstorming, many engineers like to have something to play with in their hands: a pen, keys, their cell phone. One of us brought a Koosh ball to a meeting, a small rubber ball with colorful rubber-band-hair sticking out in all directions, and absentmindedly
tossed it up and down throughout the meeting. Within a week, we all had Koosh balls, and a new phenomenon emerged: If someone threatened to monopolize the discussion, the others had the right to register their displeasure by tossing their Koosh ball at the offender. Before long, all someone had to do was pick up the Koosh ball and look menacingly at the speaker for that person to realize it was time to let others speak.
You want doers, not talkers, on a concept phase team.
A DATA-DRIVEN CULTURE
An important part of the concept phase is to establish the project’s technical roots. For P6, we had to choose some basic project directions such as 32-bit virtual addressing versus 64-bit, 16-bit performance, frontside bus architecture, and out-of-order micro architecture versus the P5’s in-order architecture. Somewhat later in the project, we faced similar questions about clock frequency, superpipelining, and multiple processors on a bus.
These directions might not have come as quickly (or at all) had we not had a team of doers who actively sought project-wide, high-level decisions, the points of highest leverage in a project. We knew from the beginning that deciding on an out-of-order microarchitecture was the number one conceptual priority, since that choice would drive most of the other questions. An out-of-order core would imply a much more complicated engine, which would tend to increase the number of pipeline stages, which would impact the clock frequency (making it either higher or lower, we were not entirely sure which), and so on.
In our early concept phase discussions, we briefly considered all the ways that we knew might give us the required performance, including a very long instruction word (VLIW) approach, since Dave and I had worked on VLIW designs while at Multiflow. But we kept coming back to out-of-order, despite the dearth of successful out-of-order designs before P6. Glenn felt that there were solutions to the obvious problems of an out-of-order Intel Architecture chip, and Andy had worked on out-of-order ideas in his master’s thesis and was enthusiastic about the concept’s performance potential. Dave and I knew from our VLIW experience that the implicit code reordering of an out-of-order engine had real potential, but also substantial complexity, and it was not out of the question for an out-of-order engine to lose more performance to overhead than it gained in cleverness. We also knew how easy it was to make global performance-driven decisions early in a project that would later turn out to have spawned whole colonies of product bugs.
Despite these drawbacks, we settled on out-of-order as the POR in only a few days. After a few hours of discussion, we had generated a list of open items, things we knew were important and could make or break the project, but we did not yet know how to resolve them. Again, had our team been talker-dominated, we would have spent the next several months debating every issue without a single resolution, because there is no religious demagogue worse than an engineer arguing from intuition instead of data. If any two concept phase engineers have opposing intuitions, and neither collects data sufficient to convince the other, the debate cannot be resolved and quickly degenerates into a loud, destructive ego contest.
We avoided most such encounters in the P6 by allocating the list of open items to the five participants and then requiring that the item be argued at the next meeting on the basis of data to be collected by then. In this way, we established the project’s data-driven culture. Each person had to think of and then execute an experiment to help resolve a technical issue or unknown in sufficient detail and veracity to convince the rest of us that his conclusion had merit. The lack of time and resources actually worked mostly to our advantage. You might think that there was not enough time to do a really good experiment between Tuesday noon and Thursday 8 A.M., but the short deadlines forced us to think carefully about exactly what was being asked. The thinking was the important part; if we got that part right, then writing up some code or researching the literature was usually simple. And the focus it provided our answer tended to make that answer right, or at least useful.
Not being able to quantify the answer to a particular issue was sometimes the first indication that we were not chasing some technical corner case but, in fact, had possibly stumbled onto an important, potentially difficult problem that might need everyone’s immediate, concerted attention. I knew very little in those days about the Intel Architecture, and was acutely sensitive to signs that something fundamental in the instruction-set architecture would prevent any reasonable out-of-order implementations. My nightmare was that we would build a very fast computer that could not, no matter how clever we were, run all x86 code properly. I knew, for example, that to ensure high performance, we had to be able to service loads in a temporal order other than what the program code implied. Would any examples of Intel Architecture code break if some loads were executed out-oforder?
My nightmare was that we would build a very fast computer that could not, no matter how clever we were, run all x86 code properly.
We eventually invented some mechanisms by which loads that needed strict sequentiality could be identified by hardware and handled separately, but early in the project there were dozens of such worries. Those that we could not quickly dispatch via experiment, we added to a watch list for further scrutiny, and in some cases initiated longer exploratory work.
Of course, insisting on data for all decisions will lead to analysis paralysis. In this case, schedule pressure and proactive leadership are the antidotes. Legendary NASA designer Bill Layman said
If you force people to take bigger intuitive leaps, and their intuition is sound, then you can get more and more efficient as you force larger and larger leaps. There’s some optimum point where they succeed with ninety percent of their leaps. For the ten percent they don’t make, there’s time to remake that leap or go a more meticulous route to the solution of that particular problem…. There’s an optimal level of stress. [15]
The Right Tool
There’s an optimal level of stress.
Very early on in P6 development, when only Dave, Glenn, and I were on the project, we had noticed that all out-of-order performance results in the literature were RISC-based. Was there something about the x86 architecture that obviated those results and would not let us achieve the performance predicted? Dave and I wrote the dataflow analyzer, or DFA, to get a quick answer to this important question.
We purposely called this tool a dataflow “analyzer” because it was most emphatically not a simulator of any kind. DFA was aimed solely at answering the question of how much parallelism was intrinsic to general x86 code, and it operated from execution traces. An execution trace is a record of exactly how a processor executed a program, instruction by instruction, including memory addresses generated and branches taken. After DFA had walked over the entire trace, we could compare how many clock cycles DFA’s rescheduling would have taken to how many instructions were executed (again, assuming one clock cycle per instruction for ease of analysis). We would then know the theoretical speedup limit.
In a few weeks, results from DFA confirmed that the published results should apply quite well to x86. Glenn added a graphical interface and DFA became a mainstay tool for the rest of the P6 project. It also helped define the project’s data-driven culture because it was general and open source. Thus, we could use it to attack a wide variety of questions and any of us could modify it as needed.
Dave, Glenn, and I were hardware engineers, not professional programmers. We could write code, and it would work when we were finished, but it never surprised us when others would later point out better ways to do things. (Andy and Michael were outstanding programmers, but they didn’t join the project until a few months after DFA’s creation.) As an example of Michael’s diplomacy in the face of egregious overstimulation, he once pointed out to me that a set of command line parsing routines were available in the standard library, so I had not had to write my own. He didn’t even crack a smile, even though outright guffaws would have been warranted. In my defense, I hasten to point out my command line parsing routines worked, and nobody ever found a bug in
them.
The lesson for all chip architects at this phase is to get your hands dirty.
The important thing was not how pretty the DFA code was; it was what we learned in creating and using it. The lesson for all chip architects at this phase is to get your hands dirty. What you learn now will pay off later when all the decisions become painful tradeoffs among must-have’s.
The “What Would Happen If” Game. Having a data-driven culture in a design organization is always a good idea, but never more so than when the team’s leaders are learning new technology. When you have spent 10 or 15 years designing computers, as Glenn, Dave, and I had, you will have absorbed a subliminal set of intuitions about what seems reasonable. For example, this size for an L2 cache should mean that the L1 cache is that size; if the CPU is this fast, then the frontside bus must have at least that bandwidth. This kind of experience base can be immensely valuable to a design team, since it focuses the team’s efforts on the few solution sets that might actually work, instead of allowing them to wander in the much larger general possibility space.
But DFA quickly taught us that much of what we had subconsciously learned was wrong; it just didn’t apply to out-of-order engines. Undiscouraged, Dave and I started a little prediction game that went something like this: I would show up at my office and start work. Within the hour, Dave would arrive, sporting an “I know something you don’t” grin, and say, “What do you suppose would happen if you had an out-of-order x86 engine with 300 arithmetic pipelines that could do 200 simultaneous loads but only one store per clock?” Now, Dave didn’t just make up that question on the spot, and he wasn’t asking because he hoped I would know the answer. He was asking because, just last night, he didn’t know, and he stayed up all night experimenting until he found the answer-an answer that apparently delighted him with its unexpected nature, and taught him some thing worth knowing about this strange out-of-order universe. He could just tell me the answer, but where was the fun in that?