Managing Concurrency: Latent Futures, Parallel Lives
By Kyle Wilson
Sunday, January 29, 2006
I bought a dual-core computer two weeks ago. Last week I attended Sony's PlayStation 3 developer conference in San Jose. Between those two events, I've had concurrency much on the brain lately.
In what follows, I'm going to necessarily talk about the Xbox 360 and the PlayStation 3. I don't want to violate any NDAs, so I'll limit myself to discussing the stats available from Wikipedia -- which are often more complete than what the console manufacturers tell us anyway.
Software development has reached an inflection point. The code that we write, and the hardware that it runs on, are undergoing a revolutionary change. Herb Sutter, chair of the ISO C++ standards committee, writes that "The Free Lunch Is Over": heat production and power requirements have stalled CPU speeds for several years now, and even if they hadn't, we're running awfully close to the clock speed limit imposed by the speed of light . Hardware manufacturers can keep adding transistors to CPUs, but they can't make traditional sequential programming any faster . The future is multi-core architectures and, for anyone writing performance-critical applications, multi-threaded programming.
Game developers have to deal with this sooner than others for two reasons:
First, we care about performance, while most other applications don't come close to using all the processing power available to modern PCs. Most applications will be happy to run single-threaded for some time to come. To paraphrase Herb Sutter again, if those apps get any benefit from new multi-core hardware, it's just that the OS will offload to other cores all the spyware and spyware killers, viruses and virus shields, spambots and spam filters that are quietly fighting it out in the background of most PCs.
The second reason why game developers have to deal with multi-core architectures now is that multi-core game hardware is here now. Intel and AMD have been slow to jump on the multi-core bandwagon because they know that home users buying new computers expect their existing applications to run more quickly on the new hardware. So Intel and AMD go to more and more heroic/absurd lengths to squeeze out tiny increments of clock speed, do out-of-order and speculative execution, and so forth, just to make those single-threaded legacy apps run a tiny bit faster.
Microsoft and Sony don't have that problem. Instead, they have a pressing need to control costs on ~$400 consoles, and if they can buy processing power more cheaply by buying multi-core CPUs, they will. Of course, that offloads the hard work of utilizing all that processing power onto the game developers, who have to try to distribute some pretty inherently synchronous logic across multiple threads. This drives Chris Hecker to rant at GDC that Sony and Microsoft are screwing us over . But they're really just trying to lose a little less money per console. They're slaves to the physical limitations that processors are hitting just like the rest of us. Console developers get to be the canaries in the mine shaft on this one, but in five years, I don't think we'll be seeing single-core PCs being sold anymore, either.
Divided We Stand
Modern game console CPUs are at that awkward adolescent stage that graphics processors went through nearly a decade ago. It's obvious that they're powerful, and that they're going to change everything, but it's not yet obvious what they'll become, and it's clear that we're going to go through some amount of turmoil and upset in the process of finding out.
The Xbox 360 runs on a three-core 3.2 GHz PowerPC processor, with two hardware threads per core. The different cores run in a shared memory architecture, and in fact share a 1 MB L2 cache. This design is a relatively gentle introduction to the new concurrent landscape. Although a single-threaded application will only use one-sixth of the power of the Xbox 360, multi-threaded programs can run transparently on multiple cores: multi-threaded applications can execute the same code in different threads, can access data in a shared address space, and can use the common synchronization primitives that most programmers learned about in school a long time ago. 
The PlayStation 3's Cell processor is a rather different approach to concurrency. The heart of the PS3 is, like the heart of the Xbox 360, a 3.2 GHz PowerPC chip with two hardware threads. But instead of having three similar cores, the Cell has one PowerPC core augmented by eight "Symmetric Processing Elements," or SPEs. An SPE is a RISC processor specialized for vector math with its own 256 KB local memory for code and data. The SPE communicates with the main processor via DMA transfers. 
These are radically different architectures. This yawning chasm between Xbox 360 and PS3 hardware designs creates something of a headache for those of us who are doing cross-platform development. Generally, we like to design abstract interfaces that hide the platform-specific details of how particular operations are performed. But it's hard to see the common abstraction between the Xbox 360's six similar hardware threads and the PS3's eight SPEs.
The lowest common denominator is the SPE interface. If we design lightweight jobs that can run in less than 256 KB of memory, and that communicate with the main game thread through DMA transfers, then we've created perfect jobs for SPEs. But the more-flexible Xbox 360 processor can emulate SPE's handling of those jobs by running them in the main game address space. (Emulated SPEs can even be allocated 256 KB blocks of contiguous memory, for better cache locality.) These "virtual" SPEs can communicate with the main game thread by memcpy instead of DMA transfers, or can forego memcpy overhead where only const access to main memory is needed.
This suggests -- and, in fact, the Cell processor wizards at IBM suggest  -- a job queue, in which the main thread of an application passes SPE-sized jobs off to a system that monitors available processors, assigning jobs to processors as earlier jobs finish. Ideally, the main thread would enqueue a large number of jobs at the beginning of each frame and gather results from job target locations at the end of the frame, blocking on any necessary jobs that hadn't yet finished. Unfortunately, the various needs of a game rarely fit this ideal.
Emulating SPEs doesn't really play to the Xbox 360's strength. Jobs are likely to execute faster on the PS3 than the Xbox 360; the Cell processor's SPEs have fast local memory and greater vector processing power. However, the execution time of most jobs is likely to be dwarfed by memory latency and the Xbox may make up the difference through its ability to read data directly from main memory. At this point, any guesses about relative performance of the PS3 and the Xbox 360 would be premature.
The job queue model seems necessary, but is it sufficient? We might benefit from a more traditional approach to concurrency as well. There are still two hardware threads available on the PS3, and there may be tasks that run more efficiently in more than 256 KB of memory or sharing access to game code or resources. The job queue model could be extended to support different types of jobs. Or we might be better off with a more abstract model of shared-address-space concurrency, such as that suggested by Herb Sutter in his Concur project language extensions .
The popular version of Moore's Law holds that the number of transistors on a chip doubles every eighteen months. Hardware manufacturers claim a doubling every twenty-four months is more realistic. I'll split the difference and assume three doublings, an eight-fold increase, occur every five years. Five years is the length of a console life cycle.
How well does this prediction hold, historically? The first Xbox had 64 MB of memory and a 733 MHz processor . The PlayStation 2 had 32 MB of memory and a 294 MHz processor . Scaling memory and clock speed by a factor of eight, I would have expected this generation of consoles to have between 256 MB and 512 MB and single-core processors running between 2.4 GHz and 5.9 GHz. The amount of memory is on the high end of my predicted range. Processor clock speeds are within the range I predicted, though the move away from out-of-order execution and from CISC to RISC instruction sets make the comparison somewhat apples-to-oranges.
If processor speeds truly have plateaued, and if I take Moore's Law very literally, then the Xbox 720 should have 24 cores, each running two hardware threads. The PlayStation 4 should have a mere 8 cores driving an absurd 64 synergistic processing elements. Multiply by eight again for the next console generation and... well, rapidly the limiting factor in what any game can do isn't the raw processing power available to it, but the effort of marshalling data, distributing state information to different processors and gathering results. We're going to have more bandwidth than we know what to do with, but the latency is going to kill us.
But that's a problem for five or ten years hence. In the meantime, the shift toward concurrency is going to have interesting consequences. Let me play Nostradamus for a minute:
- 60 Hertz games are soon going to be a thing of the past. Imagine that you need to execute game logic, physics, graphics, etc. for a game world on multiple processing units. Further imagine that on some hypothetical massively-parallel future console hardware, the latency in distributing or gathering data for those operations totals 8 ms. Once data's been distributed, and before you gather it again, you can process 100 objects per ms. If your frame time is 16 ms, you can process 800 objects. If your frame time is 33 ms, you can simulate 2500 objects. It used to be that you could run twice as fast if you did half as much. Now you can run twice as fast if you do a third as much.
- The long-promised death of PC games is finally coming. There just aren't enough non-game PC applications that need maximum processing power to keep driving the sales of new hardware, especially not if that new hardware complicates application development as much as thread-safe concurrent programming does. The future I expect is one where hardware branches in a variety of specialized directions:
- Game consoles will put modestly multi-core CPUs, GPU, and various ancillary vector math co-processors all on one chip
- Servers, render farms and programmer's workstations will run as many cores on a single chip as they possibly can
- Home computers will become commoditized modestly multi-core creatures, differentiated more by brand and aesthetics than by capability (cf. iPods)
- Middleware, middleware, middleware. It used to be that if you wanted something up and running quickly, you licensed middleware, and if you wanted something that performed optimally, you rolled your own engine. That's not going to be true anymore. Writing safe and efficient concurrent code is hard. The people who can do it are going to be worth their weight in gold to middleware providers and game companies are going to find that licensing an engine makes more and more economic sense. Middleware providers, on the other hand, are going to find that multiple concurrency solutions in the same game step on each others' toes, and that developers benefit from integrated solutions. Don't be surprised if Epic Games merges with Ageia, if Havok comes out with a graphics library and scene graph implementation, or if Emergent Game Technologies adds a physics package to Gamebryo.
O brave new world, that has such hardware in it!
 Herb Sutter, "The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software." Dr. Dobb's Journal, March 2005.
 Power Architecture editors, "Unleashing the power of the Cell Broadband Engine." IBM developerWorks.
Any opinions expressed herein are in no way representative of those of my employers.