Evolutions of Parallel Processing Computing: The Rise of Throughput Computing
Stanford University and NVIDIA
Advances in semiconductor technology over the last decades have opened up many new possibilities in parallel processing architecture. The technologies and architectures of the early supercomputers have evolved towards machines that support a combination of data- and thread-level parallelism and have a deep memory hierarchy. There is a tension between these emerging parallel machines and today's commodity processor architecture.
Most commodity processors have limitations in two critical aspects of machine organization: parallel execution and hierarchical memory organization. These processors present to the programmer an illusion of sequential execution and uniform, flat memory. The evolution of these sequential, latency-optimized processors is at an end, and their performance is increasing only slowly over time. In contrast, the performance of throughput-optimized processors, such as Graphics Processing Units (GPUs), continues to increase and scale at historical rates. Throughput processors embrace, rather than deny, parallelism and memory hierarchy to realize their performance and efficiency advantage compared to conventional processors. Throughput processors have hundreds of cores today and will have thousands of cores by 2015. They will deliver most of the performance, and most of the user value, in future computer systems.
This talk will discuss some of the challenges and opportunities in the architecture and programming of future throughput processors. In these processors, performance derives from parallelism and efficiency derives from tight local coupling of computing and memory resources. Parallelism can take advantage of the plentiful and inexpensive arithmetic units in a throughput processor. Without locality, however, bandwidth quickly becomes a bottleneck. Communication bandwidth, not arithmetic power, is the critical resource in a modern computing system that dominates cost, performance, and power. This talk will discuss exploitation of parallelism and locality with examples drawn from the Imagine and Merrimac projects, from NVIDIA GPUs, and from three generations of stream programming systems.