ACM Queue - Processors

NUMA (Non-Uniform Memory Access): An Overview

Christoph Lameter — Fri, 09 Aug 2013 12:36:49 GMT

NUMA (non-uniform memory access) is the phenomenon that memory at various points in the address space of a processor have different performance characteristics. At current processor speeds, the signal path length from the processor to memory plays a significant role. Increased signal path length not only increases latency to memory but also quickly becomes a throughput bottleneck if the signal path is shared by multiple processors. The performance differences to memory were noticeable first on large-scale systems where data paths were spanning motherboards or chassis. These systems required modified operating-system kernels with NUMA support that explicitly understood the topological properties of the system's memory (such as the chassis in which a region of memory was located) in order to avoid excessively long signal path lengths. (Altix and UV, SGI's large address space systems, are examples. The designers of these products had to modify the Linux kernel to support NUMA; in these machines, processors in multiple chassis are linked via a proprietary interconnect called NUMALINK.)

Realtime GPU Audio

Bill Hsu, Marc Sosnick-Pérez — Wed, 08 May 2013 21:15:18 GMT

Today's CPUs are capable of supporting realtime audio for many popular applications, but some compute-intensive audio applications require hardware acceleration. This article looks at some realtime sound-synthesis applications and shares the authors' experiences implementing them on GPUs (graphics processing units).

FPGA Programming for the Masses

David Bacon, Rodric Rabbah, Sunil Shukla — Sat, 23 Feb 2013 03:43:14 GMT

When looking at how hardware influences computing performance, we have GPPs (general-purpose processors) on one end of the spectrum and ASICs (application-specific integrated circuits) on the other. Processors are highly programmable but often inefficient in terms of power and performance. ASICs implement a dedicated and fixed function and provide the best power and performance characteristics, but any functional change requires a complete (and extremely expensive) re-spinning of the circuits.

CPU DB: Recording Microprocessor History

Andrew Danowitz, Kyle Kelley, James Mao, John P. Stevenson, Mark Horowitz — Fri, 06 Apr 2012 22:17:03 GMT

CPU DB: Recording Microprocessor History

With this open database, you can mine microprocessor trends over the past 40 years.

Andrew Danowitz, Kyle Kelley, James Mao, John P. Stevenson, Mark Horowitz, Stanford University

In November 1971, Intel introduced the world’s first single-chip microprocessor, the Intel 4004. It had 2,300 transistors, ran at a clock speed of up to 740 KHz, and delivered 60,000 instructions per second while dissipating 0.5 watts. The following four decades witnessed exponential growth in compute power, a trend that has enabled applications as diverse as climate modeling, protein folding, and computing real-time ballistic trajectories of angry birds. Today’s microprocessor chips employ billions of transistors, include multiple processor cores on a single silicon die, run at clock speeds measured in gigahertz, and deliver more than 4 million times the performance of the original 4004.

Where did these incredible gains come from? This article sheds some light on this question by introducing CPU DB (cpudb.stanford.edu), an open and extensible database collected by Stanford’s VLSI (very large-scale integration) Research Group over several generations of processors (and students). We gathered information on commercial processors from 17 manufacturers and placed it in CPU DB, which now contains data on 790 processors spanning the past 40 years.

Managing Contention for Shared Resources on Multicore Processors

Alexandra Fedorova, Sergey Blagodurov, Sergey Zhuravlev — Wed, 20 Jan 2010 22:46:23 GMT

Managing Contention for Shared Resources on Multicore Processors

Alexandra Fedorova, Sergey Blagodurov, Sergey Zhuravlev; Simon Fraser University

Contention for caches, memory controllers, and interconnects can be alleviated by contention-aware scheduling algorithms.

Modern multicore systems are designed to allow clusters of cores to share various hardware structures, such as LLCs (last-level caches; for example, L2 or L3), memory controllers, and interconnects, as well as prefetching hardware. We refer to these resource-sharing clusters as memory domains, because the shared resources mostly have to do with the memory hierarchy. Figure 1 provides an illustration of a system with two memory domains and two cores per domain.

Threads running on cores in the same memory domain may compete for the shared resources, and this contention can significantly degrade their performance relative to what they could achieve running in a contention-free environment. Consider an example demonstrating how contention for shared resources can affect application performance. In this example, four applications—Soplex, Sphinx, Gamess, and Namd, from the SPEC (Standard Performance Evaluation Corporation) CPU 2006 benchmark suite⁶—run simultaneously on an Intel Quad-Core Xeon system similar to the one depicted in figure 1.

Reconfigurable Future

Mark Horowitz — Mon, 14 Jul 2008 15:03:29 GMT

Reconfigurable Future

The ability to produce cheaper, more compact chips is a double-edged sword.

Mark Horowitz, Stanford University

Predicting the future is notoriously hard. Sometimes I feel that the only real guarantee is that the future will happen, and that someone will point out how it's not like what was predicted. Nevertheless, we seem intent on trying to figure out what will happen, and worse yet, recording these views so they can be later used against us. So here I go...

Scaling has been driving the whole electronics industry, allowing it to produce chips with more transistors at a lower cost. But this trend is a double-edged sword: We not only need to figure out more complex devices, which people want, but we also must determine which complex devices lots of people want, as we have to sell many, many chips to amortize the significant design cost.

The Price of Performance

Luiz André Barroso — Tue, 18 Oct 2005 14:14:21 GMT

The Price of Performance

An Economic Case for Chip Multiprocessing

LUIZ ANDRÉ BARROSO, GOOGLE

In the late 1990s, our research group at DEC was one of a growing number of teams advocating the CMP (chip multiprocessor) as an alternative to highly complex single-threaded CPUs. We were designing the Piranha system,1 which was a radical point in the CMP design space in that we used very simple cores (similar to the early RISC designs of the late ’80s) to provide a higher level of thread-level parallelism. Our main goal was to achieve the best commercial workload performance for a given silicon budget.

Today, in developing Google’s computing infrastructure, our focus is broader than performance alone. The merits of a particular architecture are measured by answering the following question: Are you able to afford the computational capacity you need? The high-computational demands that are inherent in most of Google’s services have led us to develop a deep understanding of the overall cost of computing, and continually to look for hardware/software designs that optimize performance per unit of cost.

Extreme Software Scaling

Richard McDougall — Tue, 18 Oct 2005 14:14:01 GMT

Extreme Software Scaling

Chip multiprocessors have introduced a new dimension in scaling for application developers, operating system designers, and deployment specialists.

RICHARD MCDOUGALL, SUN MICROSYSTEMS

The advent of SMP (symmetric multiprocessing) added a new degree of scalability to computer systems. Rather than deriving additional performance from an incrementally faster microprocessor, an SMP system leverages multiple processors to obtain large gains in total system performance. Parallelism in software allows multiple jobs to execute concurrently on the system, increasing system throughput accordingly. Given sufficient software parallelism, these systems have proved to scale to several hundred processors.

More recently, a similar phenomenon is occurring at the chip level. Rather than pursue diminishing returns by increasing individual processor performance, manufacturers are producing chips with multiple processor cores on a single die. (See “The Future of Microprocessors,” by Kunle Olukotun and Lance Hammond, in this issue.) For example, the AMD Opteron1 processor now uses two entire processor cores per die, providing almost double the performance of a single core chip. The Sun Niagara2 processor, shown in figure 1, uses eight cores per die, where each core is further multiplexed with four hardware threads each.

The Future of Microprocessors

Kunle Olukotun, Lance Hammond — Tue, 18 Oct 2005 14:13:42 GMT

The Future of Microprocessors

Chip multiprocessors’ promise of huge performance gains is now a reality.

KUNLE OLUKOTUN AND LANCE HAMMOND, STANFORD UNIVERSITY

The performance of microprocessors that power modern computers has continued to increase exponentially over the years for two main reasons. First, the transistors that are the heart of the circuits in all processors and memory chips have simply become faster over time on a course described by Moore’s law,1 and this directly affects the performance of processors built with those transistors. Moreover, actual processor performance has increased faster than Moore’s law would predict,2 because processor designers have been able to harness the increasing numbers of transistors available on modern chips to extract more parallelism from software. This is depicted in figure 1 for Intel’s processors.

Digitally Assisted Analog Integrated Circuits

Boris Murmann, Bernhard Boser — Fri, 16 Apr 2004 10:14:19 GMT

Digitally Assisted Analog Integrated Circuits
BORIS MURMANN, STANFORD UNIVERSITY
BERNHARD BOSER, UC BERKELEY

Closing the gap between analog and digital

In past decades, “Moore’s law”1 has governed the revolution in microelectronics. Through continuous advancements in device and fabrication technology, the industry has maintained exponential progress rates in transistor miniaturization and integration density. As a result, microchips have become cheaper, faster, more complex, and more power efficient.

We will show, however, that digital performance metrics have grown significantly faster than corresponding measures for analog circuits, especially ADCs (analog-to-digital converters). Since most DSP (digital signal processor) projects depend on A/D conversion in the interfaces, this growing disparity in relative performance increase has the potential to threaten the rate of progress of DSP hardware.

ACM Queue - Processors

NUMA (Non-Uniform Memory Access): An Overview

Realtime GPU Audio

FPGA Programming for the Masses

CPU DB: Recording Microprocessor History

CPU DB: Recording Microprocessor History

With this open database, you can mine microprocessor trends over the past 40 years.

Andrew Danowitz, Kyle Kelley, James Mao, John P. Stevenson, Mark Horowitz, Stanford University

Managing Contention for Shared Resources on Multicore Processors

Managing Contention for Shared Resources on Multicore Processors

Alexandra Fedorova, Sergey Blagodurov, Sergey Zhuravlev; Simon Fraser University

Contention for caches, memory controllers, and interconnects can be alleviated by contention-aware scheduling algorithms.

Reconfigurable Future

Reconfigurable Future

The ability to produce cheaper, more compact chips is a double-edged sword.

Mark Horowitz, Stanford University

The Price of Performance

The Price of Performance

An Economic Case for Chip Multiprocessing

LUIZ ANDRÉ BARROSO, GOOGLE

Extreme Software Scaling

Extreme Software Scaling

Chip multiprocessors have introduced a new dimension in scaling for application developers, operating system designers, and deployment specialists.

RICHARD MCDOUGALL, SUN MICROSYSTEMS

The Future of Microprocessors

The Future of Microprocessors

Chip multiprocessors’ promise of huge performance gains is now a reality.

KUNLE OLUKOTUN AND LANCE HAMMOND, STANFORD UNIVERSITY

Digitally Assisted Analog Integrated Circuits

Digitally Assisted Analog Integrated Circuits BORIS MURMANN, STANFORD UNIVERSITY BERNHARD BOSER, UC BERKELEY

Closing the gap between analog and digital

Digitally Assisted Analog Integrated Circuits
BORIS MURMANN, STANFORD UNIVERSITY
BERNHARD BOSER, UC BERKELEY