Why GPUs?
NVIDIA Tesla Fermi C2050 GPU
Graphics Processing Units (GPU) were originally designed to accelerate the large number of multiply and add
computations performed in graphics rendering. Packaged as a video card attached to the PCI bus, they
offloaded this numerically intensive computation from the central processing unit (CPU). As the demand for
high performance graphics grew, so did the GPU, eventually becomming far more powerful than the CPU.
The simulation of engineering and scientific problems is very closely related the to the type
of computation performed for graphic rendering. Both perform a large number of floating point
multiply-add computations. However there are two significant differences:
-
In general purpose science and engineering, the amount of information stored and processed with each data point
requires 64-bits (8-bytes) of precision. For graphics applications usually 32-bits (4-bytes) are required.
The extra precision required is the result of solving differential equations where the difference between
data points is significant. For graphics applications only the value at a data point is required.
-
For graphics the data is often recomputed several times a second. An error in a data point is usually not
noticable. For scientific and engineering applications, the results of one computation are used
for the next computation. As a result, an error in computing the value of one data point usually will
render the analysis useless.
The NVIDIA Fermi and later GPUs in the Tesla product line are designed specifically for the engineering
and scientific marketplace.
They include native 64-bit precision in data storage, paths and arithmetic units. In addition they
have error correcting memory which provides the reliability required for long simulations.
The following table compares the processing capability for current general purpose CPU processors with
those found in GPUs.
| |
CPU |
GPU |
| Number of cores |
4 |
448 |
| Flops per core
| 4 |
1 |
| Clock Speed (GHz)
| 2.5 |
1.15 |
| Performance (Gflops)
| 40 |
515 |
The large number of processing cores is the key to GPU performance. At the heart is a symmetric multiprocessor
which performs parallel computation on 32 data streams. Each GPU contains 14 to 16 such multiprocessors for
a total number
of cores ranging from 448 to 512, depending on the model. Matrix algebra applications including
FMS are ideal candidates for this architecture.
Currently 8 GPUs can be installed in a single system (node). Systems containing thousands of nodes, each with
GPUs, form the architecture of the world's fastest supercomputers.
The architecture of GPUs offer the following benefits:
- Faster Processing
Each GPU provides an order of magnitude or more in performance over general purpose CPU processors. The result
is faster solution times and the ability to solve large problems.
- Lower capital cost
GPUs provide an order of magnitude or more is processing power for the same capital cost.
- Reduced power consumption
The efficient architecture of GPUs perform more floating point operations per watt of power consumed.
A Workstation Example
Two NVIDIA GPUs were benchmarked in a workstation. Based on actual performance and costs, the following
chart shows the performance and cost/performance of adding GPUs to a system.
The chart above illustrates two key points:
- GPUs lower computational cost.
For scientific computing the metric used for performing useful work is the number of floating point operations
(add or multiply) performed per second (Flop). A Gigaflop (Gflop) is a billion Flops. A typical workstation
configured for FMS computation will cost about $200 per Gflop of performance. GPUs, however,
cost less than $9 per Gflop of performance. The difference is due to the large number of multiply-adder
units on the GPU processor. Adding 2 GPUs to the workstation lowered the cost of a Gflop of performance from
$200 to $25. For FMS applications this can lower machine cost by a factor of 8 or provide
8 times the performance for the same cost. GPUs provide a similar reduction in power consumption, cooling
and space requirements.
- GPUs increase performance.
The performance of the workstation without GPUs was 80 Gflops. The performance with 2 GPUs was 660 Gflops, a
performance increase of over 8. The GPUs extended the performance beyond what is possible with CPUs alone at
any cost. Note that FMS operates the CPUs and GPUs in parallel so the total performance
includes the contribution from both types of processors.
A Server Example
GPUs can extend server performance far beyond that which can be obtained with CPUs alone. The following example
is a server having 8 CPUs. While several CPU options are available, the numbers shown are an average. The server
achieved 435 Gflops of performance at a cost of $211 per Gflop.
First 2 GPUs were installed in the PCI slots inside the server. The performance increased to over 1,000 Gflops
(1 Tflop) while the cost performance improved to $90 per Gflop.
Next two 1U expansion chassis were added with 4 GPUs each. These systems provide the power and cooling required
by the GPUs. The server interface was provided by PCI expansion cards. The resulting performance increased to
2,800 Gflops (2.8 Tflops) and the price/performance improved to $42/Gflop.
This server example also shows the power of GPUs in increasing performance and the benefits of reduced
capital and operational costs.
Copyright © Multipath Corporation