Home

Why DISKS?

DiskDrive
Memory ($50 per Gbyte)
8 to 32 Gbytes per Stick
Limited Maximum Size
DiskDrive
Solid State Disk ($16 per Gbyte)
5 to 10 TBytes per PCIe x16 slot
DiskDrive
Hard Disk ($1 per Gbyte)
900 Gbytes per Drive

A key benefit of FMS is the ability to store the problem data on disk, which significantly lowers cost and extends the problem size. The most frequent question asked is how FMS can match the relatively slow transfer rates from disk with processors computing in the teraflop range. This section answers that question.

FMS may store data directly in memory, on solid state disk or on rotating disk. The choice may be made at runtime, depending on the size of the problem and storage available.

The storage used by FMS to build the matrix and vectors is usually temporary scratch storage, which is deleted after the job completes. Memory and volatile SSD meet this requirement. Non-volatile SSD and disks may also be used, with the files deleted after the job completes. Input problem data and results are usually stored on non-volatile media (disks).

FMS transfers data between disk and memory in large blocks, usually several Gbytes at a time. Large transfers provide the best possible performance including file striping, where several disks are operated in parallel to improve transfer rate.

The algorithms used by FMS are designed to write data once and reread it many times. This reduces wear on SSD and rotating disks. In addition, the read transfer rate is usually faster than the write rate.

The choice of storage media is determined by cost and capacity. The following chart shows the storage cost options for full complex matrices up to 1,000,000 equations.

Storage Cost

For small problems FMS may operate directly on data stored in memory. This storage may be in memory allocated by FMS or arrays in your application program.

As the problem size increases, a point is reached where direct storage in memory is not an option. The total amount of memory available is limited by the size of the memory stick and the number of memory sockets in the system. The memory size may also be limited by cost. A system fully populated with the latest memory sticks may cost up to an order of magnitude more than a system with minimal memory.

For larger problems FMS offers the option of storing data on block oriented devices including solid state or rotating disks. These devices provide significant cost savings over memory. More importantly they may easily be expanded to handle any problem size.

FMS transfers data between memory and disk in parallel with processing (asynchronous I/O). This allows the processors to run at full speed as if the entire data were stored directly in memory.

The key to matching slower disk transfer rates with fast processing rates is data reuse. If a piece of data is moved from disk to memory and used in 1000 calculations, then the disk transfer rate can be 1000 times slower than the rate required by the processors.

The amount of reuse required can be computed from the disk transfer rate, the compute rate and the amount of memory available.

Reuse Example

The most frequent computation performad by FMS is the matrix multiply, as shown in the following figure:

Matrix Multiply

Ci,j = N

k=1
Ai,k Bk,j
To compute a single term in the output matrix Ci,j requires the dot product of a row of the first matrix Ai and a column of the second matrix, Bj. Each dot product requires N multiplies and N additions for a total of 2N floating point operations. Since [C] is an N by N matrix, there are N2 dot products to compute. Therefore the total numbeer of floating point operations is 2N3.

For complex data 4 times as much computation is required because the real, imaginary and their two cross products must be computed. Therefore to multiply complex matrices, 8N3 floating point operations are required.

The table below summarizes the number of floating point operations required to compute a matrix multiply for both real and complex data types.

Real Complex
Flops to multiply 2N3 8N3

We want to use the above computation as part of computing a sequence of matrix multiplies, where the output matrix [C] is updated several times,

[C] = N

k=1
[A]i [B]i

While one value of [C] is being computed we want to transfer in the next [A] and [B]. This requires 2N2 words for real data and 4N2 words for complex data.

The table below summarizes the number of floating point operations required, the amount of data transferred and the number of times each data word is reused.

Real Complex
Flops to multiply 2N3 8N3
Words to transfer 2N2 4N2
Reuse per word N 2N

If the matrices are sufficiently large, they may be stored on disk, transferred to memory concurrently with processing and no processing delays will occur.

If we define the compute rate and transfer rate as

C=Compute Rate(Flops/sec.)
D=Transfer Rate(Bytes/Sec.)

then for double-precision 8-byte words the ratio of compute time to transfer time is:

Real Complex
Time to multiply (sec.) (2N3)/C (8N3)/C
Time to transfer (sec.) (16N2)/D (32N2/D
Compute/IO Ratio ND/8C ND/4C
N (Compute/IO Ratio=1) 8C/D 4C/D

If the Compute/IO ratio is treater than 1, then the process will be compute bound and no time will be spent waiting for I/O to complete. The ratio gives the amount the computing can be increased without changing the I/O system.

If the compute/IO ratio is less than 1, then the process will be I/O bound and time will be spent waiting for I/O to complete. In this case either the I/O performance should be increased or the size of the matrices being computed N should be increased, or some combination of the two.

The amount of memory required can be computed from the size of the matrices, N. Storage must be provided for 5 matrices: [C], [A] and [B] being computed and the next [A] and [B] being transferred in. The following table summarizes the memory requirements:

Real Complex
Memory Reqired(bytes) 40N2 80N2

The amount of memory required for complex data can be shown as a function of the compute rate and disk transfer rate as follows:

Memory Required

The above chart shows the minimum amount of memory required by FMS to balance processing with I/O for complex data type. Twice this amount of memory is required for real data type. FMS automatically uses additional memory to reduce the number of I/O transfers performed. Some additional memory should be allowed for the operating system and problem data.



Home
Copyright © Multipath Corporation