« Fastest Communication - 1 | Main | My Eureka Moments »
October 11, 1986
Fastest Communication - 2
Optimized Communication
With message-passing technique, I could utilize specialized communication primitives that will enable me to achieve the maximum network data transfer capacity.
I have 65,536 processors. The polyshift routine allows multidirectional array communication which is faster than shift routines powered by unidirectional array communication. Also, when the processing nodes are located outside the communication network, it may be possible to overlap the communication with the computation.
Microcoded Routines
Rewriting the computation-intensive segments of the code in assembly language is necessary to obtain the top performance. My goal is to keep much of the required operands within the cache and to perform as many arithmetical operations, as possible, at once.
If I can use specialized microcoded routines from the hardware's mathematical library, then I can increase my computational rates because I will be making minimum memory references while chaining multiplication with addition. Also, the arithmetical coprocessor(s) at each processing nodes could be pipelined.
P.S.
For example, the i860 microprocessor (to be used in the Delta message-passing computer manufactured by Intel) will have the following four pipelines: the multiplier, the adder, the external memory loader, and the graphics unit.
Either scalar or pipelined instructions can be issued. In an N-stage pipeline, the scalar mode will yield new results every N cycles and the pipelined mode will yield new results every cycle. Hence, if the unit is, say M megahertz, the pipelined multiplication-addition operation will be performed at the rate of 2M megaflops, instead of XXXX megaflops.
For the i860 microprocessor, M=33 megahertz and N=3, and the pipelined multiply-add operation will be 6 times faster that the scalar computation.
The interaction between the neighboring grid points of the explicit finite difference approximations of partial differential equations is local. Hence, the value of the pressure, or any dependent variables, at each grid point is computed as the weighted average of all dependent variables at that grid point and at its neighbors. As a result the use of explicit finite difference approximation to compute the solution at a new time level will always require that a scalar element be used to multiply/divide a variable array element followed by addition/multiplication operation. This type of operation is called a fetch-multiply-add operation. Fetching the values of the dependent variables from the neighboring grid points requires internode communication that does not have to be performed sequentially. The reason is that each node has several communication channels that can be used simultaneously. Suppose the virtual processors are given identification numbers similar to that of the grid points.
Data- Or Control-Parallel?
Since numerical approximations of partial differential equations are inherently synchronous and parallel, the programmer should first try the data-parallel approach. If it does'nt work, the programmer should try again before trying the control-parallel programming approach. I have good reasons for making the above suggestion. First, data-parallel programs have a single thread of control and are consequently easier to develop and debug. Note that due to the greater expressive power of Fortran 90, the Fortran 90 code is about three times shorter than the Fortran 77 version. This difference in size will get even larger as the number of dimensions is increased. Furthermore, the Fortran 90 version is more intuitive since its architecture looks like that of the original problem. On a serial computer, the execution time of Fortran 77 and Fortran 90 codes are the same since the computation on both versions is still carried out sequentially. However, the Fortran 90 version is preferable since it is straightforward and concise. The development and debugging times will be greatly reduced since the Fortran 90 version is easier to understand, write, and maintain.
Second, data-parallel programs uses the memory more efficiently than the control-parallel programs and therefore could allow the programmer to solve largersize problems. The reason is that the data-parallel program stores the entire micromodel within the processing nodes while the control-parallel program only stores the needed computation-intensive segment within each processing node.
The control-parallel approach could be used when the model requires the use of several conditional statements or requires that different governing equations be used in different portions of the domain, as in the case of the laminar-turbulent porous media flow explained in Chapter
Why I Prefer Power-of-2 Computations
In my discussion of the properties of the hypercubes and hypertrees, I stated that subcubes of a hypercube computer and subtrees of a hypertree computer will always have a power-of-2 processing nodes. Hence, a power-of-2 array elements would yield the high computational rates. When an array with an arbitrary number of elements are used in the CM-2, the arrays are padded up with garbage data until it contains a power-of-2 elements. The garbage results are thrown out later and the overall execution time is reduced. At first, it might seem better to use context flags to avoid the computation on the garbage data. However, this approach takes longer since it is faster to compute-then-ignore-results than to use the context flag to check-before-computing.
A penalty is paid during interprocessor communications since every array element is checked for garbage data before communicating and this checking slows down the rate of communication. Therefore, array padding yields worse results for communication-intensive problems.
The worst performance that are a padded computation could times the full processing power.
Posted by emeagwali at October 11, 1986 10:32 AM