<< Chapter < Page Chapter >> Page >

DO I=2,N,2 A(I) = A(I-1) + B(I)A(I+1) = A(I-1) + B(I) + B(I+1) ENDDO

The speed increase on a workstation won’t be great (most machines run the recast loop more slowly). However, some parallel computers can trade off additional calculations for reduced dependency and chalk up a net win.

Antidependencies

It’s a different story when there is a loop-carried antidependency, as in the code below:


DO I=1,N A(I) = B(I) * EB(I) = A(I+2) * C ENDDO

In this loop, there is an antidependency between the variable A(I) and the variable A(I+2). That is, you must be sure that the instruction that uses A(I+2) does so before the previous one redefines it. Clearly, this is not a problem if the loop is executed serially, but remember, we are looking for opportunities to overlap instructions. Again, it helps to pull the loop apart and look at several iterations together. We have recast the loop by making many copies of the first statement, followed by copies of the second:


A(I) = B(I) * E A(I+1) = B(I+1) * EA(I+2) = B(I+2) * E ...B(I) = A(I+2) * C ← assignment makes use of the new B(I+1) = A(I+3) * C value of A(I+2) incorrect.B(I+2) = A(I+4) * C

The reference to A(I+2) needs to access an “old” value, rather than one of the new ones being calculated. If you perform all of the first statement followed by all of the second statement, the answers will be wrong. If you perform all of the second statement followed by all of the first statement, the answers will also be wrong. In a sense, to run the iterations in parallel, you must either save the A values to use for the second statement or store all of the B value in a temporary area until the loop completes.

We can also directly unroll the loop and find some parallelism:


1 A(I) = B(I) * E 2 B(I) = A(I+2) * C →3 A(I+1) = B(I+1) * E | Output dependency 4 B(I+1) = A(I+3) * C |5 A(I+2) = B(I+2) * E ← 6 B(I+2) = A(I+4) * C

Statements 1–4 could all be executed simultaneously. Once those statements completed execution, statements 5–8 could execute in parallel. Using this approach, there are sufficient intervening statements between the dependent statements that we can see some parallel performance improvement from a superscalar RISC processor.

Output dependencies

The third class of data dependencies, output dependencies , is of particular interest to users of parallel computers, particularly multiprocessors. Output dependencies involve getting the right values to the right variables when all calculations have been completed. Otherwise, an output dependency is violated. The loop below assigns new values to two elements of the vector A with each iteration:


DO I=1,N A(I) = C(I) * 2.A(I+2) = D(I) + E ENDDO

As always, we won’t have any problems if we execute the code sequentially. But if several iterations are performed together, and statements are reordered, then incorrect values can be assigned to the last elements of A . For example, in the naive vectorized equivalent below, A(I+2) takes the wrong value because the assignments occur out of order:


A(I) = C(I) * 2. A(I+1) = C(I+1) * 2.A(I+2) = C(I+2) * 2. A(I+2) = D(I) + E ← Output dependency violatedA(I+3) = D(I+1) + E A(I+4) = D(I+2) + E

Whether or not you have to worry about output dependencies depends on whether you are actually parallelizing the code. Your compiler will be conscious of the danger, and will be able to generate legal code — and possibly even fast code, if it’s clever enough. But output dependencies occasionally become a problem for programmers.

Dependencies within an iteration

We have looked at dependencies that cross iteration boundaries but we haven’t looked at dependencies within the same iteration. Consider the following code fragment:


DO I = 1,N D = B(I) * 17A(I) = D + 14 ENDDO

When we look at the loop, the variable D has a flow dependency. The second statement cannot start until the first statement has completed. At first glance this might appear to limit parallelism significantly. When we look closer and manually unroll several iterations of the loop, the situation gets worse:


D = B(I) * 17 A(I) = D + 14D = B(I+1) * 17 A(I+1) = D + 14D = B(I+2) * 17 A(I+2) = D + 14

Now, the variable D has flow, output, and antidependencies. It looks like this loop has no hope of running in parallel. However, there is a simple solution to this problem at the cost of some extra memory space, using a technique called promoting a scalar to a vector . We define D as an array withN elements and rewrite the code as follows:


DO I = 1,N D(I) = B(I) * 17A(I) = D(I) + 14 ENDDO

Now the iterations are all independent and can be run in parallel. Within each iteration, the first statement must run before the second statement.

Reductions

The sum of an array of numbers is one example of a reduction — so called because it reduces a vector to a scalar. The following loop to determine the total of the values in an array certainly looks as though it might be able to be run in parallel:


SUM = 0.0 DO I=1,NSUM = SUM + A(I) ENDDO

However, if we perform our unrolling trick, it doesn’t look very parallel:


SUM = SUM + A(I) SUM = SUM + A(I+1)SUM = SUM + A(I+2)

This loop also has all three types of dependencies and looks impossible to parallelize. If we are willing to accept the potential effect of rounding, we can add some parallelism to this loop as follows (again we did not add the preconditioning loop):


SUM0 = 0.0 SUM1 = 0.0SUM2 = 0.0 SUM3 = 0.0DO I=1,N,4 SUM0 = SUM0 + A(I)SUM1 = SUM1 + A(I+1) SUM2 = SUM2 + A(I+2)SUM3 = SUM3 + A(I+3) ENDDOSUM = SUM0 + SUM1 + SUM2 + SUM3

Again, this is not precisely the same computation, but all four partial sums can be computed independently. The partial sums are combined at the end of the loop.

Loops that look for the maximum or minimum elements in an array, or multiply all the elements of an array, are also reductions. Likewise, some of these can be reorganized into partial results, as with the sum, to expose more computations. Note that the maximum and minimum are associative operators, so the results of the reorganized loop are identical to the sequential loop.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, High performance computing. OpenStax CNX. Aug 25, 2010 Download for free at http://cnx.org/content/col11136/1.5
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'High performance computing' conversation and receive update notifications?

Ask