<< Chapter < Page Chapter >> Page >

Some theoretical work has taken a data-parallel-like approach to designing algorithms, allowing the number of processors to increase with the size of the data. For example, consider summing the elements of an array. Sequentially, we would write this as

x = 0 for i = 1 to nx = x + a[i] end

Sequentially, this would run in O ( n ) size 12{O \( "log"n \) } {} time. To run this in parallel, consider n processes, numbered 1 to n , each containing original element a[i] . We might then perform the computation as a data parallel tree sum , with each node at the same level of the tree operating in parallel.

We use “forall” to denote parallel execution in all examples, although there are many other constructs.

step = 1 forall i = 1 to n doxtmp[i] = a[i]end while step<n do forall i = 1 to n-step by 2*step doxtmp[i] = xtmp[i]+ xtmp[i+step]end step = step * 2end x = xtmp[1]

This takes O ( log n ) size 12{O \( "log"n \) } {} time, which is optimal in parallel. Many such algorithms and bounds are known, and classes (analogous to P and NP) can be constructed describing “solvable” parallel problems.

Other algorithmic work has concentrated on mapping algorithms to more limited set of parallel processors. For example, the above algorithm might be mapped onto p processors in the following way.

forall j = 1 to p do lo[j]= 1+(j-1)*n/p hi[j]= j*n/p xtmp[j]= 0 for I = lo[j]to hi[j] doxtmp[j] = xtmp[j]+ a[i]end doif j=1 then step = 1barrier_wait() while step[j]<n do if j+step[j]<p and j mod (step[j]*2) = 1then xtmp[j] = xtmp[j]+ xtmp[j+step[j]]end step[j]= step[j] * 2barrier_wait() endend x = xtmp[1]

Note that the barrier synchronizations are necessary to ensure that no processor j runs ahead of the others, thus causing some updates of xtmp[j] to use the wrong data values. The time for this algorithm is O ( n / p + log p ) size 12{O \( n/p+"log"p \) } {} . This is optimal for operation count, but not necessarily for number of synchronizations.

Many other aspects of parallel algorithms deserve study. For example, the design of algorithms for other important problems is a vast subject, as is describing general families of parallel algorithms. Analyzing algorithms with respect to time, memory, parallel overhead, locality, and many other measures is important in practice. Practical implementation of algorithms, and measuring those implementations, is particularly useful in the real world. Describing the structure of a full parallel application provides an excellent guidepost for future developers. The Open Education Cup invites entries on all these aspects of parallel algorithms and applications, as well as other relevant topics.

Parallel software tools

Creating a parallel application is complex, but developing a correct and efficient parallel program is even harder. We mention just a few of the difficulties in the following list.

  • All the bugs that can occur in sequential programming – such as logic errors, dangling pointers, and memory leaks – can also occur in parallel programming. Their effects may be magnified, however. As one architect of a grid computing system put it, “Now, when we talk about global memory, we really mean global .”
  • Missing or misplaced synchronization operations are possible on all MIMD    architectures. Often, such errors lead to deadlock, the condition where a set of processors are forever waiting for each other.
  • Improper use of synchronization can allow one processor to read data before another has initialized it. Similarly, poor synchronization can allow two processors to update the same data in the wrong order. Such situations are called "race conditions", since the processors are racing to access the data.
  • Even with proper synchronization, poor scheduling can prevent a given processor from ever accessing a resource that it needs. Such a situation is called starvation.
  • Correct programs may still perform poorly because one processor has much more work than others. The time for the overall application then depends on the slowest processor. This condition is called load imbalance.
  • Processors may demand more out of a shared resource, such as a memory bank or the interconnection network, than it can provide. This forces some requests, and thus the processors that issued them, to be delayed. This situation is called contention.
  • Perhaps worst of all, the timing between MIMD processors may vary from execution to execution due to outside events. For example, on a shared machine there might be additional load on the interconnection network from other users. When this happens, any of the effects above may change from run to run of the program. Some executions may complete with no errors, while others produce incorrect results or never complete. Such “Heisenbugs” (named for Werner Heisenberg, the discoverer of the uncertainty principle in quantum mechanics) are notoriously difficult to find and fix.
Practice Key Terms 5

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, 2008-'09 open education cup: high performance computing. OpenStax CNX. Oct 28, 2008 Download for free at http://cnx.org/content/col10594/1.3
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the '2008-'09 open education cup: high performance computing' conversation and receive update notifications?

Ask