<< Chapter < Page Chapter >> Page >

These two example loops can show how these iteration scheduling approaches might operate when executing with four threads. In the vector-add loop, static scheduling would distribute iterations 1–2500 to Thread 0, 2501–5000 to Thread 1, 5001–7500 to Thread 2, and 7501–10000 to Thread 3. In [link] , the mapping of iterations to threads is shown for the static scheduling option.

Iteration assignment for static scheduling

This figure is a graph, horizontal axis labeled time and vertical axis labeled thread. there are four horizontal grey bars in the graph at thread values of 0, 1, 2, and 3, each containing a string of numbers. The numbers in thread value 0 count from 1 to 2500. The numbers in thread value 1 count from 2501 to 5000, and this grey bar is offset a little to the right of the horizontal axis. The numbers in thread value 2 count from 5001 to 7500, and this bar is offset very slightly from the vertical axis. The numbers in thread value 3 count from 7501 to 10000, and the bar is evenly aligned with the bar on thread value 1.

Since the loop body (a single statement) is short with a consistent execution time, static scheduling should result in roughly the same amount of overall work (and time if you assume a dedicated CPU for each thread) assigned to each thread per loop execution.

An advantage of static scheduling may occur if the entire loop is executed repeatedly. If the same iterations are assigned to the same threads that happen to be running on the same processors, the cache might actually contain the values for A , B , and C from the previous loop execution. The operating system and runtime library actually go to some lengths to try to make this happen. This is another reason not to have more threads than available processors, which causes unnecessary context switching. The runtime pseudo-code for static scheduling in the first loop might look as follows:


C VECTOR ADD - Static Scheduled ISTART = (THREAD_NUMBER * 2500 ) + 1IEND = ISTART + 2499 DO ILOCAL = ISTART,IENDA(ILOCAL) = B(ILOCAL) + C(ILOCAL) ENDDO

It’s not always a good strategy to use the static approach of giving a fixed number of iterations to each thread. If this is used in the second loop example, long and varying iteration times would result in poor load balancing. A better approach is to have each processor simply get the next value for IPROB each time at the top of the loop.

That approach is called dynamic scheduling , and it can adapt to widely varying iteration times. In [link] , the mapping of iterations to processors using dynamic scheduling is shown. As soon as a processor finishes one iteration, it processes the next available iteration in order.

Iteration assignment in dynamic scheduling

This figure is a graph, horizontal axis labeled time and vertical axis labeled thread. there are four rows in the graph  containing grey boxes at thread values of 0, 1, 2, and 3, each containing a number. The numbers in thread value 0 count from minimum 1 to maximum 10000, although not all boxes in the list count consecutively. The numbers in thread value 1 count from 3 to 9999, and the grey bars are offset a little to the right of the horizontal axis. The numbers in thread value 2 count from 4 to 9998, and the bars are offset slightly more from the vertical axis. The numbers in thread value 3 count from 5 to 9996, and the bars are offset slightly more from the vertical axis.

If a loop is executed repeatedly, the assignment of iterations to threads may vary due to subtle timing issues that affect threads. The pseudo-code for the dynamic scheduled loop at runtime is as follows:


C PARTICLE TRACKING - Dynamic Scheduled IPROB = 0WHILE (IPROB<= 10000 ) BEGIN_CRITICAL_SECTIONIPROB = IPROB + 1 ILOCAL = IPROBEND_CRITICAL_SECTION RANVAL = RAND(ILOCAL)CALL ITERATE_ENERGY(RANVAL) ENDWHILE

ILOCAL is used so that each thread knows which iteration is currently processing. The IPROB value is altered by the next thread executing the critical section.

While the dynamic iteration scheduling approach works well for this particular loop, there is a significant negative performance impact if the programmer were to use the wrong approach for a loop. For example, if the dynamic approach were used for the vector-add loop, the time to process the critical section to determine which iteration to process may be larger than the time to actually process the iteration. Furthermore, any cache affinity of the data would be effectively lost because of the virtually random assignment of iterations to processors.

In between these two approaches are a wide variety of techniques that operate on a chunk of iterations. In some techniques the chunk size is fixed, and in others it varies during the execution of the loop. In this approach, a chunk of iterations are grabbed each time the critical section is executed. This reduces the scheduling overhead, but can have problems in producing a balanced execution time for each processor. The runtime is modified as follows to perform the particle tracking loop example using a chunk size of 100:


IPROB = 1 CHUNKSIZE = 100WHILE (IPROB<= 10000 ) BEGIN_CRITICAL_SECTIONISTART = IPROB IPROB = IPROB + CHUNKSIZEEND_CRITICAL_SECTION DO ILOCAL = ISTART,ISTART+CHUNKSIZE-1RANVAL = RAND(ILOCAL) CALL ITERATE_ENERGY(RANVAL)ENDDO ENDWHILE

The choice of chunk size is a compromise between overhead and termination imbalance. Typically the programmer must get involved through directives in order to control chunk size.

Part of the challenge of iteration distribution is to balance the cost (or existence) of the critical section against the amount of work done per invocation of the critical section. In the ideal world, the critical section would be free, and all scheduling would be done dynamically. Parallel/vector supercomputers with hardware assistance for load balancing can nearly achieve the ideal using dynamic approaches with relatively small chunk size.

Because the choice of loop iteration approach is so important, the compiler relies on directives from the programmer to specify which approach to use. The following example shows how we can request the proper iteration scheduling for our loops:


C VECTOR ADD C$OMP PARALLEL DO PRIVATE(IPROB) SHARED(A,B,C) SCHEDULE(STATIC)DO IPROB=1,10000 A(IPROB) = B(IPROB) + C(IPROB)ENDDO C$OMP END PARALLEL DOC PARTICLE TRACKING C$OMP PARALLEL DO PRIVATE(IPROB,RANVAL) SCHEDULE(DYNAMIC)DO IPROB=1,10000 RANVAL = RAND(IPROB)CALL ITERATE_ENERGY(RANVAL) ENDDOC$OMP END PARALLEL DO

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, High performance computing. OpenStax CNX. Aug 25, 2010 Download for free at http://cnx.org/content/col11136/1.5
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'High performance computing' conversation and receive update notifications?

Ask