Skip to content

Parallel Musings Part II – Foundation Theory

June 30, 2008

2 – Limits on parallel execution

We learn programming by being introduced to the simplest concept of an algorithm: an ordered set of actions that, when executed in sequential order, solves a problem by doing the calculations related to each action. The key point here is that some of us fix that sequential order idea and never forget it. However, not all action in an algorithm have to executed in the given order to achieve the same result.

Some operations are not related to each other (reading and writing different variables), which make it possible to execute them in any relative order, or even at the same time. Every program (and related problem) has some inherent parallelism. But how further can we go in exploiting it?

2.1 – Amdahl’s law

Gene Amdahl [1] formulated a now classic result over how much performance one can gain by exploiting all the parallelism existent in a given algorithm.  We start by accounting for all the instruction of a program in two groups: P the possibly-parallel ones; and S the sequential part. Even if we had an unlimited number of processors, there is a limit on how much speedup (how fast the parallel version is compared to the sequential one) we can achieve. This limit is expressed by the following formulae, with N being the number of available cores:

speedup = (S + P)/(S + P/N)

Notice that even with an infinite number of cores, the maximum speedup of a program that is half sequential will be at most 2. So, how would behave the same program in a recent dual or quad-core machine? Given that the programmer insert all the multithreading spice to it (more on this later), the maximum speedup would be in a quad-core (remember that in the example S = P):

speedup = (S + P)/(S + P/4) -> (S + S)/(S + S/4) -> 2*S/( (5*S)/4 ) -> 8/5

See that even in a quad-core machine, as expected, the speedup will be less than 2. Also remember that we are not considering any overhead or other issues in parallelizing the P instructions. This limit is sometimes depressing but it shows that parallelism is not easy, even if you do know what to do.

2.2 – Gustafson’s law

But the previous results do not say it is impossible to do more work with more cores. Amdahl’s law just accounts for the speedup for a specific program implementation. We still can do a completely different program, or solve different problems. This is what John Gustafson exposed in his 1988 paper [2] and was also explained by Sutter [3], where he proposes other aproaches such as:

  • Use more data in the parallel parts, solving a bigger problem (Gustafson’s approach);
  • Do different work that is parallel such as solving problems that were not possible before (aka real-time ray tracing).

What is being said is that the available power of the extra cores can (and should) be used. A key thing is that this power will not come for free, and not all programs can be refactored to use it. In the next section we’ll be looking at how the hardware has been engineered to implicitly exploit (or expose) paralellism.

3 – Practical issues

The recent trend in multicore systems is not based on new parallel models. It is instead a convergence of decades of research in the field where several different approaches were experimented.From a programmers point of view, two concerns are important: data and functions, so one can classify the available parallelism expressed in programs based on these two aspects [4]:

  • Functional parallelism – expressed by the divide and conquer approach of most programming paradigms. It is usually small, sparce and irregular;
  • Data parallelism – comes from data-structures that allow for the parallel manipulation on them. It is usually regular and massive.

However, not all available parallelism can be exploited in practice, so one must account for the utilized parallelism at runtime. Part of the available functional parallelism can be exploited implicitelly by techniques that try to squeeze the most of each piece of (expressed) sequential code. Most of these techniques can be though of as ILP (instruction level parallelism). ILP accounts fine grained, often machine level, instructions that do not have a mandatory sequential order to execute. The good news is that (good) compilers and also hardware are capable of tackling most of these cases succesfuly. The bad news is that this form of parallelism does not allow for really noticeable speedups.

The bread and butter really comes from coarse grained functional (programm or thread level) and data parallelism. Since this forms or parallelism are explicit (the programmer need to express it directly) one will have to learn about memory models, plataforms and libraries to describe data structures and algorithms in parallel fashion.This is where our jorney into parallel architectures really begins. We start out by classifying types of machines based on how they expose data and instructions to the programmer using a (somehow) well accepted taxonomy given by Flynn in 1966 (taken from a parallel programming instructional tutorial [5]):

  • SISD – single instruction with single data is the basic model for sequential machines;
  • SIMD – single instruction and multiple data. Processors usually execute the same instructions in lock-step. This the model for vector machines and processor arrays. Until recently, graphic processors (GPU) could be classified in this category;
  • MISD – multiple instruction, single data. This model is of little use and few machines were ever built around it. A possible application would be to apply different criptoanalysis algorithms over the same message (data)
  • MIMD – multiple instruction, multiple data. This model acconts for almost all parallel machines nowadays, such as SMP (simetric multiprocessors), multicores (kind of SMP), massive parallel machines (most clusters), GRIDs, and modern GPUs;

Again, from the programmers point of view, there is another layer on top of this one that have to be taken into consideration: the memory model. Given that we have more than one processor and some ammount of available memory, the memory model accounts for how much of it is directly accessible from each processor. Shared memory architectures allow for all processors to access all memory addresses directly. Still, this access can be uniform (UMA – uniform memory access), where there is no latency or bandwidth difference for each processor. In this category are SMP and multicore machines. Non uniform access (NUMA) provide a direct addressing for the whole memory but latency and bandwidth vary depending on data locality, such as CPU accessing its own RAM and VideoRAM from the GPU. Figure 1 shows an ilustration of this pattern.

Shared memory (taken from [5])

Figure 1: Shared memory (taken from [5])

Distributed memory models are all those architectures where there is no direct addressing for all available memory. Most clusters and GRIDs are into this category. Figure 2 ilustrates this model.

Distributed memory (taken from [5])

Figure 2: Distributed memory (taken from [5])

Some real world scenarios, however, are not strictly into only one of these models. Most clusters now use SMP or multicore machines, ending up as a hybrid architecture. Also the IBM Cell Processor (Sony Playstation 3) has one main processor that addresses all available memory in NUMA pattern, while each synergistic processor accesses only its own memory (distributed model). Figure 3 shows this practical scenario.

Hybrid memory access (taken from [5])

Figure 3: Hybrid memory (taken from [5])

It is also important to mention that while some software platform (as we will see in the next part) hide these architecture details from the programmer (such as allowing for global addressing in distributed architecture, or message passing in shared memory), knowing the instrisics of the hardware platform is still mandatory if one wants to squeeze the most performance out of it.

In the next part we will be discussing programming models and languages to exploit exposed parallelism. Some subjective propositions will be put on team and individual instruction approaches to learning parallelism.

References

[1] Gene Amdahl. “Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities” (AFIPS Conference Proceedings, Vol. 30, AFIPS Press, April 1967).

[2] John Gustafson. “Reevaluating Amdahl’s Law” (Communications of the ACM, 31(5), 1988).

[3] Herb Sutter. “Break Amdahl’s law! – here is a law you should break early and often” (Dr. Dobb’s Journal, 2008).

[4] Eugene Vinod Rebello. “Parallel architecture lecture notes” (IC-UFF, 2008).

[5] Blaise Barney. “Introduction to parallel computing” (HPC – Lawrence Livermore National Laboratory, 2007).

Advertisements
No comments yet

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: