Skip to content

Parallel Musings Part III – Languages and Tools

July 1, 2008

4 – Programming Models

Given that current architecture trends already converged to MIMD machines with shared (SMP and multicores) or hybrid (GRIDs and clusters) memory models, how does this affect the way programmers do their work? Until the beggining of this century, all this concepts and concerns were of little interest for the mainstream enterprise programmer (code monkey). This lack of interest is mostly related to the fact that if the software was slow, in the next couple of years new technology should arise to enable it to run smoothly.

There is a common misconception about the well known Moore’s law, who envisioned that every 18 months, the transistor count of a given core would double. This “rule” is still valid, but the issue is that these new transistors are (more and more) necessary to deal with the increasing complexity of the processors. And also, higher clock rates increase the power consumption and, consequently, the generated heat. These issues have forced the mainstream CPU industry to expose coarse grained parallelism (current multicore systems). Until recently, these mainstream CPUs only explored ILP (instruction level) in their superscalar design. In this part we explain the programming models that are available (today) to deal with this coarse grained parallelism.

We will consider three parallel programming models, namely:

  • Threads
  • Message passing
  • Data-parallel

These models are abstractions over CPU and memory architectures discussed so far and are not tied to any of them. However, it is important to remember basic assumptions in order to be able to get the most out of it.

4.1 – Threads

Threads are (almost) independent execution paths inside a program. It is common to relate them with subroutines inside a given algorithm. Given two subroutines that are not dependent to one another (operate over different data), we can safely run them in parallel. Threads share the same address space, so one have to be extra careful when assuming that they are independent. This is also one of the strengths of the model, since this shared view of memory makes the communication between threads easier (global variables).

Even in the case where two threads operate (partially) over the same data, it is possible to benefit from parallel execution in those parts that are not accessing this shared state. It is known that race conditions, situations in which two or more threads are operating over the same data (communication), can lead to unpredictible results. To avoid them, several hardware and software constructs (TSL instructions, semaphores, monitors, locks) have been developed over the last decades. Even though, the fact that these constructs are not as easy to grasp as sequential algorithms (also the lack of a standard model) contributes to little adoption of these models besides very small groups (several software development teams make good use of threads, but not mainstream code-monkeys).

4.2 – Message passing

The message passing model provides for a single and well accepted construct for communication between distributed units of work. These units (processes) can be spread over a cluster, GRID or even one single node. To communicate, processes use two primitives: send and receive. Each send dispatched by one process instance must have a matching receive in order to complete.

To the programmer, message passing is exposed by a set of library functions for the most usual programming languages (such as Fortran, C or Java). In recent years, the most common of these libraries has become MPI (message passing interface). MPI (and to some extent message passing) is well accepted by the high performance computing (HPC) community but is relatively unknown outside these research groups.

4.3 – Data parallel

Data parallel accounts for domains where data (information) is provided as regular sets with considerable size. Tasks operate (depending on the hardware architecture) concurrently on the smae data set, but over different partitions. Notice that on shared memory architectures, the whole data set is accessible to all task, while in distributed memory the partitions can be spread over the nodes (see that the programming model is suitable for different architectures).

In practice, not all problems (domains) map well to this model, but tho one that do are very importante, such as weather simulation, financial analysis, computer graphics, among others. It is also interesting to notice that in recent years the computer games market lead to the development of specialized SIMD (and later MIMD) hardware (GPUs) whose programming model is data parallel. This hardware is now suitable for general purpose computing.

Is interesting to notice that these models can be used at the same time in hybrid fashion. For instance, one can use distributed MPI processes, each one with multiple threads manipulating data-parallel GPUs to solve one single (and huge) problem. However, all these performance-wise wonders seem to be distance from the mainstream programmer. Lets now take a look at how we can solve problems with parallel thinking.

5 – Parallel problem solving

Ideally, we would create sequential algorithms and tools would identify oportunities for parallelism and reorganize the code acordingly, making it benefit from the new architectures. This is the approach of hardware that provides ILP (instruction level parallelism) and some (good) compilers. However, there are very important caveats:

  • Depending on the case, performance could even degrade;
  • Limited to a small subset of code (fine grained – close instruction reordering and/or loop unrolling);
  • Much less flexible than manual parallelization.

There is no substitute for a good understanding of the domain yet, specially for problems that can benefit from coarse grained parallelism such as data-parallel or independent chunks of code/data communicatin with message passing. The first step is always to examinate the problem and the current algorithm used to solve it, paying spetial attention to possible data and functional decomposition.

Certain problems allow data decomposition, where the information can be split between discrete “data-chunks” that are independent and suitable for parallel processing in SIMD fashion. Figure 4 shows a skematic representation of this kind of decomposition.

Domain decomposition

Figure 4: Domain (data) decomposition (taken from [5])

Functional decomposition accounts for partitioning the tasks that need to be acomplished to solve the problem. The gains come from subtasks that are independent (operate over different data) and those that depend on data-streams from other subtasks, both relying on MIMD architecture to acomplish parallelism.

The later case is related to the producer-consumer classic syncronization problem, and is parallelizable in two ways: a) while the first task is operating over the second data-element of the stream, the second task is already working over the first element of the stream. It is also important to see that if each different task operates over independent chunks of the stream, they can be replicated (SIMD – several copies of the same task over multiple chunks of data), creating a hybrid approach. Figure 5 shows a pipeline example (not hybrid).

Pipelined tasks over data-stream

Figure 5: Pipelined tasks over data-stream

These decomposition approaches are imagined and developed at high abstraction layers but their implementation still relies on the programming models and related libraries such as threads (POSIX Threads, Java Threads), message passing (PVM, MPI) or data-parallel models (High Performance Fortran, GPGPU techniques).

One last note on this topic about programming models to discuss about a recent trend with GPGPU programming. Graphics hardware company NVidia has recently released a library (<a href=”http://cuda.nvidia.com”>CUDA</a&gt;) to ease the task of developing parallel programs to run on its hardware (AMD/ATI also released similar library). The interesting thing about the CUDA library is that, although the hardware architecture and memory model are better suitable to a data-parallel approach, NVidia chose to base the library constructs and terminology upon the concept of threads. The reasons behind this may be related to the fact that the data-parallel model is even more obscure (for the mainstream programmer) than threads.

6 – Final thoughts

It’s been already 6 years since the last “big” leap in raw clock increase in the mainstream CPU market and most software developers are still not aware of this. It is also interesting to notice how the average customer did not account for the fact that his brand new PC is not faster than the old one (sometimes it is, just based on the fact that the old one was infested with malware – and as the new one will also become, its multicores will take care of the load).

But the market still pushes for more complex computing, specially in the enterprise and scientific research fields. The new generation of programmers have a (now even more important) choice: to think parallel or not to think parallel. Several enterprise software categories are already benefiting from multicore processors, such as databases, web servers and virtualized operating systems. But they were already production-grade multithreaded software before this recent trend (mostly because of isolation and SMP architectures).

Those who develop for the web or run virtualized data-centers may be saying that for them there is no need to learn new paradigms, but maybe they are just missing the point. Multicore systems are enabling them to do the same work as they did before with less power consumption per server but this is just more of the same. Interesting questions that came to my mind:

  • What can I do now that I could not do before?
  • Who else (competition) is thinking about this?
  • Am I risking to loose customers to these (parallel) people?

As a last remark, I’ll leave other interesting questions (and further readings) for the reader. These are related to language and tools to ease (with some overhead, but who said garbage collection is/was rubish?) the burden on parallel programming:

  • Will functional languages with immutable objects (evidences here for *Lisp and Erlang) replace current procedural and object oriented ones? (operations over immutable objects are inherently parallelizable)
  • What about specialized tools (with accompaining restrictive design patterns) such as Google’s MapReuce framework (or its opensource incarnation Apache Hadoop)?

I hope you enjoyed the (kind of long) reading and that it was able to fire at least some curiosity light on the subject.

Advertisements
No comments yet

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: