Special to TG Daily - Multicore architecture has created a little frenzy among programming language designers and software developers to hone up their concurrent and parallel programming skills. But is this push one way? Is it just the hardware driving the software and language design in multi-core era? Rajesh Karmani continues his article series on TG Daily which focuses on a dramatic shift in software development techniques to help developers exploit the horsepower of multi-core processors.
We know that applications have driven chip design in the past (ASICâ€™s being the extreme example). But performance has been the key factor in hardware design. And mostly it has dictated how software should look like right from von Neumann machines (sequential model) to caches (locality-awareness) to out-of-order execution (weaker consistency memory models). In the multi-core era, with programmability such a big challenge, the chip designers seem to be more accommodating towards language designers.
Previously published in this series:
An inside view: The $20 million Intel/Microsoft multicore programming initiative
Concurrent Programming: A solution for the multi-core programming era?
I previously discussed how concurrent programming has been traditionally viewed and done. As pointed out, in a shared memory model with many threads trying to access the shared state concurrently, managing correctness and consistency becomes a difficult task. It is not hard to argue that such model will not scale as the number of cores and henceforth the number of threads increase. No wonder it is sometimes referred to as â€œwild concurrencyâ€, and programming language researchers and practitioners are working hard to tame it. I discussed one such proposal called Transactional Memory (TM). It has generated enough interest to convince Sun to integrate TM support in its upcoming Rock processor.
In this article, Iâ€™ll discuss an alternate programming model, message-passing model of concurrency, and present my thoughts on what possible impact it can have on processor architecture. Itâ€™s been widely used in academia and research labs for a few years. These are the domains which had the demand (scientific computation, physical simulations, numerical computation are CPU intensive) and the resources (money, people, grids, clusters) for parallel programming. Not to mention the main driver behind parallelism is the huge amount of data these applications process. On the other hand, itâ€™s only through the impact of multi-core chips that parallel programming is beginning to push itself into mainstream.
Message Passing Interface (MPI)
One of the prime reasons for its wide adoption is the standardization of message-passing model in early 90â€™s in the shape of MPI (Message Passing Interface), a specification subsequently implemented by different scientists. A list can be found here. Coincidently, two of the pioneering scientists behind MPI, Dr William Gropp and Dr Marc Snir, are faculty members at UIUC. Dr Marc Snir is also the co-director of UPCRC at the University of Illinois.
The basic goal of MPI is obtaining high performance, scalability and portability in these domains. Although it defines a large number of functions in its specification, there are six basic calls. Point-to-point communication includes both synchronous and asynchronous calls. Also it defines so-called collective communication patterns apart from the point-to-point communication. Among others these include broadcast, scatter, gather, reduce, scan, all-reduce and all-gather. MPI started out assuming no shared memory, but the later version incorporated Distributed Shared Memory architectures. A good introduction to MPI is available here. There are plenty of other references available on Google.
MPI has been quite successful in achieving its objectives and hence been widely used to write parallel programs. Although based on a message-passing model, itâ€™s a library itself and not a full-blown programming language. Therein lays one of its strength; programming language independence and compatibility with legacy languages.
Problems with MPI
Although I donâ€™t claim first-hand experience with MPI, it involves a fair amount of hand-tuning including partitioning of code and data, placement and scheduling across the multi-processor architectures. If the goal is to obtain the last iota of performance and stakes are high, it makes quite good sense. In fact, projects involving MPI have computer scientists working with physicists, astronomers and other domain scientists to deliver the performance. MPI has been deemed very low-level to the point of being called â€œassembly languageâ€ of parallel programming. With such a high entry barrier, MPI needs to raise its abstraction level with elegant constructs to define the different communication and coordination patterns for concurrent and parallel programming. Professor Kale from UIUC has been working on an adaptive implementation of MPI and a dynamic run-time to support placement and scheduling on multiprocessor machines. This is a big step forward from the low-level mechanisms in MPI.
Also, it is prone to deadlocks due to synchronous communication but been in the hands of expert programmers (scientists) so far, the problem has been masked. One can imagine how error-prone and tricky it can get for mainstream programming.
But MPI is not the end of message-passing models. A more abstract model based on asynchronous message-passing is Actor model of programming. Although originally proposed around 30 years ago, it has been receiving a lot more attention lately due to the problem of multicore programming. I briefly discussed Actor model in a previous article.
Message-passing model revisited
Message passing model (with no shared state) has some nice properties regarding concurrent programming. The non-determinism is due to the arrival order of messages. This can be resolved locally at each site, leading to local reasoning about correctness of programs. Compare this to multi-threaded model where shared state can be accessed from any point in any thread requiring global reasoning about correctness and consistency of programs. Moreover, it is more amenable to visualization of the programâ€™s flow as local state is abstracted away. Messages represent the explicit data flow in the application and an easily-comprehensible picture of the program emerges. Visualization tools go towards solving one of the problems that I see with any large scale programming task: short attention span of the programmer in space (code files) and in time.
So what does message-passing model mean for chip architecture? With its low inter-core latency and high bandwidth, multi-core chips are more suited to message-passing than the grids, clusters which have been the traditional havens for such model. Tileraâ€™s TILE64 chip has a 5-layered mesh interconnect and a development environment supporting inter-core message-passing. But with its small-sized caches, message-passing model has to employ the shared memory (logically, message passing model can be mapped physically to shared memory architecture and vice versa.) Off-chip shared memory provides an order of magnitude slow access than on-chip core-cache or core-core access. The bottom line being if message-passing model enables writing correct parallel programs there is an incentive for chip makers to provide the optimizing facilities in hardware.
Having talked about the two prominent models, there is a belief and a few good reasons that different programming models and a set of languages will co-exist for some time to come. Some of these reasons are the investment in research and learning, domain specific requirements etc. There have been efforts to catalog patterns of problems and solutions in parallel programming, just like the design patterns for object-oriented programming. One such effort is PPP. Basically the goal is to evaluate proposed programming models in the light of these patterns. There are some exciting times to look forward to.