This article gives an overview of how to effectively write parallel code for multi-core processors, which would be helpful for a beginner who wants to learn parallel programming. Further details about challenges while developing softwares for multi-core processors, an overview of optimization techniques, focusing on threading issues and performance tuning will be discussed which would be useful for a programmer who wants to optimize their programs on latest multi-core technology.
To fully exploit the potential of multi-core processors (i.e. to scale the performance while optimizing power consumption), you must understand the inherent parallelism in your applications. The best way to extract the full potential of a multi-core processor is through threading.
The keys to effective parallelism are :
- Identify the concurrent work.
- Divide the work evenly.
- Create private copies of commonly used resources.
- Synchronize access to unique shared resources.
There are three classification of parallel technologies for multi-core processors:-
1. Threaded libraries:- Threaded libraries such as POSIX threads and Windows API threads which enable very explicit control of threads. You may use these threading technologies when you require fine management of threads.
2. Compiler support:- It includes OpenMP and automatic parallelization enable application to take advantage of machines that share the same memory space. The advantage of OpenMP over threaded libraries is that it simplifies parallel application development by hiding many of the
details of thread management and thread communication behind a
simplified programming interface.
3. Message Passing libraries:- Message passing libraries such as Message Passing Interface (MPI) enable
one application to take advantage of several machines that do not
necessarily share the same memory space
Taking advantage of parallelism :-
First of all you need to know the parallelism available in your application. To do so, you have to use any one of the performance analysis tool among gprof, Valgrind(with Cachegrind, Callgrind, Massif) and Intel VTune(formally... Intel Thread Profiler) etc.
Using one of these tools you can find out where your program spends more. Once the most time consuming functions are identified, drill-down to the source code to determine whether threading can be effectively implemented.
Some resource-intensive functions may not lend themselves to parallel
execution. If you find yourself faced with a hot spot that cannot be
threaded, then Performance Analyzer's call graph technology is
the next step. Call graph graphically depicts the call tree through an
application. Even when your hot spot is not amenable to threading, this
technology may be able to identify a function further up the call tree
that can be threaded. Threading a function further up the call tree will
improve performance by allowing multiple threads to call to the hot
function simultaneously.
Automatic Parallelization :-
Auto parallelization analyzes the loops and creates threaded code for the loops. This is done through
compiler while will only parallelize loops that can be determined to safe to parallelize. Following tips may improve the likelihood of successful parallelization.
- Avoid placing function calls inside loop bodies. Function calls may have
effects on the loop that cannot be determined at compile time and may
prevent parallelization.
- Use the optimization reporting option. The parallelization optimization
report (-par report on Linux) provides a summary of the compiler’s
analysis of every loop and in cases where a loop cannot be parallelized,
a reason as to why not. This is useful in that even if the compiler
cannot parallelize the loop, you can use the information
gained in the report to identify regions for manual threading.
Threading Issues (Correctness issues):-
Once threading has been added to an application, following common threading issues may come:
- Data Race
- Synchronization
- Thread Stall
- Deadlock
- False Sharing
1. Data Race :- A data race occurs when two or more threads are trying to access the same data at same time. This leads to inconsistent results in the running program. For example,
in a read/write data race, one thread is attempting to write to a
variable at the same time another thread is trying to read the variable.
The thread that is reading the variable will get a different result
depending on whether the write has already occurred.
The way to correct the data race is through synchronization or lock.
- Synchronization :- It is an useful technique, but care should be taken to limit the unnecessary synchronization as it slows down the performance of an application. Since only one thread is allowed to access a critical section at a time,
any other threads needing to access that section are forced to wait.
This means precious resources are sitting idle, negatively impacting
performance.
- Lock :- It is another technique to avoid the data race. In this case a thread will lock its specific resources while it is using that resources which also denies access to other threads. Two common threading errors can occur when using lock. The first is thread stall and second is deadlock.
- (i). Thread Stall:- This happens when you have one thread that has locked the resources and has started continuing the program execution without releasing these resources. When second thread tries to access that resources it is forced to wait infinite amount of time, causing a stall. You should ensure that threads release their locks before continuing through the program.
- (ii) Deadlock:- This happens when you have two threads and both are trying to access the same variable which other has locked. In general, you should avoid complex locking hierarchies, if possible,
and ensure that locks are acquired and released in the same order.
2. False Sharing:- This is not necessarily an error in the program, but something that is
likely to affect performance. False sharing occurs when two threads are
manipulating data that lie on the same cache line. When one thread has
changed data on that line of cache it causes the cache to become
invalidated. The second thread will have to wait while the cache is
reloaded from memory. If this happens repeatedly, for example inside of a
loop, it will severely affect performance.
One way to detect false sharing is to sample on L2 cache misses using
the Intel VTune Performance Analyzer sampling technology. If this event
occurs frequently in a threaded program, it is likely that false sharing
is at fault.
Performance Tuning :- Once correctness issues are solved, performance tuning can occur. The
Intel Thread Profiler lets you visually inspect the performance of your threading applications and answer the questions such as,
- Is the work evenly distributed between threads?
- How much of the program is running in parallel?
- How does performance increase as the number of processors employed increases?
- What is the impact of synchronization between threads on execution time?
The answer to these questions can help you optimize your application
further. For example, if you determined that the workload was not
balanced evenly between threads, you could implement code changes and
iteratively test the application until you have confirmed a balance.