Sunday, July 14, 2013

Software Optimization

Commenly, software optimization is done at end of software development process with whatever time remains. Waiting to end to start optimizing an applications makes it much more difficult to get great performance improvements.
So to make sure that your application is complete and runs with expected performance, the performance of an application should be specified in design documents so that programmers know the goal before they start programming.
 
The Software Optimization process:- It is an iterative process. The first step is to identify the hotspots or the area of the application that are consuming the majority of time. An investigation is then conducted on hotspot to determine its cause; slow memory accesses, inefficient algorithms, high loop counts, branch prediction problems and slow instructions etc. Once you know the cause of the hotspot, a solution can be designed and implemented.
Since not all changes result in performance improvements, the benchmark is used to verify that performance was improved as a result of the implemented changes.
 
Hotspots:-
Hotspots are the area of application that have intense activity. Intense activity usually refers to time, but the definition can include anything, such as mis-predicted branches or cache misses for example. While measuring the performance, time is always the priority but sometimes number of cache line is important to reduce cache misses.
 
Performance issuse:-

Saturday, April 13, 2013

Optimization Techniques for Multi-Core Processors

This article gives an overview of how to effectively write parallel code for multi-core processors, which would be helpful for a beginner who wants to learn parallel programming. Further details about challenges while developing softwares for multi-core processors, an overview of optimization techniques, focusing on threading issues and performance tuning will be discussed which would be useful for a programmer who wants to optimize their programs on latest multi-core technology. 
To fully exploit the potential of multi-core processors (i.e. to scale the performance while optimizing power consumption), you must understand the inherent parallelism in your applications. The best way to extract the full potential of a multi-core processor is through threading.

The keys to effective parallelism are :
  • Identify the concurrent work.
  • Divide the work evenly.
  • Create private copies of commonly used resources.
  • Synchronize access to unique shared resources.

There are three classification of parallel technologies for multi-core processors:-
1. Threaded libraries:-  Threaded libraries such as POSIX threads and Windows API threads which enable very explicit control of threads. You may use these threading technologies when you require fine management of threads.
2. Compiler support:-   It includes OpenMP and automatic parallelization enable application to take advantage of machines that share the same memory space. The advantage of OpenMP over threaded libraries is that it simplifies parallel application development by hiding many of the details of thread management and thread communication behind a simplified programming interface.
3. Message Passing libraries:- Message passing libraries such as Message Passing Interface (MPI) enable one application to take advantage of several machines that do not necessarily share the same memory space


Taking advantage of parallelism :-
First of all you need to know the parallelism available in your application. To do so, you have to use any one of the performance analysis tool among gprof, Valgrind(with Cachegrind, Callgrind, Massif) and Intel VTune(formally... Intel Thread Profiler) etc.
Using one of these tools you can find out where your program spends more. Once the most time consuming functions are identified, drill-down to the source code to determine whether threading can be effectively implemented. Some resource-intensive functions may not lend themselves to parallel execution. If you find yourself faced with a hot spot that cannot be threaded, then Performance Analyzer's call graph technology is the next step. Call graph graphically depicts the call tree through an application. Even when your hot spot is not amenable to threading, this technology may be able to identify a function further up the call tree that can be threaded. Threading a function further up the call tree will improve performance by allowing multiple threads to call to the hot function simultaneously.


Automatic Parallelization :-
Auto parallelization analyzes the loops and creates threaded code for the loops. This is done through compiler while will only parallelize loops that can be determined to safe to parallelize. Following tips may improve the likelihood of successful parallelization.
  • Avoid placing function calls inside loop bodies. Function calls may have effects on the loop that cannot be determined at compile time and may prevent parallelization.
  • Use the optimization reporting option. The parallelization optimization report (-par report on Linux) provides a summary of the compiler’s analysis of every loop and in cases where a loop cannot be parallelized, a reason as to why not. This is useful in that even if the compiler cannot parallelize the loop, you can use the information gained in the report to identify regions for manual threading. 

Threading Issues (Correctness issues):-
Once threading has been added to an application, following common threading issues may come:
  • Data Race
  • Synchronization
  • Thread Stall
  • Deadlock
  • False Sharing
1. Data Race :-  A data race occurs when two or more threads are trying to access the same data at same time. This leads to inconsistent results in the running program. For example, in a read/write data race, one thread is attempting to write to a variable at the same time another thread is trying to read the variable. The thread that is reading the variable will get a different result depending on whether the write has already occurred.
The way to correct the data race is through synchronization or lock.
  • Synchronization :-  It is an useful technique, but care should be taken to limit the unnecessary synchronization as it slows down the performance of an application. Since only one thread is allowed to access a critical section at a time, any other threads needing to access that section are forced to wait. This means precious resources are sitting idle, negatively impacting performance.
  • Lock :-  It is another technique to avoid the data race. In this case a thread will lock its specific resources while it is using that resources which also denies access to other threads. Two common threading errors can occur when using lock. The first is thread stall and second is deadlock.
  • (i). Thread Stall:-  This happens when you have one thread that has locked the resources and has started continuing the program execution without releasing these resources. When second thread tries to access that resources it is forced to wait infinite amount of time, causing a stall. You should ensure that threads release their locks before continuing through the program.
  • (ii) Deadlock:-  This happens when you have two threads and both are trying to access the same variable which other has locked. In general, you should avoid complex locking hierarchies, if possible, and ensure that locks are acquired and released in the same order.
2. False Sharing:-  This is not necessarily an error in the program, but something that is likely to affect performance. False sharing occurs when two threads are manipulating data that lie on the same cache line. When one thread has changed data on that line of cache it causes the cache to become invalidated. The second thread will have to wait while the cache is reloaded from memory. If this happens repeatedly, for example inside of a loop, it will severely affect performance. One way to detect false sharing is to sample on L2 cache misses using the Intel VTune Performance Analyzer sampling technology. If this event occurs frequently in a threaded program, it is likely that false sharing is at fault.


Performance Tuning :-  Once correctness issues are solved, performance tuning can occur.  The Intel Thread Profiler lets you visually inspect the performance of your threading applications and answer the questions such as,
  • Is the work evenly distributed between threads?
  • How much of the program is running in parallel?
  • How does performance increase as the number of processors employed increases?
  • What is the impact of synchronization between threads on execution time? 
The answer to these questions can help you optimize your application further. For example, if you determined that the workload was not balanced evenly between threads, you could implement code changes and iteratively test the application until you have confirmed a balance.