Software Application Optimization

The purpose of this lab is the familiarization with the field of application optimizations, through the means of dedicated tools for spotting performance bottlenecks.

We will focus on several open source and commercial tools, as follows:

  • valgrind
  • perf
  • Intel Parallel Studio / Intel VTune Amplifier

1. Valgrind

Valgrind is a tool used for memory debugging, memory leak detection and profiling. It is also a generic framework for creating dynamic analysis tools, such as memory checkers [1].

Valgrind is in essence a virtual machine using just-in-time compilation techniques, including dynamic recompilation. It is important to keep in mind that nothing from the original program ever gets run directly on the host processor. Instead, it will translate the input program into a simpler form called Intermediate Representation (IR), which is processor neutral. After this transformation, a tool [2] is called to do whatever transformation of the IR it needs and the resulting IR is then translated back into machine code and ran on the host processor.

The tools available in Valgrind are:

  • memcheck. Used to detect memory-management problems and it is aimed at C and C++ programs. All memory reads and writes are checked, and all calls to malloc/new/free/delete are intercepted. Therefore it can detect memory leaks, access to invalid memory, weird initialization values, overflows, etc. Memcheck runs programs about 10-30x slower than normal;
  • cachegrind. Used to profile CPU cache. It performs detailed simulation of the I1, D1 and L2 caches in order to pinpoint the sources of cache misses. It identifies the number of cache misses, memory references and instructions executed for each line of source code. Cache grind runs programs about 20-100x slower than normal;
  • callgrind. It is an extension to cachegrind and provides all the information that the latter offers, plus extra information regarding call graphs. In order to view the results, a visualization tool called KCachegrind [3] can be used;
  • massif. It is a heap profiler and it performs detailed heap profiling by taking regular snapshots of a program's heap and produces a graph showing heap usage over time, including information about which parts of the program are responsible for the most memory allocations. Massif runs programs about 20x slower than normal;
  • hellgrind and drd. These tools are thread debuggers which find data races in multithreaded programs. They look for memory locations which are accessed by more than one (POSIX) pthread, but for which no consistently used (pthread_mutex_) lock can be found;
  • other 3rd party tools can be found here [4].

TASK 1: (2p)Install valgrind on your computers and run the memcheck tool, for a simple application written by you in which you intentionally add memory leaks.

2. Perf

Perf is a performance analysis tool, available in the Linux kernel since version 2.6.31 [5]. The userspace control application is accessed from the command line and provides a number of subcommands. Unlike Valgrind, perf is capable of statistical profiling of both the entire system (kernel and userspace) and per process PID basis. It supports hardware performance counters, tracepoints, software performance counters (e.g. hrtimer), and dynamic probes (for example, kprobes or uprobes).

Perf is used with several subcommands:

  • stat: measure total event count for a single program or for the whole system for a specified time period;
  • top: top-like dynamic view of hottest functions;
  • record: measure and save sampling data for single program;
  • report: analyze file generated by perf record;
  • annotate: annotate sources or assembly;
  • sched: tracing/measuring of scheduler actions and latencies;
  • list: list available events.

TASK 2: (1p)Make sure that perf is installed on your computers.

3. Intel Parallel Studio

Intel Parallel Studio [6] is a software development product that facilitates native code development on Windows and Linux in C++/C and Fortran for parallel computing. Parallel programming enables software programs to take advantage of multi-core processors from Intel and other processor vendors.

Parallel Studio is composed of several component parts, each of which is a collection of capabilities:

  • Intel Parallel Composer
    • Intel C++ Compiler with Cilk Plus and OpenMP;
    • Intel Fortran Compiler with OpenMP;
    • IDE plug-in integration with Visual Studio and Eclipse;
    • Debugging via Visual Studio Debugger extensions, GNU Debugger extensions;
    • Intel Integrated Performance Primitives (IPP), Intel Math Kernel Library (MKL) and Threading Building Blocks (TBB) libraries;
    • Intel Data Analytics Acceleration Library (DAAL).
  • Intel Parallel Advisor helps programmers decide where to parallelize their code, and whether the resulting performance gain will be worth the effort.
  • Intel VTune Amplifier is a performance profiler that analyzes hotspots, concurrency and locks-and-waits.
  • Intel Parallel Inspector improves reliability by identifying memory errors and threading errors.

TASK 3: (1p)Install Intel Parallel Studio on your computers. Locate the place where it is installed and find the directory where each of the above components are located.

4. A practical approach to an optimization problem

In this section, we will focus on analyzing a software application. We will analyze both a serial and a parallel implementation. The application is called “tachyon” and you can find the source code attached to this lab.

Before compilation, you must install the X11 dev tools and create a set of symlinks. For Ubuntu 15.10 64 bit, we must do the following:

  • install dependencies
     sudo apt-get install libx11-dev 
  • create the symlinks:
    •  sudo mkdir /usr/lib64 
    •  sudo ln -s /usr/lib/x86_64-linux-gnu/libX11.so /usr/lib64/libX11.so 
    •  sudo ln -s /usr/lib/x86_64-linux-gnu/libXext.so /usr/lib64/libXext.so 

To compile it, you must extract the archive to local disk and run make. You can test the compilation by running in the same directory:

./tachyon_find_hotspots dat/balls.dat 

You should see a window like the one below:

TASK 4: (1p)Download the archive, compile the application and test it.

Finding hotspots

Using Intel VTune Amplifier

  1. Open Intel VTune Amplifier. From the local terminal connect to fep.grid.pub.ro using ssh
ssh -X fep.grid.pub.ro -l user_cs_curs_pub_ro
$ qlogin -q ibm-dp.q

then type

module load utilities/intel_parallel_studio_xe_2016

to load Intel module. This script sets the PATH environment variable that specifies locations of the product graphical user interface utility and command line utility. After this, open VTune with the command:

amplxe-gui
  1. Create a performance baseline for the application

  • Create a new project. Using New > Project, specify a project name. This will create a project directory under $HOME/intel/amplxe/projects and will open the Choose Target and Analysis Type window with the Analysis Target tab active. From the left pane, select the local target system and from the right pane select the Application to Launch target type from the drop-down menu. For the Application field browse to the directory where you compiled the application and choose tachyon_find_hotspots. For the Application parameters field, make the same step and choose the file balls.dat from the directory dat. After filling those two fields, click on the Choose Analysis button on the right to switch to the analysis type configuration.
  • Run an analysis. From the analysis tree on the left, select Algorithm Analysis → Basic Hotspots. The right pane is updated with the default options for the Basic Hotspots analysis. Click the Start button on the right command bar. VTune Amplifier launches the executable that takes the input and renders an image displaying the execution time before exiting. VTune Amplifier finalizes the collected data and opens the results.
  1. Interpret Results

  • Basic Hotspot Metrics. Start analysis with the Summary window. To interpret the data, hover over the question mark icons to read the pop-up help and better understand what each performance metric means. Note that CPU Time for the sample application is equal to 13.850 seconds. It is the sum of CPU time for all application threads. Total Thread Count is 1, so the sample application is single-threaded. The tachyon_find_hotspots application ran mostly on one logical CPU. If you hover over the highest bar, you see that it spent 14.320 seconds using one core only, which is classified by the VTune Amplifier as a Poor utilization for a multicore system. To understand what prevented the application from using all available logical CPUs effectively, explore the Bottom-up pane.
    • The Top Hotspots section provides data on the most time-consuming functions (hotspot functions) sorted by CPU time spent on their execution.
    • For the sample application, the initialize_2D_buffer function, which took 8.031 seconds to execute, shows up at the top of the list as the hottest function.
    • The Others entry at the bottom shows the sum of CPU time for all functions not listed in the table.
    • The CPU Usage Histogram represents the Elapsed time and usage level for the available logical processors.

  • Analyze the Most Time-consuming Functions. Click the Bottom-up tab to explore the Bottom-up pane. Analyze the CPU Time column values. This column is marked with a yellow star as the Data of Interest column. It means that the VTune Amplifier uses this type of data for some calculations (for example, filtering, stack contribution, and others). Functions that took most CPU time to execute are listed on top. The initialize_2D_buffer function took the maximum time to execute,8.031 seconds, and had the longest poor CPU utilization (red bars). This means that the processor cores were underutilized most of the time spent on executing this function. To get the detailed CPU usage information per function, use the Expand button in the Bottom-up pane to expand the CPU Time column. To get the detailed CPU usage information per function, use the Expand button in the Bottom-up pane to expand the CPU Time column.
  1. Analyze Code The hottest function in the application is initialize_2D_buffer. Use the Source/Assembly buttons to toggle the Source/Assembly panes on/off. Assembler instructions are grouped by basic blocks. The assembler instructions for the selected hotspot function are highlighted. To get help on an assembler instruction, right-click the instruction and select Instruction Reference. When you identify a hotspot in the serial code, you can make some changes in the code to tune the algorithms and speed up that hotspot. Another option is to parallelize the sample code by adding threads to the application so that it performs well on multi-core processors. By default, when you double-click the hotspot in the Bottom-up pane, VTune Amplifier opens the source file positioning at the most time-consuming code line of this function. For the initialize_2D_buffer function, this is the line used to initialize a memory array using non-sequential memory locations.
  2. Tune Algorithm Modify the initialize_2D_buffer in order to initialize the memory array using sequential memory locations.

  1. Compare with previous results Select the result in the Project Navigator, right-click and choose Compare Results from the context menu. Specify the Basic Hotspots analysis results you want to compare and click the Compare Results button. In the Summary window, you should see that the Elapsed time shows 7.842 seconds of optimization for the whole application execution. The Top Hotspots section should show the gain in performance for the most critical functions. Switch to the Bottom-up window to compare the two results and see the differences per metrics side by side.

:!: If you encounter a compare error related to a lock on the database, close the views/tabs for the analysis.

TASK 5: (3p)Modify initialize_2D_buffer() in order to initialize the memory array using sequential memory locations. Perform a Basic Hotspots analysis and compare with the previous results.

Using Valgrind

1. Make sure you have Valgrind and KCachegrind installed on the system and the application in the initial state, without any modifications

sudo apt-get update
sudo apt-get install valgrind kcachegrind

2. We will use the tool callgrind to get information from the running application. Run the following command line:

valgrind --tool=callgrind --collect-jumps=yes --dump-instr=yes --collect-systime=yes -- ./tachyon_find_hotspots dat/balls.dat

3. Open the profile in KCachegrind and click on the Calee Map tab. Also, make sure that the buttons % Relative, Cycle detection and Relative to parent are selected. You should see something like this:
From this image, we can see that valgrind measured that about 78% of the total time was spent in the initialize_2D_buffer function. Double click the square containing the function name, then select the “Source code” tab and you will see the problematic code.

TASK 6: (1p)Modify the source code from find_hotspots.cpp, rebuild the application and measure again with valgrind. Compare the output of KCachegrind from the unoptimized version.

Using perf

1. Make sure you have perf installed on the system and the application in the initial state, without any modifications
2. Run the following command line:

perf record -a -g -- ./tachyon_find_hotspots

For other perf parameters, you can read this link 3. Run the following command line to view the collected results:

perf report

You should see a screen like the following:
From this image you can see that perf will display the symbol for the function that takes the most amount of CPU time in red. In our case it’s the _Z20initialize_2D_bufferPjS_, which translates in the C source code into the same function as with VTune and Valgrind.

Hint: To find out the demangled name, use the c++filt command:

 c++filt _Z20initialize_2D_bufferPjS_

TASK 7: (1p)Modify the source code from find_hotspots.cpp, rebuild the application and measure again with perf. Compare the output of perf report from the unoptimized version.

Extra: Analyzing locks and waits

In this part, we will use the sample application called “tachyon_analyze_locks” and will guide you through basic steps required to analyze a source code for locks and waits, when implementing a multithreaded application.

1. Open Intel VTune Amplifier.
Open VTune with the command

amplxe-gui 

2. Create a new project.
Using New > Project, specify a project name. This will create a project directory under $HOME/intel/amplxe/projects and will open the Choose Target and Analysis Type window with the Analysis Target tab active. From the left pane, select the local target system and from the right pane select the Application to Launch target type from the drop-down menu. For the Application field browse to the directory where you compiled the application and choose tachyon_analyze_locks. For the Application parameters field, make the same step and choose the file balls.dat from the directory dat. After filling those two fields, click on the Choose Analysis button on the right to switch to the analysis type configuration.

3. Run an analysis.
From the analysis tree on the left, select Algorithm Analysis > Locks and Waits. The right pane is updated with the default options for the Locks and Waits analysis. Click the Start button on the right command bar. VTune Amplifier launches the executable that takes the input and renders an image displaying the execution time before exiting. VTune Amplifier finalizes the collected data and opens the results in the Locks and Waits viewpoint.

4. Interpret result data
a) Analyze the Basic Locks and Waits Metrics.

The Result Summary section provides data on the overall application performance per the following metrics:

  1. Elapsed Time is the total time the application ran, including data allocation and calculations;
  2. Wait Time occurs when software threads are waiting due to APIs that block or cause synchronization. Wait Time is calculated per thread, so the total Wait time may exceed the application Elapsed time. Expand the Wait Time metric to view a distribution per processor utilization levels. In the sample application, most of the Wait time is characterized with an ineffective processor usage;
  3. Wait Count is the overall number of times the system wait API was called for the analyzed application;
  4. Spin Time is the time a thread is active in a synchronization construct; the current value exceeds the threshold, so it is classified as a performance issue and highlighted in pink;
  5. CPU Time is the sum of CPU time for all threads;
  6. Total Thread Count is the number of threads in the application;
  7. Paused Time is the amount of Elapsed time during which the analysis was paused via GUI, CLI commands, or user API.

For the tachyon_analyze_locks application, the Wait time is high. To identify the cause, you need to understand how this Wait time was distributed per synchronization objects. The Top Waiting Objects section provides the list of synchronization objects with the highest Wait Time and Wait Count, sorted by the Wait Time metric.

The Thread Concurrency Histogram represents the Elapsed time and concurrency level for the specified number of running threads. Ideally, the highest bar of your chart should be within the Ok or Ideal utilization range.

Note the Target Concurrency value. By default, this number is equal to the number of physical cores. Consider this number as your optimization goal.

The Average metric is calculated as CPU time / Elapsed time. Use this number as a baseline for your performance measurements. The closer this number to the number of cores, the better. For the sample code, the chart shows that tachyon_analyze_locks is a multithreaded application running maximum 10 threads simultaneously on a machine with 12 cores. But it is not using available cores effectively. The Average Concurrency on the chart is about 0.8 while your target should be making it as closer to 12 as possible (for the system with 12 cores). Hover over the second bar to understand how long the application ran serially. The tooltip shows that the application ran one thread for almost 7.8 seconds, which is classified as Poor concurrency.

The CPU Usage Histogram represents the Elapsed time and usage level for the logical CPUs. Ideally, the highest bar of your chart should be within the Ok or Ideal utilization range.

The tachyon_analyze_locks application was either idle or ran mostly on one logical CPU. If you hover over the second bar, you see that it spent 5.651 seconds using one core only, which is classified by the VTune Amplifier as a Poor utilization. To understand what prevented the application from using all available logical CPUs effectively, explore the Bottom-up pane.

b) Identify locks
Click the Bottom-up tab to open the Bottom-up pane. For the analyzed sample code, you see that the first synchronization object caused the longest Wait time. The red bar in the Wait Time column indicates that most of the time for this object processor cores were underutilized. It is a Mutex that shows much serial time and is causing a wait. Click the arrow sign at the object name to expand the node and see the draw_task wait function that contains this mutex and call stack. Double-click this wait function to see the source code.

5. Analyze the source code
For the sample code, the VTune Amplifier highlights the line entering the rgb_mutex mutex in the draw_task function. The draw_task function was waiting for almost 86 seconds while this code line was executing and most of the time the processor was underutilized. During this time, the critical section was contended 511 times.

The rgb_mutex is the place where the application is serializing. Each thread has to wait for the mutex to be available before it can proceed. Only one thread can be in the mutex at a time. We need to optimize the code to make it more concurrent.

6. Solve the problem and test it
Open the source file called src/linux/analyze_locks/analyze_locks.cpp. The rgb_mutex was introduced to protect calculation from multithreaded access. The brief analysis shows that the code is thread safe and the mutex is not really needed. To solve the issue, comment the lines that use the mutex and disable it, save it and rebuild the application.

Run tachyon_analyze_locks as follows:

./tachyon_analyze_locks dat/balls.dat

System runs the tachyon_analyze_locks application. Note that execution time reduced from 11.632 seconds to 0.800 seconds.

BONUS TASK: (2p)Modify the source code from analyze_locks.cpp, rebuild the application and measure again with VTune. Discuss the changes made with the teaching assistant.

Reference

Resources

asc/lab6/index.txt · Last modified: 2016/03/31 09:55 by adriana.draghici
CC Attribution-Share Alike 3.0 Unported
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0