-
Notifications
You must be signed in to change notification settings - Fork 17
Profiling PaRSEC Applications with PAPI
While running your PaRSEC-based application, you may be interested in measuring hardware performance information to identify potential improvements to your code or to gather some benchmarks to assess your program's performance. Within PaRSEC, there is a PINS module that utilizes the Performance Application Programming Interface (PAPI) to do just that. PAPI is a great tool for getting hardware counter information during a program's execution and incurs minimal overhead.
In order to use this feature you must install PAPI, which can be found here, and you need to set the following CMAKE parameters:
* PARSEC_PROF_TRACE=ON * PINS_ENABLE=ON * PAPI_INCLUDE_DIR=[path/to/papi/include] ** Note: The brackets are not included in the actual input. A sample path would look like: /opt/include/ * PAPI_LIBRARY=[/path/to/papi/lib/libpapi.a] ** Note: The brackets are not included in the actual input. A sample path would look like: /opt/lib/libpapi.a or /opt/lib/libpapi.so depending on whether you want to link papi statically or dynamically.
Once all of these things have been set, you must specify which PAPI events you would like to monitor and under what scheme you would like them to be monitored. To do this, go to your home directory and create a .parsec directory like so:
#!bash
cd ~
mkdir .parsec
#!bash
cd ~/.parsec
emacs mca-params.conf
#!bash
profile_filename = demo
mca_pins = papi
pins_papi = 1
pins_papi_event = S*:C*:F1:PAPI_L1_ICM,S0:C0:F.01s:PAPI_L1_DCM
#!bash
S[socket number]:C[core number]:F[frequency]:[PAPI event name],
To figure out what PAPI events are supported on your system along with their names and descriptions, you can navigate to your PAPI installation's bin directory and run the papi_native_avail utility. You can get a less detailed listing with the papi_avail utility.
Now that we've set up our config file, we can run our application to see how it works. For this demo, we'll use the testing_dpotrf (Cholesky Factorization) file in your build directory at /dplasma/testing/testing_dpotrf. So, after navigating to this directory we use the following command:
#!bash
./testing_dpotrf -t 100 -N 2000
This command will run a Cholesky factorization using all available cores on the current compute node with a tile size of 100 and using a random 2000X2000 matrix. The output will look something like this:
#!bash
[dancer.icl.utk.edu:30386] Using PAPI version 5040000
[dancer.icl.utk.edu:30386] Could not find a proposed time unit equivalent for nanosecond. Fall back to nanosecond
[dancer.icl.utk.edu:30386] No units found. Assuming task-based frequency: 1
[dancer.icl.utk.edu:30386] Valid PAPI event PAPI_L1_ICM on socket -1 (-1 for all), core -1 (-1 for all) with frequency 1 tasks
[dancer.icl.utk.edu:30386] Valid PAPI event PAPI_L1_DCM on socket 0 (-1 for all), core 0 (-1 for all) with frequency 10000000.000000 nanosecond
Found 1 components, activated 1
[dancer.icl.utk.edu:30386] PAPI event PAPI_L1_ICM{int64_t}; core -1 socket -1 frequency 1 tasks enabled
[dancer.icl.utk.edu:30386] PAPI event PAPI_L1_ICM{int64_t}; core -1 socket -1 frequency 1 tasks enabled
[dancer.icl.utk.edu:30386] PAPI event PAPI_L1_DCM{int64_t}; core 0 socket 0 frequency 10000000.000000 ns enabled
[dancer.icl.utk.edu:30386] PAPI event PAPI_L1_ICM{int64_t}; core -1 socket -1 frequency 1 tasks enabled
[dancer.icl.utk.edu:30386] PAPI event PAPI_L1_ICM{int64_t}; core -1 socket -1 frequency 1 tasks enabled
[dancer.icl.utk.edu:30386] PAPI event PAPI_L1_ICM{int64_t}; core -1 socket -1 frequency 1 tasks enabled
[dancer.icl.utk.edu:30386] PAPI event PAPI_L1_ICM{int64_t}; core -1 socket -1 frequency 1 tasks enabled
[dancer.icl.utk.edu:30386] PAPI event PAPI_L1_ICM{int64_t}; core -1 socket -1 frequency 1 tasks enabled
[dancer.icl.utk.edu:30386] PAPI event PAPI_L1_ICM{int64_t}; core -1 socket -1 frequency 1 tasks enabled
[****] TIME(s) 0.06466 : dpotrf PxQ= 1 1 NB= 100 N= 2000 : 41.271038 gflops
Now that you have completed your run, there will be a file in your current directory with a name that looks like: demo-0.prof-[random]. You write a python module to extract the information from this file by following the instructions here. There is also a section below that describes the use of a simple python module that you can use to get some simple plots of the PAPI counter values over time.
PaRSEC exposes a series of internal counters as PAPI events using the Software Defined Events architecture.
To access these counters, the final program needs to be linked with a recent version of libpapi, that features the SDE interface.
These PAPI events appear with the prefix .
If is linked with libparsec, it will show all available events exposed by PaRSEC. Alternatively, an application that links both PaRSEC and PAPI (with SDE enabled) can list the events iterating on .
To access these events inside a PINS tracing, one just needs to declare the event as one of them in the MCA parameter. For example:
$shell > cat ~/.parsec/mca-params.conf
profile_filename = /tmp/parsec-pins-profile
pins_papi_event = sde:::PARSEC::SCHEDULER::PENDING_TASKS
mca_pins = papi
pins_papi = 1
$shell >
Will trace the number of pending tasks (tasks ready to execute, but waiting in one of the queues defined by the scheduler for this node), at the entry and termination of the execution of each task.
In some cases you may want to perform your execution on fewer cores than the maximum available on a node. To do this, you simply need to use the -c option for your run to specify the number of cores, so if you wanted to perform the same example as above, but with 4 cores, you would use the following command line:
#!bash
./testing_dpotrf -c 4 -t 100 -N 2000
When you specify 'n' cores, the first 'n' cores on the current compute node based on hwloc will be chosen as the cores used. In some instances you may want to specify the specific cores that you want to use for this run, in which case you will need to define a virtual process map file. This file will specify a series of //virtual processes// in which you can have multiple cores per virtual process and these cores will be the ones used for the execution instead of the first 'n' cores. So, to create one of these files, you merely have to define each virtual process on its own line in the file with the line specifying the number of cores in the virtual process and the core ids (as reported by hwloc) of the cores you want to use separated by commas. The format looks like this:
#!bash
:[number_of_cores]:[first_core_id],[second_core_id],[etc...]
:[number_of_cores]:[first_core_id],[second_core_id],[etc...]
[etc...]
For our example with 4 cores, let's assume we have a compute node with two sockets each with an 8 core processor. Let's also assume that the cores of these processors are assigned the ids of 0 to 7 for the one processor and 8 to 15 for the second processor. So, let's say that we want to use the two lowest id cores on each processor for our sample run, which would be cores 0, 1, 8, and 9. If we group all of these cores into one virtual process, our virtual process file, called //pmap// for this example, would look like this:
#!bash
:4:0,1,8,9
If we decided that we want the cores from each processor to be in their own virtual processes, we could change our //pmap// file to look like this:
#!bash
:2:0,1
:2:8,9
Now that we have specified our virtual process map file, we can change our command line and run our test with the 4 cores that we want it to be run on. Assuming our //pmap// file is in the same directory as our test file, our command line would look like this:
#!bash
./testing_dpotrf -c 4 -t 100 -N 2000 -- -V file:pmap
This python module can be found in [path_to_PaRSEC_home_directory]/tools/profiling/python/examples, and is a simple module that will take a binary profile file, multiple binary profile files from the same run in the case of a distributed run, or a converted HDF5 file as input and output a combined graph of all of the PAPI counter values for each counter over time, and a separate graph of each counter broken down by the socket, core, and frequency that each counter was counting. All of these graphs will be .png images made with the matplotlib python library.
Usage of this module is extremely simple with only two optional command line flags in addition to the list of file names. The two flags are -t and -u. The -t option specifies the time units used by PaRSEC on your system, which could be nanoseconds, microseconds, or cycles. To specify these time units, you use the -t switch followed by 'ns', 'us', or 'c' for each of the options respectively. The -u option specifies the time units you wish to use for the output plots, which could be nanoseconds, microseconds, milliseconds, or seconds. To specify these time units, you use the -u switch followed by 'ns', 'us', 'ms', or 's' for each of the options respectively.
So, if you want to use this module for the example above, you could rename your binary file from demo-0.prof-[random] to something easier to write like demo.prof, though this isn't required. Also to make your life easier, you can copy the binary file to the same directory as the python module. Let's assume that the system we are working on uses nanoseconds as its time units on this system (as seen in the sample output above) and we want our plots to use seconds for their outputs. In this case, we would use the following command line:
#!bash
python plot_scatter_papi.py -t ns -u s demo.prof
This would produce output like the following on the command line:
#!bash
Found units: 'ns'
Graphs will display time in seconds.
Conversion Factor: 1e-09
Importing matplotlib took 0.34679889679 seconds.
Converting binary trace to the ParSEC Trace Tables format...
Converting ['demo.prof']
Event TASK_MEMORY conv <size{int64_t}> length 8
Event Device delegate conv <> length 0
Event PINS_PAPI_S0_C1_F1t_0 conv <PAPI_L1_ICM{int64_t};> length 8
Event PINS_PAPI_S1_C8_F1t_0 conv <PAPI_L1_ICM{int64_t};> length 8
Event PINS_PAPI_S1_C9_F1t_0 conv <PAPI_L1_ICM{int64_t};> length 8
Event PINS_PAPI_S0_C0_F1t_0 conv <PAPI_L1_ICM{int64_t};> length 8
Event PINS_PAPI_S0_C0_F10000000.000000ns_0 conv <PAPI_L1_DCM{int64_t};> length 8
Event MPI_ACTIVATE conv <src{int32_t};dst{int32_t};tid{int64_t};hid{int32_t};fid{int8_t};pad{char[3]}> length 24
Event MPI_DATA_CTL conv <src{int32_t};dst{int32_t};tid{int64_t};hid{int32_t};fid{int8_t};pad{char[3]}> length 24
Event MPI_DATA_PLD_SND conv <src{int32_t};dst{int32_t};tid{int64_t};hid{int32_t};fid{int8_t};pad{char[3]}> length 24
Event MPI_DATA_PLD_RCV conv <src{int32_t};dst{int32_t};tid{int64_t};hid{int32_t};fid{int8_t};pad{char[3]}> length 24
Event potrf_dgemm conv <ddesc_unique_key{uint64_t};ddesc_data_id{uint32_t};ddessc_padding{uint32_t};k{int32_t};m{int32_t};n{int32_t}> length 28
Event potrf_dsyrk conv <ddesc_unique_key{uint64_t};ddesc_data_id{uint32_t};ddessc_padding{uint32_t};k{int32_t};n{int32_t}> length 24
Event potrf_dtrsm conv <ddesc_unique_key{uint64_t};ddesc_data_id{uint32_t};ddessc_padding{uint32_t};k{int32_t};n{int32_t}> length 24
Event potrf_dpotrf conv <ddesc_unique_key{uint64_t};ddesc_data_id{uint32_t};ddessc_padding{uint32_t};k{int32_t}> length 20
....<<<<
Parsing the PBT files took 0.122670888901 seconds, which is 0.0306677222252 seconds per thread.
Then we concatenate the event DataFrames....
events DataFrame concatenation time: 0.00572299957275
Constructed additional structures in 0.00748085975647 seconds.
Generate trace to HDF5 format in 1.19245505333 seconds.
Loading the HDFed trace...
The load took 2.32339406013 seconds.
Populating user-defined lists...
Populating the lists took 0.00188398361206 seconds.
Colors Needed: 5
Plotting all PAPI counters together...
Saving plot as all_papi_counters.png...
Plotting and saving took 0.246990919113 seconds.
Plotting data for: PAPI_L1_DCM...
Saving plot as PAPI_L1_DCM.png...
Plotting and saving took 0.307183027267 seconds.
Plotting data for: PAPI_L1_ICM...
Saving plot as PAPI_L1_ICM.png...
Plotting and saving took 0.438541173935 seconds.
Total Time: 3.93275809288 seconds
The python module will also produce three image files of the plots listed as all_papi_counters.png, PAPI_L1_DCM.png, and PAPI_L1_ICM.png. These plots can be seen below.
all_papi_counters.png
PAPI_L1_DCM.png
PAPI_L1_ICM.png
You may notice that the PAPI_L1_DCM counter is not counting exactly every .01 seconds as we specified, and the PAPI_L1_ICM counters for some of the cores have a break period in them even though they are supposed to happen after every task. These are artefacts of the scheduling of the tasks and the length of the execution. This is an extremely short test and as such scheduling issues become much more pronounced. You may also notice that the gap for core 0 is much larger than the other cores, which is due to the fact that the control flow is handled by that core. This explains why the PAPI_L1_DCM values aren't happening every .01 seconds because there aren't always tasks scheduled on that core.