Skip to content

Comparison of parallel matrix multiplication methods using OpenMP, focusing on cache efficiency, runtime, and performance analysis with Intel VTune.

License

Notifications You must be signed in to change notification settings

yigitbektasgursoy/openmp-matrix-optimization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Matrix Multiplication Optimization Project

A compact yet powerful demonstration of matrix multiplication optimizations using cache blocking, memory alignment, loop unrolling, and multi-threading (OpenMP).

Highlights

  • Naive vs. Optimized
    Compare a simple triple-nested loop (matmul_naive.c) against optimized approaches (cache-blocked, aligned, unrolled).

  • Multi-threading
    All methods support OpenMP for parallel execution and improved CPU utilization.

  • Analysis
    Profiling with Intel VTune plus custom scripts yields metrics on:

    • Execution Time
    • Speedup
    • L1/LLC Cache Miss Rates
    • CPU Utilization

Directory Overview

  • src/

    • Core implementations (matmul_naive.c, matmul_blocked.c, etc.)
    • test_matmul.c for validation and performance checks
  • logs/

    • Recorded performance data (cache miss rates, CPU usage)
  • graphs/

    • Plots illustrating key performance metrics (shown below)
  • scripts/

    • Automation and visualization scripts (e.g., cache_analysis_draw.py, compare_threading.py)
  • report/

    • Methodology, results, and conclusions in a concise PDF/Markdown document

Detailed Graphs

Below are all of the generated PNGs, separated by metric and matrix size.

1) CPU Utilization

Matrix Size = 1024
CPU Utilization (Matrix Size = 1024)

Matrix Size = 2048
CPU Utilization (Matrix Size = 2048)

Matrix Size = 4096
CPU Utilization (Matrix Size = 4096)


2) Execution Time

Matrix Size = 1024
Execution Time (Matrix Size = 1024)

Matrix Size = 2048
Execution Time (Matrix Size = 2048)

Matrix Size = 4096
Execution Time (Matrix Size = 4096)


3) L1-dcache Miss Percentage

Matrix Size = 1024
L1 dcache Miss % (Matrix Size = 1024)

Matrix Size = 2048
L1 dcache Miss % (Matrix Size = 2048)

Matrix Size = 4096
L1 dcache Miss % (Matrix Size = 4096)


4) LLC-load Miss Percentage

Matrix Size = 1024
LLC Miss % (Matrix Size = 1024)

Matrix Size = 2048
LLC Miss % (Matrix Size = 2048)

Matrix Size = 4096
LLC Miss % (Matrix Size = 4096)


5) Speedup

Matrix Size = 1024
Speedup (Matrix Size = 1024)

Matrix Size = 2048
Speedup (Matrix Size = 2048)

Matrix Size = 4096
Speedup (Matrix Size = 4096)


Conclusion

By integrating cache blocking, memory alignment, loop unrolling, and multi-threading, we significantly reduce cache misses and boost CPU utilization. Check out the logs for raw data, graphs for visual insights, and the report folder for a comprehensive discussion of these results.

About

Comparison of parallel matrix multiplication methods using OpenMP, focusing on cache efficiency, runtime, and performance analysis with Intel VTune.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published