Fast, Device-Agnostic, AbstractFFTs-compatible DCT library for Julia.
AcceleratedDCTs.jl provides highly optimized Discrete Cosine Transform implementations for 1D, 2D and 3D data:
- DCT-II (Standard "DCT") and DCT-III (Inverse DCT)
- DCT-I and IDCT-I (symmetric boundary conditions)
It leverages KernelAbstractions.jl to run efficiently on both CPUs (multithreaded) and GPUs (CUDA, AMD, etc.), and implements the AbstractFFTs.jl interface for easy integration.
- β‘ High Performance: optimized algorithms (Makhoul's method) that outperform standard separable approaches.
- π Device Agnostic: Runs on CPU (Threads) and GPU (
CuArray,ROCArrayviaKernelAbstractions). - π₯ VkDCT Backend: Pre-compiled VkFFT-based CUDA library (
VkDCT_jll) for DCT-I offering ~15x speedup on GPU. Zero-setup: justusing CUDA, AcceleratedDCTs. - π§© AbstractFFTs Compatible: Zero-allocation
mul!,ldiv!, and precomputedPlansupport. - π¦ 3D Optimized: Specialized 3D kernels that avoid redundant transposes.
using Pkg
Pkg.add("AcceleratedDCTs")using AcceleratedDCTs: plan_dct, mul!
using CUDA
# 1. Create Data
N = 128
x_gpu = CUDA.rand(Float64, N, N, N) # can be any Real, e.g. Float32
# 2. Create Optimized Plan (Recommended)
p = plan_dct(x_gpu)
# 3. Execute
y = p * x_gpu # Standard execution
mul!(y, p, x_gpu) # Zero-allocation (in-place output)
# 4. Inverse
x_rec = p \ y
# or
inv_p = inv(p)
mul!(x_rec, inv_p, y)For convenience (slower due to plan creation overhead):
using AcceleratedDCTs: dct, idct
y = dct(x_gpu)
x_rec = idct(y)using AcceleratedDCTs: dct1, idct1, plan_dct1
# One-shot
y = dct1(x_gpu)
x_rec = idct1(y)
# Plan-based (recommended for repeated use)
p = plan_dct1(x_gpu)
y = p * x_gpu
x_rec = p \ yMeasurement of 3D DCT performance on varying grid sizes (mul! (where supported) to exclude allocation overhead.
Lower is better.
| Grid Size ( |
cuFFT (Baseline) |
Opt 3D DCT |
Batched DCT (Old) |
|---|---|---|---|
| 0.080 ms | 0.113 ms | 1.041 ms | |
| 0.076 ms | 0.131 ms | 0.946 ms | |
| 0.116 ms | 0.246 ms | 1.165 ms | |
| 0.833 ms | 1.423 ms | 3.302 ms | |
| 5.945 ms | 10.417 ms | 26.019 ms |
Note:
Opt 3D DCTmaintains excellent performance across all sizes, being only ~1.75x slower than rawcuFFT(due to necessary pre/post-processing). In contrast, the naiveBatched DCTis ~3.9x slower than FFT. For$N=256$ ,Opt 3D DCTis >2.2x faster than the batched implementation.
Measurement of 3D DCT-I performance. Compares Opt DCT-I against raw cuFFT rfft of size
| Grid Size ( |
cuFFT rfft (Baseline) |
Opt DCT-I |
Overhead |
|---|---|---|---|
| 0.079 ms | 0.108 ms | ~1.36x | |
| 0.245 ms | 0.313 ms | ~1.27x | |
| 1.204 ms | 1.323 ms | ~1.10x | |
| 23.289 ms | 23.951 ms | ~1.03x | |
| 88.519 ms | 92.446 ms | ~1.04x |
Note: Our optimized DCT-I implementation adds minimal overhead (<5% at large sizes) over the raw FFT, demonstrating extremely efficient kernel implementation.
For maximum performance on NVIDIA GPUs (providing 7x-15x speedup over the device-agnostic backend), AcceleratedDCTs.jl integrates VkDCT_jll, a pre-compiled VkFFT-based CUDA library. No manual compilation is required.
When CUDA.jl is loaded, the VkDCTExt extension automatically activates and accelerates plan_dct1 for CuArray:
using AcceleratedDCTs
using CUDA
# Automatically uses VkDCT backend on GPU
p = plan_dct1(CuArray(rand(128, 128, 128)))Note:
VkDCT_jllis installed automatically as a dependency. On systems without CUDA, it has no effect.
When FFTW.jl is loaded, the FFTWExt extension activates and replaces the generic plan_dct1 / plan_idct1 for CPU Array inputs with FFTW's native REDFT00 (real-even DFT), which computes DCT-I directly in a single optimized call:
using AcceleratedDCTs
using FFTW # β loads the FFTWExt extension
x = rand(64, 64, 64)
p = plan_dct1(x) # Uses FFTW REDFT00 (fast)
y = p * xImportant
Without using FFTW, plan_dct1 on CPU Array falls back to the generic separable implementation (pre-process β complex FFT β post-process), which is significantly slower. For best CPU DCT-I performance, always load FFTW:
using FFTW # Required for optimal CPU DCT-I
using AcceleratedDCTsNote that the separable fallback itself still requires some FFT backend (e.g. FFTW) to be loaded for its internal plan_fft! calls.
AcceleratedDCTs.jl uses Julia's package extensions to keep heavy dependencies optional:
| Extension | Trigger | Provides |
|---|---|---|
FFTWExt |
using FFTW |
Optimized CPU DCT-I via REDFT00 |
VkDCTExt |
using CUDA |
GPU DCT-I via VkFFT (7β15x faster) |
The core package depends only on AbstractFFTs and KernelAbstractions, keeping it lightweight and compatible with alternative FFT backends.
Comprehensive documentation is available at https://liuyxpp.github.io/AcceleratedDCTs.jl/dev/.
The documentation includes:
- Quick Start & Tutorial: Usage examples and the plan-based API.
- Theory & Algorithms: Mathematical background of Makhoul's algorithm.
- Implementation Details: Insights into
KernelAbstractionsand buffer management. - Benchmarks: In-depth performance analysis on CPU and GPU.
- API Reference: Detailed function documentation.
Most of source codes and docs in this project are generated by Claude Opus 4.5 (thinking) and Gemini 3.0 Pro (High) in Google Antigravity. The LLM are guided by human with many rounds to achieve a pre-designed goal. And the AI generated contents are carefully examined by human. The correctness are verified with FFTW and the roundtrip transform. See test folder for verification details.