Skip to content

liuyxpp/AcceleratedDCTs.jl

Repository files navigation

AcceleratedDCTs.jl

Development documentation Test workflow status Docs workflow Status BestieTemplate

Fast, Device-Agnostic, AbstractFFTs-compatible DCT library for Julia.

AcceleratedDCTs.jl provides highly optimized Discrete Cosine Transform implementations for 1D, 2D and 3D data:

  • DCT-II (Standard "DCT") and DCT-III (Inverse DCT)
  • DCT-I and IDCT-I (symmetric boundary conditions)

It leverages KernelAbstractions.jl to run efficiently on both CPUs (multithreaded) and GPUs (CUDA, AMD, etc.), and implements the AbstractFFTs.jl interface for easy integration.

Key Features

  • ⚑ High Performance: optimized algorithms (Makhoul's method) that outperform standard separable approaches.
  • πŸš€ Device Agnostic: Runs on CPU (Threads) and GPU (CuArray, ROCArray via KernelAbstractions).
  • πŸ”₯ VkDCT Backend: Pre-compiled VkFFT-based CUDA library (VkDCT_jll) for DCT-I offering ~15x speedup on GPU. Zero-setup: just using CUDA, AcceleratedDCTs.
  • 🧩 AbstractFFTs Compatible: Zero-allocation mul!, ldiv!, and precomputed Plan support.
  • πŸ“¦ 3D Optimized: Specialized 3D kernels that avoid redundant transposes.

Installation

using Pkg
Pkg.add("AcceleratedDCTs")

Quick Start

Basic Usage

using AcceleratedDCTs: plan_dct, mul!
using CUDA

# 1. Create Data
N = 128
x_gpu = CUDA.rand(Float64, N, N, N)  # can be any Real, e.g. Float32

# 2. Create Optimized Plan (Recommended)
p = plan_dct(x_gpu)

# 3. Execute
y = p * x_gpu           # Standard execution
mul!(y, p, x_gpu)       # Zero-allocation (in-place output)

# 4. Inverse
x_rec = p \ y
# or
inv_p = inv(p)
mul!(x_rec, inv_p, y)

One-shot Functions

For convenience (slower due to plan creation overhead):

using AcceleratedDCTs: dct, idct

y = dct(x_gpu)
x_rec = idct(y)

DCT-I (Symmetric Boundary)

using AcceleratedDCTs: dct1, idct1, plan_dct1

# One-shot
y = dct1(x_gpu)
x_rec = idct1(y)

# Plan-based (recommended for repeated use)
p = plan_dct1(x_gpu)
y = p * x_gpu
x_rec = p \ y

Benchmarks

Measurement of 3D DCT performance on varying grid sizes ($N^3$). Results collected using in-place mul! (where supported) to exclude allocation overhead. Lower is better.

DCT-II (dct) Performance (GPU, NVIDIA RTX 2080 Ti)

Grid Size ($N^3$) cuFFT (Baseline) Opt 3D DCT Batched DCT (Old)
$16^3$ 0.080 ms 0.113 ms 1.041 ms
$32^3$ 0.076 ms 0.131 ms 0.946 ms
$64^3$ 0.116 ms 0.246 ms 1.165 ms
$128^3$ 0.833 ms 1.423 ms 3.302 ms
$256^3$ 5.945 ms 10.417 ms 26.019 ms

Note: Opt 3D DCT maintains excellent performance across all sizes, being only ~1.75x slower than raw cuFFT (due to necessary pre/post-processing). In contrast, the naive Batched DCT is ~3.9x slower than FFT. For $N=256$, Opt 3D DCT is >2.2x faster than the batched implementation.

DCT-I (dct1) Performance (GPU, NVIDIA RTX 2080 Ti)

Measurement of 3D DCT-I performance. Compares Opt DCT-I against raw cuFFT rfft of size $(2M-2)^3$ to measure overhead (since CUDA has no native DCT-I).

Grid Size ($M^3$) cuFFT rfft (Baseline) Opt DCT-I Overhead
$16^3$ 0.079 ms 0.108 ms ~1.36x
$32^3$ 0.245 ms 0.313 ms ~1.27x
$64^3$ 1.204 ms 1.323 ms ~1.10x
$128^3$ 23.289 ms 23.951 ms ~1.03x
$256^3$ 88.519 ms 92.446 ms ~1.04x

Note: Our optimized DCT-I implementation adds minimal overhead (<5% at large sizes) over the raw FFT, demonstrating extremely efficient kernel implementation.

VkDCT Extension (High Performance GPU DCT-I)

For maximum performance on NVIDIA GPUs (providing 7x-15x speedup over the device-agnostic backend), AcceleratedDCTs.jl integrates VkDCT_jll, a pre-compiled VkFFT-based CUDA library. No manual compilation is required.

When CUDA.jl is loaded, the VkDCTExt extension automatically activates and accelerates plan_dct1 for CuArray:

using AcceleratedDCTs
using CUDA

# Automatically uses VkDCT backend on GPU
p = plan_dct1(CuArray(rand(128, 128, 128)))

Note: VkDCT_jll is installed automatically as a dependency. On systems without CUDA, it has no effect.

FFTW Extension (Optimized CPU DCT-I)

When FFTW.jl is loaded, the FFTWExt extension activates and replaces the generic plan_dct1 / plan_idct1 for CPU Array inputs with FFTW's native REDFT00 (real-even DFT), which computes DCT-I directly in a single optimized call:

using AcceleratedDCTs
using FFTW   # ← loads the FFTWExt extension

x = rand(64, 64, 64)
p = plan_dct1(x)   # Uses FFTW REDFT00 (fast)
y = p * x

Important

Without using FFTW, plan_dct1 on CPU Array falls back to the generic separable implementation (pre-process β†’ complex FFT β†’ post-process), which is significantly slower. For best CPU DCT-I performance, always load FFTW:

using FFTW              # Required for optimal CPU DCT-I
using AcceleratedDCTs

Note that the separable fallback itself still requires some FFT backend (e.g. FFTW) to be loaded for its internal plan_fft! calls.

Extension Architecture

AcceleratedDCTs.jl uses Julia's package extensions to keep heavy dependencies optional:

Extension Trigger Provides
FFTWExt using FFTW Optimized CPU DCT-I via REDFT00
VkDCTExt using CUDA GPU DCT-I via VkFFT (7–15x faster)

The core package depends only on AbstractFFTs and KernelAbstractions, keeping it lightweight and compatible with alternative FFT backends.

Documentation

Comprehensive documentation is available at https://liuyxpp.github.io/AcceleratedDCTs.jl/dev/.

The documentation includes:

AI Usage Disclaimer

Most of source codes and docs in this project are generated by Claude Opus 4.5 (thinking) and Gemini 3.0 Pro (High) in Google Antigravity. The LLM are guided by human with many rounds to achieve a pre-designed goal. And the AI generated contents are carefully examined by human. The correctness are verified with FFTW and the roundtrip transform. See test folder for verification details.

About

Device agnostic implementation of 1D/2D/3D DCTs.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors