VideoNSA: Native Sparse Attention for Video Understanding

VideoNSA is a learnable, hardware-aware sparse-attention framework for efficient video understanding, processing up to 128K vision-text tokens using only 3.6% of the full attention budget.

News

[2025-10] We release our training data in Huggingface.
[2025-10] Code and model released. Paper released on arXiv.
[2025-10] Project website launched

Installation

For Training

Currently, we build on existing open-source implementations to support training:

FLA: flash-linear-attention
Tilde: nsa-impl

We also recommend exploring Scalable-Flash-Native-Sparse-Attention, which provides a highly optimized and scalable implementation of native sparse attention.

# Clone the repository
git clone https://github.com/Espere-1119-Song/VideoNSA.git
cd VideoNSA

# Install ms-swift
cd ms-swift
pip install -e .

# Install flash-linear-attention
pip uninstall fla-core flash-linear-attention -y
pip install -U git+https://github.com/fla-org/flash-linear-attention

cd ..

For Evaluation

# Install UV package manager
pip install uv

# Install lmms-eval
uv pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git

# Install additional dependencies
pip install flash-attn --no-build-isolation
pip install qwen_vl_utils
pip install accelerate

# Login to Hugging Face
huggingface-cli login

For Baseline Comparisons (Optional)

# Install MInference for testing other sparse attention baselines
pip install minference

Usage

Training

cd ms-swift

# Run training script
bash ../scrips/train.sh

Note: Before training, modify the following in scrips/train.sh:

$YOUR_DATASET: Path to your training dataset
--output_dir: Output directory for checkpoints
--logging_dir: Directory for logs
GPU settings (CUDA_VISIBLE_DEVICES, NPROC_PER_NODE, WORLD_SIZE)
Hyperparameters (learning rate, batch size, etc.)

Evaluation

cd lmms-eval

# Run evaluation script
bash ../scrips/eval.sh

Note: Before evaluation, set the following environment variables in scrips/eval.sh:

$MAX_PIXELS: Maximum pixels for video processing
$FPS: Frames per second for video sampling
$MAX_NUM_FRAMES: Maximum number of frames
$TASK_NAME: Benchmark task name (e.g., mvbench, videomme, etc.)

You can also modify:

--num_processes: Number of GPUs to use
--batch_size: Batch size per device
--output_path: Directory for evaluation results

Baseline Comparisons

To evaluate sparse attention baselines used in the paper:

cd lmms-eval

# Make sure minference is installed
pip install minference

# Run baseline evaluation
bash ../scrips/baselines.sh

Note: Modify the baseline script to select different sparse attention methods and configure their parameters.

TODO

Training dataset release

Citation

@misc{song2025videonsanativesparseattention,
      title={VideoNSA: Native Sparse Attention Scales Video Understanding},
      author={Enxin Song and Wenhao Chai and Shusheng Yang and Ethan Armand and Xiaojun Shan and Haiyang Xu and Jianwen Xie and Zhuowen Tu},
      year={2025},
      eprint={2510.02295},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.02295},
}

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
assets		assets
lmms-eval		lmms-eval
ms-swift		ms-swift
scrips		scrips
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VideoNSA: Native Sparse Attention for Video Understanding

News

Installation

For Training

For Evaluation

For Baseline Comparisons (Optional)

Usage

Training

Evaluation

Baseline Comparisons

TODO

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

Espere-1119-Song/VideoNSA

Folders and files

Latest commit

History

Repository files navigation

VideoNSA: Native Sparse Attention for Video Understanding

News

Installation

For Training

For Evaluation

For Baseline Comparisons (Optional)

Usage

Training

Evaluation

Baseline Comparisons

TODO

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages