This repository maintains the DeViBench (Degraded Video Understanding Benchmark) from the HotNets paper "Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI".
*The benchmark is continuously growing, and we are considering transforming more existing Streaming Video Understanding Benchmarks to the "DeViBench style".
The dataset file datasets.csv contains the following columns:
| Column | Description | 
|---|---|
| sample_folder | Corresponding video ID. This dataset uses the same video files as the StreamingBench Real-Time Visual Understanding dataset, available for download from the provided link. | 
| start_time | Start time of the video segment. Questions refer to a 5-second segment starting from this timestamp. | 
| question | The question content. | 
| options | Available options for the question. | 
| standard_answer | Standard answer to the question. | 
| task_type | Task type of the question. | 
The prompt format for asking questions:
You are a multiple-choice question answering assistant.
Your task is to output only the letter of the correct option, without any explanation or extra text.
Question:
[Question]
Options:
[Options]
Video files can be downloaded from:
https://huggingface.co/datasets/mjuicem/StreamingBench/tree/main
Please match the video files with the corresponding sample_folder in datasets.csv.
An example (200 kbps vs 2000 kbps). If the video fails to load or has color distortion, please try Chrome browser.
example.mp4
Question: How does the road gradient (GRAD) change as the cyclist progresses?
Options:
- 
(A) Remains constant at -1 
- 
(B) Gradually becomes more negative 
- 
(C) Fluctuates between -1 and -2 
- 
(D) Gradually becomes less negative 
Standard Answer: B
Answer from 200 kbps: D
Answer from 2000 kbps: B
Please refer to our HotNets paper for context-aware video streaming. Our method allocates more bits to chat-important regions (e.g., purple circles) and fewer bits to chat-irrelevant regions (e.g., yellow circles), thus improving MLLM accuracy.
To achieve fine-grained QP control, we adopt H.265 implemented by Kvazaar  to encode ours and baseline. Except for the QP values, ours and baseline use the same encoding parameters.
The specific Kvazaar command lines are as follows:
kvazaar -i {input.yuv} --input-res={resolution} --gop 0 --period 0 --input-fps {fps} --qp {qp} [--roi roi.txt] -o {output.mp4}Where the fps remains the same as the original video. The resolution gradually decreases as the bitrate decreases. For example, we set 1920×1080 for 800 Kbps, 1600×900 for 600 Kbps, 1280×720 for 400 Kbps, and 1024×576 for 200 Kbps. --roi is optional and is only used in our method. Both our method and the baseline achieve the target bitrates by adjusting --qp.
@article{wu2025chat,
  title={Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI},
  author={Wu, Jiangkai and Ren, Zhiyuan and Liu, Liming and Zhang, Xinggong},
  journal={arXiv preprint arXiv:2507.10510},
  year={2025}
}
The data is model-generated and may contain minor errors. We recommend using it primarily for exploratory analysis and demonstration purposes.




