Skip to content

Conversation

Ashp116
Copy link
Contributor

@Ashp116 Ashp116 commented Jul 30, 2025

Description

This PR introduces a new Video API that streamlines video processing and rendering workflows. It addresses both issues #1923 and #1929 by enabling more flexible backend support and improved audio-video synchronization.

With this update, the video processing function now supports multiple backends, including PyAV and OpenCV. Notably, PyAV is the only backend currently supporting audio rendering, which significantly improves output quality.

This PR requires the optional dependency pyAV for the video rendering backend.

Tags:
Fixes #1923
Fixes #1929

Type of change

  • Bug fix (non-breaking change which fixes an issue)

How has this change been tested, please provide a testcase or example of how you tested the change?

Please refer to #1923 and #1929

Any specific deployment considerations

Ensure that pyAV is installed in the environment to test pyAV backend.

Docs

  • Docs updated? What were the changes

@Ashp116 Ashp116 requested a review from SkalskiP as a code owner July 30, 2025 19:29
@Ashp116 Ashp116 changed the title ADD: Added audio stream for process_video BUG: Added audio stream for process_video Jul 30, 2025
@SkalskiP
Copy link
Collaborator

Hi @Ashp116 👋🏻 Another great idea! Video processing is probably the oldest part of supervision, written over two years ago, and I’ve been wanting to update its API for a while. Would you be open to not only adding audio support but also helping me with the update?

@Ashp116
Copy link
Contributor Author

Ashp116 commented Jul 31, 2025

Hi @SkalskiP, yea, I would like to help update the API. I was thinking of changing how videos are written in process_video. The original compression is lost when annotations are added and the file is written to a target_path. But yea, I would like to help out with the update.

@SkalskiP
Copy link
Collaborator

SkalskiP commented Aug 1, 2025

Hi @Ashp116 I'm really glad you want to help me! Let's goooo! 🔥 🔥 🔥

I want the functionalities currently found in supervision.utils.video to be reorganized around a new Video class. Importantly, all features previously available in the old API must still be supported in the new one. Ideally, the new API should be more consistent and expressive.

  • get video info (works for files, RTSP, webcams)

    import supervision as sv
     
    # static video
    sv.Video("source.mp4").info
    
    # video stream
    sv.Video("rtsp://...").info
    
    # webcam
    sv.Video(0).info
  • simple frame iteration (object is iterable)

    import supervision as sv
    
    video = sv.Video("source.mp4")
    for frame in video:
        ...
  • advanced frame iteration (stride, sub-clip, on-the-fly resize)

    import supervision as sv
    
    for frame in sv.Video("source.mp4").frames(stride=5, start=100, end=500, resolution_wh=(1280, 720)):
        ...
  • process the video

    import cv2
    import supervision as sv
    
    def blur(frame, i):
        return cv2.GaussianBlur(frame, (11, 11), 0)
    
    sv.Video("source.mp4").save(
        "blurred.mp4",
        callback=blur,
        show_progress=True
    )
  • overwrite target video parameters

    import supervision as sv
    
    sv.Video("source.mp4").save(
        "timelapse.mp4",
        fps=60,
        callback=lambda f, i: f,
        show_progress=True
    )
  • complete manual control with explicit VideoInfo

    from supervision import Video, VideoInfo
    
    source = Video("source.mp4")
    target_info = VideoInfo(width=800, height=800, fps=24)
    
    with src.sink("square.mp4", info=target_info) as sink:
        for f in src.frames():
            f = cv2.resize(f, target_info.resolution_wh)
            sink.write(f)
  • multi-backend support decode/encode

    import supervision as sv
    
    video = sv.Video("source.mkv", backend="pyav")
    
    video = sv.Video("source.mkv", backend="opencv")

    suggested minimal protocol

    class Backend(Protocol):
        def open(self, path: str) -> Any: ...
        def info(self, handle: Any) -> VideoInfo: ...
    
        def read(self, handle: Any) -> tuple[bool, np.ndarray]: ...
        def grab(self, handle: Any) -> bool: ...
        def seek(self, handle: Any, frame_idx: int) -> None: ...
    
        def writer(self, path: str, info: VideoInfo, codec: str) -> Writer: ...
    
    class Writer(Protocol):
        def write(self, frame: np.ndarray) -> None: ...
        def close(self) -> None: ...

@Ashp116
Copy link
Contributor Author

Ashp116 commented Aug 2, 2025

Hi @SkalskiP,

I’ve addressed most of the features you mentioned, but I have some thoughts on a few aspects of the implementation:

  • .save Functionality
    How would you handle .save for a video feed coming from a webcam or an RTSP stream? Currently, I have it where only video files can be saved.

  • Writer and Backend Classes
    This is just my personal opinion, but should these classes be moved to separate scripts/modules? If we add more writers and backends in the future, keeping everything inside the main video script might become cluttered.

  • “Complete manual control with explicit VideoInfo” Functionality

    from supervision import Video, VideoInfo
    
    source = Video("source.mp4")
    target_info = VideoInfo(width=800, height=800, fps=24)
    
    with src.sink("square.mp4", info=target_info) as sink:
        for f in src.frames():
            f = cv2.resize(f, target_info.resolution_wh)
            sink.write(f)

    I’m not fully clear on what this feature is intended to do. In this snippet, the Video instance source is created but never used afterward. Is src supposed to be source? Also, is the goal to create sinks for each backend? Could you please clarify the purpose and expected usage here?

@Ashp116 Ashp116 changed the title BUG: Added audio stream for process_video FEATURE: Versatile Video class Aug 2, 2025
@Ashp116
Copy link
Contributor Author

Ashp116 commented Sep 1, 2025

Hi @SkalskiP,

It’s been a while! I’ve added better audio support. Previously, I manually manipulated audio packets, along with DTS and PTS values, to synchronize them with the video. Now, I’m using the atempo filter on audio streams, which matches the video much more cleanly.

I’ve included my Colab notebook showcasing the new .show() function. Next, I’ll be working on the documentation and unit tests for audio.

I’d love to hear your thoughts and get your feedback on my current implementation.

Thank you!

Copy link

@ryashry ryashry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve Hight quality video web

@SkalskiP
Copy link
Collaborator

@Ashp116 I know it’s been a while, but would you like to help me wrap up this PR?

@Ashp116
Copy link
Contributor Author

Ashp116 commented Oct 14, 2025

Hello @SkalskiP,

Yeah, I’d still like to help you wrap up this PR as soon as possible. Please do take a look at the changes I made based on the last code review requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reimplement video utils BUG: Audio stream not captured in process_video

3 participants