Skip to content
Change the repository type filter

All

    Repositories list

    • A curated list of recent diffusion models for video generation, editing, and various other applications.
      2594.4k10Updated May 17, 2025May 17, 2025
    • livecc

      Public
      LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale (CVPR 2025)
      Python
      2819550Updated May 16, 2025May 16, 2025
    • DoraCycle

      Public
      [CVPR 2025] DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles
      12220Updated May 13, 2025May 13, 2025
    • A curated list of recent robot learning papers incorporating diffusion models for robotics tasks.
      516100Updated May 1, 2025May 1, 2025
    • Exo2Ego-V

      Public
      Python
      Apache License 2.0
      04220Updated Apr 28, 2025Apr 28, 2025
    • Show-o

      Public
      [ICLR 2025] Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.
      Python
      Apache License 2.0
      601.4k432Updated Apr 28, 2025Apr 28, 2025
    • omg

      Public
      Open Multimodal Gathering workshop @ NUS
      JavaScript
      0000Updated Apr 28, 2025Apr 28, 2025
    • Code Implementation of "PhotoDoodle: Learning Artistic Image Editing from Few-Shot Pairwise Data"
      Python
      MIT License
      2638681Updated Apr 23, 2025Apr 23, 2025
    • VideoLLM-online: Online Video Large Language Model for Streaming Video (CVPR 2024)
      Python
      Apache License 2.0
      47463270Updated Apr 23, 2025Apr 23, 2025
    • FAR

      Public
      Code for: "Long-Context Autoregressive Video Modeling with Next-Frame Prediction"
      Python
      MIT License
      820300Updated Apr 23, 2025Apr 23, 2025
    • ROICtrl

      Public
      Code for [CVPR 2025] ROICtrl: Boosting Instance Control for Visual Generation
      Python
      010820Updated Apr 16, 2025Apr 16, 2025
    • Out-of-the-box (OOTB) GUI Agent for Windows and macOS
      Python
      Apache License 2.0
      1561.6k306Updated Apr 15, 2025Apr 15, 2025
    • Enable AI to control your PC. This repo includes the WorldGUI Benchmark and GUI-Thinker Agent Framework.
      Python
      66710Updated Apr 11, 2025Apr 11, 2025
    • 📖 A curated list of resources dedicated to hallucination of multimodal large language models (MLLM).
      2568610Updated Apr 9, 2025Apr 9, 2025
    • 📖 This is a repository for organizing papers, codes and other resources related to unified multimodal models.
      2654610Updated Apr 9, 2025Apr 9, 2025
    • Repository of GUI Action Narrator
      JavaScript
      01000Updated Apr 8, 2025Apr 8, 2025
    • VideoGUI

      Public
      [NeurIPS 2024 D&B] VideoGUI: A Benchmark for GUI Automation from Instructional Videos
      JavaScript
      23500Updated Apr 7, 2025Apr 7, 2025
    • SMS

      Public
      Balanced Image Stylization with Style Matching Score
      12900Updated Apr 2, 2025Apr 2, 2025
    • Official code of "LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer"
      Python
      MIT License
      34940Updated Apr 1, 2025Apr 1, 2025
    • Official code of "MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation"
      Python
      MIT License
      918130Updated Apr 1, 2025Apr 1, 2025
    • FQGAN

      Public
      FQGAN: Factorized Visual Tokenization and Generation
      Python
      Other
      25000Updated Mar 29, 2025Mar 29, 2025
    • MovieAgent: Automated Movie Generation via Multi-Agent CoT Planning
      Python
      2319470Updated Mar 26, 2025Mar 26, 2025
    • SAM-I2V

      Public
      Apache License 2.0
      0210Updated Mar 22, 2025Mar 22, 2025
    • LOVA3

      Public
      (NeurIPS 2024) Official PyTorch implementation of LOVA3
      Python
      28400Updated Mar 21, 2025Mar 21, 2025
    • Python
      66710Updated Mar 20, 2025Mar 20, 2025
    • [CVPR 2025] A Hierarchical Movie Level Dataset for Long Video Generation
      Python
      25700Updated Mar 16, 2025Mar 16, 2025
    • ShowUI

      Public
      [CVPR 2025] Open-source, End-to-end, Vision-Language-Action model for GUI Agent & Computer Use.
      Python
      Apache License 2.0
      831.2k80Updated Mar 13, 2025Mar 13, 2025
    • TPDiff

      Public
      TPDiff: Temporal Pyramid Video Diffusion Model
      21910Updated Mar 13, 2025Mar 13, 2025
    • VLog

      Public
      [CVPR 2025] Video Narration as Vocabulary & Video as Long Document
      Python
      2856780Updated Mar 13, 2025Mar 13, 2025
    • MovieSeq

      Public
      [ECCV 2024] Learning Video Context as Interleaved Multimodal Sequences
      Jupyter Notebook
      13910Updated Mar 11, 2025Mar 11, 2025