MUSYN is an innovative system designed for real-time musical co-creation, focusing on generating visual art from music. This project proposes a modular workflow that integrates various phases to achieve this transformation: from audio capture and preprocessing, through the transformation of music into textual descriptions, to image synthesis. This flexible and adaptable architectural design is intended to allow the integration of specific artificial intelligence models within each of its parts, depending on needs and technological advancements. By establishing a direct and reactive connection between sound and image, MUSYN goes beyond the traditional concept of synesthesia, providing a platform to explore and materialize a new form of interactive visual art, where music becomes the creative engine for visual expression.
./start.sh- Installs all Python dependencies (see
setup.py). - Downloads the required music captioning and image generation models.
- Run the app
python app.py
- Open the web interface
The app launches a Gradio interface in your browser. - Choose your mode
- Live Audio: Use your microphone to generate images from live music.
- File Audio: Upload an audio file for processing.
- Interact
- View generated captions and images in real time.
- Adjust image width/height and use example prompts.
- Audio Preprocessing:
Audio is captured (live/file) and preprocessed using utilities inutils/audio_utils.py. - Music Captioning:
The audio is passed to a music captioning model (model/music2txt.py), which uses a BART-based architecture to generate descriptive text. - Image Generation:
The caption is fed into a Stable Diffusion XL Turbo pipeline (model/txt2img.py) to generate an image. - Web Interface:
The Gradio app (app.py) provides a user-friendly interface for real-time interaction.
musyn/
├── app.py # Main Gradio web application
├── config.py # UI and model configuration
├── setup.py # Python package setup and dependencies
├── start.sh # Installation and model download script
├── utils/
│ └── audio_utils.py # Audio loading and preprocessing utilities
├── model/
│ ├── bart.py # BART-based captioning model definition
│ ├── modules.py # Audio encoder and feature extraction modules
│ ├── music2txt.py # Music-to-text (captioning) pipeline
│ └── txt2img.py # Text-to-image (Stable Diffusion XL Turbo) pipeline
├── model/models/ # Downloaded model weights (auto-created)
├── LICENSE
└── README.md
- Exploring Real-Time Music-to-Image Systems for Creative Inspiration in Music Creation
- LP-MusicCaps: LLM-Based Pseudo Music Captioning
- ArtSpew
- SDXLTurbo
- Ultimate guide to optimizing Stable Diffusion XL
- StreamDiffusion
- Generate Images in RT.
- Option of 5s audio input.
- Improve RT Image Generation.
- Publish a demo on a Hugging Face Space
License:
GNU GPLv3