This part of the repository stores all the code to run and train the transformer based end-to-end vision pipeline. The latest trained model can be found on Huggingface, which also allows a simple browser based demo.
For more information on the usage of the scripts, please refer README.
In order to generate training data for the model, the script generate_dataset_full.sh with a parameter N can be used.
This will create a dataset of N samples. The dataset is saved in the data
, data_augmented
, data_hw
and data_hw_augmented
folders. Please look here README for more information on the usage of the script.
This ROS node is responsible for processing images from a camera source and recognizing notes in the images using a pre-trained model. It converts the image data into a textual LilyPond representation and publishes them as ROS messages.
The ROS node receives recognized notes from the vision_node and generates visual representations of the musical notations. It uses the LilyPond library to create musical staff notation and publishes the resulting images as ROS messages for visualization.