When sound hits an object, it causes small vibrations on the object’s surface. Here we show how, using only high-speed video of the object, we can extract those minute vibrations and partially recover the sound that produced them, allowing us to turn everyday objects—a glass of water, a potted plant, a box of tissues, or a bag of chips—into visual microphones.
The original project was done by MIT-CSAIL team. They have captured high-speed videos of packet of chips moving due to an audio clip of "Mary Had A Little Lamb "song. The video decomposition was done using a technique called Riesz Pyramids. In our project we use the same videos provided in the MIT-CSAIL website, but we use 2D Dual Tree Complex Wavelet Transform instead.
The videos can be downloaded from here
You can follow this link from Youtube. This has a very concise explanation on how to setup python.
- Clone the repo
git clone https://github.com/joeljose/Visual-Mic.git
- Navigate to "Visual-Mic" repo.
- pip install all the python modules from requirements.txt(you should be in the "Visual-Mic" repository when you execute this command.)
pip install requirements.txt
- The video which is to be processed should be in the "Visual-Mic" repo, named as "testvid.avi".
Now you can run visualmic.py
We use phase variations in the 2D DTℂWT representation of the video V to compute local motion. 2D DTℂWT breaks each frame of the video V (x, y, t) into complex-valued sub-bands corresponding to different scales and orientations. Each scale s and orientation θ is a complex image. We can express them in terms of amplitude A and phase ϕ as: A(s,θ,x,y,t)eiϕ(s,θ,x,y,t)
Now to compute phase variation, we take local phases ϕ and subtract them from the local phase of a reference frame t0(usually the first frame).
ϕv(s,θ,x,y,t) = ϕ(s,θ,x,y,t) − ϕ(s,θ,x,y,t0)
We then compute a spatially weighted average of the local motion signals, for each scale s and orientation θ in the 2D DTℂWT decomposition of the video, to produce a single motion signal Φ(s, θ, t). Local motion signals in regions where there isn’t much texture had ambiguous local phase information, and as a result motion signals in these regions were noisy. So we perform a weighted average by taking the square of the amplitude, since the amplitude gives a measure of texture strength.
Φ(s,θ,t) = ∑x, yA(s,θ,x,y,t)2ϕv(s,θ,x,y,t)
Our final global motion signal is obtained by averaging the Φ(s,θ,t) over different scales and orientations. ŝ(t) = ∑s, θΦ(s,θ,t) We finally scale this signal and center it to the range [−1,1].
To denoise the ouput audio file we get from visualmic.py, we apply image based morphological filtering to the audio spectrograms, and then reconstruct audio from the processed spectrogram. Denoising had a lot of steps, so I've made it into a different project. Here is the link to the project https://github.com/joeljose/audio_denoising