I think this caused by us filling in zeros for the audio samples when we miss them. We should fill in with a smooth interpolation to 0 instead.