For our NeurIPS2021 paper, we firstly detected all the candidate objects emerged in the visual scene, then manually filter out the silent ones but only remain the sounding ones. Hence, all the annotated objects in the .json file are all the sounding ones.
Great thanks Triantafyllos for pointing out this!