When I used VisionZip on qwen2.5-vl-7b, compared with the original implementation, I found that although the number of visual tokens decreased, the speed of inference did not increase and the amount of GPU memory did not decrease. Especially the GPU memory, it even increased more than before the compression. This was mainly caused by the following code.
if layer_num == len_blocks-1:
hidden_states, logits, attn_key = blk(hidden_states, cu_seqlens=cu_seqlens_now, position_embeddings=position_embeddings, return_logits=True)