captioning + dataset preparation + inference + improvements #34

sayakpaul · 2024-10-15T09:49:26Z

Known gotchas:

Not all models listed in https://docs.vllm.ai/en/latest/models/supported_models.html will have the same format for doing video captioning. Change the recaption.py as needed to suit your needs.
limit_mm_per_prompt needs adjustment based on the model being selected.

After I ran launch.sh, I got:

ls -lh $output_dir/*.txt

...
-rw-rw-r-- 1 sayak sayak  596 oct.  15 11:42 video-dataset-disney/d3af95bf8f9cce29e5c99e839e630c59_caption.txt
-rw-rw-r-- 1 sayak sayak  596 oct.  15 11:42 video-dataset-disney/d8a33062ef6a446c168acb06bf45b77d_caption.txt
-rw-rw-r-- 1 sayak sayak   55 oct.  15 11:42 video-dataset-disney/ec80a7b9cfb3740e5cdf9300fdc8d8a9_caption.txt
-rw-rw-r-- 1 sayak sayak  647 oct.  15 11:42 video-dataset-disney/f6a95cd4397b6a102e82becdcd05d585_caption.txt
-rw-rw-r-- 1 sayak sayak  591 oct.  15 11:42 video-dataset-disney/fe0efaedf8c47812bff8da9951f77975_caption.txt
-rw-rw-r-- 1 sayak sayak  414 oct.  15 11:42 video-dataset-disney/fec5a5184b05acafc7904cd419cbb5a3_caption.txt

video_recaptioning/recaption.py

sayakpaul · 2024-10-17T10:22:50Z

@a-r-r-o-w this is ready for reviews.

a-r-r-o-w · 2024-10-17T10:30:26Z

awesome, testing now!

a-r-r-o-w · 2024-10-17T10:32:34Z

Good to take note of by the way: https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/main/docs/Report-v1.3.0.md#captioning

a-r-r-o-w · 2024-10-17T10:45:16Z

I get the following error when launching. Any idea why?

stacktrace

(nightly-venv) aryan@hf-dgx-01:/raid/aryan/cogvideox-distillation/video_recaptioning$ ./launch.sh
WARNING 10-17 12:43:21 cuda.py:76] Detected different devices in the system:
WARNING 10-17 12:43:21 cuda.py:76] NVIDIA A100-SXM4-80GB
WARNING 10-17 12:43:21 cuda.py:76] NVIDIA A100-SXM4-80GB
WARNING 10-17 12:43:21 cuda.py:76] NVIDIA A100-SXM4-80GB
WARNING 10-17 12:43:21 cuda.py:76] NVIDIA DGX Display
WARNING 10-17 12:43:21 cuda.py:76] NVIDIA A100-SXM4-80GB
WARNING 10-17 12:43:21 cuda.py:76] Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
/raid/aryan/nightly-venv/lib/python3.10/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
You are using a model of type qwen2_vl to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
INFO 10-17 12:43:30 config.py:887] Defaulting to use mp for distributed inference
INFO 10-17 12:43:30 llm_engine.py:237] Initializing an LLM engine (vdev) with config: model='Qwen/Qwen2-VL-2B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2-VL-2B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2-VL-2B-Instruct, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
WARNING 10-17 12:43:31 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 10-17 12:43:31 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=554434) INFO 10-17 12:43:32 multiproc_worker_utils.py:216] Worker ready; awaiting tasks
INFO 10-17 12:43:34 utils.py:1008] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=554434) INFO 10-17 12:43:34 utils.py:1008] Found nccl from library libnccl.so.2
INFO 10-17 12:43:34 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=554434) INFO 10-17 12:43:34 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=554434) INFO 10-17 12:43:35 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /home/aryan/.cache/vllm/gpu_p2p_access_cache_for_2,3.json
INFO 10-17 12:43:35 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /home/aryan/.cache/vllm/gpu_p2p_access_cache_for_2,3.json
INFO 10-17 12:43:35 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7f6928dde860>, local_subscribe_port=47105, remote_subscribe_port=None)
INFO 10-17 12:43:35 model_runner.py:1060] Starting to load model Qwen/Qwen2-VL-2B-Instruct...
(VllmWorkerProcess pid=554434) INFO 10-17 12:43:35 model_runner.py:1060] Starting to load model Qwen/Qwen2-VL-2B-Instruct...
INFO 10-17 12:43:35 weight_utils.py:243] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=554434) INFO 10-17 12:43:35 weight_utils.py:243] Using model weights format ['*.safetensors']
[rank0]: Traceback (most recent call last):
[rank0]:   File "/raid/aryan/cogvideox-distillation/video_recaptioning/recaption.py", line 121, in <module>
[rank0]:     fire.Fire(main)
[rank0]:   File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/fire/core.py", line 135, in Fire
[rank0]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]:   File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire
[rank0]:     component, remaining_args = _CallAndUpdateTrace(
[rank0]:   File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank0]:     component = fn(*varargs, **kwargs)
[rank0]:   File "/raid/aryan/cogvideox-distillation/video_recaptioning/recaption.py", line 89, in main
[rank0]:     vllm_engine, sampling_params = load_model(
[rank0]:   File "/raid/aryan/cogvideox-distillation/video_recaptioning/recaption.py", line 69, in load_model
[rank0]:     vllm_engine = LLM(
[rank0]:   File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 177, in __init__
[rank0]:     self.llm_engine = LLMEngine.from_engine_args(
[rank0]:   File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 574, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 335, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:   File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 111, in _init_executor
[rank0]:     self._run_workers("load_model",
[rank0]:   File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers
[rank0]:     driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]:   File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/vllm/worker/worker.py", line 183, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1062, in load_model
[rank0]:     self.model = get_model(model_config=self.model_config,
[rank0]:   File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
[rank0]:     return loader.load_model(model_config=model_config,
[rank0]:   File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 402, in load_model
[rank0]:     model.load_weights(self._get_all_weights(model_config, model))
[rank0]:   File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_vl.py", line 1126, in load_weights
[rank0]:     for name, loaded_weight in weights:
[rank0]:   File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 377, in _get_all_weights
[rank0]:     yield from self._get_weights_iterator(primary_weights)
[rank0]:   File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 336, in _get_weights_iterator
[rank0]:     hf_folder, hf_weights_files, use_safetensors = self._prepare_weights(
[rank0]:   File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 292, in _prepare_weights
[rank0]:     hf_folder = download_weights_from_hf(
[rank0]:   File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/weight_utils.py", line 246, in download_weights_from_hf
[rank0]:     with get_lock(model_name_or_path, cache_dir):
[rank0]:   File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/filelock/_api.py", line 297, in __enter__
[rank0]:     self.acquire()
[rank0]:   File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/filelock/_api.py", line 255, in acquire
[rank0]:     self._acquire()
[rank0]:   File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/filelock/_unix.py", line 39, in _acquire
[rank0]:     fd = os.open(self.lock_file, open_flags, self._context.mode)
[rank0]: PermissionError: [Errno 13] Permission denied: '/tmp/31ada50e7aab2c9a44d651c511a187e63ec6fd0d7ba11408530179c4c8ea1eb2Qwen-Qwen2-VL-2B-Instruct.lock'
INFO 10-17 12:43:35 multiproc_worker_utils.py:121] Killing local vLLM worker processes
Fatal Python error: _enter_buffered_busy: could not acquire lock for <_io.BufferedWriter name='<stdout>'> at interpreter shutdown, possibly due to daemon threads
Python runtime state: finalizing (tstate=0x0000563edffa8600)

Current thread 0x00007f69dc88c740 (most recent call first):
  <no Python frame>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, yaml._yaml, markupsafe._speedups, PIL._imaging, psutil._psutil_linux, psutil._psutil_posix, msgspec._core, sentencepiece._sentencepiece, PIL._imagingft, jaxlib.cpu_feature_guard, regex._regex, msgpack._cmsgpack, google.protobuf.pyext._message, setproctitle, uvloop.loop, ray._raylet, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, zmq.backend.cython._zmq (total: 46)
/home/aryan/.pyenv/versions/3.10.14/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
./launch.sh: line 11: 554077 Aborted                 (core dumped) python recaption.py --root_dir="/raid/aryan/video-dataset-tom-and-jerry" --output_dir="/raid/aryan/video-dataset-tom-and-jerry-recaptioned" --num_devices=2 --max_num_frames=8 --max_tokens=120 --num_data_workers=4 --batch_size=2 --prompt="Please describe the content of this video in as much detail as possible, including the objects, scenery, animals, characters, and camera movements within the video. Do not include '\n' in your response. Please start the description with the video content directly. Please describe the content of the video and the changes that occur, in chronological order." --num_artifact_workers=4

sayakpaul · 2024-10-17T10:49:30Z

It's a permission denied error, what can I do to sort your permissions? :P

a-r-r-o-w · 2024-10-17T11:07:25Z

Seems like you're the owner of the lock file on DGX. Not sure why they haven't implemented per-user tmp directories, because this problem is so common when using same cache folders for multiple users lol

sayakpaul · 2024-10-17T11:08:48Z

I can lift it either because I don't have sudo. Better to check in infra.

a-r-r-o-w · 2024-10-17T11:09:55Z

I'm looking through the vLLM docs to see if they have an environment variable I could configure to use different cache dir, but if not then will ask in infra. Thanks!

a-r-r-o-w · 2024-10-17T11:18:00Z

Seems like adding a download_dir param fixes it with multiple users. I'll add the relevant changes and optional arg to the script: vllm-project/vllm#2675

a-r-r-o-w · 2024-10-17T11:52:35Z

I will try a few models to see what works best as default. I personally preferred the outputs of MiniCPM a lot, but will also give Qwen 7b a try. Currently, getting descriptions like:

In these frames, we see an animated scene involving a dog character. The dog is crouching down in front of a wooden dresser using a weapon. The weapon appears to be a sword, and the dog’s expression suggests that they are in the middle of making a move or about to strike. The scene is dynamic, showing the dog in a poised position as they execute the attack or защиты.This is an encoded image sequence of an animated character peacefully hanging in an industrial setting. The character is depicted in a suspended position, seemingly balancing against pipes and metal infrastructure. The environment appears to be confined spaces with industrial fixtures, suggesting a engineering or industrial context.In the given sequence of frames, we see an animated scene featuring two characters associated with a renowned animated series. They are placed in a setting with a brown wall and green floor.

1. At the top of the frame, there is Tom, a character frequently seen in the series, leading in a playful manner. He is dressed in a red dress jacket and surrounded by polka-dotted socks. His head is adorned with a porkpie hat, and his facial expression suggests a cheerful and jovial demeanor.

2. Below him, a character dressed in a blue outfit can be observed.The frames depict a scene of two animated characters sitting on a bench. The character on the left is a cat wearing a cowboy hat and holding a paper bag, while the character on the right is a cow. They appear to be talking or engaged in conversation. The background suggests a makeshift cabin or shed, indicating a rural or desert setting.The scene depicts a close-up of a person sitting in what appears to be a car or a vehicle. The person is seated on a seat, and the background shows a white curtain hanging behind them, possibly indicating that the vehicle is indoors. The overall atmosphere is indoor and cozy, conveying a sense of comfort and relaxation.In the presented sequence of frames, a character wearing white gloves comes into view. The character's white gloves have relaxed and ungrip betting in a peculiar way. The palm-side of the human finger and pointer finger are facing upward with fingers sticking straight out for a distance that seems asymptomatic. Both the hand's back and the fingers' heads are oriented parallel to each other in a high-angle manner, raising the combined formation in a peculiar fashion. The whole posture and positioning of the gloves may signify surprise or surprise-breathing or diahorrea like/unlike roleplay. Other hands purely in casual colourthis set of frames(2,90),(986,613) are quick and energetic.The frames in the video depict an animated scene involving a cat character who appears to be bundles wrapped around timbers or a similar object. The cat is wearing red trousers, white tennis shoes, and seems to be in a festive or playful mood. Surrounding the cat is a similarly dressed character and an ocean scene. The cat holds one of the bundles in its left hand, and it seems to be engaged in a lighthearted task, possibly navigating the bundles to reach a destination or dealing with them as part of a playful activity.a gray cat around a wall, dancing near the sill.The sequence depicts a view from a ship's Wildlife Pool, showcasing a swimming pool surrounded by a diving board with climbing stairs on both sides. The swimming pool has both a shallow end and a deep end, and is flanked by railings that provide an elevated diving ramp for participants. The diving board is adjacent to the pool, and the railing curve along the side of the deep end provides a protective barrier.In the beginning, a cute red ax is visually engaging punches the blue oar in a chilling blue water.Overall, the frames depict a scene of a white boat sailing under a bridge. The scene appears to be calm, with no任何人 attach themselves easilyThis is an animated scene with a curious mouse named Jerry. The mouse is standing in front of a wooden wall, displaying a mix of surprise, confusion, and possibly disbelief. The mouse has large, expressive ears and is wearing a black bow tie, adding a playful element to its expression. The background includes a room with a内阁的 wood and a wooden shelf, which adds to the setting.A cat with a fish tail appears in front of a bathroom shower, pero Suddenly, the cat falls backward and dyes the water behind it light blue.this set of frames.(2,2),(994,990) describes where there is a chair standing by the wall.This set of frames depicts a cartoon scene involving two characters. The character on the left is a anthropomorphic cat named Tom, who is wearing a gray coat and has a frying pan around his neck. He is holding a baseball bat in his right hand. The character on the right is a small brown dog named Jerry, who is wearing a blue coat and has a plastic dog toy in its mouth.

Does not seem to be respecting the 120 token limit set when lauching 🤔

sayakpaul · 2024-10-17T11:55:47Z

I will try a few models to see what works best as default.

Take note of the following from #34 (comment)

Not all models listed in https://docs.vllm.ai/en/latest/models/supported_models.html will have the same format for doing video captioning. Change the recaption.py as needed to suit your needs.

Does not seem to be respecting the 120 token limit set when lauching 🤔

That is likely a issue for vllm then.

Configuration-wise, we can experiment but I expected code-related comments as the first set of comments.

sayakpaul added 3 commits October 15, 2024 12:44

add dataset

247a6be

add captioner.

e12cc03

makefile

e16ea4c

sayakpaul requested a review from a-r-r-o-w October 15, 2024 09:49

sayakpaul commented Oct 15, 2024

View reviewed changes

video_recaptioning/recaption.py Outdated Show resolved Hide resolved

sayakpaul marked this pull request as draft October 15, 2024 15:30

fixes

132e506

sayakpaul marked this pull request as ready for review October 17, 2024 10:22

a-r-r-o-w added 2 commits October 17, 2024 13:48

add download_dir param

97a0af8

make style

61d269a

sayakpaul mentioned this pull request Oct 18, 2024

request for gradio UI , dataset captioning scripts #44

Closed

a-r-r-o-w added 8 commits November 11, 2024 19:40

Merge branch 'main' into video-recaptioning

ba01419

update

218f48c

update

415b07e

update

a0542f0

update

5436d58

update accelerate ocnfig

343d654

update slurm scripts

415021c

update

5ab3743

update

0f89efc

a-r-r-o-w changed the title ~~[feat] distributed captioning with Qwen~~ captioning + dataset preparation + inference + improvements Nov 14, 2024

a-r-r-o-w mentioned this pull request Nov 18, 2024

[feat] add Mochi-1 trainer #90

Merged

sayakpaul mentioned this pull request Jan 2, 2025

add a simple script for video captioning #172

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

captioning + dataset preparation + inference + improvements #34

captioning + dataset preparation + inference + improvements #34

sayakpaul commented Oct 15, 2024 •

edited

Loading

sayakpaul commented Oct 17, 2024

a-r-r-o-w commented Oct 17, 2024

a-r-r-o-w commented Oct 17, 2024

a-r-r-o-w commented Oct 17, 2024

sayakpaul commented Oct 17, 2024

a-r-r-o-w commented Oct 17, 2024

sayakpaul commented Oct 17, 2024

a-r-r-o-w commented Oct 17, 2024

a-r-r-o-w commented Oct 17, 2024

a-r-r-o-w commented Oct 17, 2024

sayakpaul commented Oct 17, 2024

captioning + dataset preparation + inference + improvements #34

Are you sure you want to change the base?

captioning + dataset preparation + inference + improvements #34

Conversation

sayakpaul commented Oct 15, 2024 • edited Loading

sayakpaul commented Oct 17, 2024

a-r-r-o-w commented Oct 17, 2024

a-r-r-o-w commented Oct 17, 2024

a-r-r-o-w commented Oct 17, 2024

sayakpaul commented Oct 17, 2024

a-r-r-o-w commented Oct 17, 2024

sayakpaul commented Oct 17, 2024

a-r-r-o-w commented Oct 17, 2024

a-r-r-o-w commented Oct 17, 2024

a-r-r-o-w commented Oct 17, 2024

sayakpaul commented Oct 17, 2024

sayakpaul commented Oct 15, 2024 •

edited

Loading