RuntimeError: DataLoader worker (pid 1233) exited unexpectedly with exit code 1. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.

# if  evaluate in train or val, get bug:

12 Feb 08:32    INFO  Text Item num: 96282
12 Feb 08:32    INFO  Inference item_data with item_batch_size = 80 len(item_loader) = 1204
  0%|

                                ...

12 Feb 08:32    INFO   not in self.env
  0%|                                                                                                                                                                                                                            | 0/1204 [00:00<?, ?it/s]thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-core-1.12.1/src/registry.rs:168:10:
The global thread pool has not been initialized.: ThreadPoolBuildError { kind: IOError(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }) }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-core-1.12.1/src/registry.rs:168:10:
The global thread pool has not been initialized.: ThreadPoolBuildError { kind: IOError(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }) }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Process Process-14:
Process Process-12:
thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-core-1.12.1/src/registry.rs:168:10:
The global thread pool has not been initialized.: ThreadPoolBuildError { kind: IOError(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }) }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-core-1.12.1/src/registry.rs:168:10:
The global thread pool has not been initialized.: ThreadPoolBuildError { kind: IOError(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }) }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Process Process-8:
Process Process-2:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/home/HLLM/code/REC/data/dataset/batchset.py", line 68, in __getitem__
    ids, _ = process_item(item_token_i)
  File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/HLLM/code/REC/data/dataset/batchset.py", line 58, in process_item
    ids = self.tokenizer.encode(text_str)
  File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2654, in encode
    encoded_inputs = self.encode_plus(
  File "/home/HLLM/code/REC/data/dataset/batchset.py", line 68, in __getitem__
    ids, _ = process_item(item_token_i)
  File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3062, in encode_plus
    return self._encode_plus(
  File "/home/HLLM/code/REC/data/dataset/batchset.py", line 58, in process_item
    ids = self.tokenizer.encode(text_str)
  File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 583, in _encode_plus
    batched_output = self._batch_encode_plus(
  File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2654, in encode
    encoded_inputs = self.encode_plus(
  File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 511, in _batch_encode_plus
    encodings = self._tokenizer.encode_batch(
  File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3062, in encode_plus
    return self._encode_plus(
  File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 583, in _encode_plus
    batched_output = self._batch_encode_plus(
  File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 511, in _batch_encode_plus
    encodings = self._tokenizer.encode_batch(
pyo3_runtime.PanicException: The global thread pool has not been initialized.: ThreadPoolBuildError { kind: IOError(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }) }
pyo3_runtime.PanicException: The global thread pool has not been initialized.: ThreadPoolBuildError { kind: IOError(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }) }
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/HLLM/code/REC/data/dataset/batchset.py", line 68, in __getitem__
    ids, _ = process_item(item_token_i)
  File "/home/HLLM/code/REC/data/dataset/batchset.py", line 58, in process_item
    ids = self.tokenizer.encode(text_str)
  File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2654, in encode
    encoded_inputs = self.encode_plus(
  File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3062, in encode_plus
    return self._encode_plus(
  File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 583, in _encode_plus
    batched_output = self._batch_encode_plus(
  File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 511, in _batch_encode_plus
    encodings = self._tokenizer.encode_batch(
pyo3_runtime.PanicException: The global thread pool has not been initialized.: ThreadPoolBuildError { kind: IOError(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }) }
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/HLLM/code/REC/data/dataset/batchset.py", line 68, in __getitem__
    ids, _ = process_item(item_token_i)
  File "/home/HLLM/code/REC/data/dataset/batchset.py", line 58, in process_item
    ids = self.tokenizer.encode(text_str)
  File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2654, in encode
    encoded_inputs = self.encode_plus(
  File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3062, in encode_plus
    return self._encode_plus(
  File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 583, in _encode_plus
    batched_output = self._batch_encode_plus(
  File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 511, in _batch_encode_plus
    encodings = self._tokenizer.encode_batch(
pyo3_runtime.PanicException: The global thread pool has not been initialized.: ThreadPoolBuildError { kind: IOError(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }) }
/home/HLLM/code/REC/data/dataset/collate_fn.py:41: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage = elem.storage()._new_shared(numel)
/home/HLLM/code/REC/data/dataset/collate_fn.py:41: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage = elem.storage()._new_shared(numel)
/home/HLLM/code/REC/data/dataset/collate_fn.py:41: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage = elem.storage()._new_shared(numel)
/home/HLLM/code/REC/data/dataset/collate_fn.py:41: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage = elem.storage()._new_shared(numel)
/home/HLLM/code/REC/data/dataset/collate_fn.py:41: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage = elem.storage()._new_shared(numel)
/home/HLLM/code/REC/data/dataset/collate_fn.py:41: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage = elem.storage()._new_shared(numel)
/home/HLLM/code/REC/data/dataset/collate_fn.py:41: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage = elem.storage()._new_shared(numel)
/home/HLLM/code/REC/data/dataset/collate_fn.py:41: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage = elem.storage()._new_shared(numel)
/home/HLLM/code/REC/data/dataset/collate_fn.py:41: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage = elem.storage()._new_shared(numel)
/home/HLLM/code/REC/data/dataset/collate_fn.py:41: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage = elem.storage()._new_shared(numel)
  0%|                                                                                                                                                                                                                            | 0/1204 [00:00<?, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/HLLM/code/run.py", line 141, in <module>
[rank0]:     run_loop(local_rank=local_rank, config_file=config_file, extra_args=extra_args)
[rank0]:   File "/home/HLLM/code/run.py", line 108, in run_loop
[rank0]:     test_result = trainer.evaluate(test_loader, load_best_model=False, show_progress=config['show_progress'], init_model=True)
[rank0]:   File "/home/HLLM/code/REC/trainer/trainer.py", line 480, in evaluate
[rank0]:     self.compute_item_feature(self.config, eval_data.dataset.dataload)
[rank0]:   File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/HLLM/code/REC/trainer/trainer.py", line 429, in compute_item_feature
[rank0]:     items = self.model(items, mode='compute_item')
[rank0]:   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/site-packages/lightning/fabric/wrappers.py", line 141, in forward
[rank0]:     output = self._forward_module(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1855, in forward
[rank0]:     loss = self.module(*inputs, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/HLLM/code/REC/model/HLLM/hllm.py", line 186, in forward
[rank0]:     return self.compute_item(interaction)
[rank0]:   File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/HLLM/code/REC/model/HLLM/hllm.py", line 233, in compute_item
[rank0]:     pos_embedding = self.forward_item_emb(pos_input_ids, pos_position_ids, pos_cu_input_lens, self.item_emb_token_n, self.item_emb_tokens, self.item_llm)
[rank0]:   File "/home/HLLM/code/REC/model/HLLM/hllm.py", line 160, in forward_item_emb
[rank0]:     model_out = llm(inputs_embeds=inputs_embeds.unsqueeze(0), cu_input_lens=cu_input_lens, position_ids=position_ids.unsqueeze(0))
[rank0]:   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/HLLM/code/REC/model/HLLM/modeling_llama.py", line 1217, in forward
[rank0]:     outputs = self.model(
[rank0]:   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/HLLM/code/REC/model/HLLM/modeling_llama.py", line 1089, in forward
[rank0]:     layer_outputs = decoder_layer(
[rank0]:   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/HLLM/code/REC/model/HLLM/modeling_llama.py", line 768, in forward
[rank0]:     hidden_states = self.input_layernorm(hidden_states)
[rank0]:   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/HLLM/code/REC/model/HLLM/modeling_llama.py", line 278, in forward
[rank0]:     variance = hidden_states.pow(2).mean(-1, keepdim=True)
[rank0]:   File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
[rank0]:     _error_if_any_worker_fails()
[rank0]: RuntimeError: DataLoader worker (pid 1233) exited unexpectedly with exit code 1. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.



# train

cd code && python3 main.py \
--config_file overall/LLM_deepspeed.yaml HLLM/HLLM.yaml \
--loss nce \
--epochs 5 \
--dataset Pixel200K \
--train_batch_size 8 \
--MAX_TEXT_LENGTH 256 \
--MAX_ITEM_LIST_LENGTH 10 \
--checkpoint_dir saved_path \
--optim_args.learning_rate 1e-4 \
--item_pretrain_dir TinyLlama-1.1B-Chat-v1.0 \
--user_pretrain_dir TinyLlama-1.1B-Chat-v1.0 \
--text_path "../information" \
--text_keys '[\"title\",\"tag\",\"description\"]'


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: DataLoader worker (pid 1233) exited unexpectedly with exit code 1. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace. #40

if evaluate in train or val, get bug:

train

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RuntimeError: DataLoader worker (pid 1233) exited unexpectedly with exit code 1. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace. #40

Description

if evaluate in train or val, get bug:

train

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions