-
Notifications
You must be signed in to change notification settings - Fork 78
Description
if evaluate in train or val, get bug:
12 Feb 08:32 INFO Text Item num: 96282
12 Feb 08:32 INFO Inference item_data with item_batch_size = 80 len(item_loader) = 1204
0%|
...
12 Feb 08:32 INFO not in self.env
0%| | 0/1204 [00:00<?, ?it/s]thread '' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-core-1.12.1/src/registry.rs:168:10:
The global thread pool has not been initialized.: ThreadPoolBuildError { kind: IOError(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }) }
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
thread '' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-core-1.12.1/src/registry.rs:168:10:
The global thread pool has not been initialized.: ThreadPoolBuildError { kind: IOError(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }) }
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
Process Process-14:
Process Process-12:
thread '' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-core-1.12.1/src/registry.rs:168:10:
The global thread pool has not been initialized.: ThreadPoolBuildError { kind: IOError(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }) }
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
thread '' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-core-1.12.1/src/registry.rs:168:10:
The global thread pool has not been initialized.: ThreadPoolBuildError { kind: IOError(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }) }
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
Process Process-8:
Process Process-2:
Traceback (most recent call last):
Traceback (most recent call last):
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
File "/home/HLLM/code/REC/data/dataset/batchset.py", line 68, in getitem
ids, _ = process_item(item_token_i)
File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/HLLM/code/REC/data/dataset/batchset.py", line 58, in process_item
ids = self.tokenizer.encode(text_str)
File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2654, in encode
encoded_inputs = self.encode_plus(
File "/home/HLLM/code/REC/data/dataset/batchset.py", line 68, in getitem
ids, _ = process_item(item_token_i)
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3062, in encode_plus
return self._encode_plus(
File "/home/HLLM/code/REC/data/dataset/batchset.py", line 58, in process_item
ids = self.tokenizer.encode(text_str)
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 583, in _encode_plus
batched_output = self._batch_encode_plus(
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2654, in encode
encoded_inputs = self.encode_plus(
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 511, in _batch_encode_plus
encodings = self._tokenizer.encode_batch(
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3062, in encode_plus
return self._encode_plus(
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 583, in _encode_plus
batched_output = self._batch_encode_plus(
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 511, in _batch_encode_plus
encodings = self._tokenizer.encode_batch(
pyo3_runtime.PanicException: The global thread pool has not been initialized.: ThreadPoolBuildError { kind: IOError(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }) }
pyo3_runtime.PanicException: The global thread pool has not been initialized.: ThreadPoolBuildError { kind: IOError(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }) }
Traceback (most recent call last):
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/HLLM/code/REC/data/dataset/batchset.py", line 68, in getitem
ids, _ = process_item(item_token_i)
File "/home/HLLM/code/REC/data/dataset/batchset.py", line 58, in process_item
ids = self.tokenizer.encode(text_str)
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2654, in encode
encoded_inputs = self.encode_plus(
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3062, in encode_plus
return self._encode_plus(
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 583, in _encode_plus
batched_output = self._batch_encode_plus(
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 511, in _batch_encode_plus
encodings = self._tokenizer.encode_batch(
pyo3_runtime.PanicException: The global thread pool has not been initialized.: ThreadPoolBuildError { kind: IOError(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }) }
Traceback (most recent call last):
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/HLLM/code/REC/data/dataset/batchset.py", line 68, in getitem
ids, _ = process_item(item_token_i)
File "/home/HLLM/code/REC/data/dataset/batchset.py", line 58, in process_item
ids = self.tokenizer.encode(text_str)
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2654, in encode
encoded_inputs = self.encode_plus(
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3062, in encode_plus
return self._encode_plus(
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 583, in _encode_plus
batched_output = self._batch_encode_plus(
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 511, in _batch_encode_plus
encodings = self._tokenizer.encode_batch(
pyo3_runtime.PanicException: The global thread pool has not been initialized.: ThreadPoolBuildError { kind: IOError(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }) }
/home/HLLM/code/REC/data/dataset/collate_fn.py:41: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
storage = elem.storage()._new_shared(numel)
/home/HLLM/code/REC/data/dataset/collate_fn.py:41: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
storage = elem.storage()._new_shared(numel)
/home/HLLM/code/REC/data/dataset/collate_fn.py:41: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
storage = elem.storage()._new_shared(numel)
/home/HLLM/code/REC/data/dataset/collate_fn.py:41: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
storage = elem.storage()._new_shared(numel)
/home/HLLM/code/REC/data/dataset/collate_fn.py:41: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
storage = elem.storage()._new_shared(numel)
/home/HLLM/code/REC/data/dataset/collate_fn.py:41: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
storage = elem.storage()._new_shared(numel)
/home/HLLM/code/REC/data/dataset/collate_fn.py:41: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
storage = elem.storage()._new_shared(numel)
/home/HLLM/code/REC/data/dataset/collate_fn.py:41: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
storage = elem.storage()._new_shared(numel)
/home/HLLM/code/REC/data/dataset/collate_fn.py:41: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
storage = elem.storage()._new_shared(numel)
/home/HLLM/code/REC/data/dataset/collate_fn.py:41: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
storage = elem.storage()._new_shared(numel)
0%| | 0/1204 [00:00<?, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/HLLM/code/run.py", line 141, in
[rank0]: run_loop(local_rank=local_rank, config_file=config_file, extra_args=extra_args)
[rank0]: File "/home/HLLM/code/run.py", line 108, in run_loop
[rank0]: test_result = trainer.evaluate(test_loader, load_best_model=False, show_progress=config['show_progress'], init_model=True)
[rank0]: File "/home/HLLM/code/REC/trainer/trainer.py", line 480, in evaluate
[rank0]: self.compute_item_feature(self.config, eval_data.dataset.dataload)
[rank0]: File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/HLLM/code/REC/trainer/trainer.py", line 429, in compute_item_feature
[rank0]: items = self.model(items, mode='compute_item')
[rank0]: File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/site-packages/lightning/fabric/wrappers.py", line 141, in forward
[rank0]: output = self._forward_module(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1855, in forward
[rank0]: loss = self.module(*inputs, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/HLLM/code/REC/model/HLLM/hllm.py", line 186, in forward
[rank0]: return self.compute_item(interaction)
[rank0]: File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/HLLM/code/REC/model/HLLM/hllm.py", line 233, in compute_item
[rank0]: pos_embedding = self.forward_item_emb(pos_input_ids, pos_position_ids, pos_cu_input_lens, self.item_emb_token_n, self.item_emb_tokens, self.item_llm)
[rank0]: File "/home/HLLM/code/REC/model/HLLM/hllm.py", line 160, in forward_item_emb
[rank0]: model_out = llm(inputs_embeds=inputs_embeds.unsqueeze(0), cu_input_lens=cu_input_lens, position_ids=position_ids.unsqueeze(0))
[rank0]: File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/HLLM/code/REC/model/HLLM/modeling_llama.py", line 1217, in forward
[rank0]: outputs = self.model(
[rank0]: File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/HLLM/code/REC/model/HLLM/modeling_llama.py", line 1089, in forward
[rank0]: layer_outputs = decoder_layer(
[rank0]: File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/HLLM/code/REC/model/HLLM/modeling_llama.py", line 768, in forward
[rank0]: hidden_states = self.input_layernorm(hidden_states)
[rank0]: File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/HLLM/code/REC/model/HLLM/modeling_llama.py", line 278, in forward
[rank0]: variance = hidden_states.pow(2).mean(-1, keepdim=True)
[rank0]: File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
[rank0]: _error_if_any_worker_fails()
[rank0]: RuntimeError: DataLoader worker (pid 1233) exited unexpectedly with exit code 1. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.
train
cd code && python3 main.py
--config_file overall/LLM_deepspeed.yaml HLLM/HLLM.yaml
--loss nce
--epochs 5
--dataset Pixel200K
--train_batch_size 8
--MAX_TEXT_LENGTH 256
--MAX_ITEM_LIST_LENGTH 10
--checkpoint_dir saved_path
--optim_args.learning_rate 1e-4
--item_pretrain_dir TinyLlama-1.1B-Chat-v1.0
--user_pretrain_dir TinyLlama-1.1B-Chat-v1.0
--text_path "../information"
--text_keys '["title","tag","description"]'