Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python deduplicate_dataset.py #12

Open
simplew2011 opened this issue Feb 22, 2024 · 1 comment
Open

python deduplicate_dataset.py #12

simplew2011 opened this issue Feb 22, 2024 · 1 comment

Comments

@simplew2011
Copy link

https://github.com/huggingface/cosmopedia/blob/main/deduplication/deduplicate_dataset.py

2024-02-22 14:17:57.759 | INFO     | datatrove.executor.slurm:launch_job:216 - Launching dependency job "mh3"
2024-02-22 14:17:57.759 | INFO     | datatrove.executor.slurm:launch_job:216 - Launching dependency job "mh2"
2024-02-22 14:17:57.759 | INFO     | datatrove.executor.slurm:launch_job:216 - Launching dependency job "mh1"
2024-02-22 14:17:57.763 | INFO     | datatrove.executor.slurm:launch_job:249 - Launching Slurm job mh1 (120 tasks) with launch script "/home/wzp/code/LLMData/open_source/datatrove/data/minhash_logs/signatures/launch_script.slurm"
Traceback (most recent call last):
  File "/home/wzp/code/LLMData/open_source/datatrove/demo.py", line 110, in <module>
    stage4.run()
  File "/home/wzp/code/LLMData/open_source/datatrove/src/datatrove/executor/slurm.py", line 169, in run
    self.launch_job()
  File "/home/wzp/code/LLMData/open_source/datatrove/src/datatrove/executor/slurm.py", line 217, in launch_job
    self.depends.launch_job()
  File "/home/wzp/code/LLMData/open_source/datatrove/src/datatrove/executor/slurm.py", line 217, in launch_job
    self.depends.launch_job()
  File "/home/wzp/code/LLMData/open_source/datatrove/src/datatrove/executor/slurm.py", line 217, in launch_job
    self.depends.launch_job()
  File "/home/wzp/code/LLMData/open_source/datatrove/src/datatrove/executor/slurm.py", line 262, in launch_job
    self.job_id = launch_slurm_job(launch_file_contents, *args)
  File "/home/wzp/code/LLMData/open_source/datatrove/src/datatrove/executor/slurm.py", line 349, in launch_slurm_job
    return subprocess.check_output(["sbatch", *args, f.name]).decode("utf-8").split()[-1]
  File "/home/wzp/anaconda3/envs/3.10/lib/python3.10/subprocess.py", line 421, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/home/wzp/anaconda3/envs/3.10/lib/python3.10/subprocess.py", line 503, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/home/wzp/anaconda3/envs/3.10/lib/python3.10/subprocess.py", line 971, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/home/wzp/anaconda3/envs/3.10/lib/python3.10/subprocess.py", line 1863, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
@loubnabnl
Copy link
Collaborator

Can you provide more details about your setup? e.g did you run it on a slurm cluster?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants