Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

昇腾910B训练PaddleDetection——PP-YOLOE_plus-S失败 #2005

Open
1737686924 opened this issue Sep 18, 2024 · 1 comment
Open

昇腾910B训练PaddleDetection——PP-YOLOE_plus-S失败 #2005

1737686924 opened this issue Sep 18, 2024 · 1 comment
Assignees

Comments

@1737686924
Copy link

Checklist:

  1. 查找历史相关issue寻求解答
  2. 翻阅FAQ常见问题汇总和答疑
  3. 确认bug是否在新版本里还未修复
  4. 翻阅PaddleX API文档说明

描述问题

PaddleX 支持对数据集进行校验,确保数据集格式符合 PaddleX 的相关要求。同时在数据校验时,能够对数据集进行分析,统计数据集的基本信息。

python main.py -c paddlex/configs/object_detection/PP-YOLOE_plus-S.yaml
-o Global.mode=check_dataset
-o Global.dataset_dir=./dataset/det_coco_examples

成功

复现

  1. 您是否已经正常运行我们提供的教程

  2. 您是否在教程的基础上修改代码内容?还请您提供运行的代码
    python main.py -c paddlex/configs/object_detection/PP-YOLOE_plus-S.yaml
    -o Global.mode=train
    -o Global.dataset_dir=./dataset/det_coco_examples
    -o Global.output=ppyolo_plus_s_output
    -o Global.device="npu:0,1,2,3"

  3. 您使用的数据集是?

  4. 请提供您出现的报错信息及相关log
    ======================= Modified FLAGS detected =======================
    FLAGS(name='FLAGS_use_stride_kernel', current_value=False, default_value=True)
    =======================================================================
    I0918 23:46:05.051712 973133 tcp_utils.cc:130] Successfully connected to 127.0.0.1:52457
    loading annotations into memory...
    Done (t=0.00s)
    creating index...
    index created!
    W0918 23:46:24.526521 973133 dygraph_functions.cc:83150] got different data type, run type promotion automatically, this may cause data type been changed.

\


C++ Traceback (most recent call last):

0 egr::Backward(std::vector<paddle::Tensor, std::allocatorpaddle::Tensor > const&, std::vector<paddle::Tensor, std::allocatorpaddle::Tensor > const&, bool)
1 egr::RunBackward(std::vector<paddle::Tensor, std::allocatorpaddle::Tensor > const&, std::vector<paddle::Tensor, std::allocatorpaddle::Tensor > const&, bool, bool, std::vector<paddle::Tensor, std::allocatorpaddle::Tensor > const&, bool, std::vector<paddle::Tensor, std::allocatorpaddle::Tensor > const&)
2 Conv2dGradNodeFinal::operator()(paddle::small_vector<std::vector<paddle::Tensor, std::allocatorpaddle::Tensor >, 15u>&, bool, bool)
3 paddle::experimental::conv2d_grad(paddle::Tensor const&, paddle::Tensor const&, paddle::Tensor const&, std::vector<int, std::allocator > const&, std::vector<int, std::allocator > const&, std::string const&, std::vector<int, std::allocator > const&, int, std::string const&, paddle::Tensor*, paddle::Tensor*)
4 void custom_kernel::Conv2DGradKernel<float, phi::CustomContext>(phi::CustomContext const&, phi::DenseTensor const&, phi::DenseTensor const&, phi::DenseTensor const&, std::vector<int, std::allocator > const&, std::vector<int, std::allocator > const&, std::string const&, std::vector<int, std::allocator > const&, int, std::string const&, phi::DenseTensor*, phi::DenseTensor*)
5 aclnnConvolutionBackward
6 InitL2Phase2Context(char const*, aclOpExecutor*)
7 GetOpExecCacheFromExecutor(aclOpExecutor*)


Error Message Summary:

FatalError: Segmentation fault is detected by the operating system.
[TimeInfo: *** Aborted at 1726674458 (unix time) try "date -d @1726674458" if you are using GNU date ***]
LAUNCH INFO 2024-09-18 23:47:48,695 Exit code -11
[SignalInfo: *** SIGSEGV (@0xed94d) received by PID 973133 (TID 0xffffa00bae90) from PID 973133 ***]

Traceback (most recent call last):
File "/work/workspace/PaddleX/paddlex/utils/result_saver.py", line 30, in wrap
result = func(self, *args, **kwargs)
File "/work/workspace/PaddleX/paddlex/engine.py", line 42, in run
trainer.train()
File "/work/workspace/PaddleX/paddlex/modules/base/trainer/trainer.py", line 61, in train
train_result = self.pdx_model.train(**self.get_train_kwargs())
File "/work/workspace/PaddleX/paddlex/repo_apis/PaddleDetection_api/object_det/model.py", line 109, in train
return self.runner.train(
File "/work/workspace/PaddleX/paddlex/repo_apis/PaddleDetection_api/object_det/runner.py", line 54, in train
return self.run_cmd(
File "/work/workspace/PaddleX/paddlex/repo_apis/base/runner.py", line 359, in run_cmd
raise CalledProcessError(
paddlex.utils.errors.others.CalledProcessError: Command ['/usr/bin/python', '-m', 'paddle.distributed.launch', '--devices', '0,1,2,3', '--log_dir', '/work/workspace/PaddleX/ppyolo_plus_s_output/distributed_train_logs', 'tools/train.py', '--eval', '--config', '/root/.paddlex/tmp99soy5_c/detmodel_PP-YOLOE_plus-S.yml', '--use_vdl', 'True', '--vdl_log_dir', '/work/workspace/PaddleX/ppyolo_plus_s_output'] returned non-zero exit status 245.

环境

  1. 请提供您使用的PaddlePaddle和PaddleX的版本号
    3.0-beta

  2. 请提供您使用的操作系统信息,如Linux/Windows/MacOS

  3. 请问您使用的Python版本是?

  4. 请问您使用的CUDA/cuDNN的版本号是?

@a31413510
Copy link
Contributor

请问使用的镜像和paddle包是文档里提供的吗
https://github.com/PaddlePaddle/PaddleX/blob/release/3.0-beta/docs/tutorials/INSTALL_OTHER_DEVICES.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants