Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Movinet fails in GPU mode #1122

Open
bugsyb opened this issue Apr 7, 2024 · 2 comments
Open

Movinet fails in GPU mode #1122

bugsyb opened this issue Apr 7, 2024 · 2 comments
Labels
bug Something isn't working priority: normal

Comments

@bugsyb
Copy link

bugsyb commented Apr 7, 2024

Which version of recognize are you using?

6.11

Enabled Modes

Object recognition, Face recognition, Video recognition, Music recognition

TensorFlow mode

GPU mode

Downstream App

Memories App

Which Nextcloud version do you have installed?

28.0.4.1

Which Operating system do you have installed?

Ubuntu 20.0.4.4

Which database are you running Nextcloud on?

Postgres

Which Docker container are you using to run Nextcloud? (if applicable)

28.0.4.1

How much RAM does your server have?

32

What processor Architecture does your CPU have?

x86_64

Describe the Bug

Seems like after upgrade to NC 28.0.4.1 & Recognize 6.1.1 it started to report below.
It seems like it launches process, based on looking at nvidia-smi and it stays there though doesn't seem like it puts a load on GPU.

Classifier process output: Error: Session fail to run with error: 2 root error(s) found.
  (0) NOT_FOUND: could not find registered platform with id: 0x7fd379c7fae4
\t [[{{node movinet_classifier/movinet/stem/stem/conv3d/StatefulPartitionedCall}}]]
\t [[StatefulPartitionedCall/_1555]]
  (1) NOT_FOUND: could not find registered platform with id: 0x7fd379c7fae4
\t [[{{node movinet_classifier/movinet/stem/stem/conv3d/StatefulPartitionedCall}}]]
0 successful operations.
0 derived errors ignored.
    at NodeJSKernelBackend.runSavedModel (/var/www/html/custom_apps/recognize/node_modules/@tensorflow/tfjs-node-gpu/dist/nodejs_kernel_backend.js:461:43)
    at TFSavedModel.predict (/var/www/html/custom_apps/recognize/node_modules/@tensorflow/tfjs-node-gpu/dist/saved_model.js:341:43)
    at MovinetModel.predict (/var/www/html/custom_apps/recognize/src/movinet/MovinetModel.js:46:21)
    at /var/www/html/custom_apps/recognize/src/movinet/MovinetModel.js:95:24
    at /var/www/html/custom_apps/recognize/node_modules/@tensorflow/tfjs-core/dist/tf-core.node.js:4559:22
    at Engine.scopedRun (/var/www/html/custom_apps/recognize/node_modules/@tensorflow/tfjs-core/dist/tf-core.node.js:4569:23)
    at Engine.tidy (/var/www/html/custom_apps/recognize/node_modules/@tensorflow/tfjs-core/dist/tf-core.node.js:4558:21)
    at Object.tidy (/var/www/html/custom_apps/recognize/node_modules/@tensorflow/tfjs-core/dist/tf-core.node.js:8291:19)
    at MovinetModel.inference (/var/www/html/custom_apps/recognize/src/movinet/MovinetModel.js:92:21)
    at runMicrotasks (<anonymous>)

At the same time it seems to have all what's needed (btw. it didn't raise this error prior upgrades):

./bin/node src/test_gputensorflow.js 
2024-04-07 21:37:06.584377: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-07 21:37:06.593729: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-04-07 21:37:06.636405: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-04-07 21:37:06.636674: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-04-07 21:37:07.109813: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-04-07 21:37:07.110064: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-04-07 21:37:07.110196: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-04-07 21:37:07.110354: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 3393 MB memory:  -> device: 0, name: Quadro M1200, pci bus id: 0000:01:00.0, compute capability: 5.0

And models seem to be there too:

du -shx ./models/*
50M	./models/efficientnet_lite4
794M	./models/efficientnetv2
22M	./models/landmarks_africa
41M	./models/landmarks_asia
41M	./models/landmarks_europe
41M	./models/landmarks_north_america
31M	./models/landmarks_oceania
41M	./models/landmarks_south_america
47M	./models/movinet-a3
31M	./models/musicnn

Unsure where to look for an issue.
It is after ffmpeg finishes its job.

Thanks!

Expected Behavior

Would just proceed to classify.

To Reproduce

Unsure.

Debug log

No response

@bugsyb bugsyb added the bug Something isn't working label Apr 7, 2024
Copy link

github-actions bot commented Apr 7, 2024

Hello 👋

Thank you for taking the time to open this issue with recognize. I know it's frustrating when software
causes problems. You have made the right choice to come here and open an issue to make sure your problem gets looked at
and if possible solved.
I try to answer all issues and if possible fix all bugs here, but it sometimes takes a while until I get to it.
Until then, please be patient.
Note also that GitHub is a place where people meet to make software better together. Nobody here is under any obligation
to help you, solve your problems or deliver on any expectations or demands you may have, but if enough people come together we can
collaborate to make this software better. For everyone.
Thus, if you can, you could also look at other issues to see whether you can help other people with your knowledge
and experience. If you have coding experience it would also be awesome if you could step up to dive into the code and
try to fix the odd bug yourself. Everyone will be thankful for extra helping hands!
One last word: If you feel, at any point, like you need to vent, this is not the place for it; you can go to the forum,
to twitter or somewhere else. But this is a technical issue tracker, so please make sure to
focus on the tech and keep your opinions to yourself. (Also see our Code of Conduct. Really.)

I look forward to working with you on this issue
Cheers 💙

@bugsyb
Copy link
Author

bugsyb commented Apr 8, 2024

Could be related to:
#1060

In my case it is CUDA 11

#nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Thu_Feb_10_18:23:41_PST_2022
Cuda compilation tools, release 11.6, V11.6.112
Build cuda_11.6.r11.6/compiler.30978841_0

@marcelklehr marcelklehr changed the title Movinet fails Movinet fails in GPU mode May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority: normal
Projects
Status: Backlog
Development

No branches or pull requests

2 participants