-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sometimes assigning new temperature show me "Unknown Error" #23
Comments
Hmm yeah it doesn't look like it's a problem with the script itself, but I can't be sure. You can try setting the DISPLAY variable via the script directly; so like If that doesn't work, let me see the output of the script when you do the above command and with the extra log option of Let me know how that goes. |
My patch from first post not worked :( Log without set display variable:
Log with set display variable:
and
and
and
|
Yeah that's definitely not an issue with my script. What kind of configuration are you working with? I notice that you're connecting remotely; if you're using some sort of headless setup, I'm not sure if nvidia-settings supports such a configuration. Although I see you have GDK messages in the log so I'm not sure. I guess the log that says: There are ways to force a display to be detected (even when there isn't a physical display detected) but that's outside of my field of expertise, aside from just connecting an old display or getting a fake display adapter. The strange part of this is that it works sometimes. I guess that's only when you have a remote connection to the client active. Perhaps it has something to do with whatever power saving settings you have? When the script has the error at line 116, saying |
Yes, I use this server without a display. And, unfortunately, there is no way to connect something there. With logging and indicating the display in the command, it became clearly better, but still, sometimes problems arise. Maybe you know some kind of 100% solution to avoid such errors and management did not stop? The fact is that I use your project in the server to train neural networks. And I know a lot of people who do the same thing, but just set the fan settings to maximum, although this is a dubious undertaking. I think they can also face such a problem if they take advantage of your solution ... PS: This problem occurs even when I am disconnected from the remote server. And I'm not sure that the problem is with the energy-saving settings, the rest of the processes are working properly. |
I found something that may help, so please try this and get back to me with the results. If it works I'll add it to my script so it works automatically. With your original first patch If not, have a look at this: Otherwise, I may need to know more about your setup to help, if there is even a solution. It does seem that it is possible to run nvidia-settings without a connected display though, it just depends on what programs you're using. |
Line Now:
My setup is: i3-9100F (without processor graphics) + ASUS Prime Z390-A + nVidia GTX 1080Ti Ubuntu 18.04 with nvidia driver version: 430.5 in system. I run nvidia-docker with container
|
I have not found the perfect solution, maybe you have something? I am now launch #!/bin/bash
PID=$(ps aux | grep nfancurve/temp.sh | grep -v grep | awk {'print $2'})
if [ ! -z "${PID}" ]
then
kill -9 ${PID}
sleep 5
fi
DISPLAY=:0 XAUTHORITY=/run/user/120/gdm/Xauthority sh /home/fullusr/nfancurve/temp.sh -d ":0" -l But it seems to me that this is a bad decision. |
The fact that this works for you is quite interesting. I could make a "careful" mode, where it makes sure that everything is what it was set at the beginning of the program whenever it wants to change the fan speed if you like. That may work, but it would require some back and forth testing between us. :) Edit: Infact, doing some reading about the nvidia-docker program, there may be something to alter. I'm reading through the documentation now. Edit 2: Have you tried running the script from the docker command? I don't know if this could be a problem you're aware of, but https://github.com/NVIDIA/nvidia-docker/wiki/Frequently-Asked-Questions#why-is-nvidia-smi-inside-the-container-not-listing-the-running-processes |
That would be great and interesting!
I did not know about it, it all looks strange...
I tested docker run --gpus all -v /home/fullusr/nfancurve/:/nf nvidia/cuda:10.0-cudnn7-devel sh /nf/temp.sh -d ":0" -l
Configuration file: /nf/config
/nf/temp.sh: 75: /nf/temp.sh: nvidia-settings: not found
No Fans detected Adding But running |
Hmm I was reading through some of the NVIDIA documentation; could the errors when you run my script come from whatever training you're doing finishes, and then the trainer resets the GPU or something like that? Reading through our conversation a few times over it seems like after you get an error, most of the time the script continues on and works fine until the next error. Is that correct? If so, I will work in a patch that will try and prevent the script from failing when you experience a problem with the GPU as you've told me. At the beginning, you mention that you know people who just set the fan speed to maximum when doing any training. Do you know how they do that? Is it a similar method to what my script uses? (like This information should help when I make the patch. Hope you're not too effected by the virus going around though! :) |
The task in crontab does not work properly, unfortunately. When I try to manually start the task, I see:
To be honest - I don’t know for sure, but I have a session reset for tensorflow - I try to avoid frequent video memory leaks through it (but this cannot be fixed).
It used to be (as it happened now - I wrote at the very beginning).
Yes, here is the instruction from them:
I'm okay, thanks! I hope everything is calm with you too... |
What do you mean by manually starting the task? Just in the shell? I see that they use I'll see if I can do some stuff to the script. |
Yes :)
I run it from PS: Regarding the fall of the graphical shell (xorg/gdm): I think that I should test the automatic launch of this service in the event of a fall (for example, using |
Sorry I've been quite busy; I assume you've tried running it without root? I'll let you know when I get the patch to try out :) |
I run
and get:
But after 1 or several hours (absolutely random) I get error:
I dont understand this is bug or my configuration wrong.
Now I try to test simple patch (maybe it be useful for you and anyone):
I changed: https://github.com/nan0s7/nfancurve/blob/v019.2/temp.sh#L84-L86
to:
The text was updated successfully, but these errors were encountered: