Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unraid server lockup issue #111

Open
cwills75 opened this issue Aug 29, 2024 · 25 comments
Open

Unraid server lockup issue #111

cwills75 opened this issue Aug 29, 2024 · 25 comments

Comments

@cwills75
Copy link

Been having entire server lockups forcing me to hard-reset the Unraid server since installing this docker app. I'm running Unraid 6.12.6 along with Emby, FileFlows, Resilio-Sync, Unraid-Cloudflared-Tunnel and ZeroTier dockers.

It's happened three times now while working on files for Ownfoil. Ownfoil watch folder is a folder on main array. Lockups happen seemingly randomly, but in every instance it's when either copying a file to the watch folder or deleting a file within it WHILE the Ownfoil docker is running.

My only workaround is to stop the Ownfoil docker, move/delete whatever files I need to in the watch folder, then re-start Ownfoil and have it do a scan. I'm not sure if it's locking up while a file is being transferred/deleted AND Ownfoil is doing a regular scan or what's happening.

@a1ex4
Copy link
Owner

a1ex4 commented Aug 29, 2024

It is true that the identification process takes a bit of memory and lacks optimizations, how much RAM does your server have? It may very well be running out of memory.

@cwills75
Copy link
Author

It is true that the identification process takes a bit of memory and lacks optimizations, how much RAM does your server have? It may very well be running out of memory.

System has 32GB of RAM, using 54% with all dockers running.

@cwills75
Copy link
Author

cwills75 commented Sep 3, 2024

Would it be possible to maybe add an option in settings to not auto-scan (ever) until you can sort out the issue?

@a1ex4
Copy link
Owner

a1ex4 commented Sep 4, 2024

This is definitely an Unraid issue then, I have Ownfoil running 24/7 on a server with 4GB of RAM and never ran into a server lockup issue... I don't know the first thing about Unraid, or how it handles container orchestration, but as a preventive measure you could look into how to restrict the limit of memory the container can reach.

Do you have the last logs of the container before it becomes unresponsive so I can understand at what point it breaks?

@mzaki83
Copy link

mzaki83 commented Sep 4, 2024

Not on unraid, Debian here. Will try to get you logs.

@cwills75
Copy link
Author

cwills75 commented Sep 5, 2024

Going to try adding:

--memory=2G --cpus=1

to the docker and see if that makes a difference. If it crashes I'll see if I can pull a log, I don't see anything in the log file now.

@cwills75
Copy link
Author

cwills75 commented Sep 5, 2024

I think that may have fixed it, I've been trying to break it this morning and with the RAM/CPU limitations in place, everything has been fine. Thanks for the suggestion @a1ex4.

@mzaki83
Copy link

mzaki83 commented Sep 5, 2024

OK, will try limiting memory/cpu in my docker-compose and report back.

@mzaki83
Copy link

mzaki83 commented Sep 5, 2024

OK, will try limiting memory/cpu in my docker-compose and report back.

I may have spoken too soon. I am actually not sure how to do this in compose xD

@cwills75
Copy link
Author

cwills75 commented Sep 5, 2024

OK, will try limiting memory/cpu in my docker-compose and report back.

I may have spoken too soon. I am actually not sure how to do this in compose xD

I haven't used compose much in the past, but have used composerize with some success:

https://www.composerize.com/

On Unraid it's Edit Docker->Advanced View->Extrar Parameters field.

@mzaki83
Copy link

mzaki83 commented Sep 5, 2024

I've added

mem_limit: 2G

and

cpus: 1.0

to my docker-compose.yml.

I manually trigger a library refresh in ownfoil while watching resources in real time using "docker stats" and I see memory usage of the own foil container quickly reach the 2GB maximum and then plateau. I am still scanning my library (which is reasonably large) but have not seen any performance impact on the rest of my machine.

This stabilizes the situation for me and others running containerized where it's easy to limit access to system resources. Not sure if it's a suitable fix or how those running bare metal will mitigate this issue.

Thanks again for this software @a1ex4 !!

@jaybird2203
Copy link

jaybird2203 commented Sep 22, 2024

Summary - Added all details mentioned in post below, nothing stands out as cause. Will continue to monitor. One thing to note was I had to rebuild the dB after the lock up as the entries were not populating.

Running this for the first time with fresh pull on unraid 6.12.10 and had limited to 4G of ram (server this runs on has 128G) also just had my own server fully lock up. No ssh access etc. Server was running stable for over 120 days prior.

Sequence of events when the lock up happened are approximately as follows. Will update the thread with the docker details shortly.

  1. Ownfoil running with keys imported, had 6-7 titles showing on the web gui as a small sample size with the desired folder within the array
  2. Started to copy remainder of library (approx 50 titles plus updates) to the same library as the 6-7 titles were shown.
  3. Titles began to populate on web gui
  4. Began the "mover" process as files are moved to a cache drive first before being written to the array.
  5. Approximately 30 mins later the server fully locked up.

If logs are not overwritten I will also upload those as well - added snip of last entries.

Logs
image

Docker compose detail below.

docker run
-d
--name='ownfoil'
--net='bridge'
-e TZ="America/New_York"
-e HOST_OS="Unraid"
-e HOST_HOSTNAME="Tower"
-e HOST_CONTAINERNAME="ownfoil"
-e 'PUID'='99'
-e 'PGID'='100'
-l net.unraid.docker.managed=dockerman
-l net.unraid.docker.webui='http://[IP]:[PORT:8465]'
-l net.unraid.docker.icon='https://raw.githubusercontent.com/type0dev/Unraid-Template/main/ownfoil/icon.png'
-p '8465:8465/tcp'
-v '/mnt/user/Games/Switch/Games/':'/games':'rw'
-v '/mnt/cache/Dockers/Ownfoil':'/app/config':'rw'
--memory=4G 'a1ex4/ownfoil'
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
0796bcfc83b91489f8fe9e5a414131231db1f2e53c20b2208ab9b71c2e95ad73

The command finished successfully!

@orangewhipster
Copy link

I'm having issues with Ownfoil just simply stopping on Unraid.

It works completely fine, then I wont be able to bring up the page for it at all, look at Unraid and the server has "stopped". Re-start the server and its all OK again, until it randomly stops again.

Thoughts?

@jaybird2203
Copy link

jaybird2203 commented Oct 4, 2024

@orangewhipster Any chance the mover was running for the games folder location when this occurred? I've noticed the app hits a memory limit while I was running through the web looking for a few updates and copying files to the games folder.

@a1ex4 Any idea where we can try and pin point this memory scenario? For myself when the memory limit is hit Unraid shuts down the docker itself instead of crashing the full server now at least. This is still with 4G of ram defined. If it helps, the nsp size that it was last showing as attempting to identify was well over the 4G memory limit that was set. Edit: nsp size had no impact. Apologies for pinging direct, would like to get as much valuable info as possible to isolate.

Last entry in my logs show a sequence of events in the ~3 min leadup to the OOM event and the event happened ~1 min after the last entry below:

[2024-10-04 13:09:30.441] INFO (_internal) 192.168.2.64 - "GET /api/titles HTTP/1.1" 200 -
[2024-10-04 13:18:50.352] INFO (db) File '/games/XXXXXXXXXXXXXXXXXXX/XXXXXXX.xci' has been deleted.
[2024-10-04 13:18:50.431] DEBUG (db) No files were deleted. All files are present on disk.
[2024-10-04 13:18:50.432] INFO (library) Generating library ...
[2024-10-04 13:18:54.226] ERROR (titles) Title ID not found in titledb: XXXXXXXXXXXXXXXX
[2024-10-04 13:19:01.238] ERROR (titles) Title ID not found in titledb: XXXXXXXXXXXXXXXX
[2024-10-04 13:19:01.708] INFO (library) Generating library done.
[2024-10-04 13:21:10.136] INFO (library) Scanning library path /games ...
[2024-10-04 13:21:10.475] INFO (library) Identifiying file (1/2): /XXXXXXXXXXXXXXXX/XXXXXXXXXXXXXXXX.nsp
[2024-10-04 13:21:10.825] INFO (library) Identifiying file (1/2): /XXXXXXXXXXXXXXXX/XXXXXXXXXXXXXXXX.nsp OK Title ID: XXXXXXXXXXXXXXXX App ID : XXXXXXXXXXXXXXXX Title Type: UPDATE Version: XXXXXX
[2024-10-04 13:21:10.837] INFO (library) Identifiying file (2/2): /XXXXXXXXXXXXXXXX/XXXXXXXXXXXXXXXX.nsp
[2024-10-04 13:21:59.176] INFO (app) Scan already in progress

Had another crash doing similar steps - viewing the webpage, transferring files and appears to hang while identifying a file. File below is different than file above. Similar time after the last entry in logs before OOM event happened ~1 min.

[2024-10-04 14:14:53.555] INFO (_internal) 192.168.2.64 - "GET /api/titles HTTP/1.1" 200 -
[2024-10-04 14:15:14.967] INFO (library) Scanning library path /games ...
[2024-10-04 14:15:15.196] INFO (library) Identifiying file (1/1): /XXXXXXXXXXXX/XXXXXXXXXXXXXX.nsp
[2024-10-04 14:15:20.733] INFO (app) Scan already in progress

Noticed something odd after restarting the docker again, appears there were additional log entries updated based on timestamp. Bolded lines below for quick reference. And according to the Unraid logs the timestamp of 14:16:21.467 in the logs below is just before the OOM event at 14:16:22 when my docker crashes.

[2024-10-04 14:15:14.967] INFO (library) Scanning library path /games ...
[2024-10-04 14:15:15.196] INFO (library) Identifiying file (1/1): /XXXXXXXXXXXXXXXXXXXXX/XXXXXXXXXXXXXX.nsp
[2024-10-04 14:15:20.733] INFO (app) Scan already in progress
[2024-10-04 14:15:22.263] INFO (_internal) 192.168.2.64 - "GET / HTTP/1.1" 200 -
[2024-10-04 14:15:22.326] INFO (_internal) 192.168.2.64 - "GET /static/style.css HTTP/1.1" 304 -
[2024-10-04 14:15:22.401] INFO (_internal) 192.168.2.64 - "GET /api/titles HTTP/1.1" 200 -
[2024-10-04 14:15:34.651] DEBUG (db) No files were deleted. All files are present on disk.
[2024-10-04 14:15:34.657] INFO (library) Generating library ...
[2024-10-04 14:15:57.410] ERROR (titles) Title ID not found in titledb: XXXXXXXXXXXXXXXX
[2024-10-04 14:16:20.142] ERROR (titles) Title ID not found in titledb: XXXXXXXXXXXXXXXX
[2024-10-04 14:16:21.467] INFO (library) Generating library done.

users:x:100:games,ownfoil
ownfoil:x:99:100:Linux User,,,:/home/ownfoil:/sbin/nologin
Starting ownfoil

@orangewhipster
Copy link

Yes, its possible. I left it running for the past day without any interaction with it and it hasn't crashed yet. So, must be something to do with scanning the library...cause that's when I noticed it happens.

@brendongl
Copy link

Im having this issue, i limited the ram usage to 2GB, but yes the webgui locks up once it hits the limit during a scan.
Did anyone have a fix for this? I need to scan and it will be a large library which is updated regularly.

@a1ex4
Copy link
Owner

a1ex4 commented Dec 9, 2024

From all of the reports shared here it would seem that the issue is happening when the directories are being monitored for file change events. I use the gorakhargosh/watchdog library and I don't see an issue related to Unraid on here.

Perhaps it would be good to execute a simple Python script just like Ownfoil does when setting up the file watcher, in an Unraid environment and see if the memory gets used as with Ownfoil:

from file_watcher import Watcher
import threading

def on_library_change(events):
    print(events)

watcher = Watcher([], on_library_change)
watcher_thread = threading.Thread(target=watcher.run)
watcher_thread.daemon = True
watcher_thread.start()

watcher.add_directory('/games')

@a1ex4
Copy link
Owner

a1ex4 commented Dec 16, 2024

Could you guys try the new develop image? I changed the behavior for monitoring library change, maybe it could help in your situation. Thanks!

@jaybird2203
Copy link

jaybird2203 commented Dec 17, 2024

pulled the dev branch, just applied direct over latest and began to copy some files over while watching logs. I copied over a few games and a few updates. First two files it detected while copying over and it appears to have hung while identifying the second file, and webpage is fully unavailable/not opening.

Didn't have the webpage open while I was transferring either, just watching the logs. Renamed the directory/file in the log below.

Not sure if it makes any difference for debugging but in addition to basic alphabet characters the naming of the directory has spaces, hyphens and decimals/periods. File name has hyphens and underscores. Will restart container in a moment and compare logs.

[2024-12-16 22:00:43.245] INFO (_internal) 192.168.2.64 - "GET /api/titles HTTP/1.1" 200 -
[2024-12-16 22:01:46.368] INFO (library) Scanning library path /games ...
[2024-12-16 22:01:46.809] INFO (library) Identifiying file (1/2): /(subdirectory)/file name.nsp
[2024-12-16 22:01:47.025] INFO (library) Identifiying file (1/2): /(subdirectory)/file name.nsp OK Title ID: 0000000000000 App ID : 11111111111111 Title Type: UPDATE Version: BBBBBBB
[2024-12-16 22:01:47.038] INFO (library) Identifiying file (2/2): //(subdirectory2)/file name2.nsp/name2.nsp
[2024-12-16 22:02:13.335] INFO (app) Scan already in progress
[2024-12-16 22:02:24.003] DEBUG (db) No files were deleted. All files are present on disk.
[2024-12-16 22:02:24.004] INFO (library) Generating library ...
[2024-12-16 22:02:32.472] INFO (app) Scan already in progress
[2024-12-16 22:02:49.360] DEBUG (db) No files were deleted. All files are present on disk.
[2024-12-16 22:02:49.372] INFO (library) Generating library ...
[2024-12-16 22:03:02.318] INFO (app) Scan already in progress
[2024-12-16 22:03:19.677] INFO (library) Generating library done.

@jaybird2203
Copy link

Restarted the container and web came back up. Had to manually scan the directory and it detected the original games/updates above and the rest that were copied.

Moving the files from the cache to the array (unraid automated process) and logs point to most of the files being deleted and added again. Manually scanning the library at this time allowed the missing files to be detected.

Web didn't hang at this moment either.

@brendongl
Copy link

brendongl commented Dec 17, 2024 via email

@a1ex4
Copy link
Owner

a1ex4 commented Dec 19, 2024

Okay so the new implementation did not correctly handle files being copied, I pushed a fix on the develop branch that improves the automatic library scan a lot, by tracking files being copied and handling the modified files in batches.

I'm curious if the situation improves for the memory lockup issue.

@jaybird2203
Copy link

jaybird2203 commented Dec 25, 2024

Ran a few more copies and moves. No crash/hangs but missing entries.

Logs state the files imported when they were finished copying in but not listed in the library/web. Proceeded to trigger the unraid move and get the log entry below.

INFO (db) File '/games/Folder/game.nsp' removed from database.
[2024-12-25 12:40:38.131] DEBUG (db) No files were deleted. All files are present on disk.

Manually rescanned the library via settings page and logs show the same files that were moved then "removed" were then identified again after the mover finished.

In the above the titles are not updated on the web library interface. Will restart the container followed by a fresh database pull on the dev branch.

Edit - after restarting the container the recently added, moved (removed), rescanned files now show in the web. Not sure if the two are connected but figured I'd mention here.

@Masamune3210
Copy link

Masamune3210 commented Dec 27, 2024

Make sure that you are pointing to the user folder and not directly to the cache or one of the disks. FUSE shouldn't be letting anything see changes relating to which disk the file is on, that's the whole idea behind the mover mechanism.

basically, go through and make sure all the mount points other than appdata are /mnt/user instead of /mnt/cache or /mnt/disk#

@jaybird2203
Copy link

Thanks for the direction to ensure the correct location is used although the /games directory was always set for /mnt/user location, not the user0 or cache etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants