-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Really high CPU load over time #1356
Comments
We are experiencing that too. After some period CasparCG 2.3 LTS process stucks at 99% and then fails. Even after STOPping all layers and playing only one then. |
I have seen this too. |
Does anyone have reliable repro steps? |
@scriptorian is able to reproduce this and is having a look into the cause |
Seems like it can be reproduced by issuing multiple LOAD and PLAY commands over time. |
Reproducable after multiple PLAY and LOADBG commands over time as @hummelstrand mentioned - even on single layer. I will prepare commands log to reproduce. |
As mentioned I have managed to reproduce this with a test script that repeatedly LOADs a clip onto a channel/layer (using the ffmpeg producer). No PLAY is required to provoke the fault. For testing I have made the script loop every 200ms and this makes the problem apparent in a reasonable amount of time. The first symptom is the process working set increasing linearly, then after a few minutes the CPU load starts increasing too. I have analysed the application using various tools and confirmed that it is working well and not leaking any threads or objects on the heap (with the exception of one rare bug that I have addressed - not relevant to this problem) which is great news but frustrating in terms of finding the problem. I recently tried running Windows Performance Analyzer and finally found a clue. By comparing CPU usage early and late in a run it was apparent that an increasing amount of time was spent in the TBB library and with cleaning up thread local storage. With some very simple (and not production ready!) hacking I removed the TBB thread parallel optimisations in the ffmpeg producer and the memory and CPU growth problem disappeared. I don't believe there is anything wrong with the CasparCG code that uses this library so my next step will be to get an updated version of the TBB library and try again with that. The release notes mention some bugfixes that may be relevant. Intel have now wrapped it into their new oneAPI product and installing that failed for me just now. If anyone here has experience of this library (@ronag?) I'd be grateful for any pointers for how you cooked it / downloaded it last time. |
Try skipping the custom tbb stuff and use the regular ffmpeg thread pool? |
Thanks @ronag. If you are referring to to the override of AVFilterGraph::execute that is currently using TBB as the custom multithreading implementation then yes, I have turned this off. The real difference with this problem though is in the tbb::parallel_invoke and tbb::parallel_for_each calls in av_producer and av_util. Removing these stops the problem, removing just one of them halves the rate of growth! |
For now just remove the tbb stuff. We can follow up with another PR with an updated tbb version later. |
I don't know how to update tbb at the moment since intel wrapped it into oneAPI. |
on windows you can also try https://docs.microsoft.com/en-us/cpp/parallel/concrt/how-to-write-a-parallel-for-loop?view=msvc-160 |
Do we know if this problem occurs on Linux? |
Thanks for the suggestions. I've got hold of the latest tbb now and I think the best approach is to push through with trying that. If the problem has gone away then there are no code changes (any tbb interface changes notwithstanding) and linux should continue to work - hopefully without any problems. Any other approach would require a fair amount of code changes with potentially surprising impacts on performance and that seems like something to avoid if possible. |
Sorry, is it something we can fix via some TBB tweaking in Windows, or not? |
I have now downloaded and built with the latest TBB library from the Intel oneAPI product. There were some API changes but dealing with these was straightforward and should be safe. |
Awesome, will it be included in some future builds of CasparCG? Or can you please provide your build for long time testing? |
We are just discussing how to progress with testing this change and whether to make a beta version. Does anyone here have any thoughts? I'll update this thread when we have a plan! |
Please beta test and report any issues here! |
Is this something to worry about on Linux? (Running NRK version). |
It's not clear whether the TBB bug also exists in the Linux version. The TBB release notes include some mentions of fixing relevant bugs in the Windows version so there is reasonable hope that this problem won't affect Linux. |
The latest NRK version of CasparCG Server is v2.1, so it is not affected by this bug which seems to have been introduced in v2.2. |
OK I get it. Thank you @scriptorian and @hummelstrand |
Just FYI: It seems there is no problem with increasing CPU load on 2.3.2 beta on Windows 10 (yellow lines). There is just a slight memory usage increase over time but from my experience, it will eventually drop. Green lines belong to a custom 2.3.0 build running on Debian. Both servers use LOADBG/AUTO to play mixed (Linux) and XDCAM HD (Windows) playlists. |
I have to confirm, that this build fixes CPU usage leak on Windows (both Intel and AMD currently running 5 days 24/7). Unfortunately I have experienced memery leak on GPU when HTML tempalte GPU acceleration is enabled. I will start a new thread for that. |
I have also encountered this. |
Never mind, already done, thanks! |
This is off-topic, but Beta-version v2.3.2-lts-beta also has audio issues on systems that use the 1001-based-standard. |
Expected behaviour
Be able to play clips, both long and short without having to worry about the CPU load.
Current behaviour
When playing shorter clips using v2.3.0 LTS (even in v2.2.0), the CPU load goes to 90-92% over time and is stuck there. I have attached some screens to show how it looks like. For longer clips we do not see this behaviour.
Shorter clips = around 20 seconds
Longer clips = hours
I think it has to do with the number of commands sent and that it's not related to the actual file length, but it's just a theory.
Environment
Screenshots
The text was updated successfully, but these errors were encountered: