-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Client: premature check for max_concurrent can starve resources #1677
Comments
After wider deployment in a range of computers, I find I need to withdraw this suggestion as an over-simplistic solution - although the issue remains open. Line 130 causes starvation by restricting the list of runnable jobs when max_concurrent is present: jobs required to occupy a different resource (second GPU type, for example) from the same project aren't added to the list. But disabling line 130 (alone) also causes starvation of another type: when max_concurrent is present, tasks that are needed for the same resource type (multi-core CPU) from a different project are restricted. We need to populate the runnable list with sufficient jobs to satisfy all resources from all projects, even when restrictions (such as max_concurrent and exclude_gpu) are in operation. |
I've been asked to investigate another example of this, documented in All CPU tasks not running. Now all are: - "Waiting to run" Scenario: app_config.xml supports setting max_concurrent at the app level and at the project level, but not at the app_version level. User has chosen to set <project_max_concurrent> to 16 to allow the 4 GPU and 12 CPU tasks to run (but no more). Observed behaviour: Desired behaviour:
But we didn't. We added 16 of the things. It looks as if stop_scan_coproc (lines 118-124 of client/cpu_sched.cpp) is designed to prevent this, but it isn't invoked until lines 839/856 (in 'add_coproc_jobs'). Is there any pressing reason why we don't invoke it in 'can_schedule'? |
The issue is why we didn't add CPU jobs after the GPU ones. In this (and all client scheduling issues) please have the user create a scenario in the client emulator: |
I considered that, but as yet the web interface to the simulator doesn't accept app_config.xml input. We confirmed in discussion that the user had
in operation, and that the scheduler had added 16 tasks - all for GPU - to the runnable list: removing If I get him to submit the core files, can you add app_config.xml to the simulation manually for testing? |
What are the "core" files needed? Are they the four files mentioned in the "scenario" page? I assume that I would have to go back to the configuration that prevents cpu work from running. How long does the host have to run to stabilize client_state.xml for it to be considered valid for the scenario. |
If you follow the link in David's post, and click on the big green 'create a scenario' button, you are asked to find and upload: client_state.xml Notice no app_config.xml, which renders the simulation less than perfect (no app_info.xml either, but don't worry about that - the contents are copied into client_state.xml). But yes - it probably makes sense to re-create the 'All CPU tasks not running. Now all are: - "Waiting to run"' configuration, to provide David with as many clues as possible. |
I cannot create the scenario because the client simulation page complains that is has not received my client_state.xml file. I have tried several times and am positive I am selecting the correct file. I even tried the client_state_bkup.xml file to see it it liked that. No luck. I have reconfigured the host to cause the problem with the <project_max_concurrent>16</project_max_concurrent> statement in my app_config.xml file to cause all cpu tasks from not running. So while the simulator seemed like a great idea, it is not useful to me right now. Any ideas as to what to do next? |
Maybe you can grab the file from my Dropbox account. |
Tried, and got the same result as Keith:
That's after the file upload had counted to 100% at normal speed for my internet connection. File looks like a correctly-terminated client_state, and comprises 2,054 KB. @davidpanderson - is there a file size limit on simulator input files? Edit - I looked in page source, but this is a generic error message that could cover any number of upload failure cases.
|
OK, I have uploaded all four client simulator files to my Dropbox account. The link is: I too wondered if there is a file size limit on the simulator. My client_state.xml file is probably bigger compared to most others as I have 500 Seti tasks alone in the file along with the tasks from my other projects. |
I could abort my other project tasks to try and reduce the size if you would think that would help. |
Politer simply to set NNT for a few hours to run down the cache. You and David are in the same time zone, so might be easier for you to work out a plan directly between you - I'll be off to bed within a couple of hours. |
It will take several days to over a week to run down the caches for my other projects after setting NNT based on their respective deadlines. |
I'll think about how to do that. It would be a nice addition to the emulator. |
Were you able to run my files on the emulator? Or did you run into the same troubles as Richard and myself in the emulator won't accept my client_state.xml file? |
I've edited down client_state.xml to remove (by my count) 328 SETI workunits and results. The resulting 1,625 KB file uploaded cleanly - yay! I've run the resulting scenario 160, ID 0 - link to results. The resulting timeline shows the SETI tasks running, as we would expect with no <max_concurrent> input. @davidpanderson, can you take it from there? |
Hi Richard, thanks for figuring out why the simulator wouldn't take my client_state file. I assume we discovered there is a file size limit for the file. Probably needs to allow larger file sizes in the future. I looked through the output files and don't really understand what I am looking at. I assume the tallies that said N deadlines met meant all tasks ran as expected. The timeline shows that even the cpu tasks ran? But we can't so far duplicate my actual running conditions with a max_concurrent or project_max_concurrent statement because the simulator won't allow this? Have I grasped the situation correctly? |
Went back through the history of this bug issue and read through it again. I just want to re-comment that the host ran as expected with the <project_max_concurrent> in play in my app_conf.xml and my cpu tasks ran just as they always had. It wasn't until I introduced the <gpu_exclude> statements into cc_config.xml that things went sideways. So there is an interplay between those two elements that is causing the issue. |
On a hunch that sysadmins usually set limits in round numbers, I redid the result removal rather more forensically, taking out one at a tine.
So I think we can say the limit is 2 MB - perhaps that could be noted on the file selection page? I also see that we now have an input field for app_config.xml, so we have scenario 161 to examine - but the bug still doesn't appear in the timeline. |
So I need to set NNT again on my other projects to whittle down the client_state file to less than 2,096,255 bytes?
Then I can again upload my simulator files and also include app_config? I see that it has another field below it to specify which project the app_config belongs to.
Or am I mistaken and that you already ran simulation #161 with my app_config and it didn't show anything in the timeline?
On Sunday, December 16, 2018, 3:00:42 AM PST, RichardHaselgrove <notifications@github.com> wrote:
On a hunch that sysadmins usually set limits in round numbers, I redid the result removal rather more forensically, taking out one at a tine.
File size:
2,102,396 failed
2,100,345 failed
2,098,314 failed
2,096,255 succeeded
So I think we can say the limit is 2 MB - perhaps that could be noted on the file selection page?
I also see that we now have an input field for app_config.xml, so we have scenario 161 to examine - but the bug still doesn't appear in the timeline.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Yes, scenario #161 had the app_config uploaded with the SETI url. Pending announcement, we can't be sure that all the backend tools have been hooked up (or even written) yet. |
Thanks Richard. I will keep monitoring the issue page for updates. |
I added a feature to the emulator that lets you upload the app_config.xml for a project |
I uploaded the app_config.xml for SETI (using the master url from client_state) with scenario 161, something over 24 hours ago. On the initial run, it failed to reproduce the problem - have there been further backend updates since then, and if so, will they require a fresh upload? |
It will be a while since my client_state.xml is way too big to be uploaded. Unless you fixed that flaw with the emulator. Being nice and not aborting tasks for other projects and just relying on NNT for them, it will be at least a couple of weeks for my client_state to get below the default 2MB file size limit Richard discovered. I don't feel confident in editing my client_state like Richard did. |
OK, sorry. Forgot the person's country of origin I was typing with. LOL. I have always called the # symbol a pound-sign. I've never referred to it as I guess the proper term now is hashtag. I don't do any social apps so never made the switch in terminology. |
I found and fixed the problem. Thanks to Keith and Richard for setting it up in the emulator. The output of the emulator is a bit cryptic. The "timeline" is the most useful. It shows what jobs are running (CPU jobs on the left, GPU jobs on the right) as time progresses. Interspersed with that are descriptions of the scheduler RPCs the client makes (all simulated, of course). Once a scenario is uploaded, I can run the emulator (which is based on the actual client code) under a debugger, where I can see exactly what it's doing. This makes it fairly easy to diagnose problems of this sort. |
Still confused. Is this the original "I fixed the problem" or another one after Richard reported the original fix still had issues and failed in Scenario 165? |
I added the ability to specify 2 app_configs. |
Thanks - I've created # 166. That shows the same behaviour as I described in #1677 (comment), and it's visible in the timeline. |
David has done a lot more work in #2918, and it seems to be working right for me now. @KeithMyers - would you be able to build for Linux and test at 9f8f52b, or would you need a hand? |
I assume you mean git the dpa_ncurrent repository?
On Monday, December 31, 2018, 5:05:47 AM PST, RichardHaselgrove <notifications@github.com> wrote:
David has done a lot more work in #2918, and it seems to be working right for me now. @KeithMyers - would you be able to build for Linux and test at 9f8f52b, or would you need a hand?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Forgot to add yes I would need a hand. I asked your referral at Einstein for help but never received a reply from my friend request or request for assistance in building for Linux.
On Monday, December 31, 2018, 5:05:47 AM PST, RichardHaselgrove <notifications@github.com> wrote:
David has done a lot more work in #2918, and it seems to be working right for me now. @KeithMyers - would you be able to build for Linux and test at 9f8f52b, or would you need a hand?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
That referral was from Gary Roberts, whose query to me led ultimately to #2904 - I gave him a couple of immediate lines of code which met his need, and he succeeded in compiling them. But he's not a full developer, and he may feel, like Keith, that pulling an un-merged branch from here and building a pre-alpha client for testing is beyond his skill-set. I assume we're no closer yet to an 'AppVeyor artifact' style of downloadable binary for Linux? Pending that, would any passing reader here be able to help Keith out? I feel quite strongly (as I said in the working party last year) that these complex and subtle client changes should be tested in live running, not just approved 'off the page' as valid and stylistic code. Otherwise, we enter the "Too soon to test, too late to change" gotcha at the client release beta testing phase. |
All of this is already available on bintray,
Please, check here
https://bintray.com/boinc/boinc-ci/pull-requests/PR2918_2018-12-31_9f8f52b7#files
пн, 31 дек. 2018 г. в 21:21, RichardHaselgrove <notifications@github.com>:
That referral was from Gary Roberts, whose query to me led ultimately to
#2904 <#2904> - I gave him a couple
of immediate lines of code which met his need, and he succeeded in
compiling them. But he's not a full developer, and he may feel, like Keith,
that pulling an un-merged branch from here and building a pre-alpha client
for testing is beyond his skill-set.
I assume we're no closer yet to an 'AppVeyor artifact' style of
downloadable binary for Linux? Pending that, would any passing reader here
be able to help Keith out?
I feel quite strongly (as I said in the working party last year) that
these complex and subtle client changes should be tested in live running,
not just approved 'off the page' as valid and stylistic code. Otherwise, we
enter the "Too soon to test, too late to change" gotcha at the client
release *beta* testing phase.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1677 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADFZoViD053K8tXz6C6lQahFwB4EfqbMks5u-mOpgaJpZM4KXJc->
.
--
Best regards,
Vitalii Koshura
Sent via iPhone
|
Ah, thank you - Keith, over to you! But @AenBleidd - two queries.
|
You can find a build for any pull request on bintray under BOINC
organization. You can use PR number to find an appropriate build.
As about antivirus, I know nothing about this and can't check it now
because I'm away from my PC now.
пн, 31 дек. 2018 г. в 22:26, RichardHaselgrove <notifications@github.com>:
Ah, thank you - Keith, over to you!
But @AenBleidd <https://github.com/AenBleidd> - two queries.
1.
in general, how would I find the bintray location for an arbitrary
commit from an arbitrary PR? I can follow links from the AppVeyor check
details and get the Windows artifact from there, but I don't have the
equivalent routing map for Linux.
2.
In this particular case, any of the bintray PR2918 files (Windows or
Linux) triggers an anti-virus alert from Microsoft Security Essentials, and
fails to complete the download. The alternative downloads from
https://ci.appveyor.com/project/BOINC/boinc/builds/21168922/artifacts
were passed clean by MSE.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1677 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADFZoQnhs6ytJSPoTi_U7tRtBM8fzoebks5u-nLmgaJpZM4KXJc->
.
--
Best regards,
Vitalii Koshura
Sent via iPhone
|
OK, no action now - It's New Year's Eve and we should all stop work. But preserving the evidence for later inspection: |
Win32 Trojan in Linux binaries, really? Definitely false positive.
In any case, I'll check it further.
Happy New Year, guys!
пн, 31 дек. 2018 г. в 22:55, RichardHaselgrove <notifications@github.com>:
OK, no action now - It's New Year's Eve and we should all stop work. But
preserving the evidence for later inspection:
[image: bintray trojan warning]
<https://user-images.githubusercontent.com/14886436/50567553-43509a00-0d3e-11e9-92ba-beecd73be856.png>
https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-description?name=Trojan%3aWin32%2fSpursint.F!cl&threatid=2147717281&enterprise=0
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1677 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADFZocNENXsQ7Imdcot3u8k7WivEzZyIks5u-nnBgaJpZM4KXJc->
.
--
Best regards,
Vitalii Koshura
Sent via iPhone
|
I installed clamav and the client, manager and apps files scanned clean. Made a BOINC Beta directory and unpacked the client and manager. Checked permissions and dependences. Started up the client and then tried starting up the manager. No manager. Then went back to the dependency check and found a missing dependency I never had before and didn't recognize. I have always run the TBar BOINC packages.
libpng12.so.0 => not found
Tried a sudo apt install and found nothing. Tried Synaptic and it finds nothing related to any libpng at all.
keith@Numbskull:~$ sudo apt install libpng12-0[sudo] password for keith: Reading package lists... DoneBuilding dependency tree Reading state information... DonePackage libpng12-0 is not available, but is referred to by another package.This may mean that the package is missing, has been obsoleted, oris only available from another source
E: Package 'libpng12-0' has no installation candidate
Has this always been the case with the modern Linux clients and manager releases or is this something that has just popped up with beta Linux release.
On Monday, December 31, 2018, 1:29:49 PM PST, Vitalii Koshura <notifications@github.com> wrote:
Win32 Trojan in Linux binaries, really? Definitely false positive.
In any case, I'll check it further.
Happy New Year, guys!
пн, 31 дек. 2018 г. в 22:55, RichardHaselgrove <notifications@github.com>:
OK, no action now - It's New Year's Eve and we should all stop work. But
preserving the evidence for later inspection:
[image: bintray trojan warning]
<https://user-images.githubusercontent.com/14886436/50567553-43509a00-0d3e-11e9-92ba-beecd73be856.png>
https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-description?name=Trojan%3aWin32%2fSpursint.F!cl&threatid=2147717281&enterprise=0
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1677 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADFZocNENXsQ7Imdcot3u8k7WivEzZyIks5u-nnBgaJpZM4KXJc->
.
--
Best regards,
Vitalii Koshura
Sent via iPhone
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I found an older obsoleted repository that allowed me to download the deb file for the libpng12 version. My current Ubuntu 18.04 uses the libpng16.so.16 library. Had to install libcurl4 too. Got it to run but my app_config was not read at all it seems. I put back my <project_max_concurrent>16</project_max_concurrent> statement in my app_config that causes the starvation but I found all 24 threads engaged when I ran the beta client and manager. I re-read the config files to be sure they were picked up. They were in fact read. So this beta still does not work for me. I will wait for the official commit to master before I attempt to compile the client and manager on my host. |
Actually, I wasn't downloading Linux binaries. I was downloading a byte sequence which presented itself - by name only, contents unknown - as a 7-zip compressed archive. Compressed archives are platform independent, although they may contain self-extracting executable code for any given platform. My suspicion is that it is the packaging, rather than the contents, that triggered the alert. I got the same warning on an archive supposed to contain Windows code. I'll do a detailed comparison of the AppVeyor and Bintray downloads - both package and contents - later and report back. |
Well, that was strange. I used win-manager_PR2918_2018-12-31_9f8f52b7.7z as my test file, because - at the time I did it* - the same file was available from both AppVeyor and bintray, and the AppVeyor download did NOT trigger a security alert: the bintray download DID show a virus warning from MSE. However, both the downloaded archive, and the extracted binaries, were bytewise identical, and I couldn't find any difference in the file attributes. I also have a second, commercial, AV package running on the same machine, and that found no problems even on a manual scan. So I'm happy to agree that this was a false alarm. But I do remember that one of the big freeware/shareware download sites - was it CNET? - got a very bad reputation for silently including its own 'add-ons' to other people's download packages: that's a complete no-no in my book, for any cloud storage or distribution site. I do hope nothing like that is being attempted here. [*] I wrote this about 90 minutes ago, and then found that I'd previewed it but not posted. So I recreated it: in confirming the file name, I got a security warning on the AppVeyor download from MSE. My download manager - Chrome browser - is showing 'Failed - Insufficient permissions': I need to go and look up what that means. But commit the post first... |
That is something that is bothering me too but I didn't have time to work on this. Github does have some features we can use to automatically link to artifacts but we need to wait on Travis to migrate the Github integration for the BOINC repository to support those new features. There is currently a Beta test underway which I declined because I didn't want to break the CI integration over the holidays. At the moment the best way to grab the artifacts is to go to: https://bintray.com/beta/#/boinc/boinc-ci/pull-requests?tab=overview and search the versions for the newest one. The naming convention is: Pull-Request Number, Date of build, SHA1 of commit in pull requst branch that was used to build the artifacts. After clicking on a version you need to go to the files tab in order to see the artifacts for this version. The Windows archives are the same as in AppVeyor since they are only packed once before uploading twice. I don't know why the bintray download is flagged by MSE since the archive itself is identical. It might be that MSE uses additional attributes to check for potentially malicious content (i.e. flags all downloads from bintray.com because there was one real incident). |
More or less fixed in #2918. If there are still problems with max_concurrent open a new issue. |
cpu_sched.cpp builds a list of 'runnable' jobs, from which infeasible jobs can be removed at a later stage (device exclusion present, RAM usage limit exceeded, etc.)
The test on max_concurrent_exceeded(rp) at line 130 restricts the length of the runnable list, and can prevent jobs which may be required for assignment to idle resources from being present in the initial list.
Surplus jobs which might violate max_concurrent are pruned from the provisional run list at a later stage - line 1198.
previous references:
commit 8c44b2f
issue #1615
The text was updated successfully, but these errors were encountered: