Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client: premature check for max_concurrent can starve resources #1677

Closed
RichardHaselgrove opened this issue Oct 14, 2016 · 60 comments
Closed

Comments

@RichardHaselgrove
Copy link
Contributor

cpu_sched.cpp builds a list of 'runnable' jobs, from which infeasible jobs can be removed at a later stage (device exclusion present, RAM usage limit exceeded, etc.)

The test on max_concurrent_exceeded(rp) at line 130 restricts the length of the runnable list, and can prevent jobs which may be required for assignment to idle resources from being present in the initial list.

Surplus jobs which might violate max_concurrent are pruned from the provisional run list at a later stage - line 1198.

previous references:
commit 8c44b2f
issue #1615

@RichardHaselgrove
Copy link
Contributor Author

After wider deployment in a range of computers, I find I need to withdraw this suggestion as an over-simplistic solution - although the issue remains open.

Line 130 causes starvation by restricting the list of runnable jobs when max_concurrent is present: jobs required to occupy a different resource (second GPU type, for example) from the same project aren't added to the list.

But disabling line 130 (alone) also causes starvation of another type: when max_concurrent is present, tasks that are needed for the same resource type (multi-core CPU) from a different project are restricted.

We need to populate the runnable list with sufficient jobs to satisfy all resources from all projects, even when restrictions (such as max_concurrent and exclude_gpu) are in operation.

@RichardHaselgrove
Copy link
Contributor Author

I've been asked to investigate another example of this, documented in

All CPU tasks not running. Now all are: - "Waiting to run"

Scenario:
Host has 4 GPUs, and 12-core, 24-thread CPU
Preferred project is SETI@Home
Preferred application comes in two versions - GPU version requiring additionally 1 full CPU thread, and single-threaded CPU version.
User wishes to restrict CPU version to 12 instances, to match physical core count, and GPU version to 4 instances (1 per GPU, supported by the remaining CPU threads).

app_config.xml supports setting max_concurrent at the app level and at the project level, but not at the app_version level. User has chosen to set <project_max_concurrent> to 16 to allow the 4 GPU and 12 CPU tasks to run (but no more).

Observed behaviour:
cpu_sched_debug (log posted in thread) shows that 16 tasks from the project are added to the preliminary job list (as required by max_concurrent), but that every one of them is a GPU job. When final scheduling takes place, only four tasks are scheduled (to match the GPU count), the remainder are discarded, and the CPUs are left idle.

Desired behaviour:
As documented in the 'CPU scheduling logic' statement at the head of client/cpu_sched.cpp,

// - create an ordered "run list" (make_run_list()).
// The ordering is roughly as follows:
// - GPU jobs first, then CPU jobs
// - for a given resource, jobs in deadline danger first
// - jobs from projects with lower recent est. credit first
// In principle, the run list could include all runnable jobs.
// For efficiency, we stop adding:
// - GPU jobs: when all GPU instances used

But we didn't. We added 16 of the things.

It looks as if stop_scan_coproc (lines 118-124 of client/cpu_sched.cpp) is designed to prevent this, but it isn't invoked until lines 839/856 (in 'add_coproc_jobs'). Is there any pressing reason why we don't invoke it in 'can_schedule'?

@davidpanderson
Copy link
Contributor

The issue is why we didn't add CPU jobs after the GPU ones.

In this (and all client scheduling issues) please have the user create a scenario in the client emulator:
https://boinc.berkeley.edu/sim_web.php

@RichardHaselgrove
Copy link
Contributor Author

I considered that, but as yet the web interface to the simulator doesn't accept app_config.xml input.

We confirmed in discussion that the user had

<app_config>
  <app_version>
    <app_name>setiathome_v8</app_name>
    <plan_class>cuda90</plan_class>
    <avg_ncpus>1</avg_ncpus>
    <ngpus>1</ngpus>
    <cmdline>-nobs</cmdline>
  </app_version>

  <app_version>
    <app_name>astropulse_v7</app_name>
    <plan_class>opencl_nvidia_100</plan_class>
    <avg_ncpus>1</avg_ncpus>
    <ngpus>1</ngpus>
  </app_version>
<project_max_concurrent>16</project_max_concurrent>
</app_config>

in operation, and that the scheduler had added 16 tasks - all for GPU - to the runnable list: removing <project_max_concurrent> allowed further tasks, including CPU tasks, to be considered and ultimately scheduled. SETI message 1969147

If I get him to submit the core files, can you add app_config.xml to the simulation manually for testing?

@KeithMyers
Copy link

What are the "core" files needed? Are they the four files mentioned in the "scenario" page?

I assume that I would have to go back to the configuration that prevents cpu work from running. How long does the host have to run to stabilize client_state.xml for it to be considered valid for the scenario.

@RichardHaselgrove
Copy link
Contributor Author

RichardHaselgrove commented Dec 14, 2018

If you follow the link in David's post, and click on the big green 'create a scenario' button, you are asked to find and upload:

client_state.xml
global_prefs.xml
global_prefs_override.xml
cc_config.xml

Notice no app_config.xml, which renders the simulation less than perfect (no app_info.xml either, but don't worry about that - the contents are copied into client_state.xml). But yes - it probably makes sense to re-create the 'All CPU tasks not running. Now all are: - "Waiting to run"' configuration, to provide David with as many clues as possible.

@KeithMyers
Copy link

I cannot create the scenario because the client simulation page complains that is has not received my client_state.xml file. I have tried several times and am positive I am selecting the correct file. I even tried the client_state_bkup.xml file to see it it liked that. No luck.

I have reconfigured the host to cause the problem with the <project_max_concurrent>16</project_max_concurrent> statement in my app_config.xml file to cause all cpu tasks from not running. So while the simulator seemed like a great idea, it is not useful to me right now. Any ideas as to what to do next?

@KeithMyers
Copy link

Maybe you can grab the file from my Dropbox account.
https://www.dropbox.com/s/ire3kc4gevq8oyu/client_state.xml?dl=0

@RichardHaselgrove
Copy link
Contributor Author

RichardHaselgrove commented Dec 14, 2018

Tried, and got the same result as Keith:

Unable to handle request
You must specify a client_state.xml file.

That's after the file upload had counted to 100% at normal speed for my internet connection. File looks like a correctly-terminated client_state, and comprises 2,054 KB. @davidpanderson - is there a file size limit on simulator input files?

Edit - I looked in page source, but this is a generic error message that could cover any number of upload failure cases.

    if (!is_uploaded_file($csname)) {
        error_page("You must specify a client_state.xml file.");
    }

@KeithMyers
Copy link

KeithMyers commented Dec 14, 2018

OK, I have uploaded all four client simulator files to my Dropbox account. The link is:
https://www.dropbox.com/sh/yqtg7o00jrdi3tk/AAA6BzG-eLPZZixgNatHE2Cba?dl=0

I too wondered if there is a file size limit on the simulator. My client_state.xml file is probably bigger compared to most others as I have 500 Seti tasks alone in the file along with the tasks from my other projects.

@KeithMyers
Copy link

I could abort my other project tasks to try and reduce the size if you would think that would help.

@RichardHaselgrove
Copy link
Contributor Author

Politer simply to set NNT for a few hours to run down the cache. You and David are in the same time zone, so might be easier for you to work out a plan directly between you - I'll be off to bed within a couple of hours.

@KeithMyers
Copy link

It will take several days to over a week to run down the caches for my other projects after setting NNT based on their respective deadlines.

@davidpanderson
Copy link
Contributor

I'll think about how to do that. It would be a nice addition to the emulator.

@KeithMyers
Copy link

Were you able to run my files on the emulator? Or did you run into the same troubles as Richard and myself in the emulator won't accept my client_state.xml file?

@RichardHaselgrove
Copy link
Contributor Author

I've edited down client_state.xml to remove (by my count) 328 SETI workunits and results. The resulting 1,625 KB file uploaded cleanly - yay!

I've run the resulting scenario 160, ID 0 - link to results.

The resulting timeline shows the SETI tasks running, as we would expect with no <max_concurrent> input. @davidpanderson, can you take it from there?

@KeithMyers
Copy link

Hi Richard, thanks for figuring out why the simulator wouldn't take my client_state file. I assume we discovered there is a file size limit for the file. Probably needs to allow larger file sizes in the future.

I looked through the output files and don't really understand what I am looking at. I assume the tallies that said N deadlines met meant all tasks ran as expected. The timeline shows that even the cpu tasks ran?

But we can't so far duplicate my actual running conditions with a max_concurrent or project_max_concurrent statement because the simulator won't allow this?

Have I grasped the situation correctly?

@KeithMyers
Copy link

Went back through the history of this bug issue and read through it again. I just want to re-comment that the host ran as expected with the <project_max_concurrent> in play in my app_conf.xml and my cpu tasks ran just as they always had. It wasn't until I introduced the <gpu_exclude> statements into cc_config.xml that things went sideways. So there is an interplay between those two elements that is causing the issue.

@RichardHaselgrove
Copy link
Contributor Author

On a hunch that sysadmins usually set limits in round numbers, I redid the result removal rather more forensically, taking out one at a tine.

File size:
2,102,396	failed
2,100,345	failed
2,098,314	failed
2,096,255	succeeded

So I think we can say the limit is 2 MB - perhaps that could be noted on the file selection page?

I also see that we now have an input field for app_config.xml, so we have scenario 161 to examine - but the bug still doesn't appear in the timeline.

@KeithMyers
Copy link

KeithMyers commented Dec 16, 2018 via email

@RichardHaselgrove
Copy link
Contributor Author

Yes, scenario #161 had the app_config uploaded with the SETI url. Pending announcement, we can't be sure that all the backend tools have been hooked up (or even written) yet.

@KeithMyers
Copy link

Thanks Richard. I will keep monitoring the issue page for updates.

@davidpanderson
Copy link
Contributor

I added a feature to the emulator that lets you upload the app_config.xml for a project
(you identify the project by its master URL).
Please give this a try and let me know if it reproduces the observed behavior.

@RichardHaselgrove
Copy link
Contributor Author

I uploaded the app_config.xml for SETI (using the master url from client_state) with scenario 161, something over 24 hours ago. On the initial run, it failed to reproduce the problem - have there been further backend updates since then, and if so, will they require a fresh upload?

@KeithMyers
Copy link

It will be a while since my client_state.xml is way too big to be uploaded. Unless you fixed that flaw with the emulator. Being nice and not aborting tasks for other projects and just relying on NNT for them, it will be at least a couple of weeks for my client_state to get below the default 2MB file size limit Richard discovered. I don't feel confident in editing my client_state like Richard did.

@KeithMyers
Copy link

OK, sorry. Forgot the person's country of origin I was typing with. LOL. I have always called the # symbol a pound-sign. I've never referred to it as I guess the proper term now is hashtag. I don't do any social apps so never made the switch in terminology.

@davidpanderson
Copy link
Contributor

I found and fixed the problem. Thanks to Keith and Richard for setting it up in the emulator.

The output of the emulator is a bit cryptic. The "timeline" is the most useful. It shows what jobs are running (CPU jobs on the left, GPU jobs on the right) as time progresses. Interspersed with that are descriptions of the scheduler RPCs the client makes (all simulated, of course).

Once a scenario is uploaded, I can run the emulator (which is based on the actual client code) under a debugger, where I can see exactly what it's doing. This makes it fairly easy to diagnose problems of this sort.

@KeithMyers
Copy link

Still confused. Is this the original "I fixed the problem" or another one after Richard reported the original fix still had issues and failed in Scenario 165?

@davidpanderson
Copy link
Contributor

I added the ability to specify 2 app_configs.

@RichardHaselgrove
Copy link
Contributor Author

Thanks - I've created # 166. That shows the same behaviour as I described in #1677 (comment), and it's visible in the timeline.

@RichardHaselgrove
Copy link
Contributor Author

David has done a lot more work in #2918, and it seems to be working right for me now. @KeithMyers - would you be able to build for Linux and test at 9f8f52b, or would you need a hand?

@KeithMyers
Copy link

KeithMyers commented Dec 31, 2018 via email

@KeithMyers
Copy link

KeithMyers commented Dec 31, 2018 via email

@RichardHaselgrove
Copy link
Contributor Author

That referral was from Gary Roberts, whose query to me led ultimately to #2904 - I gave him a couple of immediate lines of code which met his need, and he succeeded in compiling them. But he's not a full developer, and he may feel, like Keith, that pulling an un-merged branch from here and building a pre-alpha client for testing is beyond his skill-set.

I assume we're no closer yet to an 'AppVeyor artifact' style of downloadable binary for Linux? Pending that, would any passing reader here be able to help Keith out?

I feel quite strongly (as I said in the working party last year) that these complex and subtle client changes should be tested in live running, not just approved 'off the page' as valid and stylistic code. Otherwise, we enter the "Too soon to test, too late to change" gotcha at the client release beta testing phase.

@AenBleidd
Copy link
Member

AenBleidd commented Dec 31, 2018 via email

@RichardHaselgrove
Copy link
Contributor Author

Ah, thank you - Keith, over to you!

But @AenBleidd - two queries.

  1. in general, how would I find the bintray location for an arbitrary commit from an arbitrary PR? I can follow links from the AppVeyor check details and get the Windows artifact from there, but I don't have the equivalent routing map for Linux.

  2. In this particular case, any of the bintray PR2918 files (Windows or Linux) triggers an anti-virus alert from Microsoft Security Essentials, and fails to complete the download. The alternative downloads from https://ci.appveyor.com/project/BOINC/boinc/builds/21168922/artifacts were passed clean by MSE.

@AenBleidd
Copy link
Member

AenBleidd commented Dec 31, 2018 via email

@RichardHaselgrove
Copy link
Contributor Author

OK, no action now - It's New Year's Eve and we should all stop work. But preserving the evidence for later inspection:

bintray trojan warning

https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-description?name=Trojan%3aWin32%2fSpursint.F!cl&threatid=2147717281&enterprise=0

@AenBleidd
Copy link
Member

AenBleidd commented Dec 31, 2018 via email

@KeithMyers
Copy link

KeithMyers commented Jan 1, 2019 via email

@KeithMyers
Copy link

KeithMyers commented Jan 1, 2019

I found an older obsoleted repository that allowed me to download the deb file for the libpng12 version. My current Ubuntu 18.04 uses the libpng16.so.16 library. Had to install libcurl4 too.

Got it to run but my app_config was not read at all it seems. I put back my <project_max_concurrent>16</project_max_concurrent> statement in my app_config that causes the starvation but I found all 24 threads engaged when I ran the beta client and manager. I re-read the config files to be sure they were picked up. They were in fact read. So this beta still does not work for me.

I will wait for the official commit to master before I attempt to compile the client and manager on my host.

@RichardHaselgrove
Copy link
Contributor Author

Win32 Trojan in Linux binaries, really? Definitely false positive.

Actually, I wasn't downloading Linux binaries. I was downloading a byte sequence which presented itself - by name only, contents unknown - as a 7-zip compressed archive. Compressed archives are platform independent, although they may contain self-extracting executable code for any given platform.

My suspicion is that it is the packaging, rather than the contents, that triggered the alert. I got the same warning on an archive supposed to contain Windows code. I'll do a detailed comparison of the AppVeyor and Bintray downloads - both package and contents - later and report back.

@RichardHaselgrove
Copy link
Contributor Author

Well, that was strange. I used win-manager_PR2918_2018-12-31_9f8f52b7.7z as my test file, because - at the time I did it* - the same file was available from both AppVeyor and bintray, and the AppVeyor download did NOT trigger a security alert: the bintray download DID show a virus warning from MSE.

However, both the downloaded archive, and the extracted binaries, were bytewise identical, and I couldn't find any difference in the file attributes. I also have a second, commercial, AV package running on the same machine, and that found no problems even on a manual scan.

So I'm happy to agree that this was a false alarm. But I do remember that one of the big freeware/shareware download sites - was it CNET? - got a very bad reputation for silently including its own 'add-ons' to other people's download packages: that's a complete no-no in my book, for any cloud storage or distribution site. I do hope nothing like that is being attempted here.

[*] I wrote this about 90 minutes ago, and then found that I'd previewed it but not posted. So I recreated it: in confirming the file name, I got a security warning on the AppVeyor download from MSE. My download manager - Chrome browser - is showing 'Failed - Insufficient permissions': I need to go and look up what that means. But commit the post first...

@ChristianBeer
Copy link
Member

in general, how would I find the bintray location for an arbitrary commit from an arbitrary PR? I can follow links from the AppVeyor check details and get the Windows artifact from there, but I don't have the equivalent routing map for Linux.

That is something that is bothering me too but I didn't have time to work on this. Github does have some features we can use to automatically link to artifacts but we need to wait on Travis to migrate the Github integration for the BOINC repository to support those new features. There is currently a Beta test underway which I declined because I didn't want to break the CI integration over the holidays.

At the moment the best way to grab the artifacts is to go to: https://bintray.com/beta/#/boinc/boinc-ci/pull-requests?tab=overview and search the versions for the newest one. The naming convention is: Pull-Request Number, Date of build, SHA1 of commit in pull requst branch that was used to build the artifacts. After clicking on a version you need to go to the files tab in order to see the artifacts for this version. The Windows archives are the same as in AppVeyor since they are only packed once before uploading twice.

I don't know why the bintray download is flagged by MSE since the archive itself is identical. It might be that MSE uses additional attributes to check for potentially malicious content (i.e. flags all downloads from bintray.com because there was one real incident).

@JuhaSointusalo
Copy link
Contributor

More or less fixed in #2918. If there are still problems with max_concurrent open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants