Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests for observation loss and resource leakage #183

Open
waltermi opened this issue Oct 23, 2017 · 16 comments
Open

Tests for observation loss and resource leakage #183

waltermi opened this issue Oct 23, 2017 · 16 comments

Comments

@waltermi
Copy link

In the ECN plugin, the "ecn.negotiation" information of some measurements is missing when a lot of workers are running. Running fewer workers reduces the number of measurements with missing conditions. The problem also occured with some self-made chains.

@irl
Copy link
Member

irl commented Oct 25, 2017

This makes me think that flows are being dropped within PATHspider, or packets are being dropped and never captured. Unless you can find a place in PATHspider that this is occuring where it shouldn't, this is not a bug. Just use a sensible number of workers.

@irl irl closed this as completed Oct 25, 2017
@britram britram changed the title Missing conditions with too many workers running Tests for observation loss and resource leakage Oct 27, 2017
@britram britram added this to the Release 2.0 _Argyroneta aquatica_ milestone Oct 27, 2017
@britram
Copy link
Contributor

britram commented Oct 27, 2017

This framing may not have been the best way to ask this question, but it does seem that results are load-dependent and load-dependency in results is input size dependent, which suggests resource leakage. We don't have a good baseline or calibration for giving people who are trying to use PATHspider guidance on what "a sensible number of workers" is, and we should have that if we're going to close issues based on it.

So, I think what we need here is some idea of how many records go missing under which network, CPU, and memory conditions with what number for workers, which probably means profiling observer loss for a set of possible conditions on relatively constrained DO nodes.

@britram britram reopened this Oct 27, 2017
@britram
Copy link
Contributor

britram commented Oct 27, 2017

(I may have cycles for this in December, but not before.)

@mirjak
Copy link
Contributor

mirjak commented Oct 27, 2017

Agree, we should probably add some more logging/tracking capabilities here as well...

@irl
Copy link
Member

irl commented Oct 27, 2017

Logging/tracking is #155 probably. @mirjak if you have specific requests for logging, please add them there. Otherwise we can just go nuts adding logging (if we have a decent logging setup, logging should be near-zero cost).

@irl
Copy link
Member

irl commented Oct 27, 2017

Let's say the criteria for this being closed is running benchmarks and documenting sensible worker counts for different setups? (Also making sure our defaults are not way off for most users)

@britram
Copy link
Contributor

britram commented Oct 27, 2017

SGTM. I'd suggest running these on a smallish DO box (~2G ram) since this is one of the places we'd like it to run. We should also explicitly check short runs (100-1000 targets) against long runs (1000000 targets) since the latter will exhibit any resource leaks (@waltermi has reported this behavior, but it's unclear whether it's just in the traceroute branch at this point)

@britram
Copy link
Contributor

britram commented Oct 27, 2017

n.b. if this is targeting 2.0 and assigned to me, it will cause 2.0 to drop late.

@irl irl modified the milestones: Release 2.0 _Argyroneta aquatica_, Release 2.1 _Latrodectus hasselti_ Oct 27, 2017
@mirjak
Copy link
Contributor

mirjak commented Oct 30, 2017

I didn't check what's logged to far but know that there was a test but no observation would be really helpful here.

@irl
Copy link
Member

irl commented Oct 30, 2017

@mirjak: Does #184 look good for that?

@mirjak
Copy link
Contributor

mirjak commented Oct 30, 2017

Yes, however, not sure if not_observed should be a condition or just something to log... thanks!

@britram
Copy link
Contributor

britram commented Oct 30, 2017 via email

@britram
Copy link
Contributor

britram commented Oct 30, 2017 via email

@irl
Copy link
Member

irl commented Oct 31, 2017

It's now included in the ECN plugin, which should make some automated benchmarking possible as we just need to analyse the output for unobserved flows.

@irl
Copy link
Member

irl commented Jan 10, 2018

pspdr measure -w X -i eth0 ecn --connect tcp < /tmp/targetsp > /tmp/results

Rough numbers:

For 2320 jobs in a DO 2GB instance, 20 workers gives 20 losses. >30 workers causes all but around 300 to be lost.

It looks like DO has changed their networking as you now get a NAT IPv4 address which may be impacting the performance. This could also be Meltdown/Spectre microcode updates killing the CPU as I've seen from other cloud providers.

With the same task on my desktop (2x Intel E5640@2.67GHz, 24GB RAM, DSL connection), 2321 jobs with 100 workers only gives me 23 losses.

@irl
Copy link
Member

irl commented Aug 26, 2018

This is a duplicate of #198

@irl irl unassigned britram Dec 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants