-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tests for observation loss and resource leakage #183
Comments
This makes me think that flows are being dropped within PATHspider, or packets are being dropped and never captured. Unless you can find a place in PATHspider that this is occuring where it shouldn't, this is not a bug. Just use a sensible number of workers. |
This framing may not have been the best way to ask this question, but it does seem that results are load-dependent and load-dependency in results is input size dependent, which suggests resource leakage. We don't have a good baseline or calibration for giving people who are trying to use PATHspider guidance on what "a sensible number of workers" is, and we should have that if we're going to close issues based on it. So, I think what we need here is some idea of how many records go missing under which network, CPU, and memory conditions with what number for workers, which probably means profiling observer loss for a set of possible conditions on relatively constrained DO nodes. |
(I may have cycles for this in December, but not before.) |
Agree, we should probably add some more logging/tracking capabilities here as well... |
Let's say the criteria for this being closed is running benchmarks and documenting sensible worker counts for different setups? (Also making sure our defaults are not way off for most users) |
SGTM. I'd suggest running these on a smallish DO box (~2G ram) since this is one of the places we'd like it to run. We should also explicitly check short runs (100-1000 targets) against long runs (1000000 targets) since the latter will exhibit any resource leaks (@waltermi has reported this behavior, but it's unclear whether it's just in the traceroute branch at this point) |
n.b. if this is targeting 2.0 and assigned to me, it will cause 2.0 to drop late. |
I didn't check what's logged to far but know that there was a test but no observation would be really helpful here. |
Yes, however, not sure if not_observed should be a condition or just something to log... thanks! |
I'm pretty sure there's a not_observed condition, at least for ECN
… On 30 Oct 2017, at 17:13, mirjak ***@***.***> wrote:
Yes, however, not sure if not_observed should be a condition or just something to log... thanks!
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
ah, i see I'm wrong.
in any case `not_observed` should *absolutely* be a condition -- metadata calibrating the measurement platform itself needs to be inline or it'll be ignored by later analysis.
… On 30 Oct 2017, at 20:18, Brian Trammell ***@***.***> wrote:
I'm pretty sure there's a not_observed condition, at least for ECN
> On 30 Oct 2017, at 17:13, mirjak ***@***.***> wrote:
>
> Yes, however, not sure if not_observed should be a condition or just something to log... thanks!
>
> —
> You are receiving this because you were assigned.
> Reply to this email directly, view it on GitHub, or mute the thread.
>
|
It's now included in the ECN plugin, which should make some automated benchmarking possible as we just need to analyse the output for unobserved flows. |
Rough numbers: For 2320 jobs in a DO 2GB instance, 20 workers gives 20 losses. >30 workers causes all but around 300 to be lost. It looks like DO has changed their networking as you now get a NAT IPv4 address which may be impacting the performance. This could also be Meltdown/Spectre microcode updates killing the CPU as I've seen from other cloud providers. With the same task on my desktop (2x Intel E5640@2.67GHz, 24GB RAM, DSL connection), 2321 jobs with 100 workers only gives me 23 losses. |
This is a duplicate of #198 |
In the ECN plugin, the "ecn.negotiation" information of some measurements is missing when a lot of workers are running. Running fewer workers reduces the number of measurements with missing conditions. The problem also occured with some self-made chains.
The text was updated successfully, but these errors were encountered: