-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix intermittent IT test failures for HarvestingClientsIT #10449
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
1 similar comment
This comment has been minimized.
This comment has been minimized.
if (allowHarvestingMissingCVV) { | ||
assertEquals(DATASETS_IN_CONTROL_SET, responseJsonPath.getInt("data.lastDatasetsHarvested")); | ||
} else { | ||
assertTrue(responseJsonPath.getInt("data.lastDatasetsHarvested") < DATASETS_IN_CONTROL_SET); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so the issue seemed to be that it (sometimes?) expected this to be 2 less than the control set, rather than one? @stevenwinship just to understand do we know why it would sometimes want 1 less and sometimes 2 less? (this change, of course, does fix it for either case, just curious what exactly is going on).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure but it looks like a timing issue. It may be harvesting asynchronously. It may be failing before finishing loading the last good one. Locally I never see the issue so it's hard to test.
@landreev Do you know if files are harvested asynchronously?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Harvesting is definitely asynchronous. But if I remember correctly, the code in that test was specifically checking/waiting for the harvest to be marked as completed before verifying the number of records harvested, so you would expect that to take the asynchronous nature of it out of the picture.
I would prefer to figure out why that failure to harvest one of the records is happening. The test failing intermittently is not such a big deal, so there's no rush to fix it. OTOH, replacing the check with just a "less than" leaves it open to ignoring more serious issues, IMO?
(and I'd like to know if it is an actual failure - as opposed to the last record still being processed at the time of the check; I don't think the latter is a possibility, because, again, I don't think the results of a harvesting run are written in the database before the run is completed - but I really need to take a closer look at the code before I can make educated guesses.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @landreev. I'd like to really understand what's happening during the failure. It's difficult because I can't get it to fail locally. It only fails in the pipeline, and not every time :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do run into timing issues that only happen during Jenkins runs. Investigating such is time-consuming. Basically, you wait for it to happen again, then go on Jenkins and look for anything meaningful in the full console output from the test, AND in the server.log from the instance that was executing the server side of the exchange. The latter is saved in the workspace directory, but only for the most recent Jenkins build, I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also, if it is a timing issue, there is a chance that it would manifest itself locally - if you run the entire suite of tests, just like Jenkins does, rather than just executing the select tests. But, either way, things like this are always time-consuming.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, in my case it was not the same intermittent error we've been seeing w/ Jenkins. I just had that 7th dataset already harvested (via the "single dataset set" from demo), so it was skipped. So, nothing intermittent or random about it.
Still, this may be what is happening w/ Jenkins too. Since we are harvesting the same 7 datasets in both tests, it is conceivable that when they are deleted between tests, one is still there by the time the next harvesting run tries to import it again.
In other words, it's not the asynchronous harvesting, it's the deleteClient()
also being executed asynchronously that may be the problem.
If this is the case, simply adding a few seconds of sleep to the cleanup()
method would fix it.
Also, if this failure to import is indeed on account of the dataset still being in the database, the error message in server.log to look for would be this:
edu.harvard.iq.dataverse.api.imports.ImportException: Failed to import harvested dataset: class edu.harvard.iq.dataverse.api.imports.ImportException (The dataset with the global id doi:xxxx/yy/zzzzz already exists, in the dataverse ...., skipping.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been running the test locally with Dataverse in Docker. Testing most of the IT tests and this test is not failing even with the assert set back to expecting 7. I haven't seen only 6 harvested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be impossible to reproduce outside of Jenkins, for all we know.
This is an example of a Jenkins run where this test failed and no other test run has happened since; i.e., the server.log from the failed run is still there: https://jenkins.dataverse.org/job/IQSS-Dataverse-Develop-PR/view/change-requests/job/PR-10241/4/execution/node/3/ws/target/server.log
There is no "ImportException ... dataset already exists" message there, like on my instance. But, on a second thought, that would not be the case, since your tests are harvesting into the same collection (meaning, if the dataset still exists, Dataverse attempts to delete, then re-import it). There is indeed a ton of OptimisticLock exceptions right around the time of the second harvesting run that appear to be about a failure to delete. You can tell by the messages about the field missing from CVV that it is the second test run that is failing. No obvious smoking guns in the console output.
Don't have much time to look into this any deeper. (I don't think this issue is worth the time I've already spent looking into it, tbh). But, what I mentioned earlier - adding a few sec. of sleep to the cleanup method - may be something worth trying.
This comment has been minimized.
This comment has been minimized.
1 similar comment
This comment has been minimized.
This comment has been minimized.
Added back the assert for exact harvest count and added an extra second to the wait. |
When/if Jenkins comes back to life, if the tests complete w/out failing 3 times in a row with this extra second added, I'll vote for merging it quickly; if it keeps happening, we'll revisit it then. |
I started a build by hand just now. |
@stevenwinship Could you please sync the branch up with develop, appears to be the reason the last Jenkins run bombed. |
This comment has been minimized.
This comment has been minimized.
one successful run earlier today (build 14). build 15: a really weird failure - the other harvesting test, |
Don has fixed the payara distribution issue. But it's still Jenkins roulette, with the harvesting tests failing half of the times; somehow it's been made worse so far. So no, we are not merging this PR just yet. If I have time I'll look into it some more. |
ok, one mystery solved - I just left a comment in #10464, the PR I merged yesterday; the new errors are the result of the extra check added, which is performed before the async. indexing gets a chance to finish. |
also, just saw that you have fixed it already. |
This comment has been minimized.
This comment has been minimized.
OK, you have fixed the cleanup part - but the search check in line 278 may still bomb for the same async. reason - ? |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
globalIdList.add(jsonPath.getString("data.items["+idx+"].global_id")); | ||
} | ||
// verify count after collecting global ids | ||
assertEquals(expectedNumberOfSetsHarvested, jsonPath.getInt("data.total_count")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stevenwinship
The new intermittent error, the one introduced in #10464 the other day, is still here - the last Jenkins build has just failed again on this line with expected: <8> but was: <2>
. Just a matter of calling the search engine while the new datasets may still be getting indexed.
Not your fault, strictly speaking, but since we are already working on this test, may as well. Let me just add a second of sleep here and merge it; I'm sure that's all it needs.
|
||
int i = 0; | ||
int maxWait = 20; | ||
String query = "dsPersistentId:" + globalIdList.stream().map(s -> "\""+s+"\"").collect(Collectors.joining(",")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this necessary, btw - what was the problem with "metadataSource:" + nickName
?
Doesn't matter really, as long as it works. But I can't help but feel like there's too much going on in this test class, vs. its useful value. (Strictly speaking, there is no need to harvest the first 7 datasets twice in a row; that second set could just contain the 1 dataset with the CVV, maybe?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still mildly curious why it was necessary to search by individual dataset ids here.
Anyway, I just want to see it pass a few times in a row and merge it without further ado. |
📦 Pushed preview images as
🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name. |
There were more annoying unrelated Jenkins failures. But the 2 harvesting tests have objectively passed three times in a row. Merging. |
Description: Intermittent failures happen during IT tests pointing to HarvestingClientsIT
Closes: #10438
How to test this: Observe other PRs and see if any fail on this test