Fix intermittent IT test failures for HarvestingClientsIT #10449

stevenwinship · 2024-03-29T15:00:57Z

Description: Intermittent failures happen during IT tests pointing to HarvestingClientsIT

How to test this: Observe other PRs and see if any fail on this test

scolapasta · 2024-04-01T20:27:11Z

src/test/java/edu/harvard/iq/dataverse/api/HarvestingClientsIT.java

+                if (allowHarvestingMissingCVV) {
+                    assertEquals(DATASETS_IN_CONTROL_SET, responseJsonPath.getInt("data.lastDatasetsHarvested"));
+                } else {
+                    assertTrue(responseJsonPath.getInt("data.lastDatasetsHarvested") < DATASETS_IN_CONTROL_SET);


OK, so the issue seemed to be that it (sometimes?) expected this to be 2 less than the control set, rather than one? @stevenwinship just to understand do we know why it would sometimes want 1 less and sometimes 2 less? (this change, of course, does fix it for either case, just curious what exactly is going on).

I'm not 100% sure but it looks like a timing issue. It may be harvesting asynchronously. It may be failing before finishing loading the last good one. Locally I never see the issue so it's hard to test.

@landreev Do you know if files are harvested asynchronously?

Harvesting is definitely asynchronous. But if I remember correctly, the code in that test was specifically checking/waiting for the harvest to be marked as completed before verifying the number of records harvested, so you would expect that to take the asynchronous nature of it out of the picture.

I would prefer to figure out why that failure to harvest one of the records is happening. The test failing intermittently is not such a big deal, so there's no rush to fix it. OTOH, replacing the check with just a "less than" leaves it open to ignoring more serious issues, IMO?

(and I'd like to know if it is an actual failure - as opposed to the last record still being processed at the time of the check; I don't think the latter is a possibility, because, again, I don't think the results of a harvesting run are written in the database before the run is completed - but I really need to take a closer look at the code before I can make educated guesses.)

Thanks @landreev. I'd like to really understand what's happening during the failure. It's difficult because I can't get it to fail locally. It only fails in the pipeline, and not every time :(

We do run into timing issues that only happen during Jenkins runs. Investigating such is time-consuming. Basically, you wait for it to happen again, then go on Jenkins and look for anything meaningful in the full console output from the test, AND in the server.log from the instance that was executing the server side of the exchange. The latter is saved in the workspace directory, but only for the most recent Jenkins build, I believe.

(also, if it is a timing issue, there is a chance that it would manifest itself locally - if you run the entire suite of tests, just like Jenkins does, rather than just executing the select tests. But, either way, things like this are always time-consuming.)

I just had that same failure (6 out of 7 records) on my own dev. instance. This was outside the restassured test, I just recreated the client and ran the harvest in the UI. I'll take a look at the errors.

Unfortunately, in my case it was not the same intermittent error we've been seeing w/ Jenkins. I just had that 7th dataset already harvested (via the "single dataset set" from demo), so it was skipped. So, nothing intermittent or random about it.
Still, this may be what is happening w/ Jenkins too. Since we are harvesting the same 7 datasets in both tests, it is conceivable that when they are deleted between tests, one is still there by the time the next harvesting run tries to import it again.

In other words, it's not the asynchronous harvesting, it's the deleteClient() also being executed asynchronously that may be the problem.
If this is the case, simply adding a few seconds of sleep to the cleanup() method would fix it.

Also, if this failure to import is indeed on account of the dataset still being in the database, the error message in server.log to look for would be this:
edu.harvard.iq.dataverse.api.imports.ImportException: Failed to import harvested dataset: class edu.harvard.iq.dataverse.api.imports.ImportException (The dataset with the global id doi:xxxx/yy/zzzzz already exists, in the dataverse ...., skipping.)

I've been running the test locally with Dataverse in Docker. Testing most of the IT tests and this test is not failing even with the assert set back to expecting 7. I haven't seen only 6 harvested

It may be impossible to reproduce outside of Jenkins, for all we know.

This is an example of a Jenkins run where this test failed and no other test run has happened since; i.e., the server.log from the failed run is still there: https://jenkins.dataverse.org/job/IQSS-Dataverse-Develop-PR/view/change-requests/job/PR-10241/4/execution/node/3/ws/target/server.log

There is no "ImportException ... dataset already exists" message there, like on my instance. But, on a second thought, that would not be the case, since your tests are harvesting into the same collection (meaning, if the dataset still exists, Dataverse attempts to delete, then re-import it). There is indeed a ton of OptimisticLock exceptions right around the time of the second harvesting run that appear to be about a failure to delete. You can tell by the messages about the field missing from CVV that it is the second test run that is failing. No obvious smoking guns in the console output.

Don't have much time to look into this any deeper. (I don't think this issue is worth the time I've already spent looking into it, tbh). But, what I mentioned earlier - adding a few sec. of sleep to the cleanup method - may be something worth trying.

stevenwinship · 2024-04-09T18:16:55Z

Added back the assert for exact harvest count and added an extra second to the wait.

landreev · 2024-04-09T22:31:31Z

When/if Jenkins comes back to life, if the tests complete w/out failing 3 times in a row with this extra second added, I'll vote for merging it quickly; if it keeps happening, we'll revisit it then.

landreev · 2024-04-10T14:35:56Z

I started a build by hand just now.

landreev · 2024-04-10T16:04:31Z

@stevenwinship Could you please sync the branch up with develop, appears to be the reason the last Jenkins run bombed.

landreev · 2024-04-10T23:27:40Z

one successful run earlier today (build 14). build 15: a really weird failure - the other harvesting test, testHarvestingClientRun_AllowHarvestingMissingCVV_True, 4 datasets harvested instead of 8. (??)

landreev · 2024-04-11T14:50:46Z

Don has fixed the payara distribution issue. But it's still Jenkins roulette, with the harvesting tests failing half of the times; somehow it's been made worse so far. So no, we are not merging this PR just yet. If I have time I'll look into it some more.

landreev · 2024-04-11T16:12:55Z

ok, one mystery solved - I just left a comment in #10464, the PR I merged yesterday; the new errors are the result of the extra check added, which is performed before the async. indexing gets a chance to finish.

landreev · 2024-04-11T16:14:06Z

also, just saw that you have fixed it already.

landreev · 2024-04-11T16:23:44Z

also, just saw that you have fixed it already.

OK, you have fixed the cleanup part - but the search check in line 278 may still bomb for the same async. reason - ?
... I'll shut up and let you handle this.

landreev · 2024-04-12T19:21:20Z

src/test/java/edu/harvard/iq/dataverse/api/HarvestingClientsIT.java

+            globalIdList.add(jsonPath.getString("data.items["+idx+"].global_id"));
+        }
+        // verify count after collecting global ids
+        assertEquals(expectedNumberOfSetsHarvested, jsonPath.getInt("data.total_count"));


@stevenwinship
The new intermittent error, the one introduced in #10464 the other day, is still here - the last Jenkins build has just failed again on this line with expected: <8> but was: <2>. Just a matter of calling the search engine while the new datasets may still be getting indexed.

Not your fault, strictly speaking, but since we are already working on this test, may as well. Let me just add a second of sleep here and merge it; I'm sure that's all it needs.

landreev · 2024-04-12T19:35:12Z

src/test/java/edu/harvard/iq/dataverse/api/HarvestingClientsIT.java

+
+            int i = 0;
+            int maxWait = 20;
+            String query = "dsPersistentId:" + globalIdList.stream().map(s -> "\""+s+"\"").collect(Collectors.joining(","));


Why was this necessary, btw - what was the problem with "metadataSource:" + nickName?

Doesn't matter really, as long as it works. But I can't help but feel like there's too much going on in this test class, vs. its useful value. (Strictly speaking, there is no need to harvest the first 7 datasets twice in a row; that second set could just contain the 1 dataset with the CVV, maybe?)

I'm still mildly curious why it was necessary to search by individual dataset ids here.

landreev · 2024-04-12T19:40:22Z

Anyway, I just want to see it pass a few times in a row and merge it without further ado.

github-actions · 2024-04-12T19:44:08Z

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:10438-fix-intermittent-harvesting-tests

ghcr.io/gdcc/configbaker:10438-fix-intermittent-harvesting-tests

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

landreev · 2024-04-12T22:22:39Z

There were more annoying unrelated Jenkins failures. But the 2 harvesting tests have objectively passed three times in a row. Merging.

add order to test

4fe9053

stevenwinship self-assigned this Mar 29, 2024

stevenwinship linked an issue Mar 29, 2024 that may be closed by this pull request

Fix intermittent failures in Harvesting IT tests #10438

Closed

stevenwinship added Size: 3 A percentage of a sprint. 2.1 hours. Size: 0.5 A percentage of a sprint. 0.35 hours and removed Size: 0.5 A percentage of a sprint. 0.35 hours labels Mar 29, 2024

This comment has been minimized.

Sign in to view

add order to test

30356fd

This comment has been minimized.

Sign in to view

fix test

642b659

This comment has been minimized.

Sign in to view

stevenwinship added 2 commits March 29, 2024 17:47

fix test

2ed51c2

fix test

38d565f

This comment has been minimized.

Sign in to view

stevenwinship removed their assignment Apr 1, 2024

scolapasta reviewed Apr 1, 2024

View reviewed changes

scolapasta self-assigned this Apr 1, 2024

stevenwinship added 2 commits April 9, 2024 12:30

fixing test

627bd58

fixing test

2669bcc

This comment has been minimized.

Sign in to view

Merge branch 'develop' into 10438-fix-intermittent-harvesting-tests

df9ce9d

This comment has been minimized.

Sign in to view

stevenwinship self-assigned this Apr 11, 2024

stevenwinship unassigned scolapasta Apr 11, 2024

adding better cleanup between tests

e22501d

This comment has been minimized.

Sign in to view

refine cleanup

7db469d

This comment has been minimized.

Sign in to view

refine cleanup

f4f57d9

This comment has been minimized.

Sign in to view

stevenwinship removed their assignment Apr 12, 2024

stevenwinship requested review from landreev and scolapasta April 12, 2024 13:52

landreev self-assigned this Apr 12, 2024

landreev reviewed Apr 12, 2024

View reviewed changes

An extra sec. for indexing

03aa898

landreev reviewed Apr 12, 2024

View reviewed changes

landreev approved these changes Apr 12, 2024

View reviewed changes

landreev merged commit 0b404c8 into develop Apr 12, 2024
18 of 19 checks passed

pdurbin added this to the 6.3 milestone Apr 16, 2024

landreev removed their assignment May 2, 2024

Fix intermittent IT test failures for HarvestingClientsIT #10449

Fix intermittent IT test failures for HarvestingClientsIT #10449

Conversation

stevenwinship commented Mar 29, 2024 • edited Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

landreev Apr 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

stevenwinship commented Apr 9, 2024

landreev commented Apr 9, 2024

landreev commented Apr 10, 2024

landreev commented Apr 10, 2024

This comment has been minimized.

landreev commented Apr 10, 2024

landreev commented Apr 11, 2024

landreev commented Apr 11, 2024

landreev commented Apr 11, 2024

This comment has been minimized.

landreev commented Apr 11, 2024

This comment has been minimized.

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

landreev commented Apr 12, 2024

github-actions bot commented Apr 12, 2024

landreev commented Apr 12, 2024

stevenwinship commented Mar 29, 2024 •

edited

Loading

landreev Apr 9, 2024 •

edited

Loading