Skip to content

fix: optimise GitLab commit history scraper #1374

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from

Conversation

cmeessen
Copy link
Contributor

@cmeessen cmeessen commented Jan 24, 2025

Fixes #777

Changes proposed in this pull request:

  • optimise gitlab scrapers so that they do not scrape the entire commit history every time they run
  • scraper uses commit_history_scraped_at to start checking for new commits
  • the commit history could become incorrect if
    • there are commits between finishing scraping and writing the timestamp to the database
    • if the entire commit history was rewritten
  • added delete buttons to reset scraped metadata (commit history, programming languages, repository statistics)

How to test:

  • docker compose down --volumes && docker compose build --parallel && docker compose up
  • login and create a new software that uses a gitlab repository url, e.g.
    https://codebase.helmholtz.cloud/research-software-directory/RSD-as-a-service
    
  • publish the software entry
  • start the commit scrapers
    docker compose exec scrapers /opt/java/openjdk/bin/java -cp /usr/myjava/scrapers.jar nl.esciencecenter.rsd.scraper.git.MainCommits
    
  • the first time the scraping will take a while until it is finished
  • rerun the commit scrapers, this time the scraping should stop

PR Checklist:

  • Increase version numbers in docker-compose.yml
  • Link to a GitHub issue
  • Update documentation
  • Tests

@cmeessen cmeessen force-pushed the 777-optimise-gitlab-scraper branch 2 times, most recently from 1649d23 to 8c84e4c Compare January 24, 2025 15:01
Copy link

sonarqubecloud bot commented Feb 4, 2025

Copy link

sonarqubecloud bot commented Feb 4, 2025

@paulastock paulastock force-pushed the 777-optimise-gitlab-scraper branch 5 times, most recently from dc11bb5 to eb732f4 Compare March 12, 2025 08:31
@cmeessen cmeessen marked this pull request as ready for review March 12, 2025 08:48
@cmeessen cmeessen force-pushed the 777-optimise-gitlab-scraper branch from ff1b5aa to 7ad0ec3 Compare March 12, 2025 09:23
@cmeessen
Copy link
Contributor Author

I rebased the branch to the current status of the main branch

Copy link
Collaborator

@ewan-escience ewan-escience left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pull request! I have some comments:

The delete button for the commit history doesn't clear the commit_history_scraped_at field, leading to an incomplete commit history on the next harvest.

Furthermore, I have some comments on the code.

I didn't check the frontend code.


import java.util.UUID;

public record BasicRepositoryDataWithHistory(UUID software, String url, String commitHistoryScrapedAt, CommitsPerWeek commitsPerWeek) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer a specialised type for commitHistoryScrapedAt, probably ZonedDateTime. I see this in other places as well, like GitlabScraper. Or do you need strings to make it work with existing code? If so, this can also be changed later.

return new TreeMap<>(data);
}

public void setData(SortedMap<Instant, Long> data) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm against this method, as the data provided in the parameter might not be consistent with what this class does.

If you really need this (do you?), you should iterate over all entries in the data map and add them with the respective addCommits method.

static final Gson gson = new GsonBuilder()
.enableComplexMapKeySerialization()
.registerTypeAdapter(Instant.class, (JsonSerializer<Instant>) (src, typeOfSrc, context) -> new JsonPrimitive(src.getEpochSecond()))
.create();

public SortedMap<Instant, Long> getData() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need this other than for testing? If not, it should be package private (i.e. no visibility modifier), but I'm open for arguments for making this public. :)

@paulastock paulastock force-pushed the 777-optimise-gitlab-scraper branch 2 times, most recently from 77f6105 to 8faee73 Compare March 13, 2025 10:51
Copy link
Contributor

@dmijatovic dmijatovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paulastock Nice work for the first PR!

I have some suggestions concerning the frontend code. I would appreciate if you can join one of our standups to discuss these suggestions. I have my suggestion in code and could push additional commit with my suggestion if you like.

@dmijatovic dmijatovic self-requested a review March 14, 2025 11:06
Copy link
Contributor

@dmijatovic dmijatovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, double check my frontend suggestions.

@cmeessen cmeessen force-pushed the 777-optimise-gitlab-scraper branch 2 times, most recently from 7963756 to 1df7834 Compare March 17, 2025 11:59
@cmeessen cmeessen force-pushed the 777-optimise-gitlab-scraper branch from 1df7834 to 876f633 Compare March 17, 2025 12:00
Copy link

Copy link

Copy link

@cmeessen cmeessen mentioned this pull request Mar 17, 2025
4 tasks
@cmeessen
Copy link
Contributor Author

Closing this one due to issues with the history. Moving to #1432.

@cmeessen cmeessen closed this Mar 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimise GitLab scraper
4 participants