This repository analyzes the extent to which preprints hosted on the Open Science Framework (OSF) link to external versions of record (VOR). The goal is to understand how many OSF preprints point to external versions that live somewhere other than OSF, including copies on other preprint servers or versions eventually published in academic journals.
View the full analysis report here
The report is automatically rendered and published via GitHub Actions whenever changes are pushed to the main branch.
Using the OpenAlex API, this project:
- Extracts preprint DOIs from the OSF database
- Queries OpenAlex for location and version information
- Identifies external versions of record (non-OSF locations)
- Analyzes open access availability of external versions
From a sample of 10,000 OSF preprints:
- 99.4% successfully looked up in OpenAlex
- 3.7% have at least one external location (non-OSF)
- 47.5% of those with external locations have open access versions
- 1.7% of all preprints have an external open access version
.
├── openalex.r # Main analysis script
├── osf_preprint_vor_analysis.qmd # Quarto report with full methodology and results
├── data/
│ ├── doi_lookup_summary.csv # Summary statistics per preprint
│ └── doi_lookup_results.csv # Detailed external location data
├── logs/
│ └── doi_lookup_status.csv # API request success/failure log
└── .github/
└── workflows/
└── render-quarto.yml # GitHub Actions workflow for automatic rendering
dplyr- Data manipulationhttr2- HTTP requestscosr- OSF database access (internal)jsonlite- JSON parsingreadr- CSV file I/Otidyr- Data tidyingpurrr- Functional programming
- Access to OSF database (via
cosrpackage and local files) - Internet connection for OpenAlex API queries
# Source the main script
source("openalex.r")
# Run with default sample size (10,000)
openalex_doi_search(sample_size = 10000)
# Run on full dataset (this may take considerable time)
openalex_doi_search(sample_size = NULL)# In R
quarto::quarto_render("osf_preprint_vor_analysis.qmd")Or from the command line:
quarto render osf_preprint_vor_analysis.qmdThe repository is configured with GitHub Actions to automatically:
- Render the Quarto document whenever changes are pushed to
main - Publish the HTML output to GitHub Pages
To enable this for your fork:
- Go to your repository's Settings → Pages
- Under "Source", select GitHub Actions
- Push changes to the
mainbranch to trigger the workflow
The analysis focuses on OSF preprints that are:
- Public and not deleted
- In "accepted" machine state
- Published
- Without an existing
article_doifield (to find potential external VORs)
- Extract preprint DOIs from OSF database
- Make parallel API requests to OpenAlex (throttled to 100/min)
- Save JSON responses and log success/failure
- Parse location data from successful responses
- Filter out locations pointing back to OSF
- Generate summary and detailed CSV files
doi_lookup_summary.csv: One row per preprint with counts of total and external locations, plus OA statusdoi_lookup_results.csv: One row per external location with detailed metadata (version, access status, URL)doi_lookup_status.csv: Log of all API requests (success/failure, status codes)
With adequate permissions, these files can be accessed on Google Drive for further exploration. Please contact for access.