Skip to content

Commit

Permalink
Update of UCL discovery telescope
Browse files Browse the repository at this point in the history
  • Loading branch information
keegansmith21 authored and jdddog committed Jul 28, 2023
1 parent 48ee6e4 commit 25e8b16
Show file tree
Hide file tree
Showing 18 changed files with 830 additions and 676 deletions.
56 changes: 48 additions & 8 deletions docs/oaebu_workflows/telescopes/ucl_discovery.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,36 +2,76 @@

UCL Discovery is UCL's open access repository, showcasing and providing access to the full texts of UCL research publications.

The metadata for all eprints is obtained from the publicly available CSV file (https://discovery.ucl.ac.uk/cgi/search/advanced).
Additionally for each eprint the total downloads and downloads per country is gathered from the publicly available stats
(https://discovery.ucl.ac.uk/cgi/stats/report).
## [The Google Sheet](https://docs.google.com/spreadsheets/d/1YqU8m3xY4QvjmUhx215VtWr-HZ7NEAOHjYMEMEkN89A/edit#gid=614610019)
UCL's titles are referenced via their identifier - the eprint ID. Their metadata maps the eprint ID to an ISBN13, but not consistently. For this reason, we forgo the use of their metadata and instead employ a semi-manual process to reliably map the two identifiers.
The telescope references a Google sheet that contains all of the titles available in the UCL Discovery repository under the following headings:

| Heading | Description |
| ------------------- | -------------------------------- |
| ISBN13 | The title's ISBN13 |
| date | The date of publication |
| title_list_title | The title of the publication |
| discovery_eprint_id | The eprint ID of the publication |

Some notes :
- These headings are hardcoded into the telescope. Any change in the sheet will break the telescope without prior intervention.
- Entries without a publication date or with a publication date in the future (where the current time is determined by the airflow scheduler) will be ignored.
- Entries missing either an ISBN13 or eprint ID will be ignored.

For the aforementioned reasons, it is important that **the google sheet remain up to date**. Otherwise, the usage for a title may be missed and require a rerun.

### Access
Access to the sheet can be granted using the sheet UI (*Share* at the top right of the page). The telescope will access the sheet via a service account, which will need to be given read access (*Viewer*) by supplying the account's email address.

## Usage API
UCL Discovery provides free and open access to their usage REST API. Unfortunately, I can't find any documentation on its use and design. We utilise two endpoints:
- **Countries URI** = https://discovery.ucl.ac.uk/cgi/stats/get?from=[YYYYMMDD]&to=[YYYYMMDD]&irs2report=eprint&set_name=eprint&set_value=[EPRINT_ID]&datatype=countries&top=countries&view=Table&limit=all&export=JSON
- **Totals URI** = https://discovery.ucl.ac.uk/cgi/stats/get?from=[YYYYMMDD]&to=[YYYYMMDD]&irs2report=eprint&set_name=eprint&set_value=[EPRINT_ID]&datatype=downloads&graph_type=column&view=Google%3A%3AGraph&date_resolution=month&title=Download+activity+-+last+12+months&export=JSON

Where *from*, *to* and *set_value* are appropriately set.
The countries URI returns statistics pertaining to the number of downloads of the provided eprint ID broken down by country.
The totals URI returns statistics pertaining to the number of downloads of the provided eprint ID aggregated over all regions.
It should be noted that the *totals* data is not necessarily a simply aggregation of the *countries* data. This is because country data is omitted for downloads that are not attributed to a region. It is therefore not uncommon to have a total download count (derived from the totals URI) that is greater than the sum of all downloads from all listed countries (from the countries URI).

## Telescope Workflow
The telescope's workflow can be broken down as such:

### Download
Acquires the eprint IDs and publication dates from [[The Google Sheet]]. For each ID that has a publication date that is before the current scheduled run date, download the country and totals data. Then upload to GCS download bucket.

### Transform
Acquires the eprint IDs, ISBN13s and titles from[[The Google Sheet]]. For each ID, load the downloaded data (both coutried and totals) into a single data structure and include the title (whether it is empty or not does not matter - the title exists for completeness only). Add an additional field to each row - the *release_date* which is determined by the scheduled runtime. Upload this transformed structure to GCS transform bucket.

### BQ Load
Load the table into BigQuery and partition on the *release_date*.


# Run Summary

The corresponding table in BigQuery is `ucl.ucl_discoveryYYYYMMDD`.

```eval_rst
+------------------------------+---------+
| Summary | |
+==============================+=========+
| Average runtime | 10 min |
| Average runtime | 2 min |
+------------------------------+---------+
| Average download size | 1.5 MB |
+------------------------------+---------+
| Harvest Type |CSV + API|
| Harvest Type | API |
+------------------------------+---------+
| Harvest Frequency | Monthly |
+------------------------------+---------+
| Runs on remote worker | False |
+------------------------------+---------+
| Catchup missed runs | True |
+------------------------------+---------+
| Table Write Disposition | Truncate|
| Table Write Disposition | Append |
+------------------------------+---------+
| Update Frequency | Daily |
+------------------------------+---------+
| Credentials Required | No |
+------------------------------+---------+
| Uses Telescope Template | Snapshot|
+------------------------------+---------+
| Each shard includes all data | No |
+------------------------------+---------+
```
Expand Down
24 changes: 22 additions & 2 deletions oaebu_workflows/database/schema/book_product.json
Original file line number Diff line number Diff line change
Expand Up @@ -424,6 +424,26 @@
"name": "irus_fulcrum_metadata",
"type": "RECORD",
"description": "Metadata derived from IRUS Fulcrum"
},
{
"fields": [
{
"mode": "NULLABLE",
"name": "ISBN13",
"type": "STRING",
"description": "ISBN of the book"
},
{
"mode": "NULLABLE",
"name": "eprint_id",
"type": "STRING",
"description": "The UCL Discovery eprint ID"
}
],
"mode": "NULLABLE",
"name": "ucl_discovery_metadata",
"type": "RECORD",
"description": "Metadata derived from UCL Discovery"
}
],
"mode": "NULLABLE",
Expand Down Expand Up @@ -1047,13 +1067,13 @@
},
{
"mode": "NULLABLE",
"name": "download_count",
"name": "country_downloads",
"type": "INTEGER",
"description": "Number of downloads for the given country"
}
],
"mode": "REPEATED",
"name": "downloads_per_country",
"name": "country",
"type": "RECORD",
"description": "Number of downloads per country"
}
Expand Down
4 changes: 2 additions & 2 deletions oaebu_workflows/database/schema/irus_fulcrum.json
Original file line number Diff line number Diff line change
Expand Up @@ -66,13 +66,13 @@
"type": "RECORD",
"fields": [
{
"description": "The country name of the client registered by oapen irus uk.",
"description": "The country name of the client registered by IRUS.",
"mode": "NULLABLE",
"name": "name",
"type": "STRING"
},
{
"description": "The country code of the client registered by oapen irus uk.",
"description": "The country code of the client registered by IRUS.",
"mode": "NULLABLE",
"name": "code",
"type": "STRING"
Expand Down
212 changes: 54 additions & 158 deletions oaebu_workflows/database/schema/ucl_discovery.json
Original file line number Diff line number Diff line change
@@ -1,195 +1,91 @@
[
{
"description": "Eprint id.",
"description": "ISBN13 of the book.",
"mode": "REQUIRED",
"name": "eprintid",
"name": "ISBN",
"type": "STRING"
},
{
"description": "Title of the book.",
"description": "eprint ID of the book.",
"mode": "REQUIRED",
"name": "book_title",
"type": "STRING"
},
{
"description": "Family name of the creators",
"mode": "REPEATED",
"name": "creators_name_family",
"type": "STRING"
},
{
"description": "Given name of the creators",
"mode": "REPEATED",
"name": "creators_name_given",
"type": "STRING"
},
{
"description": "Info on whether the book is published",
"mode": "NULLABLE",
"name": "ispublished",
"type": "STRING"
},
{
"description": "Subjects",
"mode": "REPEATED",
"name": "subjects",
"type": "STRING"
},
{
"description": "Divisions",
"mode": "REPEATED",
"name": "divisions",
"type": "STRING"
},
{
"description": "Keywords",
"mode": "REPEATED",
"name": "keywords",
"type": "STRING"
},
{
"description": "Abstract",
"mode": "NULLABLE",
"name": "abstract",
"type": "STRING"
},
{
"description": "Date",
"mode": "NULLABLE",
"name": "date",
"type": "STRING"
},
{
"description": "Publisher",
"mode": "NULLABLE",
"name": "publisher",
"type": "STRING"
},
{
"description": "Official URL",
"mode": "NULLABLE",
"name": "official_url",
"type": "STRING"
},
{
"description": "OA status",
"mode": "NULLABLE",
"name": "oa_status",
"type": "STRING"
},
{
"description": "Language",
"mode": "NULLABLE",
"name": "language",
"type": "STRING"
},
{
"description": "DOI",
"mode": "NULLABLE",
"name": "doi",
"name": "eprint_id",
"type": "STRING"
},
{
"description": "ISBN",
"mode": "NULLABLE",
"name": "isbn",
"type": "STRING"
},
{
"description": "ISBN10",
"description": "Title of the book.",
"mode": "NULLABLE",
"name": "isbn10",
"name": "title",
"type": "STRING"
},
{
"description": "Language elements",
"description": "Timescale of the statistics as reported by the origin.",
"mode": "NULLABLE",
"name": "language_elements",
"type": "STRING"
"name": "timescale",
"type": "RECORD",
"fields": [
{
"description": "Format of the 'to' and 'from' fields",
"mode": "NULLABLE",
"name": "format",
"type": "STRING"
},
{
"description": "Beginning of date range for the statistics",
"mode": "NULLABLE",
"name": "from",
"type": "STRING"
},
{
"description": "End of date range for the statistics",
"mode": "NULLABLE",
"name": "to",
"type": "STRING"
}
]
},
{
"description": "Series",
"mode": "NULLABLE",
"name": "series",
"type": "STRING"
},
{
"description": "Page range",
"description": "Origin of the statistics",
"mode": "NULLABLE",
"name": "pagerange",
"type": "STRING"
"name": "origin",
"type": "RECORD",
"fields": [
{
"description": "The URL of the origin",
"mode": "NULLABLE",
"name": "url",
"type": "STRING"
},
{
"description": "The name of the origin",
"mode": "NULLABLE",
"name": "name",
"type": "STRING"
}
]
},
{
"description": "Pages",
"description": "The aggregated statistics for the reported period",
"mode": "NULLABLE",
"name": "pages",
"type": "INTEGER"
},
{
"description": "Family name of the editors",
"mode": "REPEATED",
"name": "editors_name_family",
"type": "STRING"
},
{
"description": "Given name of the editors",
"mode": "REPEATED",
"name": "editors_name_given",
"type": "STRING"
},
{
"description": "Family name of the lyricists",
"mode": "REPEATED",
"name": "lyricists_name_family",
"type": "STRING"
},
{
"description": "Given name of the lyricists",
"mode": "REPEATED",
"name": "lyricists_name_given",
"type": "STRING"
},
{
"description": "Begin date.",
"mode": "REQUIRED",
"name": "begin_date",
"type": "DATE"
},
{
"description": "End date.",
"mode": "REQUIRED",
"name": "end_date",
"type": "DATE"
},
{
"description": "Number of downloads",
"mode": "REQUIRED",
"name": "total_downloads",
"type": "INTEGER"
},
{
"description": "Number of downloads per country",
"description": "The aggregated statistics for each reported country",
"mode": "REPEATED",
"name": "downloads_per_country",
"name": "country",
"type": "RECORD",
"fields": [
{
"description": "Number of downloads for the given country",
"description": "The two letter country code.",
"mode": "NULLABLE",
"name": "download_count",
"type": "INTEGER"
},
{
"description": "Country name",
"mode": "NULLABLE",
"name": "country_name",
"name": "value",
"type": "STRING"
},
{
"description": "Country code",
"description": "The total number of item downloads for the reported period from this country.",
"mode": "NULLABLE",
"name": "country_code",
"type": "STRING"
"name": "count",
"type": "INTEGER"
}
]
},
Expand Down
Loading

0 comments on commit 25e8b16

Please sign in to comment.