Columbia University Academic Commons Weeding Project

First created: 2022-11-10.
Version 2 completed: 2022-12-23.
Last edit: 2023-04-18 (README.md)

This is a project of weeding legacy duplicate materials from Columbia University Libraries' Academic Commons done in Fall 2022.

A list of duplicate resources are identified on Academic Commons (AC) — the digital scholarship repository of Columbia University. Duplicates are flagged and pointed to the ones that would be remained in the repository before removing from users' view. Stats of the duplicates will be merged on child access level. DOIs of those removed resources will be redirected on to the available ones.

The poster of this project was presented at Code4Lib on 16 March 2023, at Princeton University. Download poster @ OSF. It was also presented as a short session at the Southern Miss Institutional Repository Conference (SMIRC), 28 April 2023, at the University of Southern Mississippi. Download PDF @ Aquila.

Dupe merge stat flowchart.pdf

Duplicate items on AC carry usage stats, which we wanted to keep and merge with the items that will stay published on the platform. This flowchart accompanies version 2 of the script that shows how to merge the stats on child asset level before unpublish them with their parent items.

all_dupe_and_related.ipynb

From the duplicate list of parent items, this script searches their children (assets) from the repository. The resulting list will be exported as 2 CSV files. On Hyacinth — the backend digital object management platform — the exported list will be used to merge stats of duplicates before unpublished. On DataCite, the duplicates will be redirected to appropriate resources.

Outputs:

1 CSV file for Hyacinth
1 CSV file for DataCite

Main Processes:

Import (a) the complete data exported from AC, and (b) the list of current duplicates identified in AC.
Select items from (b) that are marked as duplicates ('Yes dupe').
Look up bulk AC data for duplicates' child assets.
Output as 2 CSV, one for Hyacinth, the other one for DataCite.

all_dupe_and_related_v2.ipynb

Adding a new part in [21] to do a child-level mapping from the duplicate asset to its equivalent keeping asset. This mapping facilitates Hyacinth to merge usage stats before unpublishing the duplicates. The mapping will skip any originally unpublished duplicates and metadata XML.

Outputs:

1 additional CSV file for Hyacinth

Main Processes:

Collect a list of child assets of the stay-published parent items.
Continue to work on the duplicate child assets found before this step, while skipping those that had not been published on Academic Commons, and the duplicate items’ metadata XML, which were unintentionally published.
Mapping the two sets of assets.
Output as a CSV file for Hyacinth.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Columbia University Academic Commons Weeding Project

Dupe merge stat flowchart.pdf

all_dupe_and_related.ipynb

all_dupe_and_related_v2.ipynb

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Dupe merge stat flowchart.pdf		Dupe merge stat flowchart.pdf
LICENSE		LICENSE
README.md		README.md
all_dupe_and_related.ipynb		all_dupe_and_related.ipynb
all_dupe_and_related_v2.ipynb		all_dupe_and_related_v2.ipynb

License

sunniw/ColU_AcademicCommons

Folders and files

Latest commit

History

Repository files navigation

Columbia University Academic Commons Weeding Project

Dupe merge stat flowchart.pdf

all_dupe_and_related.ipynb

all_dupe_and_related_v2.ipynb

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages