Measuring Difficulty: Conceptual Foreknowledge Requirements and Coherence in Scientific Literature

Done as part of fufilling the course requirements for CPSC503 (UBC)

Abstract

In the last two decades, the reading burden on researchers has increased significantly. However, asssistive tools and underlying methods for navigating the vast array of published material have not kept up with this growth. We present both a global LDA-based coherence scoring method and a supervised concept extraction framework. Together, these offer a promising solution for assessing readability and conceptual foreknowledge requirements which is applicable in supporting tools for researchers, most notably recommender systems. We evaluated our approach on large data sets of scientific literature and contrast our coher- ence scoring method with traditional ”shallow” readability scoring. In addition to our results, we provide a design for an open-source tool to record data on users as they browse paper

Data Sources

Getting Started

Create a virtual environment using python 3.6 or higher

python -m venv venv
source venv/bin/activate
pip install -U setuptools pip

Install dependencies from the requirements.txt file

pip install -r requirements.txt

Run the scripts. Most scripts should give a help menu if passed the -h flag

for example

python scripts/compute_readability_scores.py -h

Analysis

The analysis pipeline is split into 2 stages: metadata collection and stats computation.

Metadata collection

Proir to running the metadata collection. A mySQL instance of the pubmed knowledge graph database dump should be up and running so it can be pulled from.

gunzip < pubmed19.sql.gz | mysql -h <HOSTNAME> pubmed_kg -p

The metadata can be generated with the following

source venv/bin/activate
snakemake -s workflows/metadata.snakefile --jobs 1

Text Conversion and Statistics

The PMC xml files should be uncompressed in a file with the following structure: */*.xml. The top-level directory can then be symlinked under the data directory relative to this repository.

cd data
ln -s /path/to/folder/above/xml/folders pmc_articles

Whereas the text conversion and downstream analysis can be done with the second pipeline file

source venv/bin/activate
snakemake -s workflows/text_stats.snakefile --jobs 10

This will create text files for each NXML file as well as complete stamp files in the following pattern

File	Path
complete stamp	data/pmc_articles/{batch_id}/NXML_TXT.COMPLETE
log file	data/pmc_articles/{batch_id}.readability_scores.snakemake.txt
readability scores csv	data/pmc_articles/{batch_id}.readability_scores.csv
text file conversion	data/pmc_articles/{batch_id}/{article_id}.nxml.txt
log file	data/pmc_articles/{batch_id}.nxml_to_txt.snakemake.txt

Following text conversion, LDA coherence scoring will also be analyzed and generate files with scores per batch.

Labelling Scientific Concepts

The final step requires significant setup and must be run on a GPU cluster to be feasible. Therefore we do not include that code here This creates the annotations files using the model referred to in Brack, 2020

This step creates the data/pmc_articles/{batch_id}/{article_id}.nxml.txt.ann files and takes the data/pmc_articles/{batch_id}/{article_id}.nxml.txt files as input

There are two workflow files associated with this processing

workflows/scibert-concept-extraction.snakefile
workflows/scibert-post-processing.snakefile

The first extracts concepts from individual articles whereas the second post-processes, simplifies, and merges the individual annotation files

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
manuscript		manuscript
scripts		scripts
workflows		workflows
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.scibert.txt		requirements.scibert.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Measuring Difficulty: Conceptual Foreknowledge Requirements and Coherence in Scientific Literature

Abstract

Data Sources

Getting Started

Analysis

Metadata collection

Text Conversion and Statistics

Labelling Scientific Concepts

About

Uh oh!

Releases

Packages

Languages

License

ocbier/cpsc503_final_project

Folders and files

Latest commit

History

Repository files navigation

Measuring Difficulty: Conceptual Foreknowledge Requirements and Coherence in Scientific Literature

Abstract

Data Sources

Getting Started

Analysis

Metadata collection

Text Conversion and Statistics

Labelling Scientific Concepts

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages