GitHub - NCBI-Hackathons/Females-Are-Insisting-on-Reproducibility

Contributors

Kerry Goetz (kerry.goetz@nih.gov, kgoetz2@masonlive.gmu.edu)

Dina Mikdadi

Divya Palaniswamy (divyaswamy87@gmail.com)

Yajing Song

Background

Biomedical Research Data is notoriously challenging when it comes to FAIR (Findable, Accessible, Interoperable and Re-usable) principles. Often, in order to understand the data set from a clinical research study, one needs to understand the data dictionary and associated code list. This means that someone interested in the data needs to spend a lot of time to understand the relationship between many files of several types. Additionally, the case report forms (CRF) are often contained within pdf type files and sometimes the codebooks are separate files. In some cases, the crf files also contain handwritten annotations. The goal of this work is to create an automated method to merging data from each source to a single containerized object that includes enough meta-data for an outsider to make sense of the study. This is the means to the goal of making biomedical research data FAIR.

All about the data

We are working with survey data from the Patient Reported Outcomes with LASIK (PROWL). This was a joint project for DOD/FDA/NIH. More information can be found at FDA PROWL Website and NEI PROWL Website. We looked three data categories: Demographic, Pre-Operation survey responses and Post-Operation surveys.

We started with a annotated case report form (CRF) and seperate code book in pdf format. Cooresponding Data Dictionary and survey response data in csv.

Data Information:

Cohort = 1100 people
Demographic = 8 Data Elements (DE)
Pre-OP = 142 DE
Post-OP = 108 DE
6 data files but we used one for proof of concept (see note about data access)
42 PDFs (21 pairs of annotated CRF and code book)

Data is available on controlled access basis. Required to sign a user agreement before account approval. Low-barrier to access, basically just want to track usage and make sure no third party sharing. Request access at NEI BRICS.

Goals

Target:

Pulling data from CRFs and PDFs as a structured data format
Use the Sturtured data to peform Keyword Extraction using Natural Language Processing (NLP)
Extract schema from the data sets in CSV files
Capture inventories of the files in the directories * Bagit and Pandas as tools * How many variables and missing values are in each file

Stretch:

Associate the variables with the descriptions
Validate the variables against the data dictionary and report the record that does not find a match

Workflow

Presentations

Project Concept

Mid-way Lightning Talk

Final Presentation

Dependencies:

R R version 3.5.1 (2018-07-02) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: macOS 10.14.2 attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] bindrcpp_0.2.2 dplyr_0.7.6 ggplot2_3.0.0
Python - Numpy, Matplotlib, nltk, PyPDF2
Jupiter Notebook

Input Data:

Example from CRF

Example from Data Dictionary

Example from Data

Shoe #1 - Sneakers - CRF

Reading Case Report Form (CRF) PDF to python Data Frame for Text Mining
Put into Data Frame to do cool stuff
headers in data - variable names to questions
Automated Keyword Extraction from multiple Case Report Form (CRF) PDF files using Natural Language Processing (NLP)
Other fun stuff like text analysis, mesh terms

Sneakers notebook

Shoe #2 - Variable Cleanup

Work Shoes

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
CRF.PNG		CRF.PNG
Dance_Shoes_in_R.ipynb		Dance_Shoes_in_R.ipynb
Data set.PNG		Data set.PNG
FAIR Project Background and Workflow.gdoc		FAIR Project Background and Workflow.gdoc
LICENSE		LICENSE
LasikGLCLCLSEScl.png		LasikGLCLCLSEScl.png
README.md		README.md
Sneakers.ipynb		Sneakers.ipynb
cleanup.png		cleanup.png
code list.PNG		code list.PNG
data dictionary.PNG		data dictionary.PNG
logo.png		logo.png
new_df.csv		new_df.csv
query_result_PROWL_Dem.csv		query_result_PROWL_Dem.csv
shoe2_variable_cleanup.ipynb		shoe2_variable_cleanup.ipynb
workflow.png		workflow.png
workflow_updated-1.png		workflow_updated-1.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Contributors

Background

All about the data

Data Information:

Goals

Target:

Stretch:

Workflow

Presentations

Dependencies:

Input Data:

Example from CRF

Example from Data Dictionary

Example from Data

Shoe #1 - Sneakers - CRF

Shoe #2 - Variable Cleanup

Shoe #3 - Generate Stats Charts

About

Releases

Packages

Contributors 6

Languages

License

NCBI-Hackathons/Females-Are-Insisting-on-Reproducibility

Folders and files

Latest commit

History

Repository files navigation

Contributors

Background

All about the data

Data Information:

Goals

Target:

Stretch:

Workflow

Presentations

Dependencies:

Input Data:

Example from CRF

Example from Data Dictionary

Example from Data

Shoe #1 - Sneakers - CRF

Shoe #2 - Variable Cleanup

Shoe #3 - Generate Stats Charts

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages