Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAG-WoS papers for open science/reproducility topics #5

Open
everyxs opened this issue Mar 11, 2019 · 2 comments
Open

MAG-WoS papers for open science/reproducility topics #5

everyxs opened this issue Mar 11, 2019 · 2 comments

Comments

@everyxs
Copy link
Contributor

everyxs commented Mar 11, 2019

We intend to collect a set of paper covering the general topics of open science/reproducibility. We want to first replicate our earlier studies based on our own crawled MAG data at 02-22-2108. We also want to conduct a similar analysis of the latest MAG data. In the end, we would like to map the papers in MAG to those in WoS, especially for journal names and disciplinary mapping.

@XiaoranYan
Copy link
Contributor

XiaoranYan commented Mar 13, 2019

The pipeline is designed to test many working components of CADRE-RAC from U-sql data ingestion to spark data comparison and integration, nested dockerized gephi visualizations, under a containerized Jupyter Lab environment. It will also use both WoS and MAG datasets.

The repo in development can be found at
https://github.com/everyxs/openScience

@XiaoranYan
Copy link
Contributor

XiaoranYan commented Mar 13, 2019

Finished:

  • Azure Blob connection established for U-sql result ingestion
  • Data cleaning and transformation pipeline optimized using R
  • Run Gephi scripts as a JAR package from Jupyter Lab
  • Submit U-sql scripts from Jupyter Lab
  • Established remote jupyter kernel for Karst Spark cluster using Toree, only possible outside of container, with ssh tunnels and browsers
  • Using binder and repo2docker to build reproducible container images
  • Prepare a demo based this project, with wiki instructions and binder image deployments

To do list:

  • Deploy BinderHub on AWS Kubernetes cluster
  • Test u-sql on AWS cluster while adding new overlapping output tables
  • Explore ways to dockerize customizable Toree image for local cluster integration on CADRE
  • Explore ways to pre-mount hdfs, S3, WASB and other Hadoop file system interface in customize docker image
  • Explore possible ways to dockerize Gephi for interactive visualizations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants