Skip to content

GSoC 2019 Project Ideas

Andrew Zhang edited this page Jan 19, 2019 · 2 revisions

Join the chat at https://gitter.im/weecology/retriever

Please ask questions here. Tag @zhangcandrew, @henrykironde, @ethanwhite

Preferred names(Andrew, Henry, Ethan), Preferred_greeting(Hi/Hello/Dear/Thanks/Thank you [First_name])

The code of conduct should be your first read.

Extract Scripts into Separate System

Please ask questions here. Tag @ethanwhite, @henrykironde, @zhangcandrew.

Rationale

The Data Retriever is a package manager for your data. The data retriever automatically finds, downloads and pre-processes publicly available datasets and it stores these datasets in a ready-to-analyse state. Currently the core software ships with json script metadata. We want to put this metadata in a separate project location to help with organization, maintenance, and testing.

Approach

The goal of this project aims at scaling up the number of usable datasets for retriever and standardizing maintenance of these scripts.

Degree of difficulty and needed skills

  • Moderate Difficulty
  • Knowledge of Python
  • Knowledge of Web Requests

Involved developer communities

The Data Retriever primarily interacts via issues and pull requests on GitHub.

Mentors

  • @henrysenyondo
  • @ethanwhite
  • @zhangcandrew

Retriever Provenance

Please ask questions here. Tag @ethanwhite, @henrysenyondo, @zhangcandrew.

Rationale

As script file versions change and updates are pushed to the retriever, the issue of reproducibility arises. Research scientists will need to be able to consistently reproduce the same output.

Approach

The goal of the project is to be able to consistently reproduce the exact output obtained from a retriever script version combo at a previous tag time. We can achieve this by using Docker to capture specific retriever versions and by also caching previous versions of our JSON scripts.

Degree of difficulty and needed skills

  • Moderate Difficulty
  • Knowledge of Python
  • Knowledge of Docker
  • Principles of Object Oriented Programming
  • Familiarity with Git Provenance

Involved developer communities

The Data Retriever primarily interacts via issues and pull requests on GitHub.

Mentors

  • @henrysenyondo
  • @ethanwhite

Retriever Out of Memory Functionality

Please ask questions here. Tag @ethanwhite, @henrysenyondo, @zhangcandrew.

Rationale

As Retriever functionality increases, we want to be sure not to neglect efficiency. Additionally, specifically with large datasets, we want to be able to process them in a manner that does not involve us downloading all the data before we want to work with it.

Approach

Degree of difficulty and needed skills

  • Moderate Difficulty
  • Knowledge of Python
  • Knowledge of Web Requests

Involved developer communities

The Data Retriever primarily interacts via issues and pull requests on GitHub.

Mentors

  • @henrysenyondo
  • @ethanwhite
  • @zhangcandrew