Skip to content

GSoC 2020 Project Ideas

Ethan White edited this page Mar 30, 2020 · 5 revisions

Please ask questions here. Tag @apoorvaeternity, @ethanwhite, @henrykironde

Preferred names (Apoorva, Henry, Ethan), Preferred_greeting (Hi|Hello|Dear|Thanks|Thank you [First_name])

Join the chat at https://gitter.im/weecology/retriever

The code of conduct should be your first read.

Data Retriever: Add support for more raw data formats

Rationale

The Data Retriever is a package manager for data. The Data retriever automatically finds, downloads and pre-processes publicly available datasets and it stores these datasets in a ready-to-analyse state. The Data Retriever handles tabular data and spatial data forms. The data retriever additionally handles compressed version of these data forms, i.e zip, gz and tar files

Approach

The goal of the project is to add support that will enable the Data Retriever platform to have the capability of ingesting
other forms of raw data. The project will introduce the support for raw data formats of XML, JSON, NetCDF, HDF, Excel, SQlite and Geojson data sources.

Some sources for these raw data forms.

Degree of difficulty and needed skills

  • Difficult
  • Knowledge of Python
  • Knowledge of Object Oriented Programming

Usefull skills

  • Knowledge of Git, continuous development and deployment tools
  • Knowledge of R and Julia Programming

Involved developer communities

The team at the Data Retriever primarily interacts via issues and pull requests on GitHub or through the Gitter channel. Join the chat at https://gitter.im/weecology/retriever

Mentors

  • @apoorvaeternity
  • @henrysenyondo
  • @ethanwhite

Data Retriever: Improve environment setup and installation on all platforms for all Data Retriever ecosystem services

Rationale

The main Data Retriever retriever package is a Python package with both a command line interface (CLI) and a Python interface. The platform is coupled with the Retriever-recipes's repository which stores the data packages. Additionally, the platform can be be used from Julia and R through wrapper packages. The Julia package called the Retriever.jl and the R package called the Rdataretriever are both hosted on GitHub. To be maximally useful installation and use of the retriever should be easy from all three languages (Python, R, and Julia) and operating systems (OS X, Windows, and Linux).

Approach

The goal of the project is to boost the usability of the Data Retriever platform ecosystem through enabling easy installation. Users should be able to install any of the packages with minimal steps or guidelines in a way that is intuitive for users of a R, Julia, or Python.

This project will involve automating as much of the installation process for the Python package as possible within the R and Julia wrappers so that it is as close to a normal R or Julia package install as possible. This will involve the use of the reticulate package in R (as well as renv)and the PyCall package in Julia. These packages both support the conda package management system for installing Python packages. The goal is to either have the Python package installed automatically as part of the R/Julia package installations or to include functions associated with those packages that perform the installation (e.g, rdataretriever::install_core_retriever()).

One of the challenges with this task is ensuring that it works consistently across operating systems and development environments. For example, we have encountered situations here things that work smoothly in reticulated on Linux don't work in the same way on Windows and we have seen cases where things work differently in RStudio than when running R directly. See https://github.com/ropensci/rdataretriever/issues/199 for an example of some of the challenges.

Developing good documentation to help guide users through any non-automated steps will also be important.

This project will involve the use of the modern DevOps technologies like Continuous Integration or Continuous Deployment pipelines for testing these solutions.

Degree of difficulty and needed skills

  • Moderate Difficulty
  • Knowledge of continuous development and deployment tools
  • Knowledge of programing Python

Usefull skills

  • Knowledge of Git, continuous development and deployment tools
  • Knowledge of R and Julia Programming
  • Working knowledge of Python and R package managers including conda

Involved developer communities

The team at the Data Retriever primarily interacts via issues and pull requests on GitHub or through the Gitter channel. Join the chat at https://gitter.im/weecology/retriever

Mentors

  • @apoorvaeternity
  • @henrysenyondo
  • @ethanwhite