Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JOSS paper. #43

Open
wants to merge 7 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions .github/workflows/draft-pdf.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
name: Draft PDF
on:
push:
paths:
- paper/**
- .github/workflows/draft-pdf.yml

jobs:
paper:
runs-on: ubuntu-latest
name: Paper Draft
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Build draft PDF
uses: openjournals/openjournals-draft-action@master
with:
journal: joss
# This should be the path to the paper within your repo.
paper-path: paper/paper.md
- name: Upload
uses: actions/upload-artifact@v4
with:
name: paper
# This is the output path where Pandoc will write the compiled
# PDF. Note, this should be the same directory as the input
# paper.md
path: paper/paper.pdf
2 changes: 2 additions & 0 deletions paper/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
/jats/
/paper.pdf
98 changes: 98 additions & 0 deletions paper/paper.bib
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
@misc{diaz-vico+ramos-carreno_2023_scikitdatasets,
title = {{{scikit-datasets}}: {{Scikit-learn-compatible}} Datasets},
author = {{D{\'i}az-Vico}, David and {Ramos-Carre{\~n}o}, Carlos},
year = {2023},
month = aug,
doi = {10.5281/zenodo.6383047},
url = {https://github.com/daviddiazvico/scikit-datasets},
copyright = {MIT}
}

@misc{fajardo_2024_pyreadr,
title = {Pyreadr},
author = {Fajardo, Otto},
year = {2024},
month = jul,
publisher = {Zenodo},
doi = {10.5281/zenodo.7110169},
url = {https://github.com/ofajardo/pyreadr}
}

@misc{gautier_2024_rpy2,
title = {Rpy2: {{R}} in {{Python}}},
author = {Gautier, Laurent},
year = {2024},
publisher = {GitHub},
url = {https://github.com/rpy2/rpy2}
}

vnmabus marked this conversation as resolved.
Show resolved Hide resolved
@article{harris+_2020_numpy,
title = {Array programming with {NumPy}},
author = {Charles R. Harris and K. Jarrod Millman and St{\'{e}}fan J.
van der Walt and Ralf Gommers and Pauli Virtanen and David
Cournapeau and Eric Wieser and Julian Taylor and Sebastian
Berg and Nathaniel J. Smith and Robert Kern and Matti Picus
and Stephan Hoyer and Marten H. van Kerkwijk and Matthew
Brett and Allan Haldane and Jaime Fern{\'{a}}ndez del
R{\'{i}}o and Mark Wiebe and Pearu Peterson and Pierre
G{\'{e}}rard-Marchant and Kevin Sheppard and Tyler Reddy and
Warren Weckesser and Hameer Abbasi and Christoph Gohlke and
Travis E. Oliphant},
year = {2020},
month = sep,
journal = {Nature},
volume = {585},
number = {7825},
pages = {357--362},
doi = {10.1038/s41586-020-2649-2},
}

@inproceedings{mckinney_2010_pandas,
author = {{W}es {M}c{K}inney},
title = {{D}ata {S}tructures for {S}tatistical {C}omputing in {P}ython},
booktitle = {{P}roceedings of the 9th {P}ython in {S}cience {C}onference},
pages = {56 - 61},
year = {2010},
editor = {{S}t\'efan van der {W}alt and {J}arrod {M}illman},
doi = {10.25080/Majora-92bf1922-00a},
}

@software{pandasdevelopmentteam_2024_pandasdev,
title = {{{pandas-dev/pandas}}: {{pandas}}},
author = {{The Pandas Development Team}},
year = {2024},
month = apr,
publisher = {Zenodo},
doi = {10.5281/zenodo.3509134},
url = {https://doi.org/10.5281/zenodo.3509134},
version = {latest}
}

@article{ramos-carreno+_2024_scikitfda,
title = {Scikit-Fda: {{A Python Package}} for {{Functional Data Analysis}}},
shorttitle = {Scikit-Fda},
author = {{Ramos-Carre{\~n}o}, Carlos and Torrecilla, Jos{\'e} Luis and {Carbajo-Berrocal}, Miguel and Marcos, Pablo and Su{\'a}rez, Alberto},
year = {2024},
month = may,
journal = {Journal of Statistical Software},
volume = {109},
pages = {1--37},
issn = {1548-7660},
doi = {10.18637/jss.v109.i02},
abstract = {The library scikit-fda is a Python package for functional data analysis (FDA). It provides a comprehensive set of tools for representation, preprocessing, and exploratory analysis of functional data. The library is built upon and integrated in Python's scientific ecosystem. In particular, it conforms to the scikit-learn application programming interface so as to take advantage of the functionality for machine learning provided by this package: Pipelines, model selection, and hyperparameter tuning, among others. The scikit-fda package has been released as free and open-source software under a 3-clause BSD license and is open to contributions from the FDA community. The library's extensive documentation includes step-by-step tutorials and detailed examples of use.},
copyright = {Copyright (c) 2024 Carlos Ramos-Carre{\~n}o, Jos{\'e} Luis Torrecilla, Miguel Carbajo-Berrocal, Pablo Marcos, Alberto Su{\'a}rez},
langid = {english}
}

@article{rahman+_2024_hmschpc,
title = {Accelerating joint species distribution modelling with {Hmsc-HPC} by {GPU} porting},
author = {Rahman, Anis Ur and Tikhonov, Gleb and Oksanen, Jari and Rossi, Tuomas and Ovaskainen, Otso},
year = {2024},
month = sep,
journal = {PLOS Computational Biology},
volume = {20},
number = {9},
pages = {e1011914},
doi = {10.1371/journal.pcbi.1011914},
abstract = {Joint species distribution modelling (JSDM) is a widely used statistical method that analyzes combined patterns of all species in a community, linking empirical data to ecological theory and enhancing community-wide prediction tasks. However, fitting JSDMs to large datasets is often computationally demanding and time-consuming. Recent studies have introduced new statistical and machine learning techniques to provide more scalable fitting algorithms, but extending these to complex JSDM structures that account for spatial dependencies or multi-level sampling designs remains challenging. In this study, we aim to enhance JSDM scalability by leveraging high-performance computing (HPC) resources for an existing fitting method. Our work focuses on the Hmsc R-package, a widely used JSDM framework that supports the integration of various dataset types into a single comprehensive model. We developed a GPU-compatible implementation of its model-fitting algorithm using Python and the TensorFlow library. Despite these changes, our enhanced framework retains the original user interface of the Hmsc R-package. We evaluated the performance of the proposed implementation across various model configurations and dataset sizes. Our results show a significant increase in model fitting speed for most models compared to the baseline Hmsc R-package. For the largest datasets, we achieved speed-ups of over 1000 times, demonstrating the substantial potential of GPU porting for previously CPU-bound JSDM software. This advancement opens promising opportunities for better utilizing the rapidly accumulating new biodiversity data resources for inference and prediction.},
}
vnmabus marked this conversation as resolved.
Show resolved Hide resolved
137 changes: 137 additions & 0 deletions paper/paper.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
---
title: 'rdata: A Python library for R datasets'
tags:
- Python
- R
- datasets
- rda
- rds
authors:
- name: Carlos Ramos-Carreño
orcid: 0000-0003-2566-7058
affiliation: 1
- name: Tuomas Rossi
orcid: 0000-0002-8713-4559
affiliation: 2
affiliations:
- name: Universidad Autónoma de Madrid, Spain
index: 1
- name: CSC – IT Center for Science Ltd., Finland
index: 2
date: 4 September 2024
bibliography: paper.bib

---

# Summary

Research work usually requires the analysis and processing of data from different sources.
Traditionally in statistical computing, R language has been widely used for this task, and a huge amount of datasets have been compiled in the Rda and Rds formats, native to this programming language.
vnmabus marked this conversation as resolved.
Show resolved Hide resolved
As these formats contain internally the representation of R objects, they cannot be directly used from Python, another widely used language for data analysis and processing.
The library `rdata` allows to load and convert these datasets to Python objects, without the need of exporting them to other intermediate formats which may not keep all the original information.
This library has minimal dependencies, ensuring that it can be used in contexts where an R installation is not available.
vnmabus marked this conversation as resolved.
Show resolved Hide resolved
The capability to write data in Rda and Rds formats is also under development.
Thus, the library `rdata` facilitates data interchange, enabling the usage of the same datasets in both languages (e.g. for reproducibility, comparisons of results against methods in both languages, or migration of processing pipelines to Python).
vnmabus marked this conversation as resolved.
Show resolved Hide resolved

# Statement of need

The datasets from the CRAN repository are stored in the R specific format RData.
In Python, there were a few packages that could parse this file format, albeit all of them presented some limitations.
vnmabus marked this conversation as resolved.
Show resolved Hide resolved

The package `rpy2` [@gautier_2024_rpy2] can be used to interact with R from Python.
This includes the ability to load data in the RData format, and to convert these data to equivalent Python objects.
vnmabus marked this conversation as resolved.
Show resolved Hide resolved
Although this is arguably the best package to achieve interaction between both languages, it has many disadvantages if one wants to use it just to load RData datasets.
vnmabus marked this conversation as resolved.
Show resolved Hide resolved
In the first place, the package requires an R installation, as it relies in launching an R interpreter and communicating with it.
Secondly, launching R just to load data is inefficient, both in time and memory.
Finally, this package inherits the GPL license from the R language, which is not compatible with most Python packages, typically released under more permissive licenses.

The package `pyreadr` [@fajardo_2024_pyreadr] also provides functionality to read and write some R datasets.
It relies in the C library `librdata` in order to perform the parsing of the RData format.
vnmabus marked this conversation as resolved.
Show resolved Hide resolved
This adds an additional dependency from C building tools, and requires that the package is compiled for all the desired operating systems.
Moreover, this package is limited by the functionalities available in `librdata`, which at the moment of writing
does not include the parsing of common objects such as R lists and S4 objects.
The license can also be a problem, as it is part of the GPL family and does not allow commercial use.

As existing solutions were unsuitable for our needs, the package `rdata` was developed to parse data in the RData format.
vnmabus marked this conversation as resolved.
Show resolved Hide resolved
This is a small, extensible, efficient, and very complete implementation in pure Python of a RData parser, that is able to read and convert most datasets in the CRAN repository to equivalent Python objects, such as the built-in types of The Python Standard Library, NumPy arrays [@harris+_2020_numpy], or Pandas dataframes [@mckinney_2010_pandas; @pandasdevelopmentteam_2024_pandasdev].
vnmabus marked this conversation as resolved.
Show resolved Hide resolved
It has a permissive license and can be extended to support additional conversions from custom R classes.

The package `rdata` has been designed as a pure Python package with minimal dependencies, so that it can be easily integrated inside other libraries and applications.
It currently powers the functionality offered in the `scikit-datasets` package [@diaz-vico+ramos-carreno_2023_scikitdatasets] for loading datasets from the CRAN repository of R packages.
This functionality is used for fetching the functional datasets provided in the `scikit-fda` library [@ramos-carreno+_2024_scikitfda], whose development was the main reason for the creation of the `rdata` package itself.

# Features

The package `rdata` is intended to be both flexible and easy to use.
In order to be flexible, the parsing of the RData format and the conversion of the parsed structures to appropriate Python objects have been splitted in two steps.
vnmabus marked this conversation as resolved.
Show resolved Hide resolved
This allows advanced users to perform custom conversions without losing information.
Most users, however, will want to use the default conversion routine, which attempts to convert data
to a standard Python representation which preserves most part of the information.
vnmabus marked this conversation as resolved.
Show resolved Hide resolved

```python
import rdata

converted = rdata.read_rda("dataset.rda")
converted
vnmabus marked this conversation as resolved.
Show resolved Hide resolved
```

This is equivalent to the following code, in which the two steps have been performed separatedly.
vnmabus marked this conversation as resolved.
Show resolved Hide resolved

```python
import rdata

parsed = rdata.parser.parse_file("dataset.rda")
converted = rdata.conversion.convert(parsed)
```

The function `parse_file()` of the parser module is used to parse the RData file, returning a tree-like structure of Python objects that contains a representation of the basic R objects conforming the dataset.
vnmabus marked this conversation as resolved.
Show resolved Hide resolved
The function `convert()` of the conversion module transforms that representation to the final Python objects, such as lists, dictionaries or dataframes, that users can manipulate.

Advanced users will probably require loading datasets which contain non standard S3 or S4 classes, translating each of them to a custom Python class.
This is easy to achieve using `rdata` by simply creating a constructor function that receives the converted object representation and its attributes, and returns a Python object of the desired type.
As an example, consider the following simple code that constructs a `Pandas` [@pandasdevelopmentteam_2024_pandasdev] `Categorical` object from the internal representation of an R `factor`.
vnmabus marked this conversation as resolved.
Show resolved Hide resolved

```python
import pandas


def factor_constructor(obj, attrs):
values = [attrs['levels'][i - 1] if i >= 0 else None for i in obj]

return pandas.Categorical(values, attrs['levels'], ordered=False)
vnmabus marked this conversation as resolved.
Show resolved Hide resolved
```

Then, a dictionary containing as keys the original class names to convert and as values the constructor functions can be passed as the constructor_dict parameter of the `read_rda()` (or `convert()` if we do it in two steps) function.
In the previous example, this could be done using the following code:

```python
converted = rdata.read_rda(
"dataset.rda",
constructor_dict={"factor": factor_constructor},
)
```

When the default conversion routine is being executed, if an object belonging to an S3 or S4 class is found, the appropriate constructor will be called passing to it the partially constructed object.
If no constructor is available for that class, a warning will be emitted and the constructor of the most immediate parent class available will be called.
If there are no constructors for any of the parent classes, the basic underlying Python object will be left without transformation.

By default, a dictionary named `DEFAULT_CLASS_MAP` is passed to `convert()` including constructors for commonly used classes, such as `data.frame`, `ordered` or the aforementioned `factor`.
In case anyone wants different conversions for basic R objects, it would be enough to create a subclass of the `Converter` class.
vnmabus marked this conversation as resolved.
Show resolved Hide resolved
Several utility functions, such as the routines `convert_char()` and `convert_list()`, are exposed by the conversion module in order for users to be able to reuse them for that purpose.

# Ongoing work

To broaden the utility of the `rdata` library to data processing pipelines with steps in both R and Python, we are currently extending the library with the capability to write compatible Python objects to RData files.
vnmabus marked this conversation as resolved.
Show resolved Hide resolved
As an example, such a pipeline is present in the Hmsc-HPC code [@rahman+_2024_hmschpc], the continuous development of which has been driving the ongoing work on the writing functionality in the `rdata` library.
vnmabus marked this conversation as resolved.
Show resolved Hide resolved
The writing of RData files is implemented as a two-step process similar to reading: first, the Python object is converted to the tree-like intermediate representation used in parsing, and then this intermediate representation is written to a RData file.
vnmabus marked this conversation as resolved.
Show resolved Hide resolved
Currently, the writing functionality supporting common types is available at the development branch of the `rdata` library.
vnmabus marked this conversation as resolved.
Show resolved Hide resolved

# Acknowledgements

This work has received funding
from the Spanish Ministry of Education and Innovation, projects PID2019-106827GB-I00 / AEI / 10.13039/501100011033 and PID2019-109387GB-I00,
from an FPU grant (Formación de Profesorado Universitario) from the Spanish Ministry of Science, Innovation and Universities(MICIU) with reference FPU18/00047,
and from the European Union's Horizon Europe research and innovation programme under grant agreement No 101057437 (BioDT project, [https://doi.org/10.3030/101057437](https://doi.org/10.3030/101057437)).
Views and opinions expressed are those of the author(s) only and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the European Commission can be held responsible for them.

vnmabus marked this conversation as resolved.
Show resolved Hide resolved
# References
Loading