Releases: projectglow/glow
v2.0.3
What's Changed
- fixing scala logistic regression test by @kermany in #705
- Update sbt-scoverage to 2.2.0 by @scala-steward-projectglow in #704
- fixed
opt_einsum
contract incompatibility issue for linear/logistic regression by @kermany in #710 - Update development version to 2.0.4 by @github-actions in #718
Full Changelog: v2.0.2...v2.0.3
v2.0.0
What's Changed
Major changes
- Support Spark 3.4 and 3.5
- Add functions for left and left semi joins with overlap criteria accelerated by Databricks' range join optimization
- Register SQL functions via SQL extension service provider interface, so
glow.register
is no longer necessary if Glow is on the classpath when Spark is launched
Other user facing changes
- Remove Hail integration
- Remove features that frequently cause incompatibilities between versions (
aggregate_by_index
, CSV pipe transformer). Workarounds are provided in the documentation.
Internal changes
- Future proof for Spark 4.0 / Scala 2.13 / JDK 17
- Migrate CI and release process to GitHub Actions
Overlap join benchmarks
On a dataset with 1B left rows and 1M right rows and varying percentages of SNPs in the left table (tested with 1 4 core executor due to quota):
Inner range join + left join, all SNP percentages: 4h
Glow join, 0% SNPs: 4h
Glow join, 50% SNPs: 2h9m
Glow join, 90% SNPs: 0h42m
Other notes
The Python source artifact is built from tag v2.0.0-conda
in order to fix Glow's conda recipe.
New Contributors
- @dvcastillo made their first contribution in #505
- @dtzeng made their first contribution in #519
- @srowen made their first contribution in #524
- @a-li made their first contribution in #522
- @scala-steward-projectglow made their first contribution in #555
Full Changelog: v1.2.1...v2.0.0
v1.2.1
v1.2.1 bumps glow to Spark v3.2.1
This release includes Java/Scala artifacts in Maven Central , and Python artifacts in pypi. Docker containers projectglow/open-source-glow:1.2.1
, projectglow/databricks-glow:1.2.1
, projectglow/databricks-glow:10.4
and projectglow/databricks-hail:0.2.93
can be found in projectglow's dockerhub. The Glow notebook continous integration test now uses Databricks Runtime 10.4, which is on Spark 3.2.1 (workflow definition json)
Glow leverages private catalyst APIs that have changed from Spark 3.1 to Spark 3.2. We wrote a Shim to maintain backwards compatibility. However, Spark 2 is end of life (EoL). Databricks, AWS EMR and Google Dataproc now depend on Hadoop 3.x, which is incompatible with Spark 2. So we are removing support for Spark 2, including the Spark 2 continuous integration tests (ci/cd) performed with circleci. Glow version 1.1.2 is the last release that supports Spark 2
The Spark 3 ci/cd tests depend on Hail, and these were failing since Hail does not yet support Spark 3.2, they are waiting on Google's Dataproc and AWS EMR to upgrade from Spark 3.1. So for now we expect the Spark 3 circleci tests to continue failing until we can resolve the hail tests. However, we moved forward with the new release as it is unclear when Dataproc or EMR will support Spark 3.2
Thanks to Alex Barreto, Jasser Abidi, Cameron Smith, Marcus Henry, Karen Feng, Joseph Bradley, and William Brandler for their contributions to this release
New Contributors
- @cameronraysmith made their first contribution in #483
- @JassAbidi and @jkbradley made their first contributions in #501
Full Changelog: v1.1.2...v1.2.1
Release v1.1.2
v1.1.2
Glow incorporates new functionality for quarantining records with the Glow pipe transformer in v1.1.2.
This release includes Java/Scala artifacts in Maven Central, and Python artifacts in PyPi and Conda Forge.
New Contributors
- @dmoore247 made their first contribution in #408
- @mah-databricks made their first contribution in #418
Full Changelog: v1.1.1...v1.1.2
Release v1.1.1
v1.1.1
Glow incorporates new functionality for sample masking in GWAS v1.1.1, which has been documented as a quickstart guide. Nightly notebook tests are now dockerized, making it easier to integrate Glow with other bioinformatics libraries. VEP schema changes fixes a bug with indel parsing
This release includes Java/Scala artifacts in Maven Central, and Python artifacts in PyPi and Conda Forge.
What's Changed
- Dockerize ci tests by @williambrandler in #414
- Releasev110 by @williambrandler in #411
- adding codecov.yml by @williambrandler in #413
- remove init script from nb test by @williambrandler in #415
- Fix VEP parsing failures stemming from indels by @bboutkov in #402
- Extending sample masking functionality in gwas linear regression by @bcajes in #416
- fix bedtools path by @williambrandler in #417
- add vep example by @williambrandler in #382
- Docker containers for Glow runtime environment on Databricks by @a0x8o in #420
- remove extraneous detail from quickstart docs by @williambrandler in #428
- add data simulation doc page by @williambrandler in #427
- fix pandas lmm notebook link by @williambrandler in #430
New Contributors
Credits
Alex Barreto, Boris Boutkov, Brian Cajes, Karen Feng, William Brandler, dim de grave
Full Changelog: v1.1.0...v1.1.1
v1.1.0
v1.1.0 bumps the Spark version of Glow to 3.1.2
Glow also now runs automated nightly testing of notebooks in the docs, making it easier for users to contribute code or algorithms to help others make use of Glow
This release includes Java/Scala artifacts in Maven Central, and Python artifacts in PyPi and Conda Forge.
Notable changes:
- Upgrade Spark dependency from 3.0.0 to 3.1.2 #396
- Create integration test script #373
- Hail related enhancements #377
- Remove typecheck for numpy arrays #366
Minor changes include:
Credits: Brian Cajes, Karen Feng, William Brandler, dim de grave
v1.0.1
v1.0.0
We are excited to announce the release of Glow 1.0.0. This release includes major scalability and usability improvements, particularly for GloWGR whole-genome regression and genome-wide association study regression tests. These improvements create a more performant GloWGR workflow with simpler APIs.
Major features and changes include:
- #302, #309: Pandas-based linear regression. Introduced the
linear_regression
Python function which can be used to perform GWAS linear regression tests for multiple phenotypes simultaneously. The function is optimized for performance through one-time calculation of intermediate matrices common across multiple phenotypes and genotypes. The function can also accept WGR terms as an offset parameter. This function is superior in performance compared to the existing SQL-basedlinear_regression_gwas
function, which only works on a single phenotype. - #316, #318, #319: Pandas-based logistic regression. Introduced the
logistic_regression
Python function with the same properties mentioned above for linear regression. This function implements a fast multi-phenotype multi-genotype score test with fallback logic for significant variants indicated by the score test. The currently supported fallback test is the Approximate Firth method presented in REGENIE. - #323: Improved the WGR API so that the user can now provide all the input to a single class and run different functions without passing any arguments. An
estimate_loco_offsets
function was added to perform an end-to-end generation of loco predictors using a single command. In addition, GloWGR was revised to make its behavior regarding standardization of phenotypes and genotypes, and treatment of intercept match the REGENIE algorithm. - #300: Conversion from Hail MatrixTables to Glow-compatible Spark DataFrames.
- #274: Faster default VCF reader.
- #294: Streamlined GloWGR between WGR and GWAS functions.
- #282: Improved scalability of GloWGR.
- #303: Added hard calling by default to the BGEN reader.
Backwards-incompatible changes:
- #326: Changed Glow
register
function to not modify the Spark session by default.
v0.6.0
We are excited to announce the release of Glow 0.6.0. This release includes both Java/Scala and Python artifacts that can be found in Maven Central and PyPI, respectively. Please note that the name of Maven Artifacts has changed from glow
to glow-spark3
and glow-spark2
as glow is now released for both versions of Spark.
Notable additions/changes are:
- #245 Added GloWGR for binary traits
- #240 Input validation for GloWGR
- #242
transform_loco
function forRidgeRegression
, which applies the fitted model in a leave-one-chromosome-out to get phenotype predictors for each chromosome - #243
reshape_for_gwas
convenience function to prepare the output of GloWGR for use in glow GWAS functions - #285 Improved performance of
lift_over_variants
transformer - #249 Faster conversion form python double array to java array
- #276 Added support for reading uncompressed or zstd compressed BGEN files
- #254 , #291 Feature to cross release for Spark 3 and Spark 2
- #258 Fixed error in python literal conversion
- #264 Fixed splitability state of non-compressed VCFs
- #271, #281 Minor fixes to GloWGR
- #247, #250, #252, #273, #275, #279, #287 Documentation, notebook, and blog improvements
- Other minor fixes
v0.5.0
This release features the initial release of GloWGR, a framework for distributed whole genome regression. For more information, see the blog post and user guide.
Additional features:
#222: Accept non-string arguments in transformers
#213: Accept numpy ndarray
s as literal arguments to GWAS functions
#228: Add a user guide for merging variant datasets with Glow