Skip to content

Releases: mcvickerlab/GenVarLoader

0.8.0

05 Feb 05:21
Compare
Choose a tag to compare

v0.8.0 (2025-02-05)

Feat

  • sequence annotations

v0.7.3 (2025-01-27)

Feat

  • allow subset_to() to accept boolean masks and polars Series
  • allow subset_to() to accept boolean masks and polars Series

Fix

  • add test for subset_to
  • add test for subset_to
  • update tests to match internal API changes
  • update tests to match internal API changes
  • bug in mark_keep_variants with spanning deletions.

v0.7.2 (2025-01-26)

Fix

  • change loop order to only open files once.
  • respect memory limits when writing bigwig data.
  • online docs notebook syntax highlightning
  • better docs.

v0.7.1 (2025-01-17)

Fix

  • bump version
  • scalar dataset indexing, region_indices order, updated docs, hotfixes

v0.7.0 (2025-01-17)

Feat

  • indexing matches input bed file. make sel() a private method pending better API design. fix: pass tests for indexing and separate indexing and subsetting logic into DatasetIndexer.
  • write input regions to disk with a column mapping each to a row in the sorted dataset regions

Fix

  • passing tests
  • passing tests
  • passing tests

v0.6.4 (2024-12-16)

Fix

  • update rust dependencies.

v0.6.3 (2024-12-16)

Fix

  • unintended torch requirements

v0.6.2 (2024-12-16)

Fix

  • update version
  • StratifiedSampler requires torch. fix: remove deprecated conda env files.

v0.6.1 (2024-11-25)

Fix

  • handle empty genotypes during gvl write. fix: PgenGenos sample_idx should be sorted when compared to current sample_idx.

v0.6.0 (2024-09-03)

Feat

  • bump version
  • geuvadis tutorial.
  • tutorial notebook, pooch dependency.

Fix

  • update available tracks after writing transformed ones to disk.

v0.5.6 (2024-08-07)

Fix

  • bump version
  • offsets can overflow int32, use int64 instead.

v0.5.5 (2024-08-02)

Fix

  • make Records.vars_in_range... functions fallible by returning None instead of "empty" RecordInfo instances. This fixes downstream behavior of the Variants.read.. methods when there are no variants in the query. feat: when reading VCFs for the first time and no index is found, try to index them first before raising an error. fix: better docstrings on attributes of private API.
  • add build number to replace yanked release

v0.5.4 (2024-07-05)

Fix

  • fix breaking changes from polars 1.0
  • fix breaking changes from polars 1.0

v0.5.3 (2024-07-05)

Fix

  • fix breaking changes from polars 1.0
  • fix breaking changes from polars 1.0

v0.5.2 (2024-07-05)

Fix

  • typo in pyproject causing dependencies to be ignored.
  • typo in pyproject causing dependencies to be ignored.

v0.5.1 (2024-06-29)

Feat

  • prep for readthedocs
  • prepare for online documentation.

Fix

  • add favicon
  • documentation formatting
  • rtd config
  • rtd config
  • rtd config
  • rtd config
  • rtd config
  • rtd config
  • rtd config
  • readthedocs dependencies
  • readthedocs config
  • readthedocs config
  • readthedocs config
  • readthedocs config
  • readthedocs config
  • readthedocs config

v0.5.0 (2024-06-13)

Feat

  • bump version
  • multiprocess reading of genotypes, both VCF and PGEN. fix: bug in reading genotypes from PGEN

v0.4.1 (2024-06-11)

Fix

  • bump version
  • got number of regions from wrong array in get_reference

v0.4.0 (2024-06-05)

Feat

  • deprecate old loader, worse performance. reorganize code.

Fix

  • better documentation in README. feat!: rename write_transformed_tracks to write_transformed_track. feat: more ergonomic indexing.

v0.3.3 (2024-06-01)

Fix

  • bump version
  • wrong max_ends from SparseGenotypes.from_dense_with_length due to data races/incorrect parallel semantics for numba
  • diffs need to be clipped and negated when computing shifts

Perf

  • pad haplotypes on-the-fly to avoid extra copying of reference subsequences

v0.3.2 (2024-04-29)

Feat

  • can convert Records back to a polars DataFrame with minimal copying via conversion of VLenAlleles to pyarrow buffers
  • make open_with_settings the standard open function. fix: recognize .bgz extension for fasta files

Fix

  • remove dynamic versioning table
  • move cli to main feat: generalize Variants to automatically identify whether vcf or pgen is passed
  • move cli to script in python source directory, maturin limitation?
  • wrong implementation of heuristic for extending genotypes.

Perf

  • faster sparsifying genotypes. feat: log level for cli. fix: clip missing lengths for appropriate end extension.

v0.3.1 (2024-04-16)

Feat

  • benchmark interval decompression on cpu with numba vs. cpu with taichi vs. gpu with taichi
  • optionally decompress intervals to tracks on gpu
  • initial support for stranded regions
  • option to cache fasta files as numpy arrays.
  • implement BigWig intervals as Rust extension.
  • finishing touches on multi-track implementation. Block is cryptic issue where writing genotypes is somehow preventing joblib from launching new processes.
  • stop overwriting by default, add option.
  • transforms directly on tracks. feat: intervals as array of structs for better data locality.
  • let extra tracks get added via paths
  • let extra tracks get added via paths
  • initial support for indels in tracks and WIP on also returning auxiliary genome wide tracks.
  • initial sparse genos -> haplotypes and sparse hap diffs.
  • wip sparse genotypes.
  • properties for getting haplotypes, references, or tracks only.
  • properties for getting haplotypes, references, or tracks only.
  • encourage num_workers <= 1 with GVL dataloader.
  • freeze gvl.Dataset to prevent user from accidentally introducing invalid states. feat: warn if any query contigs have either no variatns or intervals associated with them.
  • warn instead of error when no reference passed and genos present.
  • disable overwriting by default, have no args be help.
  • also report number of samples.
  • add .from_table constructor for BigWigs.
  • move CLI to script, include in package.
  • use a table to specify bigwigs instead. fix: jittering.
  • add script to write datasets to disk.
  • more quality of life improvements. relax dependency version constraints.
  • with_seed method
  • quality of life methods for subsetting and converting to dataloaders.
  • torch convenience functions fix: ensure genotypes and intervals written in sorted order wrt the BED file.
  • pre-computed implementation.

Fix

  • dependency typo
  • remove taichi interval to track implementation since it did not improve performance, even on GPU
  • need to subset arrays to be reverse complemented
  • change argument order of subset_to to match the rest of the API. fix: simplify subset implementation.
  • remove python 3.10 type hints
  • dimension order on subsets.
  • make variant indices absolute on write.
  • sparse genotypes layout
  • sparse genotypes layout
  • wrong layout out genotypes and wrong max ends computation.
  • ragged array layouts for correct concatenation when writing datasets one contig at a time.
  • bug where init_intervals would not initialize all available tracks.
  • track_to_intervals had wrong n_intervals and thus, wrong offsets.
  • track_to_intervals had wrong n_intervals and thus, wrong offsets.
  • bug in computing max ends.
  • match serde for genome tracks.
  • bug in open state management.
  • bug when writing genotypes where the chromosome of the requested regions is not present in the VCF.
  • bug getting intersection of samples available.
  • bug getting intersection of samples available.
  • sum wrong axis in adjust multi index.
  • make GVLDataset getitem API match torch Dataset API (i.e. use raveled index)
  • QOL improvements.
  • incorrect genotypes returned from VCF when queries have overlapping ranges.
  • wrong shape.
  • wrong shape.

Refactor

  • move construct virtual data to loader so utils import faster.
  • move construct virtual data to loader so utils import faster.
  • rename util to utils.
  • rename util to utils.
  • move write under dataset directory. perf?: move indexing operations into numba.
  • move cli to script outside package, faster help message.
  • break up dataset implementation into smaller files. refactor!: condense with_ methods into single with_settings() methods. feat: sel() and isel() methods for eager retrieval by sample and region.

Perf

  • when opening witih settings and providing a reference, but return_sequences is false, don't load the reference into memory.

v0.3.0 (2024-03-15)

Feat

  • write ZarrTracks in smaller chunks.
  • write ZarrTracks in smaller chunks.

Fix

  • remove wip vidx feature.
  • relax numba version constraint
  • rounding issues for setting fixed lengths on BED regions.
  • more informative vcf record progress bar.

v0.3.0rc6 (2024-03-11)

Feat

  • improve record query performance by allowing nearest_nonoverlapping index adjustment to be computed on-the-fly in the weighted activity selection algorithm and thus also benefit from early stopping.
  • more descriptive progress bar for constructing ZarrGenos from another file.
  • add progress bar for reading VCF records.

Fix

  • pylance update, catch possibly unbound variables.
  • instead of failing, raise warning when encountering non-SNP, non-INDEL variants and skip them.

v0.3.0rc5 (2024-03-04)

Fix

  • more descriptive pbar when writing ZarrTracks from another reader.
  • BigWigs, only keep contigs that are shared across all bigwigs.
  • better error messages and catching cases for non-SNP, non-INDEL variants.
  • avoid segfault caused when a TensorStore is forked to new processes.
  • make ZarrTracks implement Reader protocol. feat: add NumpyGenos for in-memory representation. feat: better ZarrGenos.from_recs_genos progress bar.

v0.3.0rc4 (2024-02-29)

Fix

  • naming of .ends.gvl.arrow to .gvl.ends.arrow so file suffix parsing works correctly.

v0.3.0rc3 (2024-02-29)

v0.3.0rc2 (2024-02-29)

Fix

  • remove pyd4 dependency, had unspectacular performance.

v0.3.0-rc.1 (2024-02-28)

Feat

  • add ZarrTracks for much faster performance than D4.
  • finish deprecating parallel GVL...
Read more