Releases: mcvickerlab/GenVarLoader
Releases · mcvickerlab/GenVarLoader
0.8.0
v0.8.0 (2025-02-05)
Feat
- sequence annotations
v0.7.3 (2025-01-27)
Feat
- allow subset_to() to accept boolean masks and polars Series
- allow subset_to() to accept boolean masks and polars Series
Fix
- add test for subset_to
- add test for subset_to
- update tests to match internal API changes
- update tests to match internal API changes
- bug in mark_keep_variants with spanning deletions.
v0.7.2 (2025-01-26)
Fix
- change loop order to only open files once.
- respect memory limits when writing bigwig data.
- online docs notebook syntax highlightning
- better docs.
v0.7.1 (2025-01-17)
Fix
- bump version
- scalar dataset indexing, region_indices order, updated docs, hotfixes
v0.7.0 (2025-01-17)
Feat
- indexing matches input bed file. make sel() a private method pending better API design. fix: pass tests for indexing and separate indexing and subsetting logic into DatasetIndexer.
- write input regions to disk with a column mapping each to a row in the sorted dataset regions
Fix
- passing tests
- passing tests
- passing tests
v0.6.4 (2024-12-16)
Fix
- update rust dependencies.
v0.6.3 (2024-12-16)
Fix
- unintended torch requirements
v0.6.2 (2024-12-16)
Fix
- update version
- StratifiedSampler requires torch. fix: remove deprecated conda env files.
v0.6.1 (2024-11-25)
Fix
- handle empty genotypes during gvl write. fix: PgenGenos sample_idx should be sorted when compared to current sample_idx.
v0.6.0 (2024-09-03)
Feat
- bump version
- geuvadis tutorial.
- tutorial notebook, pooch dependency.
Fix
- update available tracks after writing transformed ones to disk.
v0.5.6 (2024-08-07)
Fix
- bump version
- offsets can overflow int32, use int64 instead.
v0.5.5 (2024-08-02)
Fix
- make Records.vars_in_range... functions fallible by returning None instead of "empty" RecordInfo instances. This fixes downstream behavior of the Variants.read.. methods when there are no variants in the query. feat: when reading VCFs for the first time and no index is found, try to index them first before raising an error. fix: better docstrings on attributes of private API.
- add build number to replace yanked release
v0.5.4 (2024-07-05)
Fix
- fix breaking changes from polars 1.0
- fix breaking changes from polars 1.0
v0.5.3 (2024-07-05)
Fix
- fix breaking changes from polars 1.0
- fix breaking changes from polars 1.0
v0.5.2 (2024-07-05)
Fix
- typo in pyproject causing dependencies to be ignored.
- typo in pyproject causing dependencies to be ignored.
v0.5.1 (2024-06-29)
Feat
- prep for readthedocs
- prepare for online documentation.
Fix
- add favicon
- documentation formatting
- rtd config
- rtd config
- rtd config
- rtd config
- rtd config
- rtd config
- rtd config
- readthedocs dependencies
- readthedocs config
- readthedocs config
- readthedocs config
- readthedocs config
- readthedocs config
- readthedocs config
v0.5.0 (2024-06-13)
Feat
- bump version
- multiprocess reading of genotypes, both VCF and PGEN. fix: bug in reading genotypes from PGEN
v0.4.1 (2024-06-11)
Fix
- bump version
- got number of regions from wrong array in get_reference
v0.4.0 (2024-06-05)
Feat
- deprecate old loader, worse performance. reorganize code.
Fix
- better documentation in README. feat!: rename write_transformed_tracks to write_transformed_track. feat: more ergonomic indexing.
v0.3.3 (2024-06-01)
Fix
- bump version
- wrong max_ends from SparseGenotypes.from_dense_with_length due to data races/incorrect parallel semantics for numba
- diffs need to be clipped and negated when computing shifts
Perf
- pad haplotypes on-the-fly to avoid extra copying of reference subsequences
v0.3.2 (2024-04-29)
Feat
- can convert Records back to a polars DataFrame with minimal copying via conversion of VLenAlleles to pyarrow buffers
- make open_with_settings the standard open function. fix: recognize .bgz extension for fasta files
Fix
- remove dynamic versioning table
- move cli to main feat: generalize Variants to automatically identify whether vcf or pgen is passed
- move cli to script in python source directory, maturin limitation?
- wrong implementation of heuristic for extending genotypes.
Perf
- faster sparsifying genotypes. feat: log level for cli. fix: clip missing lengths for appropriate end extension.
v0.3.1 (2024-04-16)
Feat
- benchmark interval decompression on cpu with numba vs. cpu with taichi vs. gpu with taichi
- optionally decompress intervals to tracks on gpu
- initial support for stranded regions
- option to cache fasta files as numpy arrays.
- implement BigWig intervals as Rust extension.
- finishing touches on multi-track implementation. Block is cryptic issue where writing genotypes is somehow preventing joblib from launching new processes.
- stop overwriting by default, add option.
- transforms directly on tracks. feat: intervals as array of structs for better data locality.
- let extra tracks get added via paths
- let extra tracks get added via paths
- initial support for indels in tracks and WIP on also returning auxiliary genome wide tracks.
- initial sparse genos -> haplotypes and sparse hap diffs.
- wip sparse genotypes.
- properties for getting haplotypes, references, or tracks only.
- properties for getting haplotypes, references, or tracks only.
- encourage num_workers <= 1 with GVL dataloader.
- freeze gvl.Dataset to prevent user from accidentally introducing invalid states. feat: warn if any query contigs have either no variatns or intervals associated with them.
- warn instead of error when no reference passed and genos present.
- disable overwriting by default, have no args be help.
- also report number of samples.
- add .from_table constructor for BigWigs.
- move CLI to script, include in package.
- use a table to specify bigwigs instead. fix: jittering.
- add script to write datasets to disk.
- more quality of life improvements. relax dependency version constraints.
- with_seed method
- quality of life methods for subsetting and converting to dataloaders.
- torch convenience functions fix: ensure genotypes and intervals written in sorted order wrt the BED file.
- pre-computed implementation.
Fix
- dependency typo
- remove taichi interval to track implementation since it did not improve performance, even on GPU
- need to subset arrays to be reverse complemented
- change argument order of subset_to to match the rest of the API. fix: simplify subset implementation.
- remove python 3.10 type hints
- dimension order on subsets.
- make variant indices absolute on write.
- sparse genotypes layout
- sparse genotypes layout
- wrong layout out genotypes and wrong max ends computation.
- ragged array layouts for correct concatenation when writing datasets one contig at a time.
- bug where init_intervals would not initialize all available tracks.
- track_to_intervals had wrong n_intervals and thus, wrong offsets.
- track_to_intervals had wrong n_intervals and thus, wrong offsets.
- bug in computing max ends.
- match serde for genome tracks.
- bug in open state management.
- bug when writing genotypes where the chromosome of the requested regions is not present in the VCF.
- bug getting intersection of samples available.
- bug getting intersection of samples available.
- sum wrong axis in adjust multi index.
- make GVLDataset getitem API match torch Dataset API (i.e. use raveled index)
- QOL improvements.
- incorrect genotypes returned from VCF when queries have overlapping ranges.
- wrong shape.
- wrong shape.
Refactor
- move construct virtual data to loader so utils import faster.
- move construct virtual data to loader so utils import faster.
- rename util to utils.
- rename util to utils.
- move write under dataset directory. perf?: move indexing operations into numba.
- move cli to script outside package, faster help message.
- break up dataset implementation into smaller files. refactor!: condense with_ methods into single with_settings() methods. feat: sel() and isel() methods for eager retrieval by sample and region.
Perf
- when opening witih settings and providing a reference, but return_sequences is false, don't load the reference into memory.
v0.3.0 (2024-03-15)
Feat
- write ZarrTracks in smaller chunks.
- write ZarrTracks in smaller chunks.
Fix
- remove wip vidx feature.
- relax numba version constraint
- rounding issues for setting fixed lengths on BED regions.
- more informative vcf record progress bar.
v0.3.0rc6 (2024-03-11)
Feat
- improve record query performance by allowing nearest_nonoverlapping index adjustment to be computed on-the-fly in the weighted activity selection algorithm and thus also benefit from early stopping.
- more descriptive progress bar for constructing ZarrGenos from another file.
- add progress bar for reading VCF records.
Fix
- pylance update, catch possibly unbound variables.
- instead of failing, raise warning when encountering non-SNP, non-INDEL variants and skip them.
v0.3.0rc5 (2024-03-04)
Fix
- more descriptive pbar when writing ZarrTracks from another reader.
- BigWigs, only keep contigs that are shared across all bigwigs.
- better error messages and catching cases for non-SNP, non-INDEL variants.
- avoid segfault caused when a TensorStore is forked to new processes.
- make ZarrTracks implement Reader protocol. feat: add NumpyGenos for in-memory representation. feat: better ZarrGenos.from_recs_genos progress bar.
v0.3.0rc4 (2024-02-29)
Fix
- naming of .ends.gvl.arrow to .gvl.ends.arrow so file suffix parsing works correctly.
v0.3.0rc3 (2024-02-29)
v0.3.0rc2 (2024-02-29)
Fix
- remove pyd4 dependency, had unspectacular performance.
v0.3.0-rc.1 (2024-02-28)
Feat
- add ZarrTracks for much faster performance than D4.
- finish deprecating parallel GVL...