-
Notifications
You must be signed in to change notification settings - Fork 15
/
Copy pathCHANGES
2417 lines (1680 loc) · 81.9 KB
/
CHANGES
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Version 1.15.0 (14th April 2023)
--------------
Version number bumped to reflect the official status of CRAM 3.1.
Updates:
* Formally accept CRAM 3.1 as an official standard. Warning removed.
For best compatibility CRAM 3.0 is still the default CRAM, but use
"-V3.1" to specify the version.
* Updated to latest htscodecs. This has a significant speed
improvement in encoding with fqzcomp (enabled in "-X small" profile).
Tested on a NovaSeq dataset, encoding from BAM to CRAM was 27% faster.
Decoding a CRAM with fqzcomp is also around 6% faster.
Version 1.14.15 (6th December 2022)
---------------
This is a bug fix release.
Updates:
* Switched to using GitHub actions for CI.
* Updated htscodecs submodule to the latest version (1.3.0).
Also fixed function names used within it.
* Minor code tidyups to remove compiler warnings.
Bug fixes:
* MacOS testing fix to cope with sed failing on long lines.
* Fixed bam_aux_skip B array handling of signed values.
* Improved detection of unsorted mode, which was causing a drastic
slow-down when encoding unsorted data.
* Fixed a CRAM reference multi-threading bug when fetching from a
fasta file.
* Fixed bam_aux_f decoding of floats.
Version 1.14.14 (17th March 2021)
---------------
This is primarily a bug fix release.
Updates:
* Bumped htscodecs submodule to 1.0. This is mainly security
hardening. This now means for the first time Io_lib and HTSlib
share the same code for the CRAM codecs.
* Cram_filter now copies with CRAM 3.1 and 4.0 files.
* Added Power support(ppc64le) to CI. (Author: Arumugam)
* Added int64_t as a HashTable key type.
* Improved configure script handling of lzma and bzip2, which are now
on by default.
* Improved support for hurd_i386 by defining PATH_MAX (with thanks to
Michael Crusoe).
Bug fixes:
* The CRAM_IO_CUSTOM_BUFFERING code is now enabled correctly. This is
required for biobambam2 / libmaus2 integration. (Thanks to German
Tischler-Hohle)
* Fixed a recent bug in the cram_open_by_callbacks function used by
Biobambam. (#39. Thanks to German Tischler-Hohle)
* Fixed cram_codec_decoder2encoder handling of CRAM 4 encodings.
* Fixed an uninitialised memory access added during 1.14.13 (harmless
as it was then immediately replaced again, but it triggered valgrind
warnings).
* Fixed configure --disable-custom-buffering
* Typo fixes, courtesy of Debian lintian.
Version 1.14.13 (3rd July 2020)
---------------
This release has a mixture of on-going CRAM 4 work (not compatible
with previous CRAM 4) and some more general quality of life
improvements for all CRAM versions including speed-ups and better
multi-threading.
Note both CRAM 3.1 and 4.0 are still to be considered an unofficial
CRAM extensions.
Updates:
* Scramble can now filter-in or filter-out aux tags during
transcoding. This is done using -d and -D options. For example:
scramble -D OQ,BI,BD in.bam out.cram
removes the GATK added OQ, BI and BD aux tags.
Requested by @jhaezebrouck in issue #24.
* The Scramble -X <profile> options are now implemented using a
CRAM_OPT_PROFILE option. This simplifies the scramble code and
makes it easier to call from a library. This also fixes a number of
bugs in the order of argument parsing.
* Improved CRAM writing speeds.
The bam_copy function now only copies the number of used bytes
rather than the number of allocated bytes, which can sometimes be
substantially smaller. As this was done in the main thread it may
have a significant benefit when multi-threading.
* Added libdeflate support into CRAM too (in addition to the existing
support in BAM). This isn't a huge change to CRAM speeds except at
high levels (-8 and -9) which are now slower, but also better
compression ratio. A modest 2-3% speed gain is visible are low and
mid levels, and at -1/-2 to -4 the compression ratio is also
improved.
* CRAM 3.1 compression level -1 is now 25% faster, but 4% larger.
This is achieved by difference choice of compression codecs, most
notably disabling the name tokeniser for level 1. Use level 2 for
something comparable to the old behaviour.
* Added an io_lib/version.h to make it easier to detect the version
being compiled against using IOLIB_VERSION macros.
Requested by German Tischler in issue #25.
* Refactored the cram encoding interface used by biobambam.
Implemented by German Tischler in PR#27.
* CRAM 4 now uses E_CONST instead of a uni-value version of
E_HUFFMAN. Also added offset field to VARINT_SIGNED and
VARINT_UNSIGNED which helps for data series that have values from -1
to MAXINT.
* CRAM 4 container structure has changed so that all values are
variable sized integers instead of fixed size.
* Further improvements with CRAM 4's use of signed values.
- Ref_seq_id is container and slice headers are now signed.
- RI (ref ID) data series and NS (mate ref ID) are also now signed
as -1 is a valid value.
- Embedded ref id is now 0 for unusued instead of -1.
* Reversed the use of CRAM 4 delta encoding for the B array. It only
helps at the moment for ONT signal data, so it needs more work to
make it auto-detect when delta makes sense. (Enabling it globally
for CRAM4 B aux tags was accidental.)
* Htscodecs submodule has gained support for big-endian platforms
Other big-endian improvements to parts of CRAM4 too.
Bug fixes:
* Fixed CRAM MD tag generatin when using the "b" feature code
(NB: unused by known CRAM encoders).
Also see https://github.com/samtools/htslib/pull/1086 for more details.
* Fixed CRAM quality string when using "q" feature code (unused by
encoders?) and in lossy-quality mode (maybe utilised in old
Cramtools).
Also see https://github.com/samtools/htslib/pull/1094 for more details.
* Fixed some minor memory leaks.
* "Scramble -X archive -1" enabled lzma, which should only have
arrived at level 7 and above. (It compared integer 7 vs ASCII '1'.)
* Removed minor compilation warning in printf debugging.
* Fixed a 7 year old bug in scram_pileup which couldn't cope with
soft-clips being followed by hard-clips.
Version 1.14.12 (30th January 2020)
---------------
This primarily has updates for CRAM 3.1 / 4.0. Note these are
*incompatible* with the files produced by 1.14.11. (That warning was
for a reason, and there is still potential for more to change.)
Updates:
* Added CRAM compression profiles: fast, normal (default), small and
archive. Specify using scramble -X profile-name.
These change the codecs used as well as the granularity. For
example "-X fast" alters compression level, disables some slower
codecs and uses only 1000 sequences per slice. Here fast implies
fast random access as well as fast(er) cpu time.
* NM and MD tags are now checked during encode to validate that they
match the decode algorithm. If not they are automatically stored
verbatim.
* CRAM can now auto-disable the multi-ref mode if it realises we're no
longer flip-flopping between many small references. This can
improve compression in some situations as it also reenables the
AP_delta flag.
* INCOMPATIBILITY: The CRAM fqzcomp quality codec has been updated for
the experimental cram versions (3.1 & 4.0). This cannot read older
fqzcomp files (and vice versa).
Quality data sizes are typically 1-3% smaller, but in some extreme
cases with data showing strong strand or positional bias it can be
up to 30% smaller.
CRAMv4 at compression level -7 and above has the maximal form
fqzcomp encoding.
* INCOMPATIBILITY: Renumbered the CRAM 3.1/4.0 codecs to sequentially
follow the 3.0 ones.
* EXPERIMENTAL: Scramble -E can embed a consensus instead of reference
and delta against that. It is not recommended that you use this yet
though until the implications are sorted out. (Likely this will need
to be a CRAMv4 only feature.) Specifically some CRAM
implementations will choke if the md5sum for the embedded "reference"
does not match the md5sum for the reference listed in the @SQ
headers.
* Lots more minor updates to CRAM 3.1 / 4.0 compression codecs.
These have now also been moved to the new htscodecs submodule.
See that logs in that git repository for full details of codec
changes.
* CRAM 4.0 format improvements:
- New variable sized integer encoding.
- New "QO" quality orientation header field to optionally permit
compression of quality strings in their as-sequenced orientation
instead of as-aligned.
- Read names can now be deduped for read-pairs, just as we do for
RNEXT, PNEXT and TLEN.
- CF has a new flag EXPLICIT_TLEN which permits encoding of TLEN
only, but not RNEXT/PNEXT. Useful for preserving off-by-one TLEN
sizes. (Usually insignificant, but on some "wrong" data sets it's
up to 5% space saving.)
- MD, NM and RG can be stored in the TD map as placeholders.
They're auto-computed still, but we now know if they existed and
if so where in the tag list.
- Improved 64-bit position support.
- Added data tranforms for RLE, bit-PACKing and mapping and DELTA.
These are analogous to the rANS4x16 codec, but may be used in
conjunction with other codecs. (Currently sparsely utilised by
the encoder.)
- Native upport for signed data types, instead of assuming
0xffffffff is -1 (for example). Used for AP, TS and RG.
* Improved build instructions: fixes github #19
* Tidied up EOF writing code to be more CRAM version agnostic.
Bug fixes:
* Fixed bashism in test harness. Also ensured the randomly generated
test data is consistent across all systems.
* Fixed NM calculation for N and P cigar ops.
* Fixed embedded references when used in multi-slice mode.
* Debian bug#912451: fail on non-intel systems that don't define
ALLOW_UAC.
* Debian bug#915450: big endian fixes for BAM CRC checking.
* Debian bug#915459: unaligned memory access crash on some systems
when doing multi-threaded CRAM decode.
* Debian bug#915460: resolves undefined behavour in hash table on
32-bit systems.
* Fixed compilation error on x32 architecture.
* Fixed LDFLAGS typo causing --with-zlib to overrule the users
definition of LDFLAGS.
* Fixed memory leaks in the test harness.
* Fixed cram_filter when used in conjunction with "scramble -n" (no
names).
* Fixed some rare thread race conditions in CRAM encoding.
* Fixed an optimisation buglet in gcc 5.0 to 5.4. Fixes github #17
* Various compiler warnings silenced (some of which were minor bug
fixes too).
* Fixed program name in help message from scram_test and
srf_extract_hash.
* Fixed type overflow problems with itf8 macros. Fixes githjub #22.
Version 1.14.11 (16th October 2018)
---------------
Updates:
* CRAM: http(s) queries now honour redirects.
The User-Agent header is also set, which is necessary in some
proxies.
Bug fixes:
* CRAM: fix to major range query bug introduced in 1.14.10.
* CRAM: more bug fixing on range queries when multi-threading (EOF
detection).
* The test harness now works correctly in bourne shell, without
using bashisms.
Version 1.14.10 (26th September 2018)
---------------
Updates:
* BAM: Libdeflate support (https://github.com/ebiggers/libdeflate).
This library is significantly faster than zlib, so it is a good
alternative to the Cloudflare and/or Intel libraries.
Configure using --with-libdeflate=/dir/to/deflate/install
* CRAM *EXPERIMENTAL*: Added custom quality and identifier codecs.
Also added the ability to use libbsc as a general purpose codec.
These are NOT OFFICIAL and so not enabled by default (version 3.0).
However as a technology demonstration only, they are available with
scramble -V3.1 or -V4.0 for evaluation and to promote discussion on
future CRAM formats. Do not use these on production data.
Implementations of the codecs and CRAM version 4.0 layout are liable
to change without prior warning.
* CRAM: name sorted files now automatically switch to non-ref mode.
Bug fixes:
* CRAM: Considerable fixes to multi-threading.
- Using more than 1 slice per container with threading now works.
- Removal of race conditions when using CRAM_OPT_REQUIRED_FIELDS.
- Combinations of ref and no-ref mode in adjacent containers.
- Other misc. threading bugs.
* Corrected end-of-range check in some scenarios.
* CRAM: bug fix to index creation when a slice contains exactly one
alignment.
* SAM: fixed parsing of illegal sequence characters (eg "Z").
These are now treated as "N" and not "=".
* BAM/SAM: protect against out of bound CIGAR operations.
* CRAM: hardening of rANS codec against malicious input.
Also fixed a very rare frequency renormalisation case.
* CRAM: fix with range queries used in conjuction with turning off
sequence retrieval (via CRAM_OPT_REQUIRED_FIELDS).
* Improved test harness for Windows and some header file problems.
* Fixed bgzip on big endian systems. (Debian bugs 876839, 876840)
Version 1.14.9 (9th February 2017)
--------------
Updates:
* BAM: Added CRC checking. Bizarrely this was absent here and in most
other BAM implementations too. Pure BAM decode of an uncompressed
BAM is around 9% slower and compressed BAM to compressed BAM is
almost identical. The most significant hit is reading uncompressed
BAM (and doing nothing else) which is 120% slower as CRC dominates.
Options are available to disable the CRC checking incase this is an
issue (scramble -!).
* CRAM: Now supports bgziped fasta references.
* CRAM/SAM: Headers are now kept in the same basic type order while
transcoding. (Eg all @PG before all @SQ, or vice versa, depending on
input ordering.)
* CRAM: Compression level 1 is now faster but larger. (The old -1 and
-2 were too similar.)
* CRAM: Improved compression efficiency in some files, when switching
from sorted to unsorted data.
* CRAM: Speedups and improvements to memory handling under GNU
malloc. See the scram_init() function.
* CRAM: Sped up the rANS codecs on x86_64 platforms (assembly code).
* CRAM: Improved multi-threading performance during decode.
* CRAM: Block CRC checks are now only done when the block is used,
speeding up multi-threading and tools that do not decode all blocks
(eg flagstat).
* Scramble -g and -G options to generate and reuse bgzip indices when
reading and writing BAM files.
* Scramble -q option to omit updating the @PG header records.
* Experimental cram_filter tool has been added, to rapidly produce
cram subsets.
* Migrated code base to git. Use github for primary repository.
Dropped ChangeLog file (recommend git clone and "git log
--abbrev-commit --pretty=medium --stat" for an svn similar log
style).
* BAM: minor improvements to gcc SIMD auto-vectorisation.
* Minor improvements to dstring memory usage (potentially reducing
memory usage when loading very large BAM headers).
Bug fixes:
* BAM: Fixed the bin value calculation for placed but unmapped reads.
* CRAM: Fixed file descriptor leak in refs_load_fai().
* CRAM: Fixed a crash in MD5 calculation for sequences beyond the
reference end.
* CRAM: Bug fixes when encoding malformed @SQ records.
* CRAM: Fixed a rare renormalisation bug in rANS codec.
* Fixed tests so make -j worked.
* Removed ancient, broken and unused popen() code.
Version 1.14.8 (22nd April 2016)
--------------
* SAM: Small speed up to record parsing.
* CRAM: Scramble now has -p and -P options to control whether to
force the BAM auxiliary sizes (8 vs 16 vs 32-bit integer quantities)
rather than reducing to smallest size required, and whether to
preserve the order of auxiliary tags including RG, NM and MD.
This latter option requires storing these values verbatim instead of
regenerating them on-the-fly, but note this only preserves tag order
with Scramble / Htslib. Htsjdk will still produce these fields out
of order.
* CRAM no longer stores data in the CORE block, permitting greater
flexibility in choosing which fields to decode. (This change is
also mirrored in htslib and htsjdk.)
* CRAM: ref.fai files in a different order to @SQ headers should now
work correctly.
* CRAM required-fields parameters no longer forces quality decoding
when asking for sequence.
* CRAM: More robustness / safety checks during decoding; itf8 bounds
checks, running out of memory, bounds checks in BETA codec, and
more.
* CRAM auto-generated read names are consistent regardless of range
queries. They also now match those produced by htslib.
* CRAM: the rANS codec should now be slightly faster at decoding.
* CRAM: there is a newer (faster than vanilla Zlib) crc32
implementation. If you are linking against CloudFlare's optimised
Zlib you should configure with --disable-own-crc to utilise their
assembly PCLMUL CRC implementation.
* CRAM bug fix: removed potential (but unobserved) possibility of
8-bit quantities stored as a 16-bit value in BAM being converted
incorrectly within CRAM.
* CRAM bug fix: fixed field widths for cram_dump and cram_size.
* SAM bug fix: no more complaining about "unknown" sort order.
* A few compiler warnings in cram_dump / cram_size have gone away.
Many small CRAM code tweaks to aid comparisons to htslib. It should
also be easier to build under Microsoft Visual Studio (although no
project file is provided).
Version 1.14.7 (18th February 2016)
--------------
* Some speed ups to BAM encoding, particularly when using uncompressed
BAM.
* Scramble now has a lossy read-name method (scramble -n) when
outputting to CRAM.
* Tidied up the formatting of cram_size.
* Cram_dump now prints up the TD map.
* CRAM bug fix: Scramble -N was sometimes failing when multi-threaded.
* CRAM bug fix: The code once again builds if CRAM_IO_CUSTOM_BUFFERING
is disabled.
* CRAM bug fix: avoid undefined behaviour in some uses of
CRAM_OPT_REQUIRED_FIELD (not readily noticable from command line
tools).
* CRAM bug fix: on very rare cases TLEN could change during decode
(albeit fixing it) for read-pairs that spanned references.
* CRAM bug fix: CIGAR sequences with more than 2^27 operations are now
supported.
* CRAM bug fix: fixed an assertion failure triggered with some
repeated templates.
Version 1.14.6 (6th November 2015)
--------------
* CRAM bug fix, reversing a bug introduced during 1.14.5. Output from
cramtools could trigger a crash during decoding with scramble. This
happened where a HUFFMAN codec was specified with zero symbols.
E.g. "DL => HUFFMAN {0, 0}" from cram_dump for a slice where there
are no D operators in the cigar strings for this slice.
Version 1.14.5 (5th November 2015)
--------------
* Scramble now has a way to control the maximum number of bases per
slice (default 5Mb), forcing a new slice if this limit is hit before
hitting the existing max sequences per slice limit. This improves
performance on very long read data. (PacBio, ONT, etc)
* Improvements for MacOS X building.
* Removed erroneous debugging output from Scramble.
* Fixed cram_dump so it works on longer sequences, eg PacBio data.
* Fixed cram_size (but not yet cram_dump summary output) to handle
multi-byte content_ids when reporting which block type is for which
data series.
* Fixed a bug with multi-slice containers, broken since r3946
(1.14.1).
* Bug fixes to the libmaus2/biobambam2 interface code. Part of this
change includes simplifying how auxiliary tag content_ids are
assigned for CRAM.
* This io_lib release should now work again when being used by the
current (albeit 2013) release of Staden Package.
Version 1.14.4 (5th October 2015)
--------------
CRAM changes:
* Fixed a CRAM encoding bug with compression level 6 and above where
the resulting CRAM file could not be decoded. This has been in
existance since 1.13.8 (and possibly 1.13.6 under rarer conditions).
* New scramble option -H to avoid printing header in SAM output.
Version 1.14.3 (29th September 2015)
--------------
CRAM changes:
* Disabled the experimental slice checksum headers (SD and BD *slice*
tags) as their behaviour is still undefined and not in the CRAM
specification. These were left in in error.
* Fixed scram_merge to honour the -R option to specify a region.
(NB: This isn't really a properly supported tool, but a test of the
library code.)
* Fixed a bug in decoding memory that caused lzma level 9 compressed
files to be unable to be decompressed.
* Minor updates to scramble usage text.
Version 1.14.2 (16th September 2015)
--------------
CRAM changes:
* Bug fix to SAM header parsing so that it now permits nul characters.
(This is a long standing bug.)
* Bug fix to auxiliary tag compression; we failed to correctly cache
the best codec to use, resulting in slower (but valid) compression
times.
Version 1.14.1 (10th September 2015)
--------------
CRAM changes:
* Small improvements to compression ratios. CRAM auxiliary fields are
now always written to their own blocks. Also now experiment with
level 1 zlib compression in addition to the required level, as
on some data series this is the best solution.
* Removed support for writing CRAM version 1.0. This format was never
truely spec compliant anyway due to errors with the first
specification.
* rANS O1 memory allocation is now via malloc rather than the heap,
permitting building on MacOS X again.
* Fixed crash in multi-threaded decoding if not decoding positional
data (via cram_required_fields() function).
* Fixed a bug with non-reference encoded CRAMs and indexing.
Version 1.14.0 (10th July 2015)
--------------
* The default CRAM format type is now version 3.0. You can still
generate version 2.1 files using scramble -V2.1
* Lots of BAM/CRAM code hardening against I/O errors or corrupted data
files. Some have been from visual inspection while many have come
from automated "american fuzzy lop" fuzz testing. See the ChangeLog
for the full list.
* Imperovements to compilation; we now compile with -Wall and this
should produce no warnings. Let us know if you get them. There is
a configure --disable-warnings option to switch off -Wall.
* CRAM: added mmap support for references. This can reduce the total
memory footprint if many instances of scramble are running and also
reduces I/O when we are using small regions of cached md5
references.
* CRAM: improved compatibility with Java Cramtools CRC32 checking
(Cram version 3.0). We've also now done full integration checking
between the two implementations to ensure best compatibility with
version 3.0.
* CRAM: better support for Biobambam/libmaus; provides an in-memory
buffered alternative to the file descriptor to allow better
multi-threading performance with libmaus.
* CRAM: we now correctly spot sequence "*" when generating a CRAM file
so we can correctly export it again (cram version 3.0). Similarly
quality "*" is better handled too when being passed on to Cramtools.
* CRAM: bug fixed NM:i tag so it no longer counts hard clips. Also
fixed NM/MD for sequence "*" and cases where one was previously
present but not the other.
* CRAM: bug fixed handling unmapped reads with sequence "*".
* CRAM: bug fix to index querying when a read starts precisely on a
boundary of a cram slice.
* CRAM: fixed the container number of blocks field to be computed
correctly for multi-slice containers. (Oddly this didn't actually
matter for Scramble or Cramtools.)
version 1.13.10 (3rd Mar 2015)
---------------
* Reduced memory coordinate sorted CRAM files with many references per
slice.
* More error protection for mismatching .fai/@SQ headers.
* Improved handling of alignments off the end of references.
version 1.13.9 (29th Jan 2015)
--------------
* Improved CRAM stats array usage. Previously it could create
sub-optimal HUFFMAN trees in rare situations. Harmless, but larger
output than necessary.
* The "configure --enable-custom-buffering" (or --disable-) mode, on
by default, adds an additional scram_open interface to allow low
level I/O operations to be externally defined. This is used within
Biobambam to replace stdio with custom code supporting an iRODS
backend. See scram_open_cram_via_callbacks().
* CRAM should now be 100% lossless, barring a few specific broken
inputs (eg CIGAR strings on unaligned data). If it detects flags,
pnext or tlen fields that would differ if decoded using the
read-pairing algorithm built in to scramble's cram decoder then it
stores the read verbatim to avoid deduplicathing these fields. It
also has better support for the Supplementary flag.
* Improved support for the Supplementary flags when auto-generating
SAM flags.
* Fixed an issue where new gcc with -O3 could crash in processing SAM
due to SIMD vectorisation and unaligned memory accesses.
* Cram_index now works via a pipe, by specifying "-" as the input filename.
version 1.13.8 (12th Jan 2015)
--------------
* The REF_PATH and RAWDATA variable expansion can now handle URLs
without an explicit URL= component. It also understands the format
of URLs and doesn't require a double colon (::) to escape a single
colon any more.
* Removed a few compiler warnings.
* CRAM: Improved test harness to test Scramble -e and -x. Fixed an
issue related to -x not setting RI data series in some cases.
* CRAM: BS:Z and BI:Z now go to separate external blocks, for
(usually) improved compression. Also added special blocks for
IonTorrent ZM:B and FZ:B tags.
* CRAM: Reading CRAM indices is now much faster. Also fixed a bug
when doing multiple range queries that fell exactly on CRAM
container boundaries.
* CRAM: Bug fixes and efficiency improvements to the logic to work out
which data-series to decode.
* CRAM: Better handling of MD and NM tags when using non-reference
encoding. Also sped up the MD string generation.
* CRAM: Improved support for dealing with both primary and secondary
alignments.
* CRAM: Better support for name-sorted data. It worked before, but we
had too many re-loads of the reference sequence. Similarly removed
pointless reference sequence loads when encoding with scramble -x.
* CRAM: Various minor memory leaks removed.
* CRAM: Multi-threading updates.
Fixed some uninitialised memory accesses causing crashes on
SPARC/Solaris. Also fixed issues when using range-requests while
multi-threading. Less mutex locking when using name-sorted data.
* CRAM: Removed spurious warnings about lack of EOF block when reading
older format CRAM files.
* CRAM: Bug fix to the (undocumented) SAM 'd' aux type. Only used here
because samtools supports it.
* CRAM: Bug fix when attempting to decode 0 bytes from an external
block.
* CRAM: More support for version 3.0. The 'b' and 'q' CRAM feature
codes have been implemented along with better support for the old
BYTE_ARRAY_LEN encoder. Support for compressed SAM headers.
Unified RANS0/RANS1 codecs to RANS (switches order itself).
EXPERIMENTAL: enable using "scramble -V3.0".
This has now been cross-validated against Java cram_tools 3.0
format. (Note this is incompatible with the v3.0 files produced
from earlier Scramble.)
* BAM: Added support for CIGAR strings above 65535 elements long. See
http://sourceforge.net/p/samtools/mailman/message/30672431/
* BAM: Removed buffer overrun in records with no auxiliary data and a
record length of 1024 or a higher power of 2.
* BAM: Handle newline/carriage-return format files.
* BAM: Bug fixed the bin/index calculation; Now [beg,end) instead of
[beg,end].
* BAM: Fixed the SAM parser to handle integers between -2billion and
+4billion. (Incoming change to the SAM spec.)
Version 1.13.7 (30th May 2014)
--------------
* CRAM: Bug fixed the required fields detection code. (It was crashing
when running scram_flagstat on Cramtools output.)
* CRAM: Bug fix to cram_dump output on files using E_BYTE_ARRAY_*
codecs.
* BAM/CRAM: Modified the thread-pool to try and minimise the number of
threads used when the program hits an I/O bottleneck. This avoids
CPU auto-frequence-scaling causing slowdowns.
Version 1.13.6 (19th May 2014)
--------------
* CRAM: Major overhaul of how data series are assigned to CORE vs
EXTERNAL blocks. The net effect is that CRAM file should become
slightly smaller and also faster for decoding when the decoder only
needs specific SAM columns. This has dramatically sped up flagstats.
* CRAM: Selection of compression algorithm for external blocks is now
more advanced. We allows specifying multiple compression (eg gzip,
bzip2, lzma, rANS) and the tool will learn which methods work best
for which blocks and adapt. (This matters only for v3.0 CRAM
specification, so is experimental.)
* Cram_dump should now do a better job of auto-detecting binary vs
printable text, allowing printing of arbitrary blocks in a
friendlier fashion. It also now tracks which block is for which
data-series and displays these in the summary output.
* Cram_index has been refactored, with the code moving out of the
program and into the io_lib library. It has also been bug-fixed to
cope with multiple references packed into a single cram container.
* CRAM: bug fixes
- The EOF writing code now uses the correct bit stream for value -1.
- Changing the version with scramble -V wasn't having any effect.
- The BETA codec now correct honours beta offset value for zero
length codes. (Previously unused)
- EOF now returns the correct value when a CRAM file is closed
before attempting to decode the first sequence. (Ie header only.)
* BAM: bug fixes
- Adding @SQ lines to a BAM file with no textual representation of
the SAM header now works.
- Fixed an issue in multi-threaded decoding, causing rare deadlocks.
* CRAM: experimental changes
- Further tweaks to the rANS codec used for the version 3 CRAM.
(Ongoing work, to be used for experimentation only.)
Version 1.13.5 (28th February 2014)
--------------
* CRAM: Fixed two bugs involving reference sequences:
- When loading a fasta file containing un-folded sequence (all on
one line) the input data wasn't uppercased. This could lead to
invalid slice MD5 sums if the fasta file contained lowercase
sequence.
- In some situations the MD5 header in the @SQ line would be
computed on a blank sequence, leading to header errors.
Version 1.13.4 (17th February 2014)
--------------
* BAM: Fixed some buffer overruns in BAM decoding.
* CRAM: Added support for CRAM EOF blocks (new in CRAM v2.1
specification). Also improved BAM EOF block checking. v2.1 is now
the default output version.
* CRAM: Fixed an error causing multi-threading to take longer for the
first additional thread.
* CRAM: Improved memory-caching of reference sequences when fetching
via MD5 path. Also reduced memory used when loading via REF_PATH.
* CRAM: Experimental / alpha quality code for CRAM version 3.0,
including new codecs (rANS & arithmetic coders) and CRCs.
* CRAM: Small improvements to the bzip2 (-j) mode. It now periodically
tests bzip2 vs gzip and uses whichever is best rather than forcing
all compression to go via bzip2.
* CRAM: Fixed crash when handling CRAM files with non-sequential
reference IDs.
* Scramble: Now supports 8-way quality binning as an output option.
* Index_tar: Fixed debian bug #729276 - buffer overflow.
* Improved Windows building via cross-compilation.
Version 1.13.3 (25th October 2013)
--------------
* Improved robustness of CRAM support.
* Fixed important bugs in CRAM multi-threading support.
* CTF has been removed from io_lib source tree. Use 1.13.2 or older if
you still need this.
* CRAM now supports the new SAM 0x800 supplementary flag.
* Minor optimisations to CRAM compression levels (-1, -3 etc) so they
are more distinct.
* Fixed bug with curl timeouts when fetching large traces and/or
reference sequences.
Version 1.13.2 (25th June 2013)
--------------
* Added multi-threading support for sam/bam/cram I/O.
* New scram_flagstat command, mainly to act as a test harness for
reading speed.
* Bug fixes and improvements to reference sequence handling, in
particular when dealing with unsorted data.
* Sped up SAM decoding by about 70%. Also improved robustness of
header parsing.
* Improved automatic file type detection (scramble).
* The CRAM header block is now padded out with lots of nul characters
to permit inline editing, although we have no tool or API to do this
currently.
Version 1.13.1 (3rd May 2013)
--------------
* CRAM now has support for storing unsorted data and for using
non-reference based encoding (although this isn't very efficient).
* CRAM can now use the EBI MD5 server: http://www.ebi.ac.uk/ena/cram/md5/%s
The library will use a colon separated REF_PATH environ (see
TRACE_PATH for analogous examples) to find references, and if set
will write a local MD5 cache to REF_CACHE environ.
* Added a rudimentary scram_pileup command, mainly as a test of the
library.
* CRAM now supports bzip2 encoding, specified using scramble -j.
* Various speed increases.
* Improved BAM support for non-intel hardware.
* Bug fixes to CRAM mate flags in various scenarios.
* Fixes to generation of NM and MD strings.
* Can now code with BAM files containing no text headers; only binary
@SQ records.
* Can also cope with SAM/BAM files containing no @SQ records at all
(entirely unmapped).
* More rigorous error checking in BAM/SAM/CRAM code.
* The code is more compatible with linking into samtools (a pilot
project is ongoing).
* CRAM encoding is more robust to broken CIGAR strings.
* Various bug fixes to cram indexing.
* [API] Various function renaming, to allow this and samtools
libraries to be linked into the same applications and also for