fusta GRCh38.fa -o hg38
cat hg38/fasta/chr{X,Y}.fa > ~/sex_chrs.fa
cat hg38/get/chr17:18108706-18179802 > MYO15A.fa
rm hg38/seq/chr{3,5}.seq
fusermount -u hg38
FUSTA is a FUSE-based virtual filesystem mirroring a (multi)FASTA file as a hierarchy of individual virtual files, simplifying efficient data extraction and bulk/automated processing of FASTA files.
The virtual files exposed by FUSTA behave like standard flat text files, and provide automatic compatibility with all existing programs. When handling large multiFASTA files, the intrinsic file caching capacities of the OS are leveraged to ensure the best experience to the user.
If you use FUSTA, please cite FUSTA: leveraging FUSE for manipulation of multiFASTA files at scale, https://doi.org/10.1093/bioadv/vbac091
FUSTA is distributed under the CeCILL-C (LGPLv3 compatible) license. Please see the LICENSE file for details.
sudo apt install cargo fuse3 libfuse3-dev pkg-config cargo install --git https://github.com/delehef/fusta
You can now find fusta
in $HOME/cargo/bin/
; you should add this this path to your $PATH
for easier use.
sudo yum install rust cargo fuse3 fuse3-devel cargo install --git https://github.com/delehef/fusta
You can now find fusta
in $HOME/cargo/bin/
; you should add this this path to your $PATH
for easier use.
sudo apt install curl fuse3 libfuse3-dev
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh # Debian cargo is outdated
then rebot your shell to update the PATH
environment variable.
Finally, install FUSTA:
cargo install --git https://github.com/delehef/fusta
You can now find fusta
in $HOME/cargo/bin/
; you should add this this path to your $PATH
for easier use.
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
sudo yum install fuse3 fuse3-devel
then rebot your shell to update the PATH
environment variable.
Finally, install FUSTA:
cargo install --git https://github.com/delehef/fusta
You can now find fusta
in $HOME/cargo/bin/
; you should add this this path to your $PATH
for easier use.
On macOS, you will need to install the build tools if you have not them ready yet: xcode-select --install
You must then download and install FUSE for macOS in order to be able to use FUSTA.
Finally, to build FUSTA, you need to install the Rust compiler. You can then build FUSTA by running cargo
, the Rust build tool:
cargo install --git https://github.com/delehef/fusta
sudo pkg install rust pkgconf fusefs-libs # Install build dependencies
sudo sysctl vfs.usermount=1 # enable FUSE mounting without requiring administrator permissions
sudo kldload fuse # load the FUSE kernel module
Finally, install FUSTA:
cargo install --git https://github.com/delehef/fusta
You can now find fusta
in $HOME/cargo/bin/
; you should add this this path to your $PATH
for easier use.
You should install FUSE (as well as its potential devel
package), from your package manager – note that a reboot might be necessary for the kernel module to be loaded.
To build FUSTA, you need to install the Rust compiler. You can then build FUSTA by running cargo
, the Rust build tool:
cargo install --git https://github.com/delehef/fusta
You can now find fusta
in $HOME/cargo/bin/
; you should add this this path to your $PATH
for easier use.
These commands run fusta
in the background, mount the FASTA file file.fa
in an automatically created fusta
folder, exposing all the sequences contained in file.fa
there. The call to tree
will display the virtual hierarchy, then fusermount
is called to cleanly unmount the file.
fusta file.fa tree -h fusta/ fusermount -u fusta
Once started, fusta
will expose the content of a FASTA file in a way that makes it usable by any piece of software using as if it were a set of independent files, detailed as follow.
For instance, here is the virtual hierarchy created by fusta
after mounting a FASTA file containing A. thaliana genome
fusta ├── append ├── fasta │ ├── 1.fa │ ├── 2.fa │ ├── 3.fa │ ├── 4.fa │ ├── 5.fa │ ├── Mt.fa │ └── Pt.fa ├── get ├── infos.csv ├── infos.txt ├── labels.txt └── seqs ├── 1.seq ├── 2.seq ├── 3.seq ├── 4.seq ├── 5.seq ├── Mt.seq └── Pt.seq
FUSTA supports all FUSTA files using UNIX-style line endings, including but not restricted to DNA files, protein files, gapped files, mixed-case files, and independently of their inner formatting (line wrapping, line length, etc.).
This read-only CSV file contains a list of all the fragments present in the mounted FASTA file, with, for each of them, the standard id
and additional informations
field, plus a third one containing the length of the sequence.
This read-only text file provides the same informations, but in a more human-readable format.
This read-only file contains a list of all the sequence headers present in the mounted FASTA file.
This folder contains all the individual sequences present in the original FASTA file, exposed as virtually independent read-only FASTA files.
This folder contains all the individual sequences present in the original FASTA file, exposed as virtually independent read/write files containing only the sequences - without the FASTA headers, but with any newline preserved. These files can be read, copied, removed, edited, etc. as normal files, and any alteration will be reflected on the original FASTA file when fusta is closed.
This folder should be used to add new sequences to the mounted FASTA file. Any valid fasta file copied or moved to this directory will be appended to the original FASTA files. It should be noted that the process is completely transparent and the the folder will remain empty, even though the operation is successful.
This folder is used for range-access to the sequences in the mounted FASTA file. Although it is empty, any read access to a (non-existing) file following the pattern SEQID:START-END
will return the corresponding range (1-indexed, fully-closed) in the specified sequence. It should be noted that the access skip headers and newlines, so that the START-END
coordinates map to actual loci in the corresponding sequence and not to bytes in the mounted FASTA file.
All the following examples assume that a FASTA file has been mounted (e.g. fusta -D genome.fa
), and is unmounted after manipulation (e.g. fusermount -u fusta
).
cat fusta/infos.txt
cat fusta/fasta/chr{X,Y}.fa > ~/sex_chrs.fa
cat fusta/get/chr12:12000000-12002000
rm fusta/seq/chr{3,5}.seq
cp more_sequences.fa fusta/append
sed 's/[a-z]/\U&/g' fusta/seqs/chr21.seq | sponge fusta/seqs/chr21.seq
nano fusta/seq/chrMT.seq
cd fusta/seq; for i in *; do mv ${i} chr${i}; done
blastn mydb.db -query fusta/fasta/seq25.fa
asgart fusta/fasta/chrX.fa fusta/asgart/chrY.fa --out result.json
FUSTA only works with uncompressed (multi)FASTA files. If you wish to use FUSTA on compressed (multi)FASTA files, we recommend to use FASTAFS as an intermediary to expose a compressed (multi)FASTA file to FUSTA without requiring to ully uncompress it.
USAGE: fusta [OPTIONS] <FASTA> ARGS: <FASTA> A (multi)FASTA file containing the sequences to mount OPTIONS: -C, --max-cache <max-cache> Set the maximum amount of memory to use to cache writes (MB) [default: 500] --cache <cache> Use either mmap, fseek(2) or memory-backed cache to extract sequences from FASTA files. WARNING: memory caching use as much RAM as the size of the FASTA file should be available. [default: mmap] [possible values: file, mmap, memory] -D, --no-daemon Do not daemonize -h, --help Print help information -o, --mountpoint <mountpoint> Specifies the directory to use as mountpoint; it will be created if it does not exist -S, --sep <csv-separator> Set the separator to use in CSV files [default: ,] -v Sets the level of verbosity -V, --version Print version information -W, --allow-overwrite allow FUSTA to overwrite existing sequences, when (i) appending new sequences conflicting with an existing ID, (ii) renaming sequences
The cache option is key in adapting FUSTA to your use, and for files of non-trivial size, a correct choice is the difference between a memory overflow and a smooth run:
file
- in this mode, FUSTA store all the fragments as offsets in their file, and access them through
fseek
accesses. The performances will probably be the worse, but memory consumption will be kept to the minimal. mmap
- this mode is extremely similar to the previous one, safe that access will proceed through mmmap(2) reads, leveraging the caching facilities of the OS – this is the default mode.
memory
- in this mode, all fragments will directly be copied to memory. Performances will be at their best, but enough memory should be available to store the entirety of the processed files.
The FASTA files may be overflowing the default setting of the memory overcommit guard. You may change the overcommiting setting with sysctl -w vm.overcommit_memory 1
, or use --cache=file
for less performances, but less virtual memory pressure.
Your FASTA file may contain too many fragments w.r.t. the number of mmap pages that can be mapped by a program. You may increase max_map_count
with sysctl -w vm.max_map_count 200000
, or use --cache=file
for less performances, but less virtual memory pressure.
Open an issue stating your problem!
If you have any question or if you encounter a problem, do not hesitate to open an issue.
FUSTA is standing on the shoulders of, among others, fuser, clap, memmap2 and daemonize.
- Fix missing newline in some cases
- Use 1-based, fully-closed genomic coordinates
- Accept more characters as FASTA sequences:
\n - _ . + =
- Fix truncating
- Improved error handling
- Better notifications
- Add a flag to allow overwrite of existing sequences as a side-effect
- Only ASCII alphanumerical content can be written to sequence files
- Refuse to open FASTA files with IDs containing characters invalid in a filename
- Update dependencies
- The default mount point is now
fusta-{filename}
- Fix mountpoint not being created
- Improve notification system
- Daemonize by default
- Update memmap to memmap2
- Add an index on ino for better performances at the cost of a bit of memory
- Daemonize after parsing FASTA files, so that (i) errors appear immediately and (ii) performances are better when launching multiple instances in parallel.
- Can now cache all fragments in memory: increased RAM consumption, but starkly reduced random access time
- Bugfixes
- FUSTA is now based on fuster instead of fuse-rs
- Various optimization let FUSTA handle >40GB FASTA files in 6GB of RAM and much better performances
- Added an optional notification system behind the
notifications
feature gate
- Use MMAP by default. While it may lead to unpleasant load when performing heavy operation on very large files, this should be a rather uncommon case.
- FUSTA can now directly extract ranges from a sequence
- Initial release