diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index bc5344f..78bed5b 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.11.1","generation_timestamp":"2024-10-25T15:33:35","documenter_version":"1.7.0"}} \ No newline at end of file +{"documenter":{"julia_version":"1.11.1","generation_timestamp":"2024-10-25T15:34:24","documenter_version":"1.7.0"}} \ No newline at end of file diff --git a/dev/alignments/index.html b/dev/alignments/index.html index 47c599c..2bcdcea 100644 --- a/dev/alignments/index.html +++ b/dev/alignments/index.html @@ -20,9 +20,9 @@ 1 3 10 11 16 11 4 15 7 17 5 5 5 3 10 10 18 3 1 14 3 5 15 14 12 4 2 7 14 5 2 17 3 7 21 8 3 10 1 3 21 19 7 16 13 5 7 21 15 5 17 3 5 2 16 3 1 - 9 3 12 11 14 10 4 18 7 18 15 5 2 3 5 2 19 3 2
julia> size(A) # length and number of sequences(53, 100)

When reading from FASTA, the choice of the alphabet is made by reading the first five sequences, and comparing the observed characters with the list of default alphabets (see The Alphabet type). If they fit one of the defaults, it will be used. Otherwise, an alphabet will be created ad hoc:


julia> A = read_fasta("../../example/strange_characters.fasta"); # warning produced because no default alphabet was found┌ Warning: Could not find a default alphabet for characters ['!', '-', '/', '0', '9', '@', 'A', 'C', 'G', 'T'] - Using Alphabet{Char,Int64}: ['!', '-', '/', '0', '9', '@', 'A', 'C', 'G', 'T'] -@ BioSequenceMappings ~/work/BioSequenceMappings.jl/BioSequenceMappings.jl/src/IO.jl:62
julia> A.alphabet |> symbols |> prod"!-/09@ACGT"

Writing to a FASTA file is just as easy:

julia> write("new_fasta_file.fasta", A) # or...
julia> open("new_fasta_file.fasta", "w") do io + 9 3 12 11 14 10 4 18 7 18 15 5 2 3 5 2 19 3 2
julia> size(A) # length and number of sequences(53, 100)

When reading from FASTA, the choice of the alphabet is made by reading the first five sequences, and comparing the observed characters with the list of default alphabets (see The Alphabet type). If they fit one of the defaults, it will be used. Otherwise, an alphabet will be created ad hoc:


julia> A = read_fasta("../../example/strange_characters.fasta"); # warning produced because no default alphabet was found┌ Warning: Could not find a default alphabet for characters ['!', '-', '/', '0', '9', '@', 'A', 'C', 'G', 'T'] +│ Using Alphabet{Char,Int64}: ['!', '-', '/', '0', '9', '@', 'A', 'C', 'G', 'T'] +└ @ BioSequenceMappings ~/work/BioSequenceMappings.jl/BioSequenceMappings.jl/src/IO.jl:62
julia> A.alphabet |> symbols |> prod"!-/09@ACGT"

Writing to a FASTA file is just as easy:

julia> write("new_fasta_file.fasta", A) # or...
julia> open("new_fasta_file.fasta", "w") do io write(io, A) end

Accessing & iterating

Sequences can be accessed by indexing. Indexing using a range will return a view in the underlying data matrix.

julia> A[1] # the first sequence of the alignment53-element view(::Matrix{Int64}, :, 1) with eltype Int64:
   1
@@ -84,16 +84,16 @@
  4  3  5  3  3  5  5  3  5  1
  2  3  2  5  3  2  4  3  2  4
julia> B[1][1] = 55
julia> A[1][1] # A remains unchanged5

With subsample_random, it is also possible to create a random subalignment by picking sequences from the original one. For now, this is only possible without replacement, i.e. the same sequence cannot be picked twice. To just pick one sequence at random without creating a new alignment object, just call rand.

julia> subsample_random(A, 3) # new alignment using three random sequences from AAlignment of M=3 sequences of length L=10 - Shown as `MxL` matrix
 3×10 adjoint(::Matrix{Int64}) with eltype Int64:
- 4  3  5  3  3  5  5  3  5  1
+ 2  3  2  5  3  2  4  3  2  4
  2  4  4  1  4  2  3  2  2  3
- 2  3  2  5  3  2  4  3  2  4
julia> subsample_random(A, 12) # sampling without replacement: this will error since size(A, 1) < 12ERROR: AssertionError: Cannot take 12 different sequences from alignment of size 5
julia> rand(A) # one random sequence from A (returns a view)10-element view(::Matrix{Int64}, :, 4) with eltype Int64: - 3 - 3 + 3 3 5 2 2 1 5 3 1 2
julia> subsample_random(A, 12) # sampling without replacement: this will error since size(A, 1) < 12ERROR: AssertionError: Cannot take 12 different sequences from alignment of size 5
julia> rand(A) # one random sequence from A (returns a view)10-element view(::Matrix{Int64}, :, 1) with eltype Int64: 5 + 3 + 4 2 - 2 + 4 1 5 - 3 1 - 2

OneHotAlignment

TBA

+ 5 + 5

OneHotAlignment

TBA

diff --git a/dev/alphabets/index.html b/dev/alphabets/index.html index 91892ed..80fc161 100644 --- a/dev/alphabets/index.html +++ b/dev/alphabets/index.html @@ -75,4 +75,4 @@ 1 julia> nt_alphabet(Y) # now this works -"CC-A-" +"CC-A-" diff --git a/dev/index.html b/dev/index.html index ba9419d..5e3e718 100644 --- a/dev/index.html +++ b/dev/index.html @@ -1,13 +1,12 @@ -Quickstart · BioSequenceMappings.jl

BioSequenceMappings

The aim of the package is to facilitate the task of converting biological sequences (nucleotides, amino acids) to integers or to onehot representation. It also provides simple function to manipulate alignments or to compute simple statistics.

Note: I do not frequently use the onehot format, so for now it is not documented / not well tested. I'll develop it more when I get the time.

Installation

From the Julia REPL:

julia> using Pkg
julia> Pkg.add("BioSequenceMappings") Updating registry at `~/.julia/registries/General.toml` - Resolving package versions... - Installed BioSequenceMappings ─ v0.1.3 - Updating `~/work/BioSequenceMappings.jl/BioSequenceMappings.jl/docs/Project.toml` - [a84bc454] ~ BioSequenceMappings v0.1.3 `~/work/BioSequenceMappings.jl/BioSequenceMappings.jl` ⇒ v0.1.3 - Updating `~/work/BioSequenceMappings.jl/BioSequenceMappings.jl/docs/Manifest.toml` - [a84bc454] ~ BioSequenceMappings v0.1.3 `~/work/BioSequenceMappings.jl/BioSequenceMappings.jl` ⇒ v0.1.3 +Quickstart · BioSequenceMappings.jl

BioSequenceMappings

The aim of the package is to facilitate the task of converting biological sequences (nucleotides, amino acids) to integers or to onehot representation. It also provides simple function to manipulate alignments or to compute simple statistics.

Note: I do not frequently use the onehot format, so for now it is not documented / not well tested. I'll develop it more when I get the time.

Installation

From the Julia REPL:

julia> using Pkg
julia> Pkg.add("BioSequenceMappings") Resolving package versions... + Installed BioSequenceMappings ─ v0.1.3 + Updating `~/work/BioSequenceMappings.jl/BioSequenceMappings.jl/docs/Project.toml` + [a84bc454] ~ BioSequenceMappings v0.1.4 `~/work/BioSequenceMappings.jl/BioSequenceMappings.jl` ⇒ v0.1.3 + Updating `~/work/BioSequenceMappings.jl/BioSequenceMappings.jl/docs/Manifest.toml` + [a84bc454] ~ BioSequenceMappings v0.1.4 `~/work/BioSequenceMappings.jl/BioSequenceMappings.jl` ⇒ v0.1.3 Precompiling project... - 2524.1 ms ✓ BioSequenceMappings + 2588.7 ms ✓ BioSequenceMappings 1 dependency successfully precompiled in 3 seconds. 81 already precompiled. 1 dependency precompiled but a different version is currently loaded. Restart julia to access the new version
julia> using BioSequenceMappings

Usage

Filter sequences by Hamming distance

Load an alignment, find all sequences with a Hamming distance smaller than 66% to the first one, create a new alignment object from them and save it to a file.

julia> using BioSequenceMappings
julia> A = read_fasta("../../example/PF00014.fasta");
julia> size(A) # 100 sequences of length 53(53, 100)
julia> s0 = A[1];
julia> s0' # transpose for legibility1×53 adjoint(view(::Matrix{Int64}, :, 1)) with eltype Int64: 1 3 10 10 14 14 17 5 7 19 3 … 17 15 15 4 3 9 17 18 3 1
julia> indices = findall(s -> hamming(s, s0; normalize=true) < 0.66, A)5-element Vector{Int64}: @@ -63,4 +62,4 @@ 3 10 11 16 11 4 15 7 17 3 18 5 5 5 3 10 10 18 3 3 5 15 14 12 4 2 7 14 3 18 5 2 17 3 7 21 8 3 3 21 19 7 16 13 5 7 21 3 18 15 5 17 3 5 2 16 3 - 3 12 11 14 10 4 18 7 18 3 17 15 5 2 3 5 2 19 3
+ 3 12 11 14 10 4 18 7 18 3 17 15 5 2 3 5 2 19 3
diff --git a/dev/objects.inv b/dev/objects.inv index a3dfa81..9b97680 100644 Binary files a/dev/objects.inv and b/dev/objects.inv differ diff --git a/dev/reference/index.html b/dev/reference/index.html index 92381a2..68331f4 100644 --- a/dev/reference/index.html +++ b/dev/reference/index.html @@ -1,16 +1,16 @@ -Reference · BioSequenceMappings.jl

Documentation for BioSequenceMappings.

BioSequenceMappings.AlignmentType
mutable struct Alignment{A,T} where {A, T<:Integer}
    data::Matrix{T}
+Reference · BioSequenceMappings.jl

Documentation for BioSequenceMappings.

BioSequenceMappings.AlignmentType
mutable struct Alignment{A,T} where {A, T<:Integer}
    data::Matrix{T}
     alphabet::Union{Nothing, Alphabet{A,T}}
     weights::Vector{Float64} = ones(size(dat,1))/size(dat,1) # phylogenetic weights of sequences
-    names::Vector{String} = fill("", size(dat, 1))

Biological sequences as vectors of type T<:Integer. data stores sequences in columns: size(dat) returns a tuple (L, M) with L the length and M the number of sequences. When displayed, shows data as an MxL matrix to match with traditional alignments.

alphabet{A,T} represents the mapping between integers in data and biological symbols of type A (nucleotides, amino acids...). If nothing, the alignment cannot be mapped to biological sequences.

weights represent phylogenetic weights, and are initialized to 1/M. They must sum to 1. names are the label of sequences, and are expected to be in the same order as the columns of data. They do not have to be unique, and can be ignored

Important: When built from a matrix, assumes that the sequences are stored in columns.

Methods

  • getindex(X::Alignment, i) returns a matrix/vector X.data[:, i].
  • for s in X::Alignment iterates over sequences.
  • eachsequence(X::Alignment) returns an iterator over sequences (Vector{Int}).
  • eachsequence_weighted(X::Alignment) returns an iterator over sequences and weights as tuples
  • subaln(X::Alignment, idx) constructs the subaln defined by index idx.
source
BioSequenceMappings.AlignmentMethod
Alignment(data::AbstractMatrix{T}; alphabet = :auto, kwargs...)

Keyword argument alphabet can be :auto, :none/nothing, or an input to the constructor Alphabet. Other keyword arguments are passed to the default constructor of Alignment.

source
BioSequenceMappings.AlignmentMethod
Alignment(data::AbstractMatrix, alphabet; kwargs...)

data is a matrix of integers, with sequences stored in columns. alphabet can be either

  • an Alphabet
  • nothing: no conversion from integers to biological symbols.
  • something to build an alphabet from (e.g. a symbol like :aa, a string, ...). The constructor Alphabet will be called like so: Alphabet(alphabet).

If the types of alphabet and data mismatch, data is converted.

data can also have the following shape:

  • vector of integer vectors, e.g. [[1,2], [3,4]]: each element is considered as a sequence
  • vector of integers: single sequence alignment
source
BioSequenceMappings.AlphabetType
struct Alphabet{A,I}
+    names::Vector{String} = fill("", size(dat, 1))

Biological sequences as vectors of type T<:Integer. data stores sequences in columns: size(dat) returns a tuple (L, M) with L the length and M the number of sequences. When displayed, shows data as an MxL matrix to match with traditional alignments.

alphabet{A,T} represents the mapping between integers in data and biological symbols of type A (nucleotides, amino acids...). If nothing, the alignment cannot be mapped to biological sequences.

weights represent phylogenetic weights, and are initialized to 1/M. They must sum to 1. names are the label of sequences, and are expected to be in the same order as the columns of data. They do not have to be unique, and can be ignored

Important: When built from a matrix, assumes that the sequences are stored in columns.

Methods

  • getindex(X::Alignment, i) returns a matrix/vector X.data[:, i].
  • for s in X::Alignment iterates over sequences.
  • eachsequence(X::Alignment) returns an iterator over sequences (Vector{Int}).
  • eachsequence_weighted(X::Alignment) returns an iterator over sequences and weights as tuples
  • subaln(X::Alignment, idx) constructs the subaln defined by index idx.
source
BioSequenceMappings.AlignmentMethod
Alignment(data::AbstractMatrix{T}; alphabet = :auto, kwargs...)

Keyword argument alphabet can be :auto, :none/nothing, or an input to the constructor Alphabet. Other keyword arguments are passed to the default constructor of Alignment.

source
BioSequenceMappings.AlignmentMethod
Alignment(data::AbstractMatrix, alphabet; kwargs...)

data is a matrix of integers, with sequences stored in columns. alphabet can be either

  • an Alphabet
  • nothing: no conversion from integers to biological symbols.
  • something to build an alphabet from (e.g. a symbol like :aa, a string, ...). The constructor Alphabet will be called like so: Alphabet(alphabet).

If the types of alphabet and data mismatch, data is converted.

data can also have the following shape:

  • vector of integer vectors, e.g. [[1,2], [3,4]]: each element is considered as a sequence
  • vector of integers: single sequence alignment
source
BioSequenceMappings.AlphabetType
struct Alphabet{A,I}
     characters::Vector{A}
     char_to_index::Dict{A, I}
     index_to_char::Dict{I, A}
     default_char = nothing
     default_index
-end

Structure allowing the mapping from biological symbols of type A to integers of type I. The typical use case would be Alphabet{Char, Int}. Alphabet can be constructed

  • from a Vector of symbols and an optional type I, e.g. Alphabet(['A','C','G','T'], UInt8)::Alphabet{Char, UInt8}
  • from a String and an optional type, e.g. Alphabet("ACGT")
  • from a mapping Dict{A, I} where I<:Integer: Alphabet(Dict('A'=>1, 'C'=>2))
  • from a Symbol, using default alphabets, e.g. Alphabet(:nt)
  • from an integer, using default alphabets (see ?default_alphabets).
source
BioSequenceMappings.compute_weightsFunction
compute_weights(X::AbstractAlignment, θ = 0.2; normalize = true)

Compute phylogenetic correction weights for sequences of X. The weight sequence S is 1/N, where N is the number of sequences in X at hamming distance less than H from S (including S itself). The threshold H is floor(θ⋅L) where L is the sequence length.

The return value is a tuple (weights, Meff), where Meff is the sum of weights (pre-normalization). If normalize, weights are normalized to sum to one. .

source
BioSequenceMappings.eachsequenceMethod
eachsequence(X::AbstractAlignment[, indices]; skip)

Return an iterator over the sequences in X. If indices is specified, consider only sequences at the corresponding indices. Use the integer argument skip to return only one sequence every skip (~ 1:skip:end).

source
BioSequenceMappings.find_sequenceMethod
find_sequence(label::AbstractString, aln::AbstractAlignment)

Find sequence with name label in aln, and return (index, sequence). Scales as the number of sequences. Return the first sequence that matches the label.

!!! Return a view of the sequence.

source
BioSequenceMappings.hammingMethod
hamming(x, y; normalize=true, positions=nothing)

Hamming distance between Vectors x and y. Only sites in vector positions will be considered.

source
BioSequenceMappings.match_sequencesMethod
match_sequences(pattern, aln::AbstractAlignment)

Find sequences whose name matches label in aln, and return (indices, sequences). Sequences are returned as columns.

!!! Return a view of the sequences.

source
BioSequenceMappings.pairwise_correlationsFunction
pairwise_correlations(X, w=X.weights; as_mat=false)

Compute connected correlations: the difference between the pairwise frequencies and the product of the single site frequencies. See ?pairwise_frequencies for the shape of the output.

source
BioSequenceMappings.pairwise_frequenciesFunction
pairwise_frequencies(X::AbstractAlignment, w=X.weights; as_mat=false)

Return a q x q x L x L tensor. The (a, b, i, j) element is the fraction of sequences for which we see a at position i and b at position j.

If as_mat=true, will return a qL x qL matrix, with q x q blocks representing correlations between two specific columns.

source
BioSequenceMappings.pairwise_hammingMethod
pairwise_hamming(X, Y; step=1, step_left, step_right, as_vec=true, kwargs...)
-pairwise_hamming(X; kwargs...)

Return all hamming distances between sequences of X and Y. In the second form, consider pairs of sequences in X.

Only consider sequences every step. step_left and step_right can be used to skip sequence either in X or in Y. This is useful for large alignment, as the number of computations grows with the product of the size of the alignments

By default, the return value is a vector organized like [H(1,2), H(1,3), ..., H(M-1, M)] with H standing for hamming distance and M for the number of sequences. If a matrix is prefered, use as_vec=false

Extra keyword arguments are passed to hamming.

source
BioSequenceMappings.read_fastaMethod
read_fasta(fastafile::AbstractString; alphabet = :auto, kwargs...)
+end

Structure allowing the mapping from biological symbols of type A to integers of type I. The typical use case would be Alphabet{Char, Int}. Alphabet can be constructed

  • from a Vector of symbols and an optional type I, e.g. Alphabet(['A','C','G','T'], UInt8)::Alphabet{Char, UInt8}
  • from a String and an optional type, e.g. Alphabet("ACGT")
  • from a mapping Dict{A, I} where I<:Integer: Alphabet(Dict('A'=>1, 'C'=>2))
  • from a Symbol, using default alphabets, e.g. Alphabet(:nt)
  • from an integer, using default alphabets (see ?default_alphabets).
source
BioSequenceMappings.compute_weightsFunction
compute_weights(X::AbstractAlignment, θ = 0.2; normalize = true)

Compute phylogenetic correction weights for sequences of X. The weight sequence S is 1/N, where N is the number of sequences in X at hamming distance less than H from S (including S itself). The threshold H is floor(θ⋅L) where L is the sequence length.

The return value is a tuple (weights, Meff), where Meff is the sum of weights (pre-normalization). If normalize, weights are normalized to sum to one. .

source
BioSequenceMappings.eachsequenceMethod
eachsequence(X::AbstractAlignment[, indices]; skip)

Return an iterator over the sequences in X. If indices is specified, consider only sequences at the corresponding indices. Use the integer argument skip to return only one sequence every skip (~ 1:skip:end).

source
BioSequenceMappings.find_sequenceMethod
find_sequence(label::AbstractString, aln::AbstractAlignment)

Find sequence with name label in aln, and return (index, sequence). Scales as the number of sequences. Return the first sequence that matches the label.

!!! Return a view of the sequence.

source
BioSequenceMappings.hammingMethod
hamming(x, y; normalize=true, positions=nothing)

Hamming distance between Vectors x and y. Only sites in vector positions will be considered.

source
BioSequenceMappings.match_sequencesMethod
match_sequences(pattern, aln::AbstractAlignment)

Find sequences whose name matches label in aln, and return (indices, sequences). Sequences are returned as columns.

!!! Return a view of the sequences.

source
BioSequenceMappings.pairwise_correlationsFunction
pairwise_correlations(X, w=X.weights; as_mat=false)

Compute connected correlations: the difference between the pairwise frequencies and the product of the single site frequencies. See ?pairwise_frequencies for the shape of the output.

source
BioSequenceMappings.pairwise_frequenciesFunction
pairwise_frequencies(X::AbstractAlignment, w=X.weights; as_mat=false)

Return a q x q x L x L tensor. The (a, b, i, j) element is the fraction of sequences for which we see a at position i and b at position j.

If as_mat=true, will return a qL x qL matrix, with q x q blocks representing correlations between two specific columns.

source
BioSequenceMappings.pairwise_hammingMethod
pairwise_hamming(X, Y; step=1, step_left, step_right, as_vec=true, kwargs...)
+pairwise_hamming(X; kwargs...)

Return all hamming distances between sequences of X and Y. In the second form, consider pairs of sequences in X.

Only consider sequences every step. step_left and step_right can be used to skip sequence either in X or in Y. This is useful for large alignment, as the number of computations grows with the product of the size of the alignments

By default, the return value is a vector organized like [H(1,2), H(1,3), ..., H(M-1, M)] with H standing for hamming distance and M for the number of sequences. If a matrix is prefered, use as_vec=false

Extra keyword arguments are passed to hamming.

source
BioSequenceMappings.read_fastaMethod
read_fasta(fastafile::AbstractString; alphabet = :auto, kwargs...)
 read_fasta(
     fastafile::AbstractString, alphabet;
     weights = false, theta = 0.2, verbose = false,
-)
source
BioSequenceMappings.site_specific_frequenciesFunction
site_specific_frequencies(X::AbstractAlignment[, weights=X.weights]; as_vec=false)

Return the site specific frequencies of X. If as_vec, the result is a vector of length Lxq. Otherwise, it is a matrix of q rows and L columns (default).

source
BioSequenceMappings.subsample_randomMethod
subsample_random(X::AbstractAlignment, m::Int)

Return an Alignment with m sequences taking randomly from X. Sampling is done without replacement, meaning the m sequences are all at different positions in X.

source
BioSequenceMappings.translateMethod
translate(x, original_alphabet::Alphabet, new_alphabet::Alphabet)

Return the translation in new_alphabet of an integer or a vector of integers x that is expressed in original_alphabet.

source
+)
source
BioSequenceMappings.site_specific_frequenciesFunction
site_specific_frequencies(X::AbstractAlignment[, weights=X.weights]; as_vec=false)

Return the site specific frequencies of X. If as_vec, the result is a vector of length Lxq. Otherwise, it is a matrix of q rows and L columns (default).

source
BioSequenceMappings.subsample_randomMethod
subsample_random(X::AbstractAlignment, m::Int)

Return an Alignment with m sequences taking randomly from X. Sampling is done without replacement, meaning the m sequences are all at different positions in X.

source
BioSequenceMappings.translateMethod
translate(x, original_alphabet::Alphabet, new_alphabet::Alphabet)

Return the translation in new_alphabet of an integer or a vector of integers x that is expressed in original_alphabet.

source
diff --git a/dev/utilities/index.html b/dev/utilities/index.html index 7d50e49..0134183 100644 --- a/dev/utilities/index.html +++ b/dev/utilities/index.html @@ -10,4 +10,4 @@ 0.2 0.2 0.2 - 0.2 + 0.2