Skip to content

Automated Selection of High quality Libraries for the Extensive analYsis of Strandseq data (ASHLEYS)

License

Notifications You must be signed in to change notification settings

friendsofstrandseq/ashleys-qc

Repository files navigation

ashleys-qc

Automated Selection of High quality Libraries for the Extensive analYsis of Strandseq data (ASHLEYS)

ASHLEYS is developed on Linux environments using Python3.7. For a full working example on how to use ASHLEYS, please take a look at the processing pipeline. Please note that the preprocessing steps in this pipeline, e.g. short-read alignment and read duplicate marking, are always required to prepare suitable input files for ASHLEYS; the pipeline (code) itself, however, is just an example implementation, and not per se part of ASHLEYS.

Setup

Clone the repository via

git clone https://github.com/friendsofstrandseq/ashleys-qc.git ashleys-qc
cd ashleys-qc

Then create and activate the conda environment:

conda env create -f environment/ashleys_env.yml
conda activate ashleys

For final setup, run

python setup.py install

Now you should be able to see all possible modules with

./bin/ashleys.py --help

Build status

Develop branch:

Build Status

Master branch:

Build Status

Feature Generation

Compute features for one or more BAM files for a given window size. For a detailed explanation of what features are computed, please refer to the feature documentation.

Example usage generating all necessary features for using the pretrained models for all .bam files in the specified directory:

./bin/ashleys.py -j 23 features -f [folder_with_bamfiles] -w 5000000 2000000 1000000 \
 800000 600000 400000 200000 -o [feature_table.tsv]

Model Training

Train a new classification model based on an annotation file specifying class 1 cells. The model is trained with support vector classification based on grid search on hyperparamters.
Example usage:

./bin/ashleys.py train -p [feature_table.tsv] -a [annotation.txt] -o [output.tsv]

Prediction

Predict the class probabilities for new cells based on pre-trained models or based on customized models.
The default model trained with support vector classification should identify low-quality cells of new data with high confidence. For detailed information about the generated files, please refer to the output interpretation.

Example usage for prediction based on this pretrained model:

./bin/ashleys.py predict -p [feature_table.tsv] -o [output_folder] -m models/svc_default.pkl

When using the pretrained models, it is necessary to have scikit-learn v.0.23.2 installed, as the models were generated with this version. For customized models also a newer version of scikit-learn can be used.

Plotting

Plot the distribution of prediction probabilities.
Example usage:

./bin/ashleys.py plot -p [output_folder]/prediction.tsv -o [output_plot]

Test Data

Example of test data prediction which directly compares the predicted class to the true annotation:

./bin/ashleys.py predict -p data/test_features.tsv -o prediction.tsv \
-m models/svc_default.pkl -a data/test_annotation.txt

About

Automated Selection of High quality Libraries for the Extensive analYsis of Strandseq data (ASHLEYS)

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages