To use covSampler to analyze your own data, you’ll need to prepare two files:
-
A FASTA file with viral genomic sequences.
-
A corresponding TSV file with metadata describing each sequence.
Prepare your nucleotide sequences in a FASTA format file named sequences.fasta.
You can see a formatted example sequence file here.
Prepare your metadata in a TSV format file named metadata.tsv.
A metadata file must include the following fields:
| Fields | Description | Format |
|---|---|---|
| strain | Sequence name | The strain values in the metadata file must match them in the fasta file |
| date | Collection date | YYYY-MM-DD (Ambiguous value is unacceptable) |
| region_exposure | Continent | Africa / Asia / Europe / North America / Oceania / South America |
| country_exposure | Country | Country |
| division_exposure | Administrative division | Division |
| pango_lineage* | Viral lineage under the Pango nomenclature | See the lastest Pango lineage list |
* Currently covSampler workflow does not include Pango lineage assignment. You can perform the Pango lineage assignment using pangolin or nextclade.
You can see a formatted example metadata file here.
All data are in the data/ directory. The raw data and intermediate data of each project will be stored in its corresponding directory.
For a new project (here named tutorial_project):
-
Create your project data folder in
data/. -
Create
rawdata/folder indata/tutorial_project. -
Move your sequence data and metadata into
data/turotial_project/rawdata/folder.
Now, the data/ directory structure should look like this:
data
├── README.md
├── example_project
│ └── rawdata
│ ├── metadata.tsv
│ └── sequences.fasta
└── tutorial_project
└── rawdata
├── metadata.tsv
└── sequences.fasta