GitHub - nicolo-tellini/S.cerevisiaeData: 1,674 S.cerevisiae genomics data

NEWS:

Easer access to the collection and the manuscripts

CONTENT

What's inside:

- pipeline
- scerstrains.csv # info on S.cer. strains
- sceradditionalstrains.txt # sequencing not available in ENA but downloadable elsewhere
- scerlowcoveragestrains.txt # scer strains without DP filtering
- publicdata.txt # link to the data
- QRCode.png # access to the data
- collectionIncluded.txt # collection references
- short illumina reads data of the hybrids ScerxSpar OS162 and OS2389

COLLECTIONS

First Name	Year	Population	Paper
Dunn cooming soon	2012	-	link
Skelly	2013	Beer	link
Marsit (GENOWINE proj)	2015	Wine	link
Marsit (EVOLYA proj) cooming soon	2015	Wine	link
Almeida	2015	Mediterranean/North America/Japan	link
Strope (100-genome proj)	2015	Multi	link
Barbosa	2016	Brasilian	link
cooming soon	201?	Clinical	link
Gonçalves	2016	wine/beer	link
cooming soon	2016	Wine	link
Gallone	2016	Beer	link
Almeida coming soon	2017	-	link
Barbosa	2018	Cachaça	link
Peter and De Chiara (1011 proj)	2018	Multi	link
Legras	2018	Wine/Fermented food environments	link
Duan	2018	Chinese	link
Ramazzotti	2019	Wine/insect	link
Pontes	2019	Alpechin	link
Gallone	2019	Beer	link
Fay coming soon	2019	Beer	link
cooming soon	202?	NZ	link
cooming soon	202?	African Fermented Food	link
cooming soon	202?	Wild	link
cooming soon	202?	???	link

SOFTWARE

bwa v. 0.7.17-r1198-dirty
samtools v. 1.14
bcftools v. 1.15.1

gVCF/BCF FEATURES

The fastqs have been aligned against S288C_reference_genome_R64-3-1_20210421.fa (Scer.genome.fa in pipeline/rep).

The files are provided in text format (.gvcf).

All the genomic positions are included (as long as at least 1 strain has been genotyped at that position).

Chromosome names are lower case and mantain roman numerals eg. chrIII (chrMT is the only exception).

Strain names are replaced by the ENA archive Run Accession.

The HOWTO below allows to rename the strains in the header.

Example

The strain MTZ13.12 in the gVCF is named SRR7851920. Renaming SRR7851920 results in MTZ13.12

The use of the Run Accession facilitates the filtering phase. This prevents the misselection of strains with overlapping, similar or multisymbolic names.

The gVCFs/BCFs were filtered as follow:

MQ >= 5
QUAL >= 20
DP >= 10

NOTE: some of the S.cerevisiae isolates were made available via custom website; these strains are listed in sceradditionalstrains.txt and genomic data stored in sceradditionalstrains files;

NOTE: some of the S.cerevisiae isolates were low coverage (DP filtering was not applied); these strains are listed in scerlowcoveragestrains.txt and genomic data stored in scerlowcoveragestrains files.

DATA ACCESS

🔗 Public link

HOWTO

extract per-sample/s data

bcftools view -S thisFIELcontainsONEstrainPERline.txt gvcf.gz -Oz -o myfavoritesamples.gvcf.gz

extract per-region/s data

bcftools view -R thisFILEcontainsCHRstartENDtabSEPARATEDcoordinates.bed gvcf.gz -Oz -o myfavoritesamples.gvcf.gz

extract only variant positions (SNPs)

bcftools view -e 'ALT="."' gvcf.gz -Oz -o vcf.gz

replace ENA archive Run Accession codes with the original strain names

Before proceed: the order of the ENA archive Run Asccession in scerstrains.csv must be the same of the output given by

bcftools query -l gvcf.gz

Important

The file we provide is already ordered but, if you subsetted by samples you need to subset scerstrains.csv and be sure the order is mantained as intended.

If/when the order is the same you can move to the next step.

cut -f2 scerstrains.csv | grep -v strainName > fromENAtoStrainName.txt

bcftools reheader --samples fromENAtoStrainName.txt -Oz -o gvcf.renamedstrains.gz gvcf.gz

renaming the strains in any other .txt file from downstream analyses.

Make a copy (backup copy) of the .txt file before running sed (sed is as powerful as dangerous).

for j in $(cut -f1 DATAonSCER.csv | grep -v vcfname)
do
 k=$(grep -w $j DATAonSCER.csv | cut -f2)
 sed -i "s+\<${j}\>+${k}+g" myresults.txt
done

ADDITIONAL DATA

Additional data are stored in a network attached storage (NAS) and shared through a personal link protected by password; both will be provided by email.

The password is personal and unique.

The access to the data is restricted to a few devices for security reasons.

The validity period of the link is limited to the date the download is ultimated.

Additional data:

single-strain gVCF (filtered as described below) [available]

Filters

MQ >= 5

QUAL >= 20

DP >= 10
single-strain BAM files [not available but can be request]
Any other intermediate file [not available but can be request]

CONTACTS

Short-term contact:
To: nicolo.tellini.2@gmail.com
Subject: DATAEXT-yourname-DD/MM/YYYY

Long-term contact:
To: matnamo@gmail.com
Subject: DATAEXT-yourname-DD/MM/YYYY

VERSION

Version	Date	N. isolates
1.0	09/12/2022	1,674

CITATION

Please cite this paper when using data for the 1674 strains for your publications.

Ancient and recent origins of shared polymorphisms in yeast
Nicolò Tellini, Matteo De Chiara, Simone Mozzachiodi, Lorenzo Tattini, Chiara Vischioni, Elena S. Naumova, Jonas Warringer, Anders Bergström & Gianni Liti
Nature Ecologya and Evolution, 2024, https://doi.org/10.1038/s41559-024-02352-5

@article{tellini2024ancient,
  title={Ancient and recent origins of shared polymorphisms in yeast},
  author={Tellini, Nicol{\`o} and De Chiara, Matteo and Mozzachiodi, Simone and Tattini, Lorenzo and Vischioni, Chiara and Naumova, Elena S and Warringer, Jonas and Bergstr{\"o}m, Anders and Liti, Gianni},
  journal={Nature Ecology \& Evolution},
  pages={1--16},
  year={2024},
  publisher={Nature Publishing Group UK London}
}

Name		Name	Last commit message	Last commit date
Latest commit History 220 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
img		img
pipeline		pipeline
LICENSE		LICENSE
README.md		README.md
collectionIncluded.txt		collectionIncluded.txt
publicdata.txt		publicdata.txt
sceradditionalstrains.txt		sceradditionalstrains.txt
scerlowcoveragestrains.tsv		scerlowcoveragestrains.tsv
scerstrains.tsv		scerstrains.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NEWS:

CONTENT

COLLECTIONS

SOFTWARE

gVCF/BCF FEATURES

DATA ACCESS

HOWTO

ADDITIONAL DATA

CONTACTS

VERSION

CITATION

About

Releases 1

Packages

Languages

License

nicolo-tellini/S.cerevisiaeData

Folders and files

Latest commit

History

Repository files navigation

NEWS:

CONTENT

COLLECTIONS

SOFTWARE

gVCF/BCF FEATURES

DATA ACCESS

HOWTO

ADDITIONAL DATA

CONTACTS

VERSION

CITATION

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages