-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Welcome to the AccessingGenbank wiki!
The on-line databases for biological sequence data are begging to be mined, but it is infeasible to do so manually. Even downloading the files is a hopeless task going through the web interfaces. Therefore I "mined" Stackexchange for a solution.
What I was looking for was a python solution for:
- downloading a list of Genbank files (automatically)
- mining the Genbank files for a certain field (automatically)
- Reporting a list of unique entries in the field
Why would you do this? To search for e.g. habitats where a certain group of micro organisms can be found.
The Genbank files have a certain "flat" file format (http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html) with some informations such as the sequence data, accession IDs, author and publication.
LOCUS AB290396 1145 bp DNA linear ENV 04-JUL-2007 DEFINITION Uncultured bacterium gene for 16S ribosomal RNA, partial sequence, clone: MBF16_34. ACCESSION AB290396 VERSION AB290396.1 GI:139004549 KEYWORDS ENV. SOURCE uncultured bacterium ORGANISM uncultured bacterium Bacteria; environmental samples. REFERENCE 1 AUTHORS Hatamoto,M., Imachi,H., Yashiro,Y., Ohashi,A. and Harada,H. TITLE Diversity of Anaerobic Microorganisms Involved in Long-Chain Fatty Acid Degradation in Methanogenic Sludges as Revealed by RNA-Based Stable Isotope Probing JOURNAL Appl. Environ. Microbiol. 73 (13), 4119-4127 (2007) PUBMED 17483279 REFERENCE 2 (bases 1 to 1145) AUTHORS Hatamoto,M. and Imachi,H. TITLE Direct Submission JOURNAL Submitted (16-JAN-2007) Hiroyuki Imachi, Japan Agency for Marine-Earth Science & Technology (JAMSTEC), Extremobiosphere Research Center; 2-15 Natsushima-cho, Yokosuka, Kanagawa 237-0061, Japan (E-mail:imachi@jamstec.go.jp, Tel:81-46-867-9709, Fax:81-46-867-9715) FEATURES Location/Qualifiers source 1..1145 /organism="uncultured bacterium" /mol_type="genomic DNA" /isolation_source="Mesophilic anaerobic sludge treating palm oil mill effluent" /db_xref="taxon:77133" /clone="MBF16_34" /environmental_sample rRNA <1..>1145 /product="16S ribosomal RNA" ORIGIN 1 agtcgagaat cttccccaat gggcgaaagc ctgagggagc gacgccgcgt gagggatgaa 61 ggccctttgg gttgtaaacc tctgttaggg ggaaagaaaa gcagtggaag caatatgtcc 121 attgcctgac gttaccccca gagaaagctc cggccaactc cgtgccagca gccgcggtaa 181 tacgggggga gcaagcgttg tccggaatca ttgggcgtaa agggcgtgta ggcggcttgg 241 caagtcgaat gtgaaatccc acggctcaac cgtggaactg cgttcgaaac tgccttgctt 301 gagtgcggga gaggtgtgcg gaattcctgg tgtagcggtg gaatgcgtag atatcaggaa 361 gaacaccggt ggcgaaggcg gcacactggc ccagcactga cgctgaggcg cgaaagcgtg 421 gggagcgaac gggattagat accccggtag tccacgctgt aaactttggg cactaggtat 481 tggaggtctc aaccccttca gtgccgtagc taacgcgtta agtgccccgc ctggggagta 541 cggtcgcaag gctgaaactc aaaggaattg acgggggccc gcacaagcgg tggagcatgt 601 ggtttaattc gatgcaacgc gaagaacctt accggggttt gacatgggag cctcgccgca 661 aggcgaggtc agccctatga aagtagggtg tgtccacaca ggtgctgcat ggctgtcgtc 721 agctcgtgtc gtgagatgtt gggttaagtc ccgcaacgag cgcaaccctc gccgatagtt 781 accaacgggt catgccgggg actctatcgg gactgccggt gataaaccgg aggaaggtgg 841 ggatgatgtc aagtcatcat ggcccttaca tcccgggcta cacacgtgct acaatggtcg 901 gtacagcggg ttgcaatacc gcgaggtgga gcaaatcctc aaagccggcc tcagtacgga 961 ttggagtctg caactcgact ctatgaagcc ggaatcgcta gtaatcgcgg atcagaatgc 1021 cgcggtgaat acgttcccgg gccttgtaca caccgcccgt caagccatgg gaatcgccag 1081 cactcgaagt cgctggccta accgcaaggg gggaggcgcc gaaagtgaag ccgatgactg 1141 gggct //
Luckily some nice people have made packages that let you search it by "field names" e.g. you can extract the "isolation_source".