AIMI Dataset Index |
Stanford AIMI shares annotated data to foster transparent and reproducible collaborative research to advance AI in medicine. The datasets are available to the public to view and use without charge for non-commercial research purposes. |
American Gut project |
The American Gut project is the largest crowdsourced citizen science project to date. Fecal, oral, skin, and other body site samples collected from thousands of participants represent the largest human microbiome cohort in existence. Detailed health and lifestyle and diet data associated with each sample is enabling us to deeply examine associations between the human microbiome and factors such as diet (from vegan to near carnivore and everything in between), season, amount of sleep, and disease states such as IBD, diabetes, or autism spectrum disorder-as well as many other factors not listed here. The American Gut project also encompasses the British Gut and Australian Gut projects, widening the cohort beyond North America. |
Cancer Genome Atlas Program |
The Cancer Genome Atlas (TCGA), a landmark cancer genomics program, molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. This joint effort between NCI and the National Human Genome Research Institute began in 2006, bringing together researchers from diverse disciplines and multiple institutions. |
Catalogue of Life |
Catalogue of Life (COL) is a collaboration bringing together the effort and contributions of taxonomists and informaticians from around the world. COL aims to address the needs of researchers, policy-makers, environmental managers and the wider public for a consistent and up-to-date listing of all the world’s known species. COL also supports those who need to manage their own taxonomic information and species lists. |
Catalogue Of Somatic Mutations In Cancer (COSMIC) |
COSMIC, the Catalogue Of Somatic Mutations In Cancer, is the world's largest and most comprehensive resource for exploring the impact of somatic mutations in human cancer. |
CZ CELLxGENE |
Explore data to understand the functionality of human tissues at the cellular level. CZ CELLxGENE is an open-source software platform that helps scientists answer questions about the function of cells within our bodies in seconds instead of executing experiments that could take years. It’s the largest interoperable corpus of single-cell data. |
COCONUT |
COCONUT and Natural Products Online is an open-source open-data portal for natural products cheminformatics. As a free database, COCONUT can be searched in multiple ways, such as molecule names, molecular structures, and structural properties. COCONUT also provides molecular properties and descriptors for each natural product. Moreover, all the data in COCONUT are available for download and can be queried programmatically via an application programming interface (API). |
Consensus CDS (CCDS) project |
The Consensus CDS (CCDS) project is a collaborative effort to identify a core set of human and mouse protein coding regions that are consistently annotated and of high quality. The long term goal is to support convergence towards a standard set of gene annotations. |
Database of Genotypes and Phenotypes |
The database of Genotypes and Phenotypes (dbGaP) was developed to archive and distribute the data and results from studies that have investigated the interaction of genotype and phenotype in Humans. |
Drug Gene Interaction database |
user-friendly browsing, searching, and filtering of information on drug-gene interactions and the druggable genome, mined from over thirty trusted sources. All data can be downloaded freely or accessed via our API. DGIdb is an open-source project. |
EMBL’s European Bioinformatics Institute (EMBL-EBI) |
EMBL’s European Bioinformatics Institute maintains the world’s most comprehensive range of freely available and up-to-date molecular data resources. |
Ensembl |
Ensembl is a genome browser for vertebrate genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation. Ensembl annotate genes, computes multiple alignments, predicts regulatory function and collects disease data. Ensembl tools include BLAST, BLAT, BioMart and the Variant Effect Predictor (VEP) for all supported species. |
Genomics of Drug Sensitivity in Cancer |
The Genomics of Drug Sensitivity in Cancer Project was built towards the goal of identifying cancer biomarkers that can be used to identify genetically defined subsets of patients most likely to respond to cancer therapies. As part of this effort, we are screening >1000 genetically characterised human cancer cell lines with a wide range of anti-cancer therapeutics. These compounds include cytotoxic chemotherapeutics as well as targeted therapeutics from commercial sources, academic collaborators, and from the biotech and pharmaceutical industries. |
InterPro |
InterPro is a resource that provides functional analysis of protein sequences by classifying them into families and predicting the presence of domains and important sites. To classify proteins in this way, InterPro uses predictive models, known as signatures, provided by several collaborating databases (referred to as member databases) that collectively make up the InterPro consortium. A key value of InterPro is that it combines protein signatures from these member databases into a single searchable resource, capitalising on their individual strengths to produce a powerful integrated database and diagnostic tool. |
LINCS L1000 |
The LINCS L1000 project has collected gene expression profiles for thousands of perturbagens at a variety of time points, doses, and cell lines. |
Molecular Modeling Database (MMDB) |
Macromolecular structures show the three-dimensional shape of proteins and other biomolecules and provide a wealth of information on the biological function, on mechanisms linked to the function, and on the evolutionary history of and relationships between macromolecules. Most structure data are obtained from experimental methods such as X-ray crystallography and NMR-spectroscopy. The Molecular Modeling DataBase (MMDB) is a database of experimentally determined three-dimensional biomolecular structures, and is also referred to as the Entrez Structure database. It is a subset of three-dimensional structures obtained from the RCSB Protein Data Bank (PDB), excluding theoretical models. |
National Center for Biotechnology Information (NCBI) |
The National Center for Biotechnology Information advances science and health by providing access to biomedical and genomic information. |
NYUMets: Brain |
NYUMets: Brain draws from the Center for Advanced Radiosurgery and constitutes a unique, real-world window into the complexities of metastatic cancer. NYUMets: Brain consists of data from 1,005 patients, 8,003 multimodal brain MRI studies, tabular clinical data from routine follow-up, and a complete record of prescribed medications—making it one of the largest datasets in existence of cranial imaging, and the largest dataset of metastatic cancer. In addition, more than 2,300 images have been carefully annotated by physicians with segmentations of metastatic tumors, making NYUMets: Brain a valuable source of segmented medical imaging. |
Pathguide |
Pathguide contains information about 702 biological pathway related resources and molecular interaction related resources. Databases that are free and those supporting BioPAX, CellML, PSI-MI or SBML standards are respectively indicated. |
PROSITE |
PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them. |
proGenomes |
proGenomes provides consistently annotated bacterial and archaeal genomes containing billions of genes from over 40,000 species. Strict quality controls are employed for the included genomes to enable accurate analyses. The genomes can be interactively explored and downloaded, whereby subsets can be customized e.g. taxonomic clades, representatives of each species or habitat-specific subsets. For each specI species cluster, precomputed pangenomes are available. |
Protein Data Bank |
The Protein Data Bank (PDB) was established as the 1st open access digital data resource in all of biology and medicine (Historical Timeline). It is today a leading global resource for experimental data central to scientific discovery. |
PubChem |
PubChem is an open chemistry database at the National Institutes of Health (NIH). PubChem mostly contains small molecules, but also larger molecules such as nucleotides, carbohydrates, lipids, peptides, and chemically-modified macromolecules. We collect information on chemical structures, identifiers, chemical and physical properties, biological activities, patents, health, safety, toxicity data, and many others. |
Reference Sequence (RefSeq) Database |
The Reference Sequence (RefSeq) collection provides a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins. RefSeq sequences form a foundation for medical, functional, and diversity studies. They provide a stable reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis (especially RefSeqGene records), expression studies, and comparative analyses. |
SCOP |
The SCOP database aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between proteins whose three-dimensional structure is known and deposited in the Protein Data Bank. |
The human metabolic reconstruction |
The human metabolism resource has been developed by the systems biology community over the past decade. It describes metabolic reactions and pathways known to occur in at least one cell type in the human body. This version was created by expanding the previous version of the human metabolic reconstruction, Recon 2 by adding new metabolites, transport reactions, and catalyzing reactions guided by publically available metabolomics data. |
UniProt |
The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. UniProt is a collaboration between the European Bioinformatics Institute (EMBL-EBI), the SIB Swiss Institute of Bioinformatics and the Protein Information Resource (PIR). Across the three institutes more than 100 people are involved through different tasks such as database curation, software development and support. |
Unknome Database |
The human genome encodes ~20,000 proteins, many still uncharacterised. Scientific and social factors have resulted in a focus on well-studied proteins, leading to a concern that poorly understood genes are unjustifiably neglected. To address this, we have developed an "Unknome database" that ranks proteins based on how little is known about them. The database is intended to aid the selection poorly characterised proteins from humans or model organisms so that they can be targeted for investigation. |