forked from Bioconductor/BiocWorkshops
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path103_Waldron_PublicData.Rmd
1392 lines (1011 loc) · 60.1 KB
/
103_Waldron_PublicData.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
bibliography: Waldron_PublicData/Waldron_PublicData.bib
---
# 103: Public data resources and Bioconductor
## Instructor names and contact information
* Levi Waldron <levi.waldron at sph.cuny.edu> (City University of New York, New York, NY, USA)
* Benjamin Haibe-Kains <benjamin.haibe.kains at utoronto.ca> (Princess Margaret Cancer Center, Toronto, Canada)
* Sean Davis <sdavis2 at mail.nih.gov> (Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA)
## Syllabus
### Workshop Description
The goal of this workshop is to introduce Bioconductor packages for finding,
accessing, and using large-scale public data resources including the
Gene Expression Omnibus [GEO](https://www.ncbi.nlm.nih.gov/geo), Sequence
Read Archive [SRA](https://www.ncbi.nlm.nih.gov/sra), the Genomic Data
Commons [GDC](https://portal.gdc.cancer.gov/), and Bioconductor-hosted
curated data resources for metagenomics, pharmacogenomics [PharmacoDB](http://pharmacodb.ca/), and The Cancer
Genome Atlas.
### Pre-requisites
* Basic knowledge of R syntax
* Familiarity with the ExpressionSet and SummarizedExperiment classes
* Basic familiarity with 'omics technologies such as microarray and NGS sequencing
Interested students can prepare by reviewing vignettes of the packages listed in "R/Bioconductor packages used" to gain background on aspects of interest to them.
Some more general background on these resources is published in @Kannan2016-yv.
### Workshop Participation
Each component will include runnable examples of typical usage that students are encouraged to run during demonstration of that component.
### R/Bioconductor packages used
* `r BiocStyle::Biocpkg("GEOquery")`: Access to the NCBI Gene Expression Omnibus (GEO), a public repository of gene expression (primarily microarray) data.
* `r BiocStyle::Biocpkg("GenomicDataCommons")`: Access to the NIH / NCI Genomic Data Commons RESTful service.
* `r BiocStyle::Githubpkg("seandavi/SRAdbV2")`: A compilation of metadata from the NCBI Sequence Read Archive, the largest public repository of sequencing data from the next generation of sequencing platforms, and tools
* `r BiocStyle::Biocpkg("curatedTCGAData")`: Curated data from The Cancer Genome Atlas (TCGA) as MultiAssayExperiment Objects
* `r BiocStyle::Biocpkg("curatedMetagenomicData")`: Curated metagenomic data of the human microbiome
* `r BiocStyle::Biocpkg("HMP16SData")`: Curated metagenomic data of the human microbiome
* `r BiocStyle::Biocpkg("PharmacoGx")`: Curated large-scale preclinical pharmacogenomic data and basic analysis tools
### Time outline
This is a 1h45m workshop.
| Activity | Time |
|-------------------------------------|---------|
| Overview | 10m |
| GEOquery | 15m |
| GenomicDataCommons | 20m |
| Sequence Read Archive | 20m |
| curatedTCGAData | 10m |
| curatedMetagenomicData and HMP16SData | 15m |
| PharmacoGx | 20m |
### Workshop goals and objectives
Bioconductor provides access to significant amounts of publicly available
experimental data. This workshop introduces students to Bioconductor
interfaces to the NCBI's Gene Expression Omnibus, Genomic Data Commons,
Sequence Read Archive and PharmacoDB. It additionally introduces curated resources
providing The Cancer Genome Atlas, the Human Microbiome Project and other
microbiome studies, and major pharmacogenomic studies, as native Bioconductor
objects ready for analysis and comparison to in-house datasets.
### Learning goals
* search NCBI resources for publicly available 'omics data
* quickly use data from the TCGA and the Human Microbiome Project
### Learning objectives
* find and download processed microarray and RNA-seq datasets from the Gene Expression Omnibus
* find and download 'omics data from the Genomic Data Commons and Sequence Read Archive
* download and manipulate data from The Cancer Genome Atlas and Human Microbiome Project
* download and explore pharmacogenomics data
## Overview
Before proceeding, ensure that the following packages are installed.
```{r eval=FALSE}
required_pkgs = c(
"TCGAbiolinks",
"GEOquery",
"GenomicDataCommons",
"limma",
"curatedTCGAData",
"recount",
"curatedMetagenomicData",
"phyloseq",
"HMP16SData",
"caTools",
"piano",
"isa",
"VennDiagram",
"downloader",
"gdata",
"AnnotationDbi",
"hgu133a.db",
"PharmacoGx")
BiocManager::install(required_pkgs)
```
## GEOquery
[@Sean2007-cv]
The NCBI Gene Expression Omnibus (GEO) serves as a public repository for a wide range of high-throughput experimental data. These data include single and dual channel microarray-based experiments measuring mRNA, genomic DNA, and protein abundance, as well as non-array techniques such as serial analysis of gene expression (SAGE), mass spectrometry proteomic data, and high-throughput sequencing data.
The `r BiocStyle::Biocpkg("GEOquery")` package [@Sean2007-cv] forms a bridge between this public repository and the analysis capabilities
in Bioconductor.
### Overview of GEO
At the most basic level of organization of GEO, there are four basic entity types. The first three (Sample, Platform, and Series) are supplied by users; the fourth, the dataset, is compiled and curated by GEO staff from the user-submitted data. See [the GEO home page](https://www.ncbi.nlm.nih.gov/geo/) for more information.
#### Platforms
A Platform record describes the list of elements on the array (e.g., cDNAs, oligonucleotide probesets, ORFs, antibodies) or the list of elements that may be detected and quantified in that experiment (e.g., SAGE tags, peptides). Each Platform record is assigned a unique and stable GEO accession number (GPLxxx). A Platform may reference many Samples that have been submitted by multiple submitters.
#### Samples
A Sample record describes the conditions under which an individual Sample was handled, the manipulations it underwent, and the abundance measurement of each element derived from it. Each Sample record is assigned a unique and stable GEO accession number (GSMxxx). A Sample entity must reference only one Platform and may be included in multiple Series.
#### Series
A Series record defines a set of related Samples considered to be part of a group, how the Samples are related, and if and how they are ordered. A Series provides a focal point and description of the experiment as a whole. Series records may also contain tables describing extracted data, summary conclusions, or analyses. Each Series record is assigned a unique and stable GEO accession number (GSExxx). Series records are available in a couple of formats which are handled by GEOquery independently. The smaller and new GSEMatrix files are quite fast to parse; a simple flag is used by GEOquery to choose to use GSEMatrix files (see below).
#### Datasets
GEO DataSets (GDSxxx) are curated sets of GEO Sample data. There are hundreds of GEO datasets available, but GEO discontinued creating
GDS records several years ago. We mention them here for completeness only.
### Getting Started using GEOquery
Getting data from GEO is really quite easy. There is only one command that is needed, `getGEO`. This one function interprets its input to determine how to get the data from GEO and then parse the data into useful R data structures.
```{r loadLibrary20, message=FALSE}
library(GEOquery)
```
With the library loaded, we are free to access any GEO accession.
### Use case: MDS plot of cancer data
The data we are going to access are from [this paper](https://doi.org/10.1158/1055-9965.EPI-17-0461).
> Background: The tumor microenvironment is an important factor in cancer immunotherapy response. To further understand how a tumor affects the local immune system, we analyzed immune gene expression differences between matching normal and tumor tissue.Methods: We analyzed public and new gene expression data from solid cancers and isolated immune cell populations. We also determined the correlation between CD8, FoxP3 IHC, and our gene signatures.Results: We observed that regulatory T cells (Tregs) were one of the main drivers of immune gene expression differences between normal and tumor tissue. A tumor-specific CD8 signature was slightly lower in tumor tissue compared with normal of most (12 of 16) cancers, whereas a Treg signature was higher in tumor tissue of all cancers except liver. Clustering by Treg signature found two groups in colorectal cancer datasets. The high Treg cluster had more samples that were consensus molecular subtype 1/4, right-sided, and microsatellite-instable, compared with the low Treg cluster. Finally, we found that the correlation between signature and IHC was low in our small dataset, but samples in the high Treg cluster had significantly more CD8+ and FoxP3+ cells compared with the low Treg cluster.Conclusions: Treg gene expression is highly indicative of the overall tumor immune environment.Impact: In comparison with the consensus molecular subtype and microsatellite status, the Treg signature identifies more colorectal tumors with high immune activation that may benefit from cancer immunotherapy.
In this little exercise, we will:
1. Access public omics data using the GEOquery package
2. Convert the public omics data to a `SummarizedExperiment` object.
3. Perform a simple unsupervised analysis to visualize these public data.
Use the [GEOquery] package to fetch data about [GSE103512].
```{r geoquery10, echo=TRUE, cache=TRUE, message=FALSE}
gse = getGEO("GSE103512")[[1]]
```
Note that `getGEO`, when used to retrieve *GSE* records, returns a list. The members of the list each represent
one *GEO Platform*, since each *GSE* record can contain multiple related datasets (eg., gene expression and DNA methylation).
In this case, the list is of length one, but it is still necessary to grab the first elment.
The first step--a detail--is to convert from the older Bioconductor data structure (GEOquery was written in 2007), the `ExpressionSet`, to the newer `SummarizedExperiment`. One line suffices.
```{r convertToSE, message=FALSE}
library(SummarizedExperiment)
se = as(gse, "SummarizedExperiment")
```
Examine two variables of interest, cancer type and tumor/normal status.
```{r geoquery20,echo=TRUE,cache=TRUE}
with(colData(se),table(`cancer.type.ch1`,`normal.ch1`))
```
Filter gene expression by variance to find most informative genes.
```{r sds,cache=TRUE,echo=TRUE}
sds = apply(assay(se, 'exprs'),1,sd)
dat = assay(se, 'exprs')[order(sds,decreasing = TRUE)[1:500],]
```
Perform [multidimensional scaling] and prepare for plotting. We will be using ggplot2, so
we need to make a data.frame before plotting.
```{r mds,echo=TRUE,cache=TRUE}
mdsvals = cmdscale(dist(t(dat)))
mdsvals = as.data.frame(mdsvals)
mdsvals$Type=factor(colData(se)[,'cancer.type.ch1'])
mdsvals$Normal = factor(colData(se)[,'normal.ch1'])
head(mdsvals)
```
And do the plot.
```{r mdsplot,echo=TRUE,fig.align='center'}
library(ggplot2)
ggplot(mdsvals, aes(x=V1,y=V2,shape=Normal,color=Type)) +
geom_point( alpha=0.6) + theme(text=element_text(size = 18))
```
[R]: https://cran.r-project.org/
[GEOquery]: https://bioconductor.org/packages/GEOquery
[GSE103512]: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE103512
[multidimensional scaling]: https://en.wikipedia.org/wiki/Multidimensional_scaling
### Accessing Raw Data from GEO
NCBI GEO accepts (but has not always required) raw data such as .CEL files, .CDF files, images, etc. It is also not uncommon for some RNA-seq or other sequencing datasets to supply *only* raw data (with accompanying sample information, of course), necessitating Sometimes, it is useful to get quick access to such data. A single function, `getGEOSuppFiles`, can take as an argument a GEO accession and will download all the raw data associate with that accession. By default, the function will create a directory in the current working directory to store the raw data for the chosen GEO accession.
## GenomicDataCommons
From the [Genomic Data Commons (GDC) website](https://gdc.nci.nih.gov/about-gdc):
> The National Cancer Institute's (NCI's) Genomic Data Commons (GDC) is
a data sharing platform that promotes precision medicine in
oncology. It is not just a database or a tool; it is an expandable
knowledge network supporting the import and standardization of genomic
and clinical data from cancer research programs.
> The GDC contains NCI-generated data from some of the largest and most
comprehensive cancer genomic datasets, including The Cancer Genome
Atlas (TCGA) and Therapeutically Applicable Research to Generate
Effective Therapies (TARGET). For the first time, these datasets have
been harmonized using a common set of bioinformatics pipelines, so
that the data can be directly compared.
> As a growing knowledge system for cancer, the GDC also enables
researchers to submit data, and harmonizes these data for import into
the GDC. As more researchers add clinical and genomic data to the GDC,
it will become an even more powerful tool for making discoveries about
the molecular basis of cancer that may lead to better care for
patients.
The
[data model for the GDC is complex](https://gdc.cancer.gov/developers/gdc-data-model/gdc-data-model-components),
but it worth a quick overview and a graphical representation is included here.

The
GDC API exposes these nodes and edges in a somewhat simplified set
of
[RESTful](https://en.wikipedia.org/wiki/Representational_state_transfer) endpoints.
### Quickstart
This quickstart section is just meant to show basic
functionality. More details of functionality are included further on
in this vignette and in function-specific help.
To report bugs or problems, either
[submit a new issue](https://github.com/Bioconductor/GenomicDataCommons/issues)
or submit a `bug.report(package='GenomicDataCommons')` from within R (which will
redirect you to the new issue on GitHub).
#### Installation
Installation of the `r BiocStyle::Biocpkg("GenomicDataCommons")` package is
identical to installation of other Bioconductor packages.
```{r install, eval=FALSE}
install.packages('BiocManager')
BiocManager::install('GenomicDataCommons')
```
After installation, load the library in order to use it.
```{r libraries, message=FALSE}
library(GenomicDataCommons)
```
#### Check connectivity and status
The `r BiocStyle::Biocpkg("GenomicDataCommons")` package relies on having network
connectivity. In addition, the NCI GDC API must also be operational
and not under maintenance. Checking `status` can be used to check this
connectivity and functionality.
```{r statusQS}
GenomicDataCommons::status()
```
#### Find data
The following code builds a `manifest` that can be used to guide the
download of raw data. Here, filtering finds gene expression files
quantified as raw counts using `HTSeq` from ovarian cancer patients.
```{r manifest}
ge_manifest = files() %>%
filter( ~ cases.project.project_id == 'TCGA-OV' &
type == 'gene_expression' &
analysis.workflow_type == 'HTSeq - Counts') %>%
manifest()
```
#### Download data
After the `r nrow(ge_manifest)` gene expression files
specified in the query above. Using multiple processes to do the download very
significantly speeds up the transfer in many cases. On a standard 1Gb
connection, the following completes in about 30 seconds. The first time the
data are downloaded, R will ask to create a cache directory (see `?gdc_cache`
for details of setting and interacting with the cache). Resulting
downloaded files will be stored in the cache directory. Future access to
the same files will be directly from the cache, alleviating multiple downloads.
```{r downloadQS}
fnames = lapply(ge_manifest$id[1:20],gdcdata)
```
If the download had included controlled-access data, the download above would
have needed to include a `token`. Details are available in
[the authentication section below](#authentication).
#### Metadata queries
The `r BiocStyle::Biocpkg("GenomicDataCommons")` can access the significant clinical, demographic, biospecimen, and annotation information contained in the NCI GDC.
```{r metadataQS}
expands = c("diagnoses","annotations",
"demographic","exposures")
projResults = projects() %>%
results(size=10)
str(projResults,list.len=5)
names(projResults)
# or listviewer::jsonedit(clinResults)
```
### Basic design
This package design is meant to have some similarities to the "hadleyverse"
approach of dplyr. Roughly, the functionality for finding and accessing files
and metadata can be divided into:
1. Simple query constructors based on GDC API endpoints.
2. A set of verbs that when applied, adjust filtering, field selection, and
faceting (fields for aggregation) and result in a new query object (an
endomorphism)
3. A set of verbs that take a query and return results from the GDC
In addition, there are exhiliary functions for asking the GDC API for
information about available and default fields, slicing BAM files, and
downloading actual data files. Here is an overview of functionality[^10].
- Creating a query
- `projects()`
- `cases()`
- `files()`
- `annotations()`
- Manipulating a query
- `filter()`
- `facet()`
- `select()`
- Introspection on the GDC API fields
- `mapping()`
- `available_fields()`
- `default_fields()`
- `grep_fields()`
- `field_picker()`
- `available_values()`
- `available_expand()`
- Executing an API call to retrieve query results
- `results()`
- `count()`
- `response()`
- Raw data file downloads
- `gdcdata()`
- `transfer()`
- `gdc_client()`
- Summarizing and aggregating field values (faceting)
- `aggregations()`
- Authentication
- `gdc_token()`
- BAM file slicing
- `slicing()`
[^10]: See individual function and methods documentation for specific details.
### Usage
There are two main classes of operations when working with the NCI GDC.
1. [Querying metadata and finding data files](#querying-metadata) (e.g., finding
all gene expression quantifications data files for all colon cancer patients).
2. [Transferring raw or processed data](#datafile-access-and-download) from the
GDC to another computer (e.g., downloading raw or processed data)
Both classes of operation are reviewed in detail in the following sections.
## Querying metadata
Vast amounts of metadata about cases (patients, basically), files, projects, and
so-called annotations are available via the NCI GDC API. Typically, one will
want to query metadata to either focus in on a set of files for download or
transfer *or* to perform so-called aggregations (pivot-tables, facets, similar
to the R `table()` functionality).
Querying metadata starts with [creating a "blank" query](#creating-a-query). One
will often then want to [`filter`](#filtering) the query to limit results prior
to [retrieving results](#retrieving-results). The GenomicDataCommons package has
[helper functions for listing fields](#fields-and-values) that are available for
filtering.
In addition to fetching results, the GDC API allows
[faceting, or aggregating,](#facets-and-aggregation), useful for compiling
reports, generating dashboards, or building user interfaces to GDC data (see GDC
web query interface for a non-R-based example).
#### Creating a query
The `r BiocStyle::Biocpkg("GenomicDataCommons")` package accesses the *same*
API as the *GDC* website. Therefore, a useful approach, particularly for
beginning users is to examine the filters available on the
[GDC repository pages](https://portal.gdc.cancer.gov/repository)
to find appropriate filtering criteria. From there, converting those checkboxes
to a GenomicDataCommons `query()` is relatively straightforward. Note that only
a small subset of the `available_fields()` are available by default on the website.

A query of the GDC starts its life in R. Queries follow the four metadata
endpoints available at the GDC. In particular, there are four convenience
functions that each create `GDCQuery` objects (actually, specific subclasses of
`GDCQuery`):
- `projects()`
- `cases()`
- `files()`
- `annotations()`
```{r projectquery}
pquery = projects()
```
The `pquery` object is now an object of (S3) class, `GDCQuery` (and
`gdc_projects` and `list`). The object contains the following elements:
- fields: This is a character vector of the fields that will be returned when we
[retrieve data](#retrieving-results). If no fields are specified to, for
example, the `projects()` function, the default fields from the GDC are used
(see `default_fields()`)
- filters: This will contain results after calling the
[`filter()` method](#filtering) and will be used to filter results on
[retrieval](#retrieving-results).
- facets: A character vector of field names that will be used for
[aggregating data](#facets-and-aggregation) in a call to `aggregations()`.
- archive: One of either "default" or ["legacy"](https://gdc-portal.nci.nih.gov/legacy-archive/).
- token: A character(1) token from the GDC. See
[the authentication section](#authentication) for details, but note that, in
general, the token is not necessary for metadata query and retrieval, only for
actual data download.
Looking at the actual object (get used to using `str()`!), note that the query
contains no results.
```{r pquery}
str(pquery)
```
#### Retrieving results
[[ GDC pagination documentation ]](https://docs.gdc.cancer.gov/API/Users_Guide/Search_and_Retrieval/#size-and-from)
[[ GDC sorting documentation ]](https://docs.gdc.cancer.gov/API/Users_Guide/Search_and_Retrieval/#sort)
With a query object available, the next step is to retrieve results from the
GDC. The GenomicDataCommons package. The most basic type of results we can get
is a simple `count()` of records available that satisfy the filter criteria.
Note that we have not set any filters, so a `count()` here will represent all
the project records publicly available at the GDC in the "default" archive"
```{r pquerycount}
pcount = count(pquery)
# or
pcount = pquery %>% count()
pcount
```
The `results()` method will fetch actual results.
```{r pqueryresults}
presults = pquery %>% results()
```
These results are
returned from the GDC in [JSON](http://www.json.org/) format and
converted into a (potentially nested) list in R. The `str()` method is useful
for taking a quick glimpse of the data.
```{r presultsstr}
str(presults)
```
A default of only 10 records are returned. We can use the `size` and `from`
arguments to `results()` to either page through results or to change the number
of results. Finally, there is a convenience method, `results_all()` that will
simply fetch all the available results given a query. Note that `results_all()`
may take a long time and return HUGE result sets if not used carefully. Use of a
combination of `count()` and `results()` to get a sense of the expected data
size is probably warranted before calling `results_all()`
```{r presultsall}
length(ids(presults))
presults = pquery %>% results_all()
length(ids(presults))
# includes all records
length(ids(presults)) == count(pquery)
```
Extracting subsets of
results or manipulating the results into a more conventional R data
structure is not easily generalizable. However,
the
[purrr](https://github.com/hadley/purrr),
[rlist](https://renkun.me/rlist/),
and [data.tree](https://cran.r-project.org/web/packages/data.tree/vignettes/data.tree.html) packages
are all potentially of interest for manipulating complex, nested list
structures. For viewing the results in an interactive viewer, consider the
[listviewer](https://github.com/timelyportfolio/listviewer) package.
#### Fields and Values
[[ GDC `fields` documentation ]](https://docs.gdc.cancer.gov/API/Users_Guide/Search_and_Retrieval/#fields)
Central to querying and retrieving data from the GDC is the ability to specify
which fields to return, filtering by fields and values, and faceting or
aggregating. The GenomicDataCommons package includes two simple functions,
`available_fields()` and `default_fields()`. Each can operate on a character(1)
endpoint name ("cases", "files", "annotations", or "projects") or a `GDCQuery`
object.
```{r defaultfields}
default_fields('files')
# The number of fields available for files endpoint
length(available_fields('files'))
# The first few fields available for files endpoint
head(available_fields('files'))
```
The fields to be returned by a query can be specified following a similar
paradigm to that of the dplyr package. The `select()` function is a verb that
resets the fields slot of a `GDCQuery`; note that this is not quite analogous to
the dplyr `select()` verb that limits from already-present fields. We
*completely replace* the fields when using `select()` on a `GDCQuery`.
```{r selectexample}
# Default fields here
qcases = cases()
qcases$fields
# set up query to use ALL available fields
# Note that checking of fields is done by select()
qcases = cases() %>% GenomicDataCommons::select(available_fields('cases'))
head(qcases$fields)
```
Finding fields of interest is such a common operation that the
GenomicDataCommons includes the `grep_fields()` function and the
`field_picker()` widget. See the appropriate help pages for details.
#### Facets and aggregation
[[ GDC `facet` documentation ]](https://docs.gdc.cancer.gov/API/Users_Guide/Search_and_Retrieval/#facets)
The GDC API offers a feature known as aggregation or faceting. By
specifying one or more fields (of appropriate type), the GDC can
return to us a count of the number of records matching each potential
value. This is similar to the R `table` method. Multiple fields can be
returned at once, but the GDC API does not have a cross-tabulation
feature; all aggregations are only on one field at a time. Results of
`aggregation()` calls come back as a list of data.frames (actually,
tibbles).
```{r aggexample}
# total number of files of a specific type
res = files() %>% facet(c('type','data_type')) %>% aggregations()
res$type
```
Using `aggregations()` is an also easy way to learn the contents of individual
fields and forms the basis for faceted search pages.
#### Filtering
[[ GDC `filtering` documentation ]](https://docs.gdc.cancer.gov/API/Users_Guide/Search_and_Retrieval/#filters-specifying-the-query)
The GenomicDataCommons package uses a form of non-standard evaluation to specify
R-like queries that are then translated into an R list. That R list is, upon
calling a method that fetches results from the GDC API, translated into the
appropriate JSON string. The R expression uses the formula interface as
suggested by Hadley Wickham in his [vignette on non-standard evaluation](https://cran.r-project.org/web/packages/dplyr/vignettes/nse.html)
> It’s best to use a formula because a formula captures both the expression to
evaluate and the environment where the evaluation occurs. This is important if
the expression is a mixture of variables in a data frame and objects in the
local environment [for example].
For the user, these details will not be too important except to note that a
filter expression must begin with a "~".
```{r allfilesunfiltered}
qfiles = files()
qfiles %>% count() # all files
```
To limit the file type, we can refer back to the
[section on faceting](#facets-and-aggregation) to see the possible values for
the file field "type". For example, to filter file results to only
"gene_expression" files, we simply specify a filter.
```{r onlyGeneExpression}
qfiles = files() %>% filter(~ type == 'gene_expression')
# here is what the filter looks like after translation
str(get_filter(qfiles))
```
What if we want to create a filter based on the project ('TCGA-OVCA', for
example)? Well, we have a couple of possible ways to discover available fields.
The first is based on base R functionality and some intuition.
```{r filtAvailFields}
grep('pro',available_fields('files'),value=TRUE)
```
Interestingly, the project information is "nested" inside the case. We don't
need to know that detail other than to know that we now have a few potential
guesses for where our information might be in the files records. We need to
know where because we need to construct the appropriate filter.
```{r filtProgramID}
files() %>% facet('cases.project.project_id') %>% aggregations()
```
We note that `cases.project.project_id` looks like it is a good fit. We also
note that `TCGA-OV` is the correct project_id, not `TCGA-OVCA`. Note that
*unlike with dplyr and friends, the `filter()` method here **replaces** the
filter and does not build on any previous filters*.
```{r filtfinal}
qfiles = files() %>%
filter( ~ cases.project.project_id == 'TCGA-OV' & type == 'gene_expression')
str(get_filter(qfiles))
qfiles %>% count()
```
Asking for a `count()` of results given these new filter criteria gives `r
qfiles %>% count()` results. Generating a manifest for bulk downloads is as
simple as asking for the manifest from the current query.
```{r filtAndManifest}
manifest_df = qfiles %>% manifest()
head(manifest_df)
```
Note that we might still not be quite there. Looking at filenames, there are
suspiciously named files that might include "FPKM", "FPKM-UQ", or "counts".
Another round of `grep` and `available_fields`, looking for "type" turned up
that the field "analysis.workflow_type" has the appropriate filter criteria.
```{r filterForHTSeqCounts}
qfiles = files() %>% filter( ~ cases.project.project_id == 'TCGA-OV' &
type == 'gene_expression' &
analysis.workflow_type == 'HTSeq - Counts')
manifest_df = qfiles %>% manifest()
nrow(manifest_df)
```
The GDC Data Transfer Tool can be used (from R, `transfer()` or from the
command-line) to orchestrate high-performance, restartable transfers of all the
files in the manifest. See [the bulk downloads section](bulk-downloads) for
details.
### Authentication
[[ GDC authentication documentation ]](https://docs.gdc.cancer.gov/API/Users_Guide/Search_and_Retrieval/#facets)
The GDC offers both "controlled-access" and "open" data. As of this
writing, only data stored as files is "controlled-access"; that is,
metadata accessible via the GDC is all "open" data and some files are
"open" and some are "controlled-access". Controlled-access data are
only available
after
[going through the process of obtaining access.](https://gdc.cancer.gov/access-data/obtaining-access-controlled-data)
After controlled-access to one or more datasets has been granted,
logging into the GDC web portal will allow you
to
[access a GDC authentication token](https://docs.gdc.cancer.gov/Data_Portal/Users_Guide/Authentication/#gdc-authentication-tokens),
which can be downloaded and then used to access available
controlled-access data via the GenomicDataCommons package.
The GenomicDataCommons uses authentication tokens only for downloading
data (see `transfer` and `gdcdata` documentation). The package
includes a helper function, `gdc_token`, that looks for the token to
be stored in one of three ways (resolved in this order):
1. As a string stored in the environment variable, `GDC_TOKEN`
2. As a file, stored in the file named by the environment variable,
`GDC_TOKEN_FILE`
3. In a file in the user home directory, called `.gdc_token`
As a concrete example:
```{r authenNoRun, eval=FALSE}
token = gdc_token()
transfer(...,token=token)
# or
transfer(...,token=get_token())
```
### Datafile access and download
The `gdcdata` function takes a character vector of one or more file
ids. A simple way of producing such a vector is to produce a
`manifest` data frame and then pass in the first column, which will
contain file ids.
```{r singlefileDL}
fnames = gdcdata(manifest_df$id[1:2],progress=FALSE)
```
Note that for controlled-access data, a
GDC [authentication token](#authentication) is required. Using the
`BiocParallel` package may be useful for downloading in parallel,
particularly for large numbers of smallish files.
The bulk download functionality is only efficient (as of v1.2.0 of the
GDC Data Transfer Tool) for relatively large files, so use this
approach only when transferring BAM files or larger VCF files, for
example. Otherwise, consider using the approach shown above, perhaps
in parallel.
```{r bulkDL}
fnames = gdcdata(manifest_df$id[3:10], access_method = 'client')
```
## Sequence Read Archive
The SRAdbV2 package is currently available from [GitHub] and is under
active development. Either the `r BiocStyle::CRANpkg("devtools")` package or the
`r BiocStyle::CRANpkg("BiocManager")` package can be used for easy installation.
```{r eval=FALSE}
install.packages('BiocManager')
BiocManager::install('seandavi/SRAdbV2')
```
[GitHub]: https://github.com/seandavi/SRAdbV2
### Usage
#### Loading the library
```{r cache=FALSE}
library(SRAdbV2)
```
#### The Omicidx
The entrypoint for using the SRAdbV2 system is the `Omicidx`, an [R6 class].
To start, create a new instance of the class.
[R6 class]: https://cran.r-project.org/web/packages/R6/vignettes/Introduction.html
```{r cache=FALSE}
oidx = Omicidx$new()
```
Typing `oidx$` and then `TAB` will give possible completions. Note the "search"
completion.
#### Queries
Once an instance of `Omicidx` is created (here we will call the instance `oidx`),
search capabilities are available via `oidx$search()`. The one interesting
parameter is the `q` parameter. This parameter takes a string formatted as a Lucene query
string. See below for [Query syntax].
```{r cache=FALSE}
query=paste(
paste0('sample_taxon_id:', 10116),
'AND experiment_library_strategy:"rna seq"',
'AND experiment_library_source:transcriptomic',
'AND experiment_platform:illumina')
z = oidx$search(q=query,entity='full',size=100L)
```
The `entity` parameter is one of the SRA entity types available via the API.
The `size` parameter is the number of records that will be returned in each "chunk".
#### Fetching results
Because result sets can be large, we have a special method that allows us to
"scroll" through the results or to simply get them *en bloc*. The first step
for result retrieval, then, is to get a `Scroller`.
```{r cache=FALSE}
s = z$scroll()
s
```
Methods such as `s$count` allow introspection into the available number of
results, in this case, `r s$count` records.
The `Scroller` provides two different approaches to accessing the resulting data.
##### Collating entire result sets
The first approach to getting results of a query back into R is the
most convenient, but for large result sets, the entire dataset is
loaded into memory and may take significant time if network
connections are slow.
```{r cache=FALSE}
# for VERY large result sets, this may take
# quite a bit of time and/or memory. An
# alternative is to use s$chunk() to retrieve
# one batch of records at a time and process
# incrementally.
res = s$collate(limit = 1000)
head(res)
```
Note that the scroller now reports that it has fetched (`s$fetched`) `r s$fetched` records.
To reuse a `Scroller`, we must reset it first.
```{r cache=FALSE}
s$reset()
s
```
##### Yielding chunks
The second approach is to iterate through results using the `yield`
method. This approach allows the user to perform processing on chunks
of data as they arrive in R.
```{r cache=FALSE}
j = 0
## fetch only 500 records, but
## `yield` will return NULL
## after ALL records have been fetched
while(s$fetched < 500) {
res = s$yield()
# do something interesting with `res` here if you like
j = j + 1
message(sprintf('total of %d fetched records, loop iteration # %d', s$fetched, j))
}
```
The `Scroller` also has a `has_next()` method that will report `TRUE` if the
result set has not been fully fetched. Using the `reset()` method will move the
cursor back to the beginning of the result set.
### Query syntax
#### Terms
A query is broken up into terms and operators. There are two types of terms: Single Terms and Phrases. A Single Term is a single word such as "test" or "hello". A Phrase is a group of words surrounded by double quotes such as "hello dolly". Multiple terms can be combined together with Boolean operators to form a more complex query (see below).
#### Fields
Queries support fielded data. When performing a search you can either specify a field, or use the default field. The field names and default field is implementation specific. You can search any field by typing the field name followed by a colon ":" and then the term you are looking for. As an example, let's assume a Lucene index contains two fields, title and abstract. If you want to find the document entitled "The Right Way" which contains the text "don't go this way" in the abstract, you can enter:
```
title:"The Right Way" AND abstract:go
```
or
Note: The field is only valid for the term that it directly precedes, so the query
```
title:Do it right
```
will only find "Do" in the title field. It will find "it" and "right" in any other fields.
#### Wildcard Searches
Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries). To perform a single character wildcard search use the "?" symbol. To perform a multiple character wildcard search use the "*" symbol. The single character wildcard search looks for terms that match that with the single character replaced. For example, to search for "text" or "test" you can use the search:
```
te?t
```
Multiple character wildcard searches looks for 0 or more characters. For example, to search for test, tests or tester, you can use the search:
```
test*
```
You can also use the wildcard searches in the middle of a term.
```
te*t
```
Note: You cannot use a * or ? symbol as the first character of a search.
#### Fuzzy Searches
Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use the tilde, "~", symbol at the end of a Single word Term. For example to search for a term similar in spelling to "roam" use the fuzzy search:
```
roam~
```
This search will find terms like foam and roams.
Starting with Lucene 1.9 an additional (optional) parameter can specify the required similarity. The value is between 0 and 1, with a value closer to 1 only terms with a higher similarity will be matched. For example:
```
roam~0.8
```
The default that is used if the parameter is not given is 0.5.
#### Proximity Searches
Lucene supports finding words are a within a specific distance away. To do a proximity search use the tilde, "~", symbol at the end of a Phrase. For example to search for a "apache" and "jakarta" within 10 words of each other in a document use the search:
```
"jakarta apache"~10
```
#### Range Searches
Range Queries allow one to match documents whose field(s) values are between the lower and upper bound specified by the Range Query. Range Queries can be inclusive or exclusive of the upper and lower bounds. Sorting is done lexicographically.
```
mod_date:[20020101 TO 20030101]
```
This will find documents whose mod_date fields have values between 20020101 and 20030101, inclusive. Note that Range Queries are not reserved for date fields. You could also use range queries with non-date fields:
```
title:{Aida TO Carmen}
```
This will find all documents whose titles are between Aida and Carmen, but not including Aida and Carmen. Inclusive range queries are denoted by square brackets. Exclusive range queries are denoted by curly brackets.
#### Boolean Operators
Boolean operators allow terms to be combined through logic operators. Lucene supports AND, "+", OR, NOT and "-" as Boolean operators(Note: Boolean operators must be ALL CAPS).
##### OR
The OR operator is the default conjunction operator. This means that if there is no Boolean operator between two terms, the OR operator is used. The OR operator links two terms and finds a matching document if either of the terms exist in a document. This is equivalent to a union using sets. The symbol || can be used in place of the word OR.
To search for documents that contain either "jakarta apache" or just "jakarta" use the query:
```
"jakarta apache" jakarta
```
or
```
"jakarta apache" OR jakarta
```
The AND operator matches documents where both terms exist anywhere in the text of a single document. This is equivalent to an intersection using sets. The symbol && can be used in place of the word AND. To search for documents that contain "jakarta apache" and "Apache Lucene" use the query:
"jakarta apache" AND "Apache Lucene"
##### +
The "+" or required operator requires that the term after the "+" symbol exist somewhere in a the field of a single document. To search for documents that must contain "jakarta" and may contain "lucene" use the query: