-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy path05-batch-processing.qmd
316 lines (215 loc) · 12.2 KB
/
05-batch-processing.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
# Batch Processing on the Cloud {#sec-batch}
Now we're prepared for the big one: batch processing on the DNAnexus platform. All of the shell and DNAnexus skills we've learned will be leveraged in this chapter.
:::{.callout-note}
## Prep for Exercises
Make sure you are logged into the platform using `dx login` and that your course project is selected with `dx select`.
In your shell (either on your machine or in binder), make sure you're in the `bash_bioinfo_scripts/batch-processing/` folder:
```
cd batch-processing/
```
:::
## Learning Objectives
1. **Utilize** `dx find data` to find data files on the platform to batch process.
1. **Iterate** over files using Bash scripting and `xargs` on the platform to batch process them within a DNAnexus project.
1. **Leverage** dxFUSE to simplify your bash scripts
1. **Utilize** `dx generate-batch-inputs`/`dx run --batch-tsv` to batch process files
1. **Utilize** Python to batch process multiple files per worker.
## Two Ways of Batching
:::{#fig-batch1}
```{mermaid}
graph LR;
A[List files </br> using `dx data`] --> F{"|"}
F --> E[`xargs` sh -c]
E --> B[`dx run` </br> on file1];
E --> C[`dx run` </br> on file2];
E --> D[`dx run` </br> on file3];
```
Batch method 1. We list files and then pipe them into `xargs`, which generates individual dx-run statements.
:::
:::{#fig-batch2}
```{mermaid}
graph LR;
A[Submit array </br> of files </br> in `dx run`] --> B[Loop over array </br> of files </br> in worker];
```
Batch method 2. We first get our files onto the worker through a single dx run command, and then use `xargs` on the worker to cycle through them.
:::
We actually have two methods of batching jobs using Swiss Army Knife:
1. Use `xargs` on our home system to run `dx run` statements for each file (@fig-batch1).
1. Submit an array of files as an input to Swiss Army Knife. Then process each file using the `icmd` input (@fig-batch2)
Both of these methods can potentially be useful.
## Finding files using `dx find data` {#sec-dx-find}
`dx find data` is a command that is extremely helpful on the DNAnexus platform. Based on metadata and folder paths, `dx find data` will return a list of files that meet the criteria.
`dx find data` lets you search on the following types of metadata:
- tags `--tag`
- properties `--property`
- name `--name`
- type `--type`
It can output in a number of different formats. Including:
- `--brief` - return only the file-ids
- `--json` - return file information in JSON format
- `--verbose` - this is the default setting
- `--delimited` - return as a delimited text file
Of all of these, `--brief` and `--json` are the most useful for automation. `--delimited` is also helpful, but there is also a utility called `dx generate-batch-inputs` that will let us specify multiple inputs to process line by line.
## Helpful `dx find data` examples
As we're starting off in our batch processing journey, I wanted to provide some helpful recipes for selecting files.
### Find all *.bam files in a project
You can use wildcard characters with the `--name` flag. Here, we're looking for anything with the suffix "*.bam".
```{bash}
#| eval: false
#| filename: batch-processing/dx-find-data-name.sh
dx find data --name "*.bam" --brief
```
### Searching within a folder
You can add the `--path` command to search in a specific folder.
```{bash}
#| eval: false
#| filename: batch-processing/dx-find-path.sh
dx find data --name "*.bam" --path "data/"
```
### Find all files with a field id
Take advantage of metadata associated with files when you can. If you are on UKB RAP, one of the most helpful properties to search is `field_id`.
Note: be careful with this one, especially if you are working on UK Biobank RAP. You don't want to return 500,000 file ids. I would concentrate on the field ids that are aggregated on the population level, such as the pVCF files.
```{bash}
#| eval: false
#| filename: batch-processing/dx-find-data-field.sh
dx find data --property field_id="23148" --brief
```
### Find all files that are of class `file`
There are a number of different object classes on the platform, such as `file` or `applet`
Search for all files in your project that have a `file` class.
```{bash}
#| eval: false
#| filename: batch-processing/dx-find-data-class.sh
dx find data --class file --brief
```
### In General: Think about leveraging metadata
In general, think about leveraging metadata that is attached to your files.
For example, for the UKB Research Analysis Platform, data files in the `Bulk/` folder in your project have multiple properties: `field_id` (the data field as specified by UK Biobank) and `eid`.
## Using `xargs` to Batch Multiple Files {#sec-xargs2}
Ok, now we have a list of files from `dx find data` that meet our criteria. How can we process them one by one?
Remember our discussion of `xargs`? (@sec-xargs) This is where `xargs` shines, when you provide it a list of files.
Remember, a really useful pattern for `xargs` is using it for variable expansion and starting a subshell to process individual files.
```{bash}
#| eval: false
#| filename: batch-processing/dx-find-xargs.sh
dx find data --name "*.bam" --brief | \
xargs -I % sh -c "dx run app-swiss-army-knife -y -iin="%" \
-icmd='samtools view -c \${in_name} > \${in_prefix-counts.txt}' \
--tag samjob --destination results/'
```
The key piece of code we're doing the variable expansion in is here:
```{bash}
#| eval: false
sh -c 'dx run app-swiss-army-knife -iin="%" \
-icmd="samtools view -c \${in_name} > \${in_prefix}-counts.txt" \
--tag samjob --destination results/'
```
We're using `sh -c` to run a script as a *subshell* to execute the `dx run` statement.
Note that we're specifying the helper variables here with a `\`:
`\${in_name}`
This escaping (`\$`) of the dollar sign is to prevent the variable expansion from happening in the top-level shell - the helper variable names need to be passed in to the subshell which needs to pass it onto the worker. Figuring this out took time and made my brain hurt.
This escaping is only necessary because we're using `xargs` and passing our `-icmd` input into the worker. For the most part, you won't need to escape the `$`. This is also a reason to write shell scripts that run on the worker.
When we run this command, we get the following screen output:
```
Using input JSON:
{
"cmd": "samtools view -c $in_name > $in_prefix-counts.txt",
"in": [
{
"$dnanexus_link": {
"project": "project-GGyyqvj0yp6B82ZZ9y23Zf6q",
"id": "file-BZ9YGpj0x05xKxZ42QPqZkJY"
}
}
]
}
Calling app-GFxJgVj9Q0qQFykQ8X27768Y with output destination
project-GGyyqvj0yp6B82ZZ9y23Zf6q:/results
Job ID: job-GJ2xVZ80yp62X5Z51qp191Y8
[more job info]
```
if we do a `dx find jobs`, we'll see our jobs listed. Hopefully they are running:
```
dx find jobs --tag samjob
* Swiss Army Knife (swiss-army-knife:main) (running) job-GJ2xVf00yp62kx9Z8VK10vpQ
tladeras 2022-10-11 13:57:59 (runtime 0:01:49)
* Swiss Army Knife (swiss-army-knife:main) (running) job-GJ2xVb80yp6KjQpxFJJBzv5k
tladeras 2022-10-11 13:57:57 (runtime 0:00:52)
* Swiss Army Knife (swiss-army-knife:main) (runnable) job-GJ2xVZj0yp6FFFXG11j6YJ9V
tladeras 2022-10-11 13:57:55 (runtime 0:01:15)
* Swiss Army Knife (swiss-army-knife:main) (runnable) job-GJ2xVZ80yp62X5Z51qp191Y8
tladeras 2022-10-11 13:57:53 (runtime 0:00:56)
```
### When batching, tag your jobs
It is critical that you tag your jobs in your `dx run` code with the `--tag` argument.
Why? You will at some point start up a bunch of batch jobs that might have some settings/parameters that were set wrong. That's when you need the tag.
```{bash}
#| eval: false
dx find jobs --tag "samjob"
```
### Using tags to `dx terminate` jobs {#sec-terminate}
`dx terminate <jobid>` will terminate a running job with that job id. It doesn't take a tag as input.
But again, `xargs` to the rescue. We can find our job ids with the tag `samjob` using `dx find jobs` and then pipe the `--brief` output into `xargs` to terminate each job id.
```{bash}
#| eval: false
dx find jobs --tag samjob --brief | xargs -I% sh -c "dx terminate %"
```
## Submitting Multiple Files to a Single Worker {#sec-mult-worker}
We talked about another method to batch process files on a worker (@fig-batch2). We can submit an array of files to a worker, and then process them one at a time on the worker.
The key is that we're running `xargs` on the worker, not on our own machine to process each file.
```{bash}
#| eval: false
#| filename: batch-processing/batch-on-worker.sh
cmd_to_run="ls *.vcf.gz | xargs -I% sh -c 'bcftools stats % > \$(basename %).stats.txt'"
dx run swiss-army-knife \
-iin="data/chr1.vcf.gz" \
-iin="data/chr2.vcf.gz" \
-iin="data/chr3.vcf.gz" \
-icmd=${cmd_to_run}
```
In the variable `$cmd_to_run`, we're putting a command that we'll run on the worker. That command is:
```{bash}
#| eval: false
ls *.vcf.gz | xargs -I% sh -c "bcftools stats % > \$(basename %).stats.txt
```
We submitted an array of files in our `dx run` statement. So now they are transferred into our working directory on the worker. So we can list the files using `ls *.vcf.gz` and pipe that list into `xargs`.
Note that we lose the ability to use helper variables in our script when we process a list of files on the worker. So here we have to use `\$(basename %)`, because we use `()` to expand a variable in a subshell, and we escape the `$` here so that bash will execute the variable expansion on the worker.
Again, this is possible, but it may be easier to have a separate script that contains our commands, transfer that as an input to Swiss Army Knife, and run that script by specifying `bash myscript.sh` in our command.
## Batching multiple inputs: `dx generate_batch_inputs`
What if you have multiple inputs that you need to batch with? This is where the [`dx generate_batch_inputs`](https://documentation.dnanexus.com/user/running-apps-and-workflows/running-batch-jobs) comes in.
For each input for an app, we can specify it using wildcard characters with regular expressions.
```{bash}
# | eval: false
dx generate_batch_inputs \
--path "data/"\
-iin="(.*)\.bam$"
```
Here we're specifying a single input `in`, and we've supplied a wildcard search. It's going to look in `data/` for this particular pattern (we're looking for bam files).
If we do this, we'll get the following response:
```
Found 4 valid batch IDs matching desired pattern.
Created batch file dx_batch.0000.tsv
```
So, there is 1 `.tsv` file that was generated by `dx generate_batch_inputs` on our machine.
If we have many more input files, say 3000 files, it would generate 3 `.tsv` files. Each of these `.tsv` files contains about 1000 files per line. We can run these individual jobs with:
```{bash}
#| eval: false
dx run swiss-army-knife --batch-tsv dx_batch.0000.tsv \
-icmd='samtools stats ${in_name} > ${in_prefix}.stats.txt ' \
--destination "/Results/" \
--detach --allow-ssh \
--tag bigjob
```
This will generate 4 jobs from the `dx_batch.0000` file to process the individual files. Each `tsv` file will generate up to 1000 jobs.
### Drawbacks to `dx generate_batch_inputs`/`dx run --batch-tsv`
The largest drawback to using `dx generate_batch_inputs` is that each column must correspond to an individual input name - you can't submit an array of files to a job this way.
### For More Information
The Batch Jobs documentation page has some good code examples for `dx generate_batch_inputs` here: <https://documentation.dnanexus.com/user/running-apps-and-workflows/running-batch-jobs/>
## Programatically Submitting Arrays of Files for a job
You can also use Python to build `dx run` statements, which is especially helpful when you want to submit arrays of 100+ files to a worker.
See <https://dnanexus.gitbook.io/uk-biobank-rap/science-corner/guide-to-analyzing-large-sample-sets> for more info.
## What you learned in this chapter
This was a big chapter, and built on everything you've learned in the previous chapters.
We put together the output of `dx find data --brief` (@sec-dx-find) with a pipe (`|`), and used `xargs` (@sec-xargs2) to spawn jobs per set of files.
Another way to process files is to upload them onto a worker and process them (@sec-mult-worker).
We also learned of alternative approaches using `dx generate_batch_inputs`/`dx run --batch-tsv` and using Python to build the `dx run` statements.