forked from EDIorg/data-package-best-practices
-
Notifications
You must be signed in to change notification settings - Fork 0
/
02-special-cases.Rmd
1241 lines (957 loc) · 106 KB
/
02-special-cases.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Data Package Design for Special Cases
Members of the working group developing these documents: S. Beaulieu, R. Brown, J. Downing, S. Elmendorf, H. Garritt, G. Gastil-Buhl, C. Gries, J. Hollingsworth, H.-Y. Hsieh, L. Kui, M. Martin, G. Maurer, A. Nguyen, J. Porter, A. Sapp, M. Servilla, T. Whiteaker
In these documents we consider special cases for archiving research data based on their data type, format, or acquisition method, and recommend practices that ensure optimal re-usability of the data. Most recommendations here are aimed at improving documentation of data acquisition and processing to avoid misinterpretation. This includes the recommendation to publish raw data and/or processing code along with the data products. Others are aimed at usability in terms of data size/volume, or connecting related data. Some recommendations involve including a metadata document formatted according to a new and emerging standard (e.g., codeMeta) or a data inventory table. Data inventory tables can cross the line between metadata and data and are intended to improve discoverability and navigation of archived data.
The intended audience for these best practice recommendations is the ecological research information manager (IM) community, and they are applicable to anyone operating in the context of an ecological research program. We assume that the target data repository is designed to handle ecological data, and that a given archive package will include metadata encoded in a community standard. This document references elements of the [EML metadata standard](https://eml.ecoinformatics.org/), but many aspects would similarly apply to other metadata standards and these documents should be considered in the larger context of applicable metadata standard best practices. We refer to the Environmental Data Initiative ([EDI](https://portal.edirepository.org)) as an example data repository, though the same practices could be applied to other similar repositories.
Throughout the chapters we use the term _data package_ to refer to a published unit of data and metadata together, which is the convention at the EDI repository. At other data repositories, equivalent terms for a data package, such as _dataset_, may be used. A data package may contain one or more _entities_, such as csv tables, spatial data, processing or modeling code, and other documents (pdf, jpg, zip). A basic discussion of data package design can be found as [EDI's first phase of data publishing documentation](https://edirepository.org/resources/designing-a-data-package) and in the [LTER Best Practices for Dataset Metadata in Ecological Metadata Language (EML)](https://environmentaldatainitiative.files.wordpress.com/2017/11/emlbestpractices-v3.pdf).
Generally, we recommend archiving entities using standard file formats that are likely to be machine readable in the future. Exceptions to this may exist where the community standard for processing particular data types relies on specialized file formats (binary, closed specification, etc.) or proprietary software. In these cases, it may be appropriate to archive specialized file types and/or a copy that has been parsed into a format (e.g. ascii) that does not require proprietary software.
**Table of contents**
* [Processing code](./code.html)
* [Modeling datasets](./model-based-datasets.html)
* [Images and documents](./images-and-documents-as-data.html)
* [Spatial data](./spatial-data.html)
* [Data gathered with small, moving platforms](./data-gathered-with-small-moving-platforms.html)
* [Provenance and data in other repositories](./data-in-other-repositories.html)
* [Very large datasets](./large-data-sets.html)
## Code
Contributors: An T. Nguyen, Tim Whiteaker
### Introduction
This document describes best practices for archiving software, code, or scripts, such as a simulation model, data visualization package, or data manipulation scripts. The intention of these recommendations is to make research based on modeling or software more transparent rather than achieve exact reproducibility, i.e., provide sufficient documentation so that a knowledgeable person can understand algorithms, programming decisions, and their ramifications for the results, rather than run the model and obtain the same results.
Examples of candidate archives for code include [CoMSES Net](https://www.comses.net/), which focuses on sharing models related to social and ecological sciences, and [Zenodo](https://zenodo.org/), a popular DOI-minting all-purpose repository, that can conveniently archive a specific version of [code in a GitHub repository](https://guides.github.com/activities/citable-code/). Alternatively, code may be archived in the EDI repository, either by itself or as part of a data package. The best practices in this document cover both archiving code in EDI and referencing code archived elsewhere.
While metadata for software may be described in detail using the EML <**software**> tree, there exists a project called [CodeMeta](https://codemeta.github.io/index.html) which is specifically designed for software metadata. Therefore, one of the key recommendations in this document is to include a CodeMeta file when archiving software or code in EDI.
### Recommendations for data packages
#### Considerations for archiving software or code
* If it is a model and/or a model-based dataset, please see the [best practices for archiving model-based datasets](model-based-datasets.html).
* How likely is it that the code will be well maintained into the future? For example, code packages submitted to established code repositories may stay there only while they comply with all testing requirements and may be removed if not well maintained (e.g., the R package repository CRAN). If that commitment to code maintenance is unlikely, such a package should be archived in a repository without maintenance requirements.
* Should the code be archived as a separate package or with the data?
* If the code is used to generate several independent datasets it should be archived as a separate package.
* The software authors wishing to place it under a different license from that of the associated data, or to obtain a DOI for only the code, may be reasons to separate code and data packages.
* If deciding to package code separately, it may be archived on EDI or another repository. If archiving code outside of EDI, see section 2.2.4 for instructions on how to reference that code from related data packages in EDI.
* In most other cases, it is recommended to archive code and data together for context.
* Large community software packages are usually maintained and available elsewhere. However, they may undergo significant updates and it may make sense to archive the code of a certain version with the data for transparency reasons. Consider whether prior versions of a software package are available wherever that software is distributed.
* When choosing a repository for the code, consider the ease of the archiving process and how well the code can be described. For example, Zenodo offers an easy pathway to archive code that is currently in GitHub, though metadata requirements are very light. Following the best practices described herein, you would create a CodeMeta file if you were going to archive with EDI. This is more rigorous than Zenodo, but then your code is better described, and in a machine-readable way.
#### Documenting software/code
When describing the code with EML, include the code as an otherEntity in a data package. Although a well documented human readable text format of the code is preferred, in case of multiple scripts, and/or where directory structure is important, a zip archive may be used. For the formatName and entityType elements in EML, we recommend using format names from the [DataONE format list](https://cn.dataone.org/cn/v2/formats) when possible. Some format names are included in examples below. Always check the list for the most up-to-date version of these names.
Example 1: EML otherEntity snippet for a script file.
```xml
<otherEntity>
<entityName>R script to process CTD data</entityName>
<entityDescription>Annotated RMarkdown script to process, calibrate, and flag raw CTD data.</entityDescription>
<physical>
<objectName>BLE_LTER_CTD_QAQC.Rmd</objectName>
<size unit="byte">9674</size>
<authentication method="MD5">8547b7a63fcf6c1f0913a5bd7549d9d1</authentication>
<dataFormat>
<externallyDefinedFormat>
<formatName>R Markdown file</formatName>
</externallyDefinedFormat>
</dataFormat>
</physical>
<entityType>script</entityType>
</otherEntity>
```
##### Software License
It is important to include a use license to make it clear how others can use your work. We recommend the [Creative Commons "no copyright reserved" (CC0)](https://creativecommons.org/share-your-work/public-domain/cc0/) license, which places the software in the public domain and makes it easiest for end users to adapt and use your work. If a more restrictive license is required, we recommend the [Apache License, Version 2.0](https://directory.fsf.org/wiki/License:Apache-2.0) license, a permissive license that allows others to reuse, modify, and redistribute your software.
If a mix of data and code needs to be archived, and they each fall under different licenses, then separating them into different packages is advisable to eliminate ambiguity on which license applies to which portion of a data package. When a license other than a public domain dedication is used, then in addition to specifying the license in the metadata (see the "intellectualRights" element in EML), consider including a copy of the license at the beginning of the code files themselves so that the license is readily apparent to end users who peruse the code.
##### CodeMeta
Include a CodeMeta JSON file for all code that is archived in EDI. The CodeMeta file should be named "codemeta.json" and listed as an EML otherEntity. The formatName should be "JavaScript Object Notation (JSON) file", the entityType should be "metadata", and the entityDescription should indicate that this is a CodeMeta file for a given software or script in the data package.
For unnamed projects, e.g., one-off scripts for data processing, analysis, and/or visualisation, a CodeMeta file might appear to be overkill; however, CodeMeta files are simple to generate, and we recommend the below bare minimum. If there are multiple scripts each in their own otherEntity tag, we recommend aggregating information about them into one codemeta.json.
Example 2: Minimum recommended codemeta.json example for unnamed projects.
```json
{
"@context": ["https://doi.org/10.5063/schema/codemeta-2.0",
"http://schema.org"
],
"@type": "SoftwareSourceCode",
"description": "RMarkdown script to calibrate and flag raw CTD data.",
"author": {
"@type": "Person",
"givenName": "Christina",
"familyName": "Bonsell",
"email": "cbonsell@utexas.edu",
"@id": "https://orcid.org/0000-0002-8564-0618"
},
"keywords": ["calibration", "CTD", "RMarkdown"],
"license": "https://unlicense.org/",
"dateCreated": "2013-10-19",
"programmingLanguage": {
"@type": "ComputerLanguage",
"name": "R",
"version": "3.6.2",
"url": "https://r-project.org"
}
}
```
Example 3: sample otherEntity metadata for example 2’s codemeta.json.
```xml
<otherEntity>
<entityName>CodeMeta file for BLE_LTER_CTD_QAQC.Rmd</entityName>
<entityDescription>CodeMeta file for annotated RMarkdown script to process, calibrate, and flag raw CTD data.</entityDescription>
<physical>
<objectName>codemeta.json</objectName>
<size unit="byte">702</size>
<authentication method="MD5">8547b7a63abc6c1f0913a5bd7549d9d1</authentication>
<dataFormat>
<externallyDefinedFormat>
<formatName>application/json</formatName>
</externallyDefinedFormat>
</dataFormat>
</physical>
<entityType>CodeMeta</entityType>
</otherEntity>
```
For named projects, also include the software name, and the version if applicable. The example below shows some additional metadata you can include. See also the more complete [codemetar example](https://docs.ropensci.org/codemetar/articles/codemeta-intro.html) and the available [CodeMeta terms](https://codemeta.github.io/terms/).
Example 4: A more complete CodeMeta example for named projects. Example taken from the CodeMeta project Github with edits for brevity.
```json
{
"@context": ["https://doi.org/10.5063/schema/codemeta-2.0",
"http://schema.org"
],
"@type": "SoftwareSourceCode",
"name": "codemetar: Generate 'CodeMeta' Metadata for R Packages",
"description": "A JSON-LD format for software metadata",
"author": [{
"@type": "Person",
"givenName": "Carl",
"familyName": "Boettiger",
"email": "cboettig@gmail.com",
"@id": "https://orcid.org/0000-0002-1642-628X"
},
{
"@type": "Person",
"givenName": "Maëlle",
"familyName": "Salmon",
"@id": "https://orcid.org/0000-0002-2815-0399"
}
],
"codeRepository": "https://github.com/ropensci/codemetar",
"dateCreated": "2013-10-19",
"license": "https://spdx.org/licenses/GPL-3.0",
"version": "0.1.8",
"programmingLanguage": {
"@type": "ComputerLanguage",
"name": "R",
"version": "3.5.3",
"url": "https://r-project.org"
},
"softwareRequirements": [{
"@type": "SoftwareApplication",
"identifier": "R",
"name": "R",
"version": ">= 3.0.0"
},
{
"@type": "SoftwareApplication",
"identifier": "git2r",
"name": "git2r",
"provider": {
"@id": "https://cran.r-project.org",
"@type": "Organization",
"name": "Comprehensive R Archive Network (CRAN)",
"url": "https://cran.r-project.org"
}
}
],
"keywords": ["metadata", "codemeta", "ropensci"]
}
```
##### Metadata to enable reproducibility
When archiving software, we strongly recommend including a user guide with installation and usage instructions if such would not already be apparent to the typical user. Take into account that the user might not have access to certain inputs that the software/scripts require. Include when feasible at least some example data, and configure the script so that it is ready to run with the example data.
Aside from the software/code itself and its dependencies, other pieces of information may be important should a user wish to reproduce results, such as the operating system and version and the system locale. Include this information in the data package’s methods/methodStep/description. For certain tools, there are ways to easily generate this information, e.g., a call to sessionInfo() in the R console. If the system outputs this information in a standardly formatted plain text file, that might be included as an otherEntity.
#### Linking code and data
There are a few solutions for providing explicit machine-readable linkages between different entities/packages (the distinction between code/data doesn’t matter too much here). For most cases we recommend the simplest approach, which is to use the methods/methodStep/description element of EML. More advanced users may wish to utilize the other solutions described herein.
##### Descriptive approach
In the dataset methods/methodStep/description element, include verbal descriptions such as "results.csv was derived from raw_data.csv using script.R" and repeat for all entities. If code and data reside in different packages, be sure to specify that.
##### The EML dataSource element
Nested under methods/methodStep, dataSource elements describe other data packages that serve as source for the current package. dataSource looks like a mini-EML tree describing the source data. Example: [ecocomDP packages](http://portal.edirepository.org/nis/simpleSearch?defType=edismax&q=ecocomDP&fq=-scope:ecotrends&fq=-scope:lter-landsat*&fl=id,packageid,title,author,organization,pubdate,coordinates&debug=false) list the original packages under dataSource. dataSource does not describe relationships between entities in the same package, and as far as we know there is no explicit way in EML to do so.
##### ProvONE
[ProvONE](http://jenkins-1.dataone.org/jenkins/view/Documentation%20Projects/job/ProvONE-Documentation-trunk/ws/provenance/ProvONE/v1/provone.html) is a model developed by DataONE affiliates for provenance or denoting relationships between data entities. Each package on DataONE is described by a science metadata document (e.g., EML, ISO, FGDC) and a resource map document following ProvONE. The resource map powers a nice display of data relationships (see [this package on the Arctic Data Center](https://arcticdata.io/catalog/view/doi%3A10.18739%2FA2556Q)). This handles both relationships between entities in the same package and entities residing in different packages. However, note that EDI currently does not utilize this model.
#### External software
Large community-backed tools or proprietary software such as ArcGIS Pro or Microsoft Excel do not need to be archived. However, if they have had any impact on the final data (e.g., ArcGIS Pro was used to modify spatial rasters), the EML methods section should describe the routines performed. Within the data package, indicate linkage to external software as follows.
* Briefly describe the software/code and its relationship to the data in EML’s methods/methodStep/description element.
* Names of all software used. Include both the common acronym and the full spelling.
* The URL(s) to all models/software used. Stable, persistent URLs pointing to exact version(s) are preferable, rather than generic links such as a project homepage. If the archived model has a DOI, then include a full citation to the model in the methods/methodStep/description text. The exception to this is when referencing tools such as Excel that have achieved global household name status.
* Broadly, the system setup used, if relevant.
* Information on exact versions for all code used (including dependencies). This is important, e.g., ArcGIS Pro 2.4.1 is very different from ArcGIS for Desktop 10.7.1. Different systems have methods to easily generate this information, e.g. a call to sessionInfo() in the R console.
* Consider, if applicable, to archive the "runfile" as its own data entity within the data package, i.e., the script(s) that sets parameters and/or calls on functions imported from external software.
Example 5: EML method description referring to external software.
```xml
<methods>
<methodStep>
<description>
<para>
The seagrass coverage raster was created in ArcGIS Pro (version
2.4.3, by Esri) using the IDW geoprocessing tool on
sampling_points.csv with a power of 2 and the nearest
12 points.
</para>
<para>
The raster was then refined using the seagrass-refiner package
with the auto-refine option checked (Smith, 2017).
</para>
<para>
Smith, J. (2017). seagrass-refiner: a package that does the cool
seagrass stuff, Version 1.2, Zenodo. https://doi.org/this-is/a-fake-doi,
2017.
</para>
</description>
</methodStep>
</methods>
```
### Resources
[CodeMeta](https://codemeta.github.io/) website
[CodeMeta generator](https://codemeta.github.io/codemeta-generator/) for creating CodeMeta
[CodeMeta crosswalks](https://codemeta.github.io/crosswalk/) for a number of popular software
[CodeMeta terms](https://codemeta.github.io/terms/) you can use for describing software
A description of some [software licenses](https://opensource.org/licenses)
[Best practices document to archiving model-based datasets](model-based-datasets.html)
[ProvONE documentation](http://jenkins-1.dataone.org/jenkins/view/Documentation%20Projects/job/ProvONE-Documentation-trunk/ws/provenance/ProvONE/v1/provone.html)
[W3C PROV-O documentation](https://www.w3.org/TR/prov-o/)
[Licensing software as part of an EDI data package](https://docs.google.com/document/d/1JeznivTDubi0ZX_lsO50eCUl-8zxSiz_xq5SsBRwbuw/edit)
## Model-Based Datasets
Contributors: An T. Nguyen, Tim Whiteaker, Corinna Gries
### Introduction
This document includes recommendations for archiving data packages composed of model-based datasets. These datasets may include the model code itself, input data, model parameter settings, and output data.
The range of cases for model-based datasets includes small one-off model code specific to one research question, through various code packages which are maintained in community repositories as long as they meet requirements (e.g., CRAN for R packages), to large community models maintained by groups of programmers and users.
The intention of these recommendations is to make research based on modeling more transparent rather than achieve exact reproducibility, i.e., provide sufficient documentation so that a knowledgeable person can understand algorithms, programming decisions, and their ramifications for the results, rather than run the model and obtain the same results.
It is not always easy to determine who among project personnel (IMs, scientists, programmers) is responsible for the different components of a model-based dataset. This is best decided on a case-by-case basis. A common division is that the code authors annotate the code, and the IM handles the archiving and linkage to data product(s); partially except in cases of large community models.
### Recommendations for data packages
![Figure 1: Flowchart for considering archival paths for various model components.](../images/model_datasets.png "model data management flowchart")
#### Referencing models in EML
For data packages related to a model, whether the model is archived within the same data package or not, indicate linkage to the model in EML following the [best practices for archiving code](code.html) (see the section on linking code and data).
Example 1: EML snippet relating data to models via the method description:
```xml
<methodStep>
<description>
<para>This methodStep contains data provenance information as specified in the LTER EML Best Practices. Each dataSource element here lists entity-specific information and links to source data used in the creation of this derivative data package.</para>
</description>
<dataSource>
<title>Source dataset title</title>
<creator>
<individualName>
<givenName>first name</givenName>
<surName>last name</surName>
</individualName>
<organizationName>organization name</organizationName>
<electronicMailAddress>email@some.edu</electronicMailAddress>
</creator>
<distribution>
<online>
<onlineDescription>This is a link to an external online data resource (describe resource and repository location).</onlineDescription>
<url function="information">https://pasta.lternet.edu/package/metadata/eml/knb-lter-ntl/80/2</url>
</online>
</distribution>
<contact>
<positionName>Information Manager</positionName>
<organizationName>organization name</organizationName>
<electronicMailAddress>infomgr@some.edu</electronicMailAddress>
</contact>
</dataSource>
</methodStep>
```
##### Model code
The model used to produce certain data needs to be well documented and linked from the resulting data product(s). However, it is not always easy to decide where and how to archive the code, and whether or not in conjunction with the data product(s). We outline in sections below three common code archiving options.
Note that these scenarios (model code archived with data, or standalone in EDI, or elsewhere) are not mutually exclusive. Any project that involves code might make use of both established and custom software hosted on many different platforms, and might use some or all archiving options.
To decide between archiving options, consider the questions listed in [best practices for publishing code](code.html).
##### Model code and data in the same package
The goal of this practice is to ensure transparency of the data, and it applies to one-off models developed for the associated data, or occasionally to larger code bases for the reasons outlined in [best practices for archiving code](code.html). Include the code as a dataset/otherEntity. Additionally, it is recommended to include a CodeMeta file, which can also be handled and documented in EML as dataset/otherEntity. CodeMeta is a metadata standard for software and code compatible with schema.org. Refer to [best practices for archiving code](code.html) for how to document the code and create CodeMeta.
##### Model code as standalone package
If the model has been used to generate several datasets, i.e., is more widely applicable, it can be archived as its own package in EDI and assigned a DOI. Include the code as a dataset/otherEntity. Additionally, it is recommended to include a CodeMeta JSON-LD file, which can also be handled and documented in EML as dataset/otherEntity. CodeMeta is a metadata standard for software and code compatible with schema.org. Refer to [best practices for archiving code](code.html) for how to document the code and create CodeMeta.
##### Model code archived/maintained elsewhere
This might include complex community models/software maintained by many people, published and actively maintained R/Python packages, etc., or simply code archived in another repository such as [CoMSES Net](https://www.comses.net/). It may sometimes be advisable to archive a copy of the model code with the data, even if it appears to be maintained elsewhere. See recommendations above for referencing models in EML.
#### Model input and output data
These are considered data entities, which should be handled according to EML best practices for corresponding data types. However, if the resulting datasets are very large, one may consider if input/output from all individual model runs need to be archived. Are there specific model run results that are more useful for non-modelers? For example: results from model runs leading to a journal publication.
Very large model inputs/outputs may need to be archived offline. Refer to [best practices for offline data](large-data-sets.html).
If the model requires a specific folder structure, you can zip model input files within the package to preserve that folder structure. A disadvantage of this approach is that you cannot elegantly describe each file with EML.
The [EarthCube Research Coordination Network, "What About Model Data?" group](https://modeldatarcn.github.io/#project-description) is working on a rubric to help you determine how much model output data to save, based on assorted criteria on reproducibility/value of the data. Learn more about that group and their rubric on their [Model Data RCN website](https://modeldatarcn.github.io/).
Researchers at the Department of Energy’s Environmental Systems Science are also working on assessing model archiving needs. In [this preprint](https://eartharxiv.org/repository/view/260/), Simmonds et al. 2020 discuss feedback from communications with modellers and propose preliminary solutions. With regards to input/output data, their feedback indicates two opposite opinions: some feel the whole gamut of raw to aggregated outputs needs to be archived, while others advocate for only high-level outputs corresponding to publication figures. They also found that spin-up simulations were not considered a high priority for archiving. See section 2.3 What is worth archiving and for how long does it remain useful?
#### Model parameters
Include model parameters whenever applicable. If code/input/output from multiple model runs are archived, make sure to archive all corresponding sets of parameters, and be explicit in linking the different components together.
Consider archiving model parameter files as their own data object(s) in both their native format and as a text (non-binary) version. If the 'runfile' will be archived, consider including the parameters within that file with appropriate annotations.
### Example data packages in EDI
<table>
<tr>
<td><strong>Dataset Title</strong>
</td>
<td><strong>Description</strong>
</td>
<td><strong>EDI Package ID</strong>
</td>
</tr>
<tr>
<td><em>North Temperate Lakes LTER General Lake Model Parameter Set for Lake Mendota, Summer 2016 Calibration</em>
</td>
<td>Parameters for specific GLM runs. GLM is a large community model, not managed and archived in EDI
</td>
<td><a href="https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-ntl.348.2">knb-lter-ntl.348.2</a>
</td>
</tr>
<tr>
<td><em>SBC LTER: Regional Oceanic Modeling System (ROMS) Setup Files, Code, and Lagrangian Model Setup Files</em>
</td>
<td>All the necessary code, grid, forcing, initial, and boundary condition files for running the UCLA version of the Regional Oceanic Modeling System (ROMS) for the Santa Barbara Channel
</td>
<td><a href="https://portal.edirepository.org/nis/mapbrowse?scope=knb-lter-sbc&identifier=126">knb-lter-sbc.126.1</a>
</td>
</tr>
<tr>
<td><em>Lake thermal structure drives inter-annual variability in summer anoxia dynamics in a eutrophic lake over 37 years</em>
</td>
<td>Dataset to run a 37-year simulation (1979-2015) of the Lake Mendota lake ecosystem using the vertical 1D GLM-AED2 model.
</td>
<td><a href="https://portal.edirepository.org/nis/mapbrowse?scope=knb-lter-ntl&identifier=396">knb-lter-ntl.396.1</a>
</td>
</tr>
</table>
### Resources
Janssen, Marco A., Lilian Na'ia Alessa, Michael Barton, Sean Bergin and Allen Lee (2008). ‘Towards a Community Framework for Agent-Based Modelling’. Journal of Artificial Societies and Social Simulation 11(2)6 [http://jasss.soc.surrey.ac.uk/11/2/6.html](http://jasss.soc.surrey.ac.uk/11/2/6.html).
Simmonds, Maegen, William J. Riley, Shreyas Cholia, and Charuleka Varadharajan (2020). 'Addressing Model Data Archiving Needs for the Department of Energy’s Environmental Systems Science Community'. EarthArXiv (preprint). [https://doi.org/10.31223/osf.io/acdk4](https://doi.org/10.31223/osf.io/acdk4).
See sections 2.3 What is worth archiving and for how long does it remain useful?, discussed above, plus 2.4 Model data archiving protocol, where the authors argue for better standardized reporting format for model data, e.g., top-level metadata and directory structure at a minimum. Section 4.1 Developing Model Data Archiving Guidelines proposes an organization scheme for model data.
## Images and Documents as Data
Contributors: Renee F. Brown (lead), Stace Beaulieu, Sarah Elmendorf, Gastil Gastil-Buhl, Corinna Gries, Li Kui, Mary Martin, Greg Maurer, John Porter, Tim Whiteaker
### Introduction
This chapter describes best practices for archiving images and other documents as data. The [Environment Ontology (ENVO)](http://purl.obolibrary.org/obo/IAO_0000101) defines a document as '_a collection of information content entities intended to be understood together as a whole_.' Common examples include still images, audio and/or video multimedia files, field notebooks, written interview notes or transcribed oral accounts, historical document collections, and 'paper' maps (non-digitized maps). For images that are already handled by specialized repositories (e.g., phenocam images, specimen images) refer to [Data in Other Repositories](data-in-other-repositories.html), for additional information on how to handle images from uncrewed (underwater or aerial) vehicles refer to [Data Gathered with Small Moving Platforms](data-gathered-with-small-moving-platforms.html), and for geospatial imagery refer to [Spatial Data](spatial-data.html).
### Recommendations for data packages
#### Reasons to archive documents as data
* **Enhance the credibility of associated datasets.** Many document types (field notes, still images, etc.) often provide additional metadata that cannot easily be encapsulated in the associated dataset(s) or were not considered important at the time of transcription. As such, these documents may provide opportunities to rectify transcription errors, retrospectively provide explanations of unusual data, and/or include additional observational or measured data, such as opportunistic measurements or calibration parameters.
* **Provide opportunities for new analyses.** New analytical methods may be employed on archived documents (especially still images) or documents that were never archived previously because the cost-to-benefit ratio was considered too high (e.g., pilot projects).
* **Improve ease of access.** In distributed projects, access to original and/or 'hard-copy' documents may be limited to a particular institution or subset of people. By digitally archiving these documents in a data repository, the data become more findable, accessible, interoperable, and reusable (FAIR).
#### Considerations for data package structure
* **Balance file size and number of files.** A data package may contain document files individually or bundled as a compressed archive (e.g., zip). The decision of how to bundle documents into compressed archives and then into data packages should be guided by the overall goal of making data usable for the intended purpose of the documents. In most cases, this would involve finding specific documents by, for example, the date or location of the acquisition, or some other aspect of interest. In addition, the effort of documenting documents (each individually vs. in groups) has to be taken into account. Also see [Large Data Sets](large-data-sets.html).
* **Document grouping.** Data packages, or compressed archives within data packages, may be grouped spatially (e.g., by location) and/or temporally (e.g., by date, season, or year). For example, data outputs from a stationary camera may be archived in annual data packages, each containing monthly compressed archives if the number of images is large. While moving camera outputs may also be archived annually, these data packages may instead include compressed archives containing all still images for a single location.
* **Document naming.** To maximize searchability, document names should be unique and meaningful for a data reuser. It is recommended that individual documents be named according to their content, and compressed archives include date, location, and other relevant information in the filename.
* **Data inventory table.** An inventory table providing the structure and organization of the included document entities or groups of documents (see Table 4.1) is recommended, especially for larger collections of documents within a data package. The inventory table serves as an additional source of metadata and may also be used to link specific documents to additional information.
* **Archival frequency.** One should strive for archiving a fully processed group of documents when no more updates are expected (e.g., after a field season or annually) due to the large volume of documents to be handled repeatedly for each update.
* **Linking to related data packages.** In the case where the documents are useful to understanding another data package and vice versa (e.g., met station visitation logs and met station time series data), it is recommended to link the complementary data package in the methods section of both datasets. Alternatively, include the document(s) or compressed archive(s) in the existing dataset as otherEntity, as described in the next section.
#### Documenting data packages
##### Ecological Metadata Language
All data packages require good discovery-level metadata in Ecological Metadata Language (EML), which should be assembled using standard documented best practices. Documents (including compressed archives) should be included as otherEntity in the data package (e.g., see Example 4.1). Refer to the most recent version of EML Best Practices ([currently v3](https://environmentaldatainitiative.files.wordpress.com/2017/11/emlbestpractices-v3.pdf)) for guidance regarding the formatName and entityType EML elements. If a format for your document type is not covered, it is recommended to use the appropriate [MIME type](https://github.com/DataONEorg/object-formats), if available.
Example 1: EML otherEntity snippet for a pdf file
```xml
<otherEntity>
<entityName>site date</entityName>
<entityDescription>Field notes at site and date.</entityDescription>
<physical>
<objectName>site_date.pdf</objectName>
<size unit="byte">9674</size>
<authentication method="MD5">8547b7a63fcf6c1f0913a5bd7549d9d1</authentication>
<dataFormat>
<externallyDefinedFormat>
<formatName>Portable Document Format</formatName>
</externallyDefinedFormat>
</dataFormat>
</physical>
<entityType>application/pdf</entityType>
</otherEntity>
```
The EML metadata should also include appropriate keywords describing the general purpose of the document or compressed archive (e.g., ice phenology, community composition, stream hydrology, etc.). For example, for still images, it is recommended to include keyword: image with the semantic annotation from the Information Artifact Ontology (IAO) :
**Term IRI:** [http://purl.obolibrary.org/obo/IAO_0000101](http://purl.obolibrary.org/obo/IAO_0000101)
**Definition:** An image is an affine projection to a two dimensional surface, of measurements of some quality of an entity or entities repeated at regular intervals across a spatial range, where the measurements are represented as color and luminosity on the projected surface.
Note, IAO includes at least one subcategory for image (e.g., [photograph](http://www.ontobee.org/ontology/IAO?iri=http://purl.obolibrary.org/obo/IAO_0000185)). It is recommended the most specific applicable concept be used.
##### Data Inventory Table
We recommend that an additional level of metadata be provided through a data inventory table that effectively serves as a document catalog (see Table 4.1). The detail provided in this table should be guided by the same principles as stated above -- to enable optimal usability of the documents. For example, still images from a stationary camera require latitude and longitude only in the EML file, not for each individual image. However, images from a moving camera may need that information for every image, or at least for every location (e.g., site, quadrat, transect). Additionally, Exif metadata from photographic images may be programmatically extracted to supplement the inventory table (refer to the _Tips and Tricks_ section of [Data Gathered with Small Moving Platforms](data-gathered-with-Small-moving-platforms.html)).
The data inventory table should be structured such that each column represents a particular attribute, described in EML as a [dataTable](https://eml.ecoinformatics.org/schema/eml-dataTable_xsd.html#eml-dataTable.xsd) entity, and each row represents an individual document or a compressed archive of a group of documents. At minimum, the table should include an attribute for the document/archive filename, as well as any other essential attributes that vary per each document/archive. Additional attributes may include information on the date and/or time, but for this information to be useful, be consistent and use a controlled vocabulary for these fields so that a user can effectively search on them.
Table 1: Data inventory table structure.
<table>
<tr>
<td><strong>Column</strong>
</td>
<td><strong>Attribute Description</strong>
</td>
</tr>
<tr>
<td>Filename
</td>
<td>Filename of each document or compressed archive, including file extension (e.g., 'site_date.jpg'). For compressed archives, include the relative path of the document, with respect to the uncompressed directory structure (e.g., '2018/SITE3/quadrat4.jpg').
</td>
</tr>
<tr>
<td>Link/URL/URI
</td>
<td>Link to download a document if it is available on a different system (also see <a href="data-in-other-repositories.html">Data in Other Repositories</a>). Persistent identifiers are recommended, if available.
</td>
</tr>
<tr>
<td>Creator(s)
</td>
<td>Name(s) of the creator(s) of the original document (e.g., photographer, field technician, interviewer). Multiple creators should be entered into a single cell using the pipe delimiter.
</td>
</tr>
<tr>
<td>Datetime
</td>
<td>Date (and time) associated with the document, in <a href="https://en.wikipedia.org/wiki/ISO_8601#Combined_date_and_time_representations">ISO 8601 format</a> (e.g., 2007-04-05T12:30-02:00).
</td>
</tr>
<tr>
<td>Project specific datetime attributes
</td>
<td>One or more appropriately labeled columns containing project specific date and time information for easier search and retrieval of documents (e.g., year, season, campaign).
</td>
</tr>
<tr>
<td>Location
</td>
<td>One or more location columns as appropriate, such as latitude and longitude in decimal degrees, site name, transect name, altitude, depth, habitat, etc.
</td>
</tr>
<tr>
<td>Document specific attributes
</td>
<td>One or more columns as appropriate to the document type, such as weather conditions, organism name, instrument type, etc.
</td>
</tr>
</table>
### Example data packages in EDI
Each of the Environmental Data Initiative (EDI) data packages listed below include images or other documents as data. Some of these packages contain data inventory tables (as dataTable entities) described in the EML metadata.
Table 2: Data packages in EDI providing examples of best practices from this document.
<table>
<tr>
<td><strong>Dataset Title</strong>
</td>
<td><strong>Description</strong>
</td>
<td><strong>EDI Package ID</strong>
</td>
</tr>
<tr>
<td><em>Annual ground-based photographs taken at 15 net primary production (NPP) study sites at Jornada Basin LTER, 1996-ongoing</em>
</td>
<td>Compressed archives of images grouped by year. Includes data inventory file.
</td>
<td><a href="https://portal.edirepository.org/nis/mapbrowse?scope=knb-lter-jrn&identifier=210011005">knb-lter-jrn.210011005.105</a>
</td>
</tr>
<tr>
<td><em>McMurdo Dry Valleys LTER: Landscape Albedo in Taylor Valley, Antarctica from 2015 to 2019</em>
</td>
<td>Compressed archives of aerial images, grouped by flight date, and associated reflectance data.
</td>
<td><a href="https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-mcm.2016.2">knb-lter-mcm.2016.2</a>
</td>
</tr>
<tr>
<td><em>MCR LTER: Coral Reef: Computer Vision: Multi-annotator Comparison of Coral Photo Quadrat Analysis</em>
</td>
<td>5090 coral reef survey images, and 251,988 random-point annotations by coral ecology experts.
</td>
<td><a href="https://portal.edirepository.org/nis/mapbrowse?scope=knb-lter-mcr&identifier=5013">knb-lter-mcr.5013.3</a>
</td>
</tr>
<tr>
<td><em>Abundance and biovolume of taxonomically-resolved phytoplankton and microzooplankton imaged continuously underway with an Imaging FlowCytobot along the NES-LTER Transect in winter 2018</em>
</td>
<td>144,281 images from a plankton imaging system with annotations and extracted size data.
</td>
<td><a href="https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-nes.9.1">knb-lter-nes.9.1</a>
</td>
</tr>
<tr>
<td><em>Calling activity of Birds in the White Mountain National Forest: Audio Recordings (2016 and 2018)</em>
</td>
<td>Compressed archive containing 410 audio files in wav format. Includes data inventory table.
</td>
<td><a href="https://portal.edirepository.org/nis/mapbrowse?scope=knb-lter-hbr&identifier=268&revision=1">knb-lter-hbr.268.1</a>
</td>
</tr>
</table>
### Resources
#### Considerations for digitizing documents
Following are some general considerations and recommendations for digitizing paper or other 'hard-copy' documents for archival. This is not meant to be an exhaustive list. For further and more detailed information, please refer to the U.S. National Archives and Records Administration (NARA)’s _[Technical Guidelines for Digitizing Archival Materials for Electronic Access](https://www.archives.gov/files/preservation/technical/guidelines.pdf)_.
* **Effort.** The decision to digitize documents, as well as the digitization method, involves trade-offs in the accessibility and ease of using particular hardware and/or software technologies, the quality of the digitization, and the overall effort spent. Digitization efforts may be significant, for example, when dealing with a large number of documents requiring meaningful file names, text recognition, and/or high resolution for improved accessibility.
* **Equipment.** Instruments for digitizing hard-copy documents range from high resolution scanners (less accessible, less user-friendly, more expensive, better quality) to smartphone cameras (ubiquitous, easy-to-use, lower quality). For example, taking a smartphone image in the field may be utilized for quick and easy digitization of field notes.
* **Document resolution and file size.** This is an important consideration that should be guided by the content and purpose of the document. Detailed paper maps should probably be scanned at high resolution and large file size, while field sheets may not need as much detail.
* **Optical Character Recognition (OCR):** When digitizing documents that include text, we recommend using scanning or other software with OCR capabilities (e.g., Adobe, ABBYY, Tesseract) to convert the text into machine readable characters so that the documents are searchable and thus, more usable. OCR does not work well for handwritten text, older fonts, or documents with busy backgrounds (speckled, dirty, faded, etc.).
* **Sensitive Information and Human Subjects:** Regardless of the digitization method, one should be mindful of sensitive information that shouldn’t be archived or otherwise redacted (e.g., photographs of human subjects, field notebooks containing personal messages, gate combinations, and/or telephone numbers). In all cases in which human subjects are involved, Institutional Review Board (IRB) restrictions must be heeded. A signed IRB consent form for the associated research project represents a contract between researcher and human subject. It is important to note that IRB restrictions can differ among research studies within the same project. For further information, see the [EDI Data Initiative Data Policy](https://edirepository.org/about/edi-policy#sensitive-data).
While transcription is a digitization method that can be performed on certain types of documents (e.g., audio/video recordings, field notebooks) and can enhance search capabilities, transcript generation requires substantially more effort than other digitization methods, and is prone to error. Moreover, in the case where the original documents contain drawings, transcripts may be incomplete or otherwise inaccurate. _Thus, we recommend digitizing documents by other means, using the considerations described above._
## Spatial Data
Contributors: Tim Whiteaker, John Porter, Mary Martin
### Introduction
This document/chapter contains recommendations on data package structure and metadata for spatial datasets. Over the timeline of Long Term Ecological Research (LTER) Network?s use of the Ecological Metadata Language (EML), both spatial data formats and data curation options have evolved. In this document, focus on best practices that can be widely adopted with the goal of enhancing data discoverability and usability, and the understanding that there are multiple solutions to creating these data packages.
### Recommendations for data packages
#### Considerations for archiving spatial data
##### Data formats
To maximize reuse, avoid proprietary formats. The formats listed below can be read or imported by most mainstream GIS programs or with code using libraries such as GDAL.
Strongly recommended formats:
* **GeoTIFF** - An open format for storing spatial raster data and metadata in a TIFF file.
* **GeoPackage** - A standard format from the Open Geospatial Consortium (OGC) for storing vector and raster data in a SQLite database file.
Some other formats to consider are listed below.
* **KML/KMZ** - Keyhole Markup Language (KML) file and its zipped version for storing vector data. This format was popularized by Google Earth and is now an OGC standard. KML is best visualized in Google software and may not render as well in other GIS software.
* **GeoJSON** - A format for storing vector data as text in JavaScript Object Notation (JSON). GeoJSON data are limited to the WGS84 coordinate system.
* **netCDF/HDF5** - binary formats originally designed for storing multidimensional arrays of spatial data typically organized onto a grid, but which now can accommodate vector data following the NetCDF Climate and Forecast Conventions (version 1.8 or higher).
A couple of Esri formats are worth mentioning and are listed below.
* **File geodatabase** - One of Esri's formats for storing vector and raster information. Several feature classes and rasters can be stored in this folder and file based structure. GDAL's OpenFileGDB driver enables non-Esri software to view at least the data layers in a file geodatabase, but usage of more advanced file geodatabase components such as topology rules or geometric networks may not be available outside of Esri software. Field types may not be imported correctly either. Export to GeoPackage instead, unless geodatabase is the only format that supports the advanced representation of your GIS data. Just know that you limit potential reuse of your data if you use this format.
* **Shapefile** - A legacy format for vector data which is widely supported. Be aware of [shapefile limitations](https://en.wikipedia.org/wiki/Shapefile#Limitations) when considering this format. A shapefile consists of several individual files; include them as a single zip file in the data package. If the package has more than one shapefile, create a separate zip file for each shapefile.
Although other open formats exist, their implementation in popular GIS software may be less common. If a proprietary format must be used to capture the full meaning of the data, consider also including a version of the data in an open format such as a simple data table along with metadata explaining its limitations in that format, or instructions on how to utilize the proprietary format. For example, an Esri layer package could be used when including recommended symbols for drawing vector features in a GIS is desired, in which case one could note that the vector data can be extracted by treating the layer package as a zip file.
Formats that are composed of more than one file, such as shapefiles, should be zipped. Include one dataset per zip file. For example, if you have 10 shapefiles, you would create 10 zip files.
#### Documenting spatial data packages
##### Document as spatial[Raster, Vector] vs. otherEntity in EML
There is a noticeable divergence in EDI spatial data packages, specifically, in the use of otherEntity vs spatial[Raster,Vector]. Here we discuss pros and cons of why one might choose to document spatial data with one type of EML entity over another. Either method is acceptable, and we recommend using spatial[Raster,Vector] when feasible. The documentation that follows provides best practices that will maximize discoverability and useability of spatial data, regardless of the entity type used.
##### [otherEntity](https://eml.ecoinformatics.org/schema/eml-dataset_xsd.html#DatasetType_otherEntity)
* Pros
* EML preparation is simpler than with the spatial EML types
* Allows aggregated data structures (e.g., file geodatabases)
* Cons
* Spatial data stored as <otherEntity> might be harder to discover because it may be difficult to determine if data in an <otherEntity> is spatial data or some other type when searching or browsing
* There is currently no controlled keywording to identify spatial data files that are included as otherEntity in EML
* Tabular attributes of geometric entities may not be described in detail
* Units (latitude/longitude vs meters vs feet) and projections may not be identified
##### spatial[[Raster](https://eml.ecoinformatics.org/schema/eml-dataset_xsd.html#DatasetType_spatialRaster),[Vector](https://eml.ecoinformatics.org/schema/eml-dataset_xsd.html#DatasetType_spatialVector)]
* Pros
* EML more fully describes vector attributes
* There is a well documented path from Esri metadata to EML
* An EML metadata search (on EDI or elsewhere) clearly identifies these as spatial datasets through the use of spatialRaster or spatialVector entities
* LTER has built applications based on spatial[Raster,Vector] entities
* Cons
* Data may not originate in ArcGIS, requiring a custom workflow to generate spatial entity EML
* Spatial[Raster,Vector] can?t describe multi-layer aggregates of GIS data (e.g., geodatabases containing multiple feature classes)
##### Keywords
Clearly identifying a dataset as spatial in nature is important to discoverability. This can be achieved by the use of keywords in the EML keyword elements as well as in the title/abstract and methods where appropriate. Keywords frequently searched include: GIS, geographic information system, spatial data, plus the more specific format names like shapefile, geoTIFF etc. Consider including as appropriate.
Do include the keywords **spatial vector** and **spatial raster** as appropriate for your data. These keywords should be used especially if the data are archived as otherEntity.
You may also include keywords that describe broad spatial data layers, e.g., digital elevation model, elevation, boundary, land use, land cover, census, parcel, imagery, as well as keywords that describe the specifics associated with a broad spatial data layer, e.g., land cover types such as water and vegetation types, land use types such as urban and forest, and so on.
##### GIS software compatible metadata
GIS platforms will not ingest EML metadata. If your GIS software creates its own metadata file specific to that software, then it may be included as otherEntity. Be sure to populate this metadata, for example with descriptions and units for attributes in vector data or raster attribute tables. However, metadata in the standard ISO 19115 or CSDGM format to enable the metadata to be read by other GIS software is more useful.
##### Attribute and coordinate system detail for otherEntity
While the GIS software compatible metadata included in the package typically describes attributes and coordinate systems of the data, such descriptions should also be included in the EML metadata to help users determine fit for use prior to data download. The EML spatialVector and spatialRaster types include elements for this purpose. EML otherEntity can also include attribute descriptions; however, inclusion of attributes in this more generic element may not be as common, and the element does not formally support a description of coordinate systems.
When using [otherEntity](https://eml.ecoinformatics.org/schema/eml-dataset_xsd.html#DatasetType_otherEntity) instead of spatialVector or spatialRaster, include coordinate system details in the otherEntity/entityDescription element. If not including a description of attributes in the otherEntity/attributeList element, at least include a summary description of attributes in otherEntity/entityDescription. If the spatial dataset and its associated metadata files are the only items in the data package, then you can include these descriptions in higher level EML elements such as the dataset abstract in addition to or in place of descriptions at the entity level.
##### Standardized content for formats and entity types
In EML physical/dataFormat/externallyDefinedFormat, include a **formatName** indicating the spatial data file format. We recommend using format names from the [DataONE format list](https://cn.dataone.org/cn/v2/formats) when possible. Some spatial items from that list are shown below. Always check the list for the most up-to-date version of these names.
* Esri Shapefile (zipped)
* Google Earth Keyhole Markup Language (KML)
* Google Earth Keyhole Markup Language (KML) Compressed archive
* Network Common Data Format, version 4
* Hierarchical Data Format version 5 (HDF5)
* GeoTIFF
* GeoPackage Encoding Standard (OGC) Format Family
* Esri File Geodatabase (zipped)
* GeoJSON, version RFC 7946
If your format is not included in the DataONE list, consider submitting an issue to that GitHub repository's issue tracker so that the format can be added.
EML **formatVersion**, a sibling of formatName can be used to indicate the format version as in the example EML snippet below.
```xml
<externallyDefinedFormat>
<formatName>Network Common Data Format, version 4</formatName>
<formatVersion>netCDF-4 classic</formatVersion>
</externallyDefinedFormat>
```
For otherEntity, when populating the **entityType** element, use **spatial raster** or **spatial vector** as appropriate.
## Data Gathered with Small Moving Platforms
Contributors: Sarah Elmendorf (lead), Tim Whiteaker, Lindsay Barbieri, Jane Wyngaard, Greg Maurer, Hap Garritt, Adam Sapp, Corinna Gries, Stace Beaulieu
### Introduction
Modern advances in technology have increasingly allowed the collection of ecological data using small, often uncrewed, moving platforms. Systems variously known as small Uncrewed Aircraft Systems (sUAS), Uncrewed Surface Vehicles (USV), Autonomous, Uncrewed Underwater Vehicles (AUV or UUV) or ?drones,? more generally, now frequently serve as sensor carrying platforms. Moving platforms may also include gliders or animals with sensors affixed. Depending on the sensor(s) installed on the moving platform, data collected may include environmental measurements (temperature, concentration of chemicals), imagery (digital photos, multi- or hyperspectral sensors), or other remote-sensing acquisitions (ranging data, ground-penetrating radar). Example research applications include studies of vegetation cover and phenology, snowpack cover and depth, ground surface temperature, terrain elevation, bathymetry, species distribution or abundance, and many others.
Raw drone data can be voluminous and challenging to archive, but after processing, derived drone datasets typically resemble the more conventional spatial datasets that are regularly used in ecological research. In this document we focus on best practices for archiving raw and derived drone data, with particular attention to metadata and processing code that is specific to drone datasets. Note that this chapter does not specifically address data collected by large moving platforms like airplanes and satellites, or by human and animal platforms.
### Recommendations for data packages
General considerations for archiving data from moving sensor platforms
* **Repository**: We are currently unaware of many specialized repositories for these data, and therefore, EDI is used as the representative data repository for many cases presented here. Repositories other than EDI may have specific metadata formatting requirements, but the general recommendations with regard to content could presumably be applied. For LiDAR based UAV data, consider contributing to Open Topography [https://opentopography.org/](https://opentopography.org/)); for AUV data, the U.S. Marine Geoscience Data System (MGDS) ([http://www.marine-geo.org/index.php](http://www.marine-geo.org/index.php)) which serves the IEDA "MGDL" node in DataONE is a good option. Glider data may be contributed to the U.S. IOOS Glider DAC ([https://gliders.ioos.us/data/](https://slack-redir.net/link?url=https%3A%2F%2Fgliders.ioos.us%2Fdata%2F&v=3)), archived at the National Centers for Environmental Information (NCEI) thus fulfilling NSF OCE Data Policy. If a decision is made to archive an LTER drone dataset in an external (i.e. non-EDI) repository but links to EDI data packages are desired, recommendations in the [Data in Other Repositories](data-in-other-repositories.html) chapter may apply.
* **Size of data set**: The file size of raw data from drone imagery can be substantial. If large volumes of raw data (>100 GB in total) are to be archived on EDI, please coordinate with EDI and follow the best practices for [Large Data Sets](large-data-sets.htm). Even if raw data are in a proprietary binary format and specific software is required for processing, publishing them may be important for reprocessing when software improves.
* **Designing a data package**: In many applications of moving sensor data, raw images/measurements must be processed to arrive at data products that can be analyzed to answer research questions. To enable a fully reproducible analysis pipeline, we recommend archiving three components: the raw data, any key derived data products (e.g., orthomosaic images, DEMs, DSMs, NDVI, landcover, snow depth, or surface temperature maps), and the processing code. These three components may be archived in separate data packages or together, and each should follow accepted best practices for its data type. To archive raw image collections, for example, see the considerations on grouping images into compressed archives (.zip, .tar) and creating an inventory file, as described in the [Images and Documents as Data](images-and-documents-as-data.html) chapter. For derived geospatial files, such as DEMs, refer to the [Spatial Data](spatial-data.htm) chapter. Custom processing code should be archived with the data following recommendations in the [Code in EDI](code.html) chapter. If a standalone program is used to process data, reference the program in the methods metadata with adequate details to ensure reproducibility (name, version, date, configuration, etc.).
#### Metadata for moving platform data packages at EDI
The data package should include metadata elements that, at a minimum, (a) identify it as being collected by a moving platform, (b) deliver basic information about the data collection platform, instrument payload (camera, sensors), and procedure (flight information or similar), and (c) deliver necessary information about post-processing of the raw camera or sensor data, if any. Accordingly, these recommendations vary based on whether the data package contains raw or derived data.
High level metadata pertaining to the entire data package are easily provided in the EML file (e.g. a geographic bounding box). Data packages from drones or other moving platforms commonly include numerous point measurements, images, or other granular data entities, either separately or inside a compressed archive file. Detailed metadata pertaining to these data entities may be included as additional files in the data package. Inventory tables, usually a simple CSV file, are one such additional metadata format. For example, an inventory table could be used to list individual data files in the data package (e.g., images from one drone flight) and provide metadata (e.g. point location) about each. In addition to inventory tables, files that enable or supplement common processing pipelines, such as flight or mission logs, may be included. A flight/mission log may be provided in a proprietary binary format, but because software for parsing these formats may become obsolete, we recommend archiving the log in the format most useful for contemporary analysis software, and extracting and appending the information to the inventory file where appropriate. Exif (Exchangeable image file format) metadata in images may also be programmatically extracted to supplement the inventory file.
Clearly, there are many possible ways to combine raw data, derived data, and metadata files into a moving platform data package. No matter the combination, the critical metadata categories and the recommended contents below should be considered and included where possible. The decision on whether to provide the metadata in EML or at a more granular level, such as an inventory table, will depend on the given dataset.
* **Methods:** unique identifier for a given flight or mission; summary information from a flight log; weather conditions; accuracy of sensor and geographic location information; data processing method; ground sample distance; image overlap; flight height; whether UAS followed terrain elevation vs fixed-height flight; location of UAS launch (since some image metadata are derived from this); general description of software used and for what purpose; sensor calibration date and procedure; general description of payload type, such as multispectral camera and spectral bands.
* **Instrumentation:** make and model of platform, sensor, and camera, including manufacturer and specific model names and numbers. Include make and model of any interchangeable lenses in cameras. Specifics like spectral bands, temperature range, sensor accuracy, etc.
* **Software:** (see also [Code in EDI](code.html) chapter) list of software used. Especially when code is proprietary or archived elsewhere, the name, version, and configuration of any software used are advisable, as are corrections applied (e.g., correction for sensor angle or heat/air flow). Ideally, a .pdf report generated by processing software can be archived as an otherEntity together with the imagery itself to convey much of the necessary information. Also include data used as a ground truth or calibration points for post processing (e.g. spectral calibration image/biomass sample/wind speed/etc) and their date of collection.
* **People with specified role:** drone operator, image processor
* **Geographic Information:** (see also [Spatial Data chapter](sptial-data.html)) a general bounding box should be included in EML, while the individual location of images or point measurements should be handled in the inventory table, or directly in the included data files. Also include the coordinate reference system (e.g., WGS 84) used for images and (if different) ground control points, projection type if needed, altitude of image/measurement acquisition, spatio-temporal coordinates, pitch, roll, and yaw from flight log or image data points. It should be noted that there are special considerations for underwater vehicles, especially with regard to metadata to explain how geographic positions were obtained. With autonomous underwater vehicles (AUVs), there can be error sources in the topside GPS localization, the underwater acoustic positioning system (e.g., Long Baseline, Ultrashort Baseline), as well as any sensors used for dead reckoning (e.g., accelerometer, Doppler Velocity Log). At a minimum, it would be useful to know which sensors were used to produce the localization data and whether the navigation tracklines were post-processed with benchmarks.
* **Temporal Information:** may also be provided at the EML level or as timestamps for individual data/image points, either in inventory tables or in the data files themselves. Time of day critically affects useability for image-based datasets; so ensure that the time of day is clear from the metadata available prior to download, either in the EML temporal coverage or via the methods.
* **Keywords:** Use of appropriate keywords aids in data discovery. Keywords that identify datasets as drone-related are therefore recommended (e.g. drone, UAV, UAS). Keywords describing the type of data collected are also recommended (e.g. image collection, aerial imagery, thermal imagery, NDVI, digital elevation map). For drone mapping data products, keyword recommendations from the [Spatial Data chapter](spatial-data.html) are largely applicable.
#### Examples and additional metadata guidance
Several EDI data packages for data from moving platforms are presented as examples in Table 1. Many more detailed, ?drone-specific? metadata terms and values can be included in data packages for drones and other moving platforms. For completeness we have developed a comprehensive list of recommended and optional metadata terms based on the work of Wyngaard et al. (2019), Thorner et al. (2020), with mappings to select relevant ontologies, viewable [here](https://docs.google.com/spreadsheets/d/1PQ0SUEQLgQXdz2PUNDy2jGry3o9veAedcz8l-5ubpwU). For each metadata element, we assessed its utility in terms of data discovery, evaluating fitness for use, and actual data reuse. The minimum recommended subsets of metadata that are included in the section above were derived from this table.
Table 1. Example packages at EDI and other repositories
<table>
<tr>
<td><strong>Title</strong>
</td>
<td><strong>Description</strong>
</td>
<td><strong>EDI packageID</strong>
</td>
</tr>
<tr>
<td><em>Orthophoto and elevation models from UAV overflights at the G-IBPE study site at Jornada Basin LTER in 2019</em>
</td>
<td>Approximately 599 RGB images and data derived from uncrewed aerial vehicle (UAV) overflights of the G-IBPE study site at the Jornada Basin LTER in southern New Mexico, USA.
</td>
<td><a href="https://portal-s.edirepository.org/nis/mapbrowse?scope=knb-lter-jrn&identifier=210543001">knb-lter-jrn.210543001</a>
</td>
</tr>
<tr>
<td><em>Aerial imagery from unmanned aerial systems (UAS) flights and ground control points: Plum Island Estuary and Parker River NWR (PRNWR), February 27th, 2018.</em>
</td>
<td>USGS Aerial imagery UAS flights at the Parker River National Wildlife Refuge, Massachusetts, USA, includes ground control, multispectral and true color child items which each have data entities that include ground control or a file catalog of images
</td>
<td><a href="https://www.sciencebase.gov/catalog/item/5c0fe16de4b0c53ecb2d1bc3">ScienceBase</a>
</td>
</tr>
<tr>
<td><em>Spatial variability in water chemistry of four Wisconsin aquatic ecosystems - High speed limnology Environmental Science and Technology datasets</em>
</td>
<td>water chemistry sensors embedded in a high-speed water intake system to document spatial variability.
</td>
<td><a href="https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-ntl.337.4">knb-lter-ntl.337.4</a>
</td>
</tr>
<tr>
<td><em>Thermal infrared, multispectral, and photogrammetric data collected by drone for hydrogeologic analysis of the East River beaver-impacted corridor near Crested Butte, Colorado</em>
</td>
<td>infrared, multispectral, visual image data, and derivative products (orthomosaic and digital surface model) collected along a beaver-impacted section of the East River from August 12-17, 2017 and July 28-August 2, 2018.
</td>
<td><a href="https://www.sciencebase.gov/catalog/item/5ccc4cc9e4b09b8c0b78c97a">ScienceBase</a>
</td>
</tr>
</table>
### Resources
#### Tips and Tricks
For making an image catalog (.csv) from a directory of images, consider using the exif tool [https://exiftool.org/](https://exiftool.org/). For example, the command ?exiftool.exe -csv -r mydirectory > image_catalog.csv? will extract the entirety of the exif tags from all files stored under mydirectory into a comma-delimited table and write it to the the file image_catalog.csv
#### Semantic Annotation
Semantic annotation of drone imagery is a rapidly developing field. Ontologies that provide relevant terms include: [dronetology](http://www.dronetology.net/); [sensorML](http://www.sensorml.com/ontologies.html); [FGDC content standard for digital geospatial metadata](https://www.fgdc.gov/metadata/csdgm/) (not officially an ontology but a structured metadata format with defined terms); [Semantic Sensor Network ontology](https://www.w3.org/TR/vocab-ssn/) (SSN, including the SOSA core); [Semantic Web for Earth and Environment Technology ontology](https://bioportal.bioontology.org/ontologies/SWEET) (SWEET); and [Environment Ontology](http://www.obofoundry.org/ontology/envo.html).
#### References
Thomer, Andrea K., Swanz, Sarah, Barbieri, Lindsay, Wyngaard, Jane. (2020). A minimum information framework the FAIR collection of earth and environmental science data with drones. DOI: 10.5281/zenodo.4017647
Wyngaard, J.; Barbieri, L.; Thomer, A.; Adams, J.; Sullivan, D.; Crosby, C.; Parr, C.; Klump, J.; Raj Shrestha, S.; Bell, T. Emergent Challenges for Science sUAS Data Management: Fairness through Community Engagement and Best Practices Development. _Remote Sens._ **2019**, _11_, 1797.
## Data in Other Repositories
Contributors: Greg Maurer (lead), Stace Beaulieu, Renée Brown, Sarah Elmendorf, Hap Garritt, Gastil Gastil-Buhl, Corinna Gries, Li Kui, An Nguyen, John Porter, Margaret O'Brien, Tim Whiteaker
### Introduction
A wide variety of data repositories are available for publishing biological, environmental, and Earth observation data, and the choice of where to publish a particular dataset is determined by many competing factors. For example, a funding agency or journal may require a certain repository (e.g., NSF BCO-DMO, NSF ADC, USDA ADC, DOE ESS-DIVE); the research subject or data type may be best served by a specialized repository (e.g., AmeriFlux, GenBank); or datasets may be submitted to a general purpose repository with minimal metadata requirements to simplify and speed data publishing (e.g., DRYAD, Figshare, Zenodo). For these and other reasons related datasets are sometimes published in disparate data repositories, the same data needs to be discoverable in more than one repository, or multiple datasets from one or more repositories may be used to create a new, derived dataset. In such cases, it can be advantageous to establish links between datasets in different repositories such that provenance, supplementation, duplication or other relationships are explicit. Clearly, this subject goes well beyond the single repository and better standards and approaches for linking resources and documenting data provenance are being developed elsewhere (e.g. [DataONE](https://www.dataone.org/network/), [ProvONE](https://purl.dataone.org/provone-v1-dev), [WholeTale](https://wholetale.org/)). Here we concentrate on specific cases in the context of large and multidisciplinary projects, such as LTER sites, that wish to enhance data discovery and preserve data relationships across multiple repositories.
### Recommendations for data packages
#### Considerations for creating linked data
In practice, links to data in other repositories can be achieved using metadata only, by including a data inventory file, or, although not recommended, by duplicating the data in the new repository record. **Generally, duplicating data in multiple repositories is not recommended because it creates two problems.** First, it is a burden to maintain multiple copies of a dataset and avoid divergence between them. Second, it can create confusion for data re-users who may download or cite the same data multiple times. Care must be taken to clearly identify such duplications for data users when they are created. Whenever linked datasets are created, it is strongly recommended that both repositories are aligned with FAIR data principles, [outlined here](https://doi.org/10.1038/sdata.2016.18), so that users have unfettered access to all data and metadata.
In addition to these considerations, there are a number of reasons to create a new repository record that is linked to data in other repositories. Each of these reasons, which are outlined below, has pros and cons that will need to be weighed from the different perspectives of the data user, data provider and research project management requirements.
* **Requirements dictate multiple repositories:** Large research projects or sites are frequently funded by different agencies and programs. Data collection may be supported by several such funding streams and, hence, fall in the purview of more than one requirement to archive data in a particular repository. In some cases, data repositories already accommodate such requirements by linking or replicating data appropriately. Examples of this are LTER data in EDI, NSF BCO-DMO and NSF ADC.
* **Adding important metadata**: If data were originally submitted to a general purpose repository with minimal metadata requirements (e.g., DRYAD, figshare) additional metadata (e.g., EML) may be needed for discoverability, reusability, and integration. By creating a new repository record that identifies and is linked to the original published dataset, richer and more useful metadata can be added to the new record and utilized.
* **Use of specialist repositories for related data:** There are sometimes advantages to publishing particular data types in specialized repositories. Specialized data repositories (e.g. GenBank, AmeriFlux) usually enforce strict data formatting, provide quality standards, enhanced search, discovery and reuse of particular types of data across projects in a way that is not possible using a generalized metadata format (EML) and repository (EDI). However, these data may not be discoverable with other, related project data taken at the same location and time. Creating links between related datasets held in specialist and generalist repositories helps preserve this context.
* **A derived data product is archived in a different repository than the source (raw) data:** A wide range of cases fall into this category, from a direct one to one relationship of, e.g., a gene sequence and its OTU identification, a metagenome analysis and its community diversity metrics, to several datasets being combined in synthesis or meta-analysis studies. In these cases, links between source data and derived data products that are published in separate repositories need to be created and clearly documented.
* **Linking to site- or project-relevant data from other research groups or agencies:** Although it may help with some aspects of data discovery it is generally not recommended to create records in EDI for data collected and managed by entirely different research groups or agencies. **_In these cases, however, it is recommended to place a pointer to such repositories on a project website or develop other means for data users to discover relevant resources._**
#### General metadata for linked data packages in EDI
In EDI, the linked data package can be assembled using standard practices and EML metadata elements, but the included metadata and data entities must clearly lead the data user to files held in outside repositories. In addition, the package metadata should communicate the essential elements needed for data discovery (subject matter, authors, location, time-frame, etc.) and a brief description of how the data may be accessed, re-used, and cited via the outside repository as needed. General guidance on the content and structure of key metadata elements in an EDI data package linked to data in other repositories are described below.
* **Abstract:** Describe the key features of the data package. If the data package contains only links to data held in other repositories, or data duplicated from another repository, clearly state that the original data are located in a different repository and direct the user to the correct data citation. Describe the target data in sufficient detail that users can determine whether these data are fit for their use, and instruct them on how to find and re-use the data.
* **Methods:** Collection/generation methods for any data entities included or linked to. If the methods are well-described in the metadata at another repository, this element can simply refer users there. If the new data package includes ancillary data or derived data, describe how those data were collected or derived.
* **Geographic description** and **coordinates:** At a minimum these elements should define a bounding box that will make the data package discoverable through EDI, DataOne, or other geographic search interfaces. Additional, more detailed coordinates may be given in the inventory file entity as described below.
* **Keywords:** Since some linked data packages include an inventory of data held at a different repository, include the keyword "data inventory" and thematic keywords that describe the data entities in the other repository.
#### Common use cases and their structure in EDI
There are several common use cases for creating a new linked data package in EDI. The new package may establish either a one-to-one link from EDI to a dataset in another repository, or a one-to-many relationship that is more complex. Three possible cases are described below in terms of what entities to publish, where to publish them, metadata elements to be created in EML, and the contents of included data entities. There are likely to be other use cases for linking EDI data packages to other repositories.
**Case 1:** One dataset needs to be discoverable in more than one data repository. The data remain the same, but the metadata in the new data package at EDI may be upgraded beyond what exists in the other repository.
* The metadata in EDI must clearly state, preferably in the abstract or another obvious location, that this data package is already published in another repository. Include the original unique identifier and instruct users to cite that original data, if appropriate.
* Include instructions on how to access and cite the original data if the original repository is lacking in such guidance.
* If data are duplicated (which is not recommended), metadata should include information on how versions in different repositories are kept synchronized. If such synchronization is not feasible, users should be warned to inspect both sources for the latest data..
* In EML the <additionalIdentifier> field may be used to store the persistent identifier (DOI), or a link (URL) that refers to the data held in another repository to make the link machine readable. Where an external repository supplies both a URL and DOI, use the DOI as URLs may not be maintained through time.
**Case 2:** A list of data records held in a specialized repository needs to be linked to ancillary or supporting data that are being published in EDI (for derived data see Case 3).
* This case applies when a collection of datasets, or similar scientific resources, is held in a specialized repository and closely related ancillary or supporting data and metadata needs to be archived in a more generalist data repository like EDI. For example, ancillary environmental data or laboratory analyses held in EDI could be linked to collections of sequence reads held in NCBI GenBank or museum voucher specimens archived with Darwin Core metadata. See complete examples in Table 5.3.
* The new EDI data package should include a 'data inventory' (or manifest of holdings) file as a data entity. This is most likely a simple tabular data file, such as a CSV, that lists and describes the repository records held in the specialized data repository and has its column attributes described in EML as a [dataTable](https://eml.ecoinformatics.org/schema/eml-dataTable_xsd.html#eml-dataTable.xsd) entity.
* The inventory table must have a row for each outside repository record (or some meaningful grouping of records, e.g., project in NCBI) being linked to, with columns that include persistent unique identifiers of the data in the other repository, and relevant descriptors of the data. The complete content of the inventory will be dictated by the structure of the other repository and the data entities and metadata held there. Suggested columns are presented in Table 5.1.
* The inventory table may also provide additional contextual information for each individual data resource in another repository. Table 5.2 presents examples of these contextual columns. They are, however, subject dependent and may vary for different projects. For more examples, see the discussion on sequencing and genomic data later in this document.
Table 1: Suggested columns for identifying the external data in the data inventory table.
<table>
<tr>
<td><strong>Column</strong>
</td>
<td><strong>Description</strong>
</td>
</tr>
<tr>
<td>External unique ID
</td>
<td>Unique identifier for the data resource in the other repository. E.g. Accession number
</td>
</tr>
<tr>
<td>External access URL
</td>
<td>A unique, persistent link to the data resource in the other repository.
</td>
</tr>
<tr>
<td>Title/description
</td>
<td>Title and/or brief description of the data resource
</td>
</tr>
<tr>
<td>Filename(s)
</td>
<td>Dataset or file name at the other repository
</td>
</tr>
<tr>
<td>Format
</td>
<td>File format of above
</td>
</tr>
<tr>
<td>Repository URL
</td>
<td>URL of the repository being linked to
</td>
</tr>
</table>
Table 2: Examples of additional contextual columns in the data inventory table.
<table>
<tr>
<td><strong>Column</strong>
</td>
<td><strong>Description</strong>
</td>
</tr>
<tr>
<td>Latitude/Longitude
</td>
<td>Latitude and longitude in standard format for each data resource in the other repository.
</td>
</tr>
<tr>
<td>Location name
</td>
<td>Locally used name of collection site
</td>
</tr>
<tr>
<td>Treatment level
</td>
<td>Experimental treatment applied to the outside dataset
</td>
</tr>
<tr>
<td>Start/End datetime
</td>
<td>Starting/ending datetime of the data resource (NA for End if data collection is ongoing)
</td>
</tr>
<tr>
<td>Reference publication
</td>
<td>DOI of publication providing in-depth context for data
</td>
</tr>
</table>
**Case 3:** One or more datasets in other repositories are used to create derived data products that need to be archived in EDI.
* In this case the new dataset is directly or indirectly derived from the 'source' dataset(s) in other repositories. Such derived data may serve a wide range of research purposes, including use in cross-site synthesis, re-analysis, or meta-analysis studies.
* Provenance metadata should be used to describe the relationship between the source and derived datasets, which ensures reproducibility and preserves data lineage. In a new EDI data package that archives derived data, the provenance metadata should be inserted in the EML file utilizing <dataSource> elements. The <dataSource> elements should be nested within a <methodStep> element and will establish the links to any source datasets located in another repository. An example snippet of provenance EML is shown in Figure 1.
* Other cross-repository standards for provenance metadata are still being developed and are not widely adopted, e.g., [ProvONE](http://homepages.cs.ncl.ac.uk/paolo.missier/doc/dataone-prov-3-years-later.pdf).
* The EDI portal interface provides [automatic generation of provenance metadata](https://portal.edirepository.org/nis/provenanceGenerator.jsp) EML snippets for datasets in EDI. The [EMLassemblyline](https://github.com/EDIorg/emlAssemblyLine) and [MetaEgress](https://github.com/BLE-LTER/MetaEgress) (in connection with [LTER-core-metabase](https://github.com/lter/LTER-core-metabase)) R packages for EML creation will also generate provenance metadata.
Example 1: EML snippet with a data provenance methodStep:
```xml
<methodStep>
<description>
<para>This methodStep contains data provenance information as specified in the LTER EML Best Practices. Each dataSource element here lists entity-specific information and links to source data used in the creation of this derivative data package.</para>
</description>
<dataSource>
<title>Source dataset title</title>
<creator>
<individualName>
<givenName>first name</givenName>
<surName>last name</surName>
</individualName>
<organizationName>organization name</organizationName>
<electronicMailAddress>email@some.edu</electronicMailAddress>
</creator>
<distribution>
<online>
<onlineDescription>This is a link to an external online data resource (describe resource and repository location).</onlineDescription>
<url function="information">https://pasta.lternet.edu/package/metadata/eml/knb-lter-ntl/80/2</url>
</online>
</distribution>
<contact>
<positionName>Information Manager</positionName>
<organizationName>organization name</organizationName>
<electronicMailAddress>infomgr@some.edu</electronicMailAddress>
</contact>
</dataSource>
</methodStep>
```
### Nucleotide sequence and genomic data
Nucleotide sequence data consists of the order and arrangement of DNA or RNA bases extracted from individual organisms or environmental samples. Similarly, genomic data refers to the complete genetic information (either DNA or RNA) of an organism, while metagenomic data refers to the study of genomes recovered from environmental samples. Sequencing, genomic and metagenomic datasets can be very large and complex, and researchers in these fields benefit from particular methods of data access, analysis, and collaboration. Therefore, these data have specialized requirements for data archiving.
Archiving nucleotide sequence and genomic (or other '[omics](https://en.wikipedia.org/wiki/Omics)') data are a common use case for creating linked datasets. Data that originate from nucleotide sequencing techniques are most often stored in specialized repositories such as National Center for Biotechnology Information (NCBI) GenBank and the European Nucleotide Archive. However, while sequences or assembled genomes constitute important raw data, ancillary and derived data products related to these raw data are frequently published in repositories specializing in ecological data. For example, data derived from sequence data, such as operational taxonomic units (OTUs) or functional assignments, and ancillary data that describe the environmental, biochemical, or experimental context of the sequencing data, are often included in scientific publications, and do not always fit within the scope of a specialized sequence or genome data repository.
#### Recommendations for sequencing or genomic datasets
Linking to genomics data is an example of Case 2 described above. Summaries or inventories of data records held in a repository like NCBI GenBank are linked to their derived products or additional measurements published in a more generalist repository such as EDI.
In addition to the metadata typically included with any data package published by the site or research group, include metadata that is descriptive specifically of sequencing and genomics datasets. It is recommended to refer to the [MixS templates](https://press3.mcs.anl.gov/gensc/mixs/) for standard terminology, especially in the keyword section:
**Keywords** that can help users discover the sequencing or genomic dataset include:
1. General data type descriptions ('nucleotide sequence', 'genomics', 'metagenomics')
2. Names of target genes or subfragments ('16S rRNA', '18S rRNA', 'nif', 'amoA', 'rpo', 'ITS')
3. Names of the sequencing technique ('Sanger', 'pyrosequencing', 'ABI-solid')
4. Names of the linked repository ('SRA', 'EMBL', 'Ensembl')