@@ -10,6 +10,277 @@ Generalizers take example data and *generalizes* to a schema
10
10
Generalization is inherently a heuristic process, this should be viewed as a bootstrapping process
11
11
that *semi *-automates the creation of a new schema for you.
12
12
13
+ Generalizing from a single TSV
14
+ -----------------
15
+
16
+ .. code-block ::
17
+
18
+ schemauto generalize-csv tests/resources/NWT_wildfires_biophysical_2016.tsv -o wildfire.yaml
19
+
20
+ The schema will have a slot for every column, e,g:
21
+
22
+ .. code-block :: yaml
23
+
24
+ classes :
25
+ Observation :
26
+ slots :
27
+ - site
28
+ - plot
29
+ - plot_size
30
+ - date
31
+ - observer
32
+
33
+ Ranges will be auto-inferred, e.g.:
34
+
35
+ .. code-block :: yaml
36
+
37
+ slots :
38
+ site :
39
+ examples :
40
+ - value : ZF20-105
41
+ range : string
42
+ plot :
43
+ examples :
44
+ - value : ' 6'
45
+ range : integer
46
+ plot_size :
47
+ examples :
48
+ - value : 10X10
49
+ range : plot_size_enum
50
+ date :
51
+ examples :
52
+ - value : ' 2016-07-09'
53
+ range : datetime
54
+
55
+ Enums will be automatically inferred:
56
+
57
+ .. code-block :: yaml
58
+
59
+ enums :
60
+ plot_size_enum :
61
+ permissible_values :
62
+ 10X10 :
63
+ description : 10X10
64
+ 5x5 :
65
+ description : 5x5
66
+ 2.5X2.5 :
67
+ description : 2.5X2.5
68
+ 5X5 :
69
+ description : 5X5
70
+ 3x3 :
71
+ description : 3x3
72
+ ecosystem_enum :
73
+ permissible_values :
74
+ Open Fen :
75
+ description : Open Fen
76
+ Treed Fen :
77
+ description : Treed Fen
78
+ Black Spruce :
79
+ description : Black Spruce
80
+ Poor Fen :
81
+ description : Poor Fen
82
+ Fen :
83
+ description : Fen
84
+ Lowland :
85
+ description : Lowland
86
+ Upland :
87
+ description : Upland
88
+ Bog :
89
+ description : Bog
90
+ Lowland Black Spruce :
91
+ description : Lowland Black Spruce
92
+
93
+ Chaining an annotator
94
+ -----------------
95
+
96
+ If you provide an ``--annotator `` option you can auto-annotate enums:
97
+
98
+ .. code-block ::
99
+
100
+ schemauto generalize-csv \
101
+ --annotator bioportal:envo \
102
+ tests/resources/NWT_wildfires_biophysical_2016.tsv \
103
+ -o wildfire.yaml
104
+
105
+ .. code-block :: yaml
106
+
107
+ ecosystem_enum :
108
+ from_schema : https://w3id.org/MySchema
109
+ permissible_values :
110
+ Open Fen :
111
+ description : Open Fen
112
+ meaning : ENVO:00000232
113
+ exact_mappings :
114
+ - ENVO:00000232
115
+ Treed Fen :
116
+ description : Treed Fen
117
+ meaning : ENVO:00000232
118
+ exact_mappings :
119
+ - ENVO:00000232
120
+ Black Spruce :
121
+ description : Black Spruce
122
+ Poor Fen :
123
+ description : Poor Fen
124
+ meaning : ENVO:00000232
125
+ exact_mappings :
126
+ - ENVO:00000232
127
+ Fen :
128
+ description : Fen
129
+ meaning : ENVO:00000232
130
+ Lowland :
131
+ description : Lowland
132
+ Upland :
133
+ description : Upland
134
+ meaning : ENVO:00000182
135
+ Bog :
136
+ description : Bog
137
+ meaning : ENVO:01000534
138
+ exact_mappings :
139
+ - ENVO:01000535
140
+ - ENVO:00000044
141
+ - ENVO:01001209
142
+ - ENVO:01000527
143
+ Lowland Black Spruce :
144
+ description : Lowland Black Spruce
145
+
146
+ The annotation can also be run as a separate step
147
+
148
+ See :ref: `annotators `
149
+
150
+ Generalizing from multiple TSVs
151
+ ------------
152
+
153
+ You can use the ``generalize-tsvs `` command to generalize from *multiple * TSVs, with
154
+ foreign key linkages auto-inferred.
155
+
156
+ For example, given a file ``envo.tsv ``:
157
+
158
+ .. csv-table :: environments
159
+ :header: envo term id, envo term label
160
+
161
+ ENVO_01000752,area of barren land
162
+ ENVO_01001570,terrestrial ecoregion
163
+ ENVO_01001581,sea surface layer
164
+ ENVO_01001582,forest floor
165
+
166
+ And a file file ``samples.tsv ``:
167
+
168
+ .. csv-table :: samples
169
+ :header: BIOSAMPLE_ID,BIOSAMPLE_NAME,ENVO_BIOME_ID,ENVO_FEATURE_ID,ENVO_MATERIAL_ID
170
+
171
+ 156554,"Enriched cells from forest soil in Barre Woods Harvard Forest LTER site, Petersham, Massachusetts, United States - Alteio_BWOrgControl_Nextera2",ENVO_01000174,ENVO_01000159,ENVO_00002261
172
+ 156649,"Enriched cells from forest soil in Barre Woods Harvard Forest LTER site, Petersham, Massachusetts, United States - Alteio_BWOrgHeat_Nextera5",ENVO_01000174,ENVO_01000159,ENVO_00005781
173
+ 156728,"Enriched cells from forest soil in Barre Woods Harvard Forest LTER site, Petersham, Massachusetts, United States - Alteio_BWOrgHeat_Nextera84",ENVO_01000174,ENVO_01000159,ENVO_00005781
174
+ 156738,"Enriched cells from forest soil in Barre Woods Harvard Forest LTER site, Petersham, Massachusetts, United States - Alteio_BWMinControl_Nextera2",ENVO_01000174,ENVO_01001275,ENVO_00002261
175
+
176
+ We can create a multi-class schema, with foreign keys inferred:
177
+
178
+ .. code-block ::
179
+
180
+ schemauto generalize-tsvs --infer-foreign-keys sample.tsv envo.tsv
181
+
182
+ This will generate a schema with two classes, where the join between the sample table and the term table
183
+ is inferred:
184
+
185
+ .. code-block :: yaml
186
+
187
+ classes:
188
+ sample:
189
+ slots:
190
+ - BIOSAMPLE_ID
191
+ - BIOSAMPLE_NAME
192
+ - ENVO_BIOME_ID
193
+ - ENVO_FEATURE_ID
194
+ - ENVO_MATERIAL_ID
195
+ envo:
196
+ slots:
197
+ - ENVO_ID
198
+ - ENVO_LABEL
199
+
200
+ slots:
201
+ BIOSAMPLE_ID:
202
+ range: integer
203
+ BIOSAMPLE_NAME:
204
+ range: string
205
+ ENVO_BIOME_ID:
206
+ examples:
207
+ - value: ENVO_01000022
208
+ range: envo
209
+ ENVO_FEATURE_ID:
210
+ range: envo
211
+ ENVO_MATERIAL_ID:
212
+ range: envo
213
+ ENVO_ID:
214
+ identifier: true
215
+ range: string
216
+ ENVO_LABEL:
217
+ range: string
218
+
219
+ Generalizing from tables on the web
220
+ -----------------
221
+
222
+ You can use ``generalize-htmltable ``
223
+
224
+ .. code-block ::
225
+
226
+ schemauto generalize-htmltable https://www.nature.com/articles/s41467-022-31626-4/tables/1
227
+
228
+ Will generate:
229
+
230
+ .. code-block :: yaml
231
+
232
+ name : example
233
+ description : example
234
+ id : https://w3id.org/example
235
+ imports :
236
+ - linkml:types
237
+ prefixes :
238
+ linkml : https://w3id.org/linkml/
239
+ example : https://w3id.org/example
240
+ default_prefix : example
241
+ slots :
242
+ GWAS trait :
243
+ examples :
244
+ - value : " \xC2 "
245
+ range : string
246
+ Peak GWAS SNP :
247
+ examples :
248
+ - value : rs2974298
249
+ range : string
250
+ Gene :
251
+ examples :
252
+ - value : SMIM19
253
+ range : string
254
+ NK cell cis eSNP :
255
+ examples :
256
+ - value : rs2974348
257
+ range : string
258
+ TWAS Z score :
259
+ examples :
260
+ - value : ' 3.809'
261
+ range : string
262
+ TWAS P value :
263
+ examples :
264
+ - value : ' 0.0001'
265
+ range : string
266
+ classes :
267
+ example :
268
+ slots :
269
+ - GWAS trait
270
+ - Peak GWAS SNP
271
+ - Gene
272
+ - NK cell cis eSNP
273
+ - TWAS Z score
274
+ - TWAS P value
275
+
276
+ Generalizing from JSON
277
+ -----------
278
+
279
+
280
+
281
+ Packages
282
+ --------
283
+
13
284
.. currentmodule :: schema_automator.generalizers
14
285
15
286
.. autoclass :: CsvDataGeneralizer
0 commit comments