@@ -21,45 +21,7 @@ project, see [DAGs.md](../reference/DAGs.md).
21
21
See each provider API script's notes in their respective [ handbook] [ ov-handbook ]
22
22
entry.
23
23
24
- [ ov-handbook ] : https://make.wordpress.org/openverse/handbook/openverse-handbook/
25
-
26
- ## Web Crawl Data (retired)
27
-
28
- The Common Crawl Foundation provides an open repository of petabyte-scale web
29
- crawl data. A new dataset is published at the end of each month comprising over
30
- 200 TiB of uncompressed data.
31
-
32
- The data is available in three file formats:
33
-
34
- - WARC (Web ARChive): the entire raw data, including HTTP response metadata,
35
- WARC metadata, etc.
36
- - WET: extracted plaintext from each webpage.
37
- - WAT: extracted html metadata, e.g. HTTP headers and hyperlinks, etc.
38
-
39
- For more information about these formats, please see the [ Common Crawl
40
- documentation] [ ccrawl_doc ] .
41
-
42
- Openverse Catalog used AWS Data Pipeline service to automatically create an
43
- Amazon EMR cluster of 100 c4.8xlarge instances that parsed the WAT archives to
44
- identify all domains that link to creativecommons.org. Due to the volume of
45
- data, Apache Spark was also used to streamline the processing. The output of
46
- this methodology was a series of parquet files that contain:
47
-
48
- - the domains and its respective content path and query string (i.e. the exact
49
- webpage that links to creativecommons.org)
50
- - the CC referenced hyperlink (which may indicate a license),
51
- - HTML meta data in JSON format which indicates the number of images on each
52
- webpage and other domains that they reference,
53
- - the location of the webpage in the WARC file so that the page contents can be
54
- found.
55
-
56
- The steps above were performed in [ ` ExtractCCLinks.py ` ] [ ex_cc_links ] .
57
-
58
- This method was retired in 2021.
59
-
60
- [ ccrawl_doc ] : https://commoncrawl.org/the-data/get-started/
61
- [ ex_cc_links] :
62
- https://github.com/WordPress/openverse/blob/c20262cad8944d324b49176678b16b230bc57e2e/archive/ExtractCCLinks.py
24
+ [ ov-handbook ] : https://make.wordpress.org/openverse/handbook/
63
25
64
26
## Development setup for Airflow and API puller scripts
65
27
@@ -70,7 +32,7 @@ different environment than the PySpark portion of the project, and so have their
70
32
own dependency requirements.
71
33
72
34
For instructions geared specifically towards production deployments, see
73
- [ DEPLOYMENT.md] ( https://github.com/WordPress/openverse/blob/main/catalog/DEPLOYMENT .md )
35
+ [ DEPLOYMENT.md] ( https://github.com/WordPress/openverse/blob/main/documentation/ catalog/guides/deployment .md )
74
36
75
37
[ api_scripts] :
76
38
https://github.com/WordPress/openverse/blob/main/catalog/dags/providers/provider_api_scripts
@@ -90,7 +52,7 @@ To set up the local python environment along with the pre-commit hook, run:
90
52
``` shell
91
53
python3 -m venv venv
92
54
source venv/bin/activate
93
- just install
55
+ just catalog/ install
94
56
```
95
57
96
58
The containers will be built when starting the stack up for the first time. If
@@ -105,7 +67,7 @@ just build
105
67
To set up environment variables run:
106
68
107
69
``` shell
108
- just dotenv
70
+ just env
109
71
```
110
72
111
73
This will generate a ` .env ` file which is used by the containers.
@@ -128,7 +90,7 @@ There is a [`docker-compose.yml`][dockercompose] provided in the
128
90
[ ` catalog ` ] [ cc_airflow ] directory, so from that directory, run
129
91
130
92
``` shell
131
- just up
93
+ just catalog/ up
132
94
```
133
95
134
96
This results, among other things, in the following running containers:
@@ -160,10 +122,10 @@ The various services can be accessed using these links:
160
122
At this stage, you can run the tests via:
161
123
162
124
``` shell
163
- just test
125
+ just catalog/ test
164
126
165
127
# Alternatively, run all tests including longer-running ones
166
- just test --extended
128
+ just catalog/ test --extended
167
129
```
168
130
169
131
Edits to the source files or tests can be made on your local machine, then tests
@@ -172,7 +134,7 @@ can be run in the container via the above command to see the effects.
172
134
If you'd like, it's possible to login to the webserver container via:
173
135
174
136
``` shell
175
- just shell
137
+ just catalog/ shell
176
138
```
177
139
178
140
If you just need to run an airflow command, you can use the ` airflow ` recipe.
@@ -192,7 +154,7 @@ To begin an interactive [`pgcli` shell](https://www.pgcli.com/) on the database
192
154
container, run:
193
155
194
156
``` shell
195
- just db-shell
157
+ just catalog/pgcli
196
158
```
197
159
198
160
If you'd like to bring down the containers, run
@@ -230,37 +192,28 @@ just recreate
230
192
## Directory Structure
231
193
232
194
``` text
233
- openverse-catalog
234
- ├── .github/ # Templates for GitHub
235
- ├── archive/ # Files related to the previous CommonCrawl parsing implementation
236
- ├── docker/ # Dockerfiles and supporting files
237
- │ └── upstream_db/ # - Docker image for development Postgres database
238
- ├── catalog/ # Primary code directory
239
- │ ├── dags/ # DAGs & DAG support code
240
- │ │ ├── common/ # - Shared modules used across DAGs
241
- │ │ ├── data_refresh/ # - DAGs & code related to the data refresh process
242
- │ │ ├── database/ # - DAGs related to database actions (matview refresh, cleaning, etc.)
243
- │ │ ├── maintenance/ # - DAGs related to airflow/infrastructure maintenance
244
- │ │ ├── oauth2/ # - DAGs & code for Oauth2 key management
245
- │ │ ├── providers/ # - DAGs & code for provider ingestion
246
- │ │ │ ├── provider_api_scripts/ # - API access code specific to providers
247
- │ │ │ ├── provider_csv_load_scripts/ # - Schema initialization SQL definitions for SQL-based providers
248
- │ │ │ └── *.py # - DAG definition files for providers
249
- │ │ └── retired/ # - DAGs & code that is no longer needed but might be a useful guide for the future
250
- │ └── templates/ # Templates for generating new provider code
251
- └── * # Documentation, configuration files, and project requirements
195
+
196
+ catalog/ # Primary code directory
197
+ ├── dags/ # DAGs & DAG support code
198
+ │ ├── common/ # - Shared modules used across DAGs
199
+ │ ├── data_refresh/ # - DAGs & code related to the data refresh process
200
+ │ ├── database/ # - DAGs related to database actions (matview refresh, cleaning, etc.)
201
+ │ ├── maintenance/ # - DAGs related to airflow/infrastructure maintenance
202
+ │ ├── oauth2/ # - DAGs & code for Oauth2 key management
203
+ │ ├── providers/ # - DAGs & code for provider ingestion
204
+ │ │ ├── provider_api_scripts/ # - API access code specific to providers
205
+ │ │ ├── provider_csv_load_scripts/ # - Schema initialization SQL definitions for SQL-based providers
206
+ │ │ │ └── *.py # - DAG definition files for providers
207
+ │ │ └── retired/ # - DAGs & code that is no longer needed but might be a useful guide for the future
208
+ │ ├── templates/ # Templates for generating new provider code
209
+ └── * # Documentation, configuration files, and project requirements
252
210
```
253
211
254
212
## Publishing
255
213
256
214
The docker image for the catalog (Airflow) is published to
257
215
ghcr.io/WordPress/openverse-catalog.
258
216
259
- ## Contributing
260
-
261
- Pull requests are welcome! Feel free to [ join us on Slack] [ wp_slack ] and discuss
262
- the project with the engineers and community members on #openverse.
263
-
264
217
## Additional Resources
265
218
266
219
- 2022-01-12: ** [ cc-archive/cccatalog] ( https://github.com/cc-archive/cccatalog ) :
@@ -277,17 +230,3 @@ For additional context see:
277
230
[ Welcome to Openverse – Openverse — WordPress.org] ( https://make.wordpress.org/openverse/2021/05/11/hello-world/ )
278
231
- 2021-12-13:
279
232
[ Dear Users of CC Search, Welcome to Openverse - Creative Commons] ( https://creativecommons.org/2021/12/13/dear-users-of-cc-search-welcome-to-openverse/ )
280
-
281
- ## Acknowledgments
282
-
283
- Openverse, previously known as CC Search, was conceived and built at
284
- [ Creative Commons] ( https://creativecommons.org ) . We thank them for their
285
- commitment to open source and openly licensed content, with particular thanks to
286
- previous team members @ryanmerkley , @janetpkr , @lizadaly , @sebworks , @pa-w ,
287
- @kgodey , @annatuma , @mathemancer , @aldenstpage , @brenoferreira , and @sclachar ,
288
- along with their
289
- [ community of volunteers] ( https://opensource.creativecommons.org/community/community-team/ ) .
290
-
291
- [ wp_slack ] : https://make.wordpress.org/chat/
292
- [ cc ] : https://creativecommons.org
293
- [ cc_community ] : https://opensource.creativecommons.org/community/community-team/
0 commit comments