Skip to content

Commit a144002

Browse files
authored
Revise Catalog Quickstart (#3325)
* revise catalog quickstart * head size * revise url links * remove DEPLOYMENT.md from catalog * move retired to catalog/reference * add file structure
1 parent 57bf632 commit a144002

File tree

5 files changed

+74
-91
lines changed

5 files changed

+74
-91
lines changed

catalog/DEPLOYMENT.md renamed to documentation/catalog/guides/deployment.md

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -23,9 +23,10 @@ occur automatically and will need to be manually initiated.
2323

2424
Currently the webserver, scheduler, and worker(s) are all run within a single
2525
docker container on the EC2 instance as defined by
26-
[the Airflow `Dockerfile` and related files](./docker/airflow). The
27-
[`docker-compose.yml`](docker-compose.yml) is used to spin up Airflow in
28-
production.
26+
[the Airflow `Dockerfile`](https://github.com/WordPress/openverse/blob/main/catalog/Dockerfile).
27+
The
28+
[`docker-compose.yml`](https://github.com/WordPress/openverse/blob/main/docker-compose.yml)
29+
is used to spin up Airflow in production.
2930

3031
**Note**: Service deployments are only necessary in the following conditions:
3132

@@ -61,8 +62,10 @@ This means that we can update the python code in-place and the next DAG run or
6162
task in a currently running DAG will use the updated code. In these cases, a new
6263
EC2 instance _does not_ need to be deployed.
6364

64-
The [`dag-sync.sh`](../dag-sync.sh) script is used in production to regularly
65-
update the repository (and thus the DAG files) on the running EC2 instance.
65+
The
66+
[`dag-sync.sh`](https://github.com/WordPress/openverse/blob/main/dag-sync.sh)
67+
script is used in production to regularly update the repository (and thus the
68+
DAG files) on the running EC2 instance.
6669

6770
### Deployment workflow
6871

@@ -89,4 +92,4 @@ out to the maintainers if you're interested).
8992

9093
Any migrations to the Catalog database must either be performed by hand or as
9194
part of a DAG's normal operation (see:
92-
[iNaturalist](dags/providers/provider_api_scripts/inaturalist.py)).
95+
[iNaturalist](https://github.com/WordPress/openverse/blob/main/catalog/dags/providers/provider_api_scripts/inaturalist.py)).

documentation/catalog/guides/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,5 @@
66
quickstart
77
deploy
88
adding_a_new_provider
9+
deployment
910
```

documentation/catalog/guides/quickstart.md

Lines changed: 24 additions & 85 deletions
Original file line numberDiff line numberDiff line change
@@ -21,45 +21,7 @@ project, see [DAGs.md](../reference/DAGs.md).
2121
See each provider API script's notes in their respective [handbook][ov-handbook]
2222
entry.
2323

24-
[ov-handbook]: https://make.wordpress.org/openverse/handbook/openverse-handbook/
25-
26-
## Web Crawl Data (retired)
27-
28-
The Common Crawl Foundation provides an open repository of petabyte-scale web
29-
crawl data. A new dataset is published at the end of each month comprising over
30-
200 TiB of uncompressed data.
31-
32-
The data is available in three file formats:
33-
34-
- WARC (Web ARChive): the entire raw data, including HTTP response metadata,
35-
WARC metadata, etc.
36-
- WET: extracted plaintext from each webpage.
37-
- WAT: extracted html metadata, e.g. HTTP headers and hyperlinks, etc.
38-
39-
For more information about these formats, please see the [Common Crawl
40-
documentation][ccrawl_doc].
41-
42-
Openverse Catalog used AWS Data Pipeline service to automatically create an
43-
Amazon EMR cluster of 100 c4.8xlarge instances that parsed the WAT archives to
44-
identify all domains that link to creativecommons.org. Due to the volume of
45-
data, Apache Spark was also used to streamline the processing. The output of
46-
this methodology was a series of parquet files that contain:
47-
48-
- the domains and its respective content path and query string (i.e. the exact
49-
webpage that links to creativecommons.org)
50-
- the CC referenced hyperlink (which may indicate a license),
51-
- HTML meta data in JSON format which indicates the number of images on each
52-
webpage and other domains that they reference,
53-
- the location of the webpage in the WARC file so that the page contents can be
54-
found.
55-
56-
The steps above were performed in [`ExtractCCLinks.py`][ex_cc_links].
57-
58-
This method was retired in 2021.
59-
60-
[ccrawl_doc]: https://commoncrawl.org/the-data/get-started/
61-
[ex_cc_links]:
62-
https://github.com/WordPress/openverse/blob/c20262cad8944d324b49176678b16b230bc57e2e/archive/ExtractCCLinks.py
24+
[ov-handbook]: https://make.wordpress.org/openverse/handbook/
6325

6426
## Development setup for Airflow and API puller scripts
6527

@@ -70,7 +32,7 @@ different environment than the PySpark portion of the project, and so have their
7032
own dependency requirements.
7133

7234
For instructions geared specifically towards production deployments, see
73-
[DEPLOYMENT.md](https://github.com/WordPress/openverse/blob/main/catalog/DEPLOYMENT.md)
35+
[DEPLOYMENT.md](https://github.com/WordPress/openverse/blob/main/documentation/catalog/guides/deployment.md)
7436

7537
[api_scripts]:
7638
https://github.com/WordPress/openverse/blob/main/catalog/dags/providers/provider_api_scripts
@@ -90,7 +52,7 @@ To set up the local python environment along with the pre-commit hook, run:
9052
```shell
9153
python3 -m venv venv
9254
source venv/bin/activate
93-
just install
55+
just catalog/install
9456
```
9557

9658
The containers will be built when starting the stack up for the first time. If
@@ -105,7 +67,7 @@ just build
10567
To set up environment variables run:
10668

10769
```shell
108-
just dotenv
70+
just env
10971
```
11072

11173
This will generate a `.env` file which is used by the containers.
@@ -128,7 +90,7 @@ There is a [`docker-compose.yml`][dockercompose] provided in the
12890
[`catalog`][cc_airflow] directory, so from that directory, run
12991

13092
```shell
131-
just up
93+
just catalog/up
13294
```
13395

13496
This results, among other things, in the following running containers:
@@ -160,10 +122,10 @@ The various services can be accessed using these links:
160122
At this stage, you can run the tests via:
161123

162124
```shell
163-
just test
125+
just catalog/test
164126

165127
# Alternatively, run all tests including longer-running ones
166-
just test --extended
128+
just catalog/test --extended
167129
```
168130

169131
Edits to the source files or tests can be made on your local machine, then tests
@@ -172,7 +134,7 @@ can be run in the container via the above command to see the effects.
172134
If you'd like, it's possible to login to the webserver container via:
173135

174136
```shell
175-
just shell
137+
just catalog/shell
176138
```
177139

178140
If you just need to run an airflow command, you can use the `airflow` recipe.
@@ -192,7 +154,7 @@ To begin an interactive [`pgcli` shell](https://www.pgcli.com/) on the database
192154
container, run:
193155

194156
```shell
195-
just db-shell
157+
just catalog/pgcli
196158
```
197159

198160
If you'd like to bring down the containers, run
@@ -230,37 +192,28 @@ just recreate
230192
## Directory Structure
231193

232194
```text
233-
openverse-catalog
234-
├── .github/ # Templates for GitHub
235-
├── archive/ # Files related to the previous CommonCrawl parsing implementation
236-
├── docker/ # Dockerfiles and supporting files
237-
│ └── upstream_db/ # - Docker image for development Postgres database
238-
├── catalog/ # Primary code directory
239-
│ ├── dags/ # DAGs & DAG support code
240-
│ │ ├── common/ # - Shared modules used across DAGs
241-
│ │ ├── data_refresh/ # - DAGs & code related to the data refresh process
242-
│ │ ├── database/ # - DAGs related to database actions (matview refresh, cleaning, etc.)
243-
│ │ ├── maintenance/ # - DAGs related to airflow/infrastructure maintenance
244-
│ │ ├── oauth2/ # - DAGs & code for Oauth2 key management
245-
│ │ ├── providers/ # - DAGs & code for provider ingestion
246-
│ │ │ ├── provider_api_scripts/ # - API access code specific to providers
247-
│ │ │ ├── provider_csv_load_scripts/ # - Schema initialization SQL definitions for SQL-based providers
248-
│ │ │ └── *.py # - DAG definition files for providers
249-
│ │ └── retired/ # - DAGs & code that is no longer needed but might be a useful guide for the future
250-
│ └── templates/ # Templates for generating new provider code
251-
└── * # Documentation, configuration files, and project requirements
195+
196+
catalog/ # Primary code directory
197+
├── dags/ # DAGs & DAG support code
198+
│ ├── common/ # - Shared modules used across DAGs
199+
│ ├── data_refresh/ # - DAGs & code related to the data refresh process
200+
│ ├── database/ # - DAGs related to database actions (matview refresh, cleaning, etc.)
201+
│ ├── maintenance/ # - DAGs related to airflow/infrastructure maintenance
202+
│ ├── oauth2/ # - DAGs & code for Oauth2 key management
203+
│ ├── providers/ # - DAGs & code for provider ingestion
204+
│ │ ├── provider_api_scripts/ # - API access code specific to providers
205+
│ │ ├── provider_csv_load_scripts/ # - Schema initialization SQL definitions for SQL-based providers
206+
│ │ │ └── *.py # - DAG definition files for providers
207+
│ │ └── retired/ # - DAGs & code that is no longer needed but might be a useful guide for the future
208+
│ ├── templates/ # Templates for generating new provider code
209+
└── * # Documentation, configuration files, and project requirements
252210
```
253211

254212
## Publishing
255213

256214
The docker image for the catalog (Airflow) is published to
257215
ghcr.io/WordPress/openverse-catalog.
258216

259-
## Contributing
260-
261-
Pull requests are welcome! Feel free to [join us on Slack][wp_slack] and discuss
262-
the project with the engineers and community members on #openverse.
263-
264217
## Additional Resources
265218

266219
- 2022-01-12: **[cc-archive/cccatalog](https://github.com/cc-archive/cccatalog):
@@ -277,17 +230,3 @@ For additional context see:
277230
[Welcome to Openverse – Openverse — WordPress.org](https://make.wordpress.org/openverse/2021/05/11/hello-world/)
278231
- 2021-12-13:
279232
[Dear Users of CC Search, Welcome to Openverse - Creative Commons](https://creativecommons.org/2021/12/13/dear-users-of-cc-search-welcome-to-openverse/)
280-
281-
## Acknowledgments
282-
283-
Openverse, previously known as CC Search, was conceived and built at
284-
[Creative Commons](https://creativecommons.org). We thank them for their
285-
commitment to open source and openly licensed content, with particular thanks to
286-
previous team members @ryanmerkley, @janetpkr, @lizadaly, @sebworks, @pa-w,
287-
@kgodey, @annatuma, @mathemancer, @aldenstpage, @brenoferreira, and @sclachar,
288-
along with their
289-
[community of volunteers](https://opensource.creativecommons.org/community/community-team/).
290-
291-
[wp_slack]: https://make.wordpress.org/chat/
292-
[cc]: https://creativecommons.org
293-
[cc_community]: https://opensource.creativecommons.org/community/community-team/

documentation/catalog/reference/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,5 @@
44
:titlesonly:
55
66
DAGs
7+
retired
78
```
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# Retired
2+
3+
## Web Crawl Data (retired)
4+
5+
The Common Crawl Foundation provides an open repository of petabyte-scale web
6+
crawl data. A new dataset is published at the end of each month comprising over
7+
200 TiB of uncompressed data.
8+
9+
The data is available in three file formats:
10+
11+
- WARC (Web ARChive): the entire raw data, including HTTP response metadata,
12+
WARC metadata, etc.
13+
- WET: extracted plaintext from each webpage.
14+
- WAT: extracted html metadata, e.g. HTTP headers and hyperlinks, etc.
15+
16+
For more information about these formats, please see the [Common Crawl
17+
documentation][ccrawl_doc].
18+
19+
Openverse Catalog used AWS Data Pipeline service to automatically create an
20+
Amazon EMR cluster of 100 c4.8xlarge instances that parsed the WAT archives to
21+
identify all domains that link to creativecommons.org. Due to the volume of
22+
data, Apache Spark was also used to streamline the processing. The output of
23+
this methodology was a series of parquet files that contain:
24+
25+
- the domains and its respective content path and query string (i.e. the exact
26+
webpage that links to creativecommons.org)
27+
- the CC referenced hyperlink (which may indicate a license),
28+
- HTML meta data in JSON format which indicates the number of images on each
29+
webpage and other domains that they reference,
30+
- the location of the webpage in the WARC file so that the page contents can be
31+
found.
32+
33+
The steps above were performed in [`ExtractCCLinks.py`][ex_cc_links].
34+
35+
This method was retired in 2021.
36+
37+
[ccrawl_doc]: https://commoncrawl.org/the-data/get-started/
38+
[ex_cc_links]:
39+
https://github.com/WordPress/openverse/blob/c20262cad8944d324b49176678b16b230bc57e2e/archive/ExtractCCLinks.py

0 commit comments

Comments
 (0)