forked from awslabs/open-data-registry
-
Notifications
You must be signed in to change notification settings - Fork 0
/
commoncrawl.yaml
60 lines (60 loc) · 2.91 KB
/
commoncrawl.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Name: Common Crawl
Description: A corpus of web crawl data composed of over 25 billion web pages.
Documentation: http://commoncrawl.org/the-data/get-started/
Contact: http://commoncrawl.org/connect/contact-us/
ManagedBy: "[Common Crawl](http://commoncrawl.org/)"
UpdateFrequency: Monthly
Tags:
- aws-pds
- encyclopedic
- machine learning
- natural language processing
- internet
License: This data is available for anyone to use under the [Common Crawl Terms of Use](http://commoncrawl.org/terms-of-use/)
Resources:
- Description: Crawl data (WARC and ARC format)
ARN: arn:aws:s3:::commoncrawl
Region: us-east-1
Type: S3 Bucket
DataAtWork:
Tutorials:
- Title: Large-scale graph mining with Spark
URL: https://towardsdatascience.com/large-scale-graph-mining-with-spark-750995050656
AuthorName: Win Suen
AuthorURL: https://github.com/wsuen/pygotham2018_graphmining
- Title: Learning word vectors for 157 languages
URL: https://arxiv.org/abs/1802.06893
AuthorName: Facebook AI Research
AuthorURL: https://fasttext.cc/docs/en/crawl-vectors.html
Tools & Applications:
- Title: Dresden Web Table Corpus (DWTC)
URL: https://wwwdb.inf.tu-dresden.de/research-projects/dresden-web-table-corpus/
AuthorName: Database Systems Group Dresden
AuthorURL: https://wwwdb.inf.tu-dresden.de/
- Title: Index to WARC Files and URLs in Columnar Format
URL: http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
AuthorName: Sebastian Nagel
Publications:
- Title: Building a Web-Scale Dependency-Parsed Corpus from CommonCrawl
URL: https://arxiv.org/pdf/1710.01779.pdf
AuthorName: Alexander Panchenko, et al.
- Title: Using open data to predict market movements
URL: https://education.emc.com/content/dam/dell-emc/documents/en-us/2017KS_Ravinder-Using_Open_Data_to_Predict_Market_Movements.pdf
AuthorName: DELL EMC
- Title: N-gram counts and language models from the Common Crawl
URL: http://www.lrec-conf.org/proceedings/lrec2014/pdf/1097_Paper.pdf
AuthorName: Christian Buck, Kenneth Heafield, Bas van Ooyen
AuthorURL: http://statmt.org/ngrams/
- Title: Large-scale analysis of style injection by relative path overwrite
URL: https://doi.org/10.1145/3178876.3186090
AuthorName: Sajjad Arshad, et al.
- Title: Web Data Commons - RDFa, microdata, and microformat data sets
URL: http://webdatacommons.org/structureddata/
AuthorName: Christian Bizer, Robert Meusel, Anna Primpeli
- Title: "C4Corpus: Multilingual Web-Size Corpus with Free License"
URL: http://www.lrec-conf.org/proceedings/lrec2016/pdf/388_Paper.pdf
AuthorName: Ivan Habernal, Omnia Zayed, Iryna Gurevych
AuthorURL: https://dkpro.github.io/dkpro-c4corpus/
- Title: Of using Common Crawl to play Family Feud
URL: https://fulmicoton.com/posts/commoncrawl/
AuthorName: Paul Masurel