-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathAdvanced_Scraping.Rmd
124 lines (83 loc) · 3.48 KB
/
Advanced_Scraping.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
title: "Advanced Scraping"
author: Andy Halterman
date: "11 November 2019"
output:
beamer_presentation:
theme: metropolis
---
```{r echo=FALSE}
library(knitr)
```
## Law/Ethics
Scraping public websites, even against the terms of service, is probably legal. See hiQ Labs, Inc. v. LinkedIn Corp (2019). https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-linkedin-protects-scraping-public-data
But circumventing technical restrictions (passwords, captchas) is probably illegal.
In any case, be nice!
- add a delay between pages
- only use distributed scraping against rich sites
## Random headers to avoid blocking
\tiny
```
desktop_agents = ['Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
def random_headers():
return {'User-Agent': choice(desktop_agents),'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}
page = requests.get(url, headers=random_headers())
```
## Download Youtube videos
Downloading Youtube videos can be important for research reproducibility (see Rich's book).
```
https://github.com/nficano/pytube
```
## Download Youtube metadata
We probably care just as much about the video metadata
```
https://github.com/elizariley/youtube_extremism
```
## Downloading a whole bunch of PDFs
\small
```
def download_cable(i, year, collection, outdir):
docid = str(i)
collection = "2694"
year = "1978"
docname = outdir + year + "_" + docid + ".pdf"
url = "http://aad.archives.gov/aad/createpdf?rid={0}&dt={1}&dl=2"\
.format(docid, collection)
req = urllib2.Request(url)
r = urllib2.urlopen(req)
if r.readline():
f = open(docname, 'wb')
f.write(r.read())
f.close()
```
## Reverse-engineering API calls
Sometimes you can directly access the raw data that goes into a page's visualization using the "Javascript Console"/network tab.
```
https://trac.syr.edu/phptools/immigration/arrest/
```
## OCR and Tesseract
If you scrape image PDFs that you want the text from, you'll need to do optical character recognition to convert the image to text
```
https://github.com/tesseract-ocr/tesseract
```
## Big scrape
- Q: What if you're scraping 10 million stories and you don't want to start over if something breaks?
- A: Use queues, databases, and multiple workers
\pause
```
https://github.com/ahalterman/big_scrape
```
(email me for access)
## Newspapers3k
Scraping a bunch of articles from a bunch of sites? This library automatically finds titles, authors, text, etc.
```
https://newspaper.readthedocs.io/en/latest/
```
## Rendering Javascript
Some page elements are dynamically generated and require Javascript to render.
To scrape, we can use a combination of `PhantomJS` (a windowless browser) and `selenium` (a tool for automating browser actions).