-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
93 lines (64 loc) · 4.95 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# scrapex
Since web scraping and API tutorials are deemed to be obsolete once they changes it’s structure or go down indefinately, `scrapex` saves specific web paths of certain websites to make scraping examples reproducibly in the long term. Moreover, it `scrapex` contains local copies of custom APIs for educational purposes using the `plumber` package. To make the package as lightweight as possible, for the local copies of websites, some images/figures/maps have been excluded. Aside from that, most examples should contain the rough skeleton of the original website.
## Installation
`scrapex` will probably never see the light of CRAN since it is deemed to grow in size if new examples come along, so you need to install the development version from [GitHub](https://github.com/) with:
```{r, eval = FALSE}
# install.packages("devtools")
devtools::install_github("cimentadaj/scrapex")
```
## Website Example
`scrapex` contains websites which have been saved locally for users to use as examples. Each example has only one function which returns the local links to the html websites. For example, for the spanish schools examples (see `?spanish_schools_ex` for documentation on the example), we can obtain the links with `spanish_schools_ex()`:
```{r example}
library(scrapex)
library(xml2)
school_websites <- spanish_schools_ex()
school_websites
```
To browse the actual website of one of these links, use `prep_browser` with `openURL`:
```{r, eval = FALSE}
browseURL(prep_browser(school_websites)[1])
```
To actually scrape information from these websites, they work as expected with `read_html`.
```{r}
# You can read them with `read_html`
read_html(school_websites[1])
```
## API example
All APIs in `scrapex` export a function that launches the API. These functions start with `api_*`. For example, loading the local copy of the API for the public bicycle system of Barcelona can be done like this:
```{r}
library(scrapex)
library(httr2)
bicing <- api_bicing()
```
You'll see that the API returns a link to the documentation of the API as well as to the main URL of the API. Users can then make requests just as if this was a real API. For example:
```{r}
rt_bicing <- paste0(bicing$api_web, "/api/v1/real_time_bicycles")
resp_output <-
rt_bicing %>%
request() %>%
req_perform() %>%
resp_body_json() %>%
lapply(data.frame) %>%
do.call(rbind, .)
resp_output
```
## Collaboration
This package is a work in progress and welcomes more examples from different backgrounds. If you want to contribute to the package you can send a pull request following these guidelines:
* One function that clearly states the substantive theme of your example (for example, `spanish_schools_ex` for web scraping or `api_*` for an API). This function should return the full local path of the html for each single page of your example (**not the full local path to the complete website**) for web scraping and a list with the `callr` process and the API website for an API.
* For webscraping, all of your local example files should be in a folder located in `inst/extdata` with the same name as the main function. Within this folder, there should be only one `.R` file which shows how the download was done. We aim to keep the package as lightweight as possible so, for example, it would be important to delete large files and complicated (long) paths. Keep in mind two things: nested directories with very long names (standard when saving a website) and complicated file names (containing dots, question marks, equal signs, etc...) will throw errors when running `devtools::check()`. For the `spanish_schools_ex` I downloaded each path to the websites, identified the relevant html, renamed it to something shorter and valid, moved only that file to the `spanish_schools_ex` folder and deleted the website's complete directory structure. The result is only one html file for each path of the website.
* For APIs, internal functions should go within one single `api_*.R` file. The API itself should be located in `/inst/plumber/`.
For webscraping, I encourage anyone looking to do a PR to look at the one for `spanish_schools_ex` which is called `download_school.R` (note that this is tricky because downloading a path of a website many times results in downloading the **complete** website. Look at the example in `download_school.R` to only download single paths). For the API example, I encourage users to look at the file `R/api_coveragedb.R` and `inst/plumber/api_coveragedb/`.
* Each main function should have an accompanied test file that has **at least** the same tests as `spanish_schools_ex()`. Similarly for APIs for `api_coveragedb()`.
All suggestions are very welcome if you feel this is not the most optimal storage strategy! Open an issue request and we'll discuss it.