forked from ChrisMuir/geolocChina
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
201 lines (148 loc) · 7.98 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
---
output:
github_document: default
html_document: default
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```
geolocChina
========
[![lifecycle](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental)
R package that takes Chinese location/address stings as input and returns geolocation data for each input string. This package does not rely on an external API service to perform geocoding (for info on how it works, see the `How It Works` section of the `README` below).
The functions are designed to work with business name strings and address strings. For each input string, the following values are returned: `Province`, `City`, `County`, `Provincial geocode`, `City geocode`, and `County geocode`. Values are returned as a tidy data frame.
The package is using geo substrings and associated geocodes from these sources:
* [Administrative-divisions-of-China](https://github.com/modood/Administrative-divisions-of-China) on GitHub, put together by user [modood](https://github.com/modood).
* [this](http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/) site from stats.gov.cn.
There are other R packages that offer similar functionality, namely [baidumap](https://github.com/badbye/baidumap) and [poster](https://github.com/Ironholds/poster). For more info on geolocating Chinese strings, see [this](https://pdfs.semanticscholar.org/ca9d/2d09d0a2420a7ce398e14ed43f8cd7464705.pdf) 2016 paper on the subject.
Please [report](https://github.com/ChrisMuir/geolocChina/issues) issues, comments, or feature requests.
Installation
------------
Install from GitHub
```{r, eval=FALSE}
# install.packages("devtools")
devtools::install_github("ChrisMuir/geolocChina")
```
Example Usage
-------------
```{r}
library(geolocChina)
library(knitr)
terms <- c(
"大连市甘井子区南关岭街道姚工街101号",
"浙江省杭州市余杭区径山镇小古城村",
"大连御洋食品有限公司",
"徐州雅莲连锁超市有限公司",
"四川省南充市阆中市公园路63号",
"同德县鲜肉蔬菜配送行"
)
knitr::kable(
geo_locate(terms)
)
```
How It Works
------------
The geolocation function `geo_locate()` works by using both substring matching and geocodes to validate child regions and infer parent regions. The package ships with internal data that contains region substrings and associated geocodes. Access to the package data is provided via function `get_package_data()`, which returns a tidy data frame.
```{r}
df <- get_package_data()
head(df)
```
The variable `region_type` features four unique values.
```{r}
# Province
head(df[df$region_type == "province", ], 1)
# City
head(df[df$region_type == "city", ], 1)
# County
head(df[df$region_type == "county", ], 1)
# County 2015
head(df[df$region_type == "county_2015", ], 1)
```
`geo_locate()` looks for matching substrings in this order: province, then city, then county. It operates largely on the premise that county geocodes contain a city geocode, and city geocodes contain a provincial geocode. So the county with geocode `330110` resides in the city with geocode `3301`, which resides in the province with geocode `33`.
Here are some different examples showing how `geo_locate()` works. To begin, we'll start with a generic input location string.
```{r}
base_str <- "工业路25号"
```
#### No matches
```{r}
knitr::kable(geo_locate(base_str))
```
#### Province match
Add a province to the `base_str`.
```{r}
x <- paste0("浙江省", base_str)
knitr::kable(geo_locate(x))
```
#### Province and city match
Add a province and a city that exists within the province to the `base_str`.
```{r}
x <- paste0("浙江省", "杭州市", base_str)
knitr::kable(geo_locate(x))
```
It's important to note that, within the return data, the first two digits of the `city_code` match the `province_code`.
On the other hand, if we try to combine a province with a city that does NOT exist within the province, the `city` variables in the output will return `NA`.
```{r}
x <- paste0("浙江省", "大连市", base_str)
knitr::kable(geo_locate(x))
```
This is because the functions attempt to establish a provincial match first, and then any city matches found must have an associated geocode that contain leading digits that match the provincial geocode. Here is the package data observation of the matching city substring.
```{r}
df[df$region == "大连", ]
```
This same principle applies to the relationship between city matches/geocodes and county matches/geocodes, as we'll see below.
#### Province, city, and county match
Add a logical province, city, and county to `base_str`.
```{r}
x <- paste0("浙江省", "杭州市余", "余杭区", base_str)
knitr::kable(geo_locate(x))
```
Just like in the example above, if we try to combine a county with a city that does NOT exist within the city, the `county` variables will return `NA`
```{r}
x <- paste0("浙江省", "杭州市余", "甘井子区", base_str)
knitr::kable(geo_locate(x))
```
```{r}
# Print the pkg data observation related to county "甘井子"
df[df$region == "甘井子", ]
```
#### City match only
Add a city to `base_str`.
```{r}
x <- x <- paste0("杭州市余", base_str)
knitr::kable(geo_locate(x))
```
`geo_locate()` will return a matched province and city. This is because it will first look for a province match and return `NA`. It then looks for a city match (in this example, a city match is found). If a city match is found AND the province is `NA`, the province output values will be inferred from the city geocode. In this example, the matching city geocode is `3301`, the first two digits are taken and assigned as the matching provincial geocode. Then that geocode `33` is used to fetch a corresponding matching province (`浙江`) from the package data.
#### County match only
Add a county to `base_str`
```{r}
x <- paste0("余杭区", base_str)
knitr::kable(geo_locate(x))
```
`geo_locate()` will return a matched province, city, and county. Just like in the last example, when the function cannot find a match for province and city, but a county match is found, the province and city output info will be inferred from the county geocode.
#### 2015 county package data
As part of the internal package data, `geolocChina` ships with a list of counties (and corresponding geocodes) that existed in 2015 and do not currently exist, due to rezoning and county/district lines being redrawn. `geo_locate()` will only use the 2015 county data if there was no city match found. And if a match is found in the 2015 county data, the function will infer the output values for province and/or city, but will leave the county output values as `NA` (as opposed to returning county values that do not currently exist).
```{r}
x <- paste0("北市区", base_str)
knitr::kable(geo_locate(x))
```
#### Two different regions of the same type
If there are multiple matches of the same region type in an input string (say, more than one provincial matches found), then `geo_locate()` will return the region that occupies the earliest character index within the input string (i.e. the substring that appears closest to the beginning of the input string). In the example below, the actual province is `浙江省`, but the road name `辽宁` also matches to a different province in China.
```{r}
x <- "浙江省杭州市辽宁路25号"
knitr::kable(geo_locate(x))
```
The logic behind this comes from the fact that Chinese address/location strings typically have this structure
```
province-city-county-district-street-number
```
Reading left to right, the geographic resolution starts large and gets more and more fine grained.
Using `四川省南充市阆中市公园路63号` as an example, we have
Region | String
-------- | --------------------------------
province | **四川省** 南充市阆中市公园路63号
city | 四川省 **南充市** 阆中市公园路63号
county | 四川省南充市 **阆中市** 公园路63号
```{r}
knitr::kable(geo_locate("四川省南充市阆中市公园路63号"))
```