Skip to content

Statistical methods for estimating scaling laws in urban data

License

Notifications You must be signed in to change notification settings

edugalt/scaling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

An inferential approach to Urban Scaling Laws

This repository contains data and the Python implementation of probabilistic models to investigate Urban scaling laws [1][2]. These statistical laws state that many observables $y_i$ (e.g., GDP) of $i=1, 2, \ldots, N$ urban areas in a country (or region) scale with the population $x_i$ as

$$ y_i \sim x_i^{\beta},$$

with $0<\beta<2$. The primary interest is to compare different models, test the validity of the urban scaling law, and estimate the scaling paramter $\beta$.

Models

For the application of models based on cities (C), see Ref. [1] and the Jupyter Notebook (or Open Notebook in Colab).

For the application of models based on the attribution of tokens to individuals (I), which account also for the spatial interaction between urban areas, see Ref. [2]|the Jupyter Notebook (or Open Notebook in Colab).

Model Parameters Spatial interaction (Y/N)? Cities(C) or Individuals(I) Formula Description/Reference
Per-capita $\emptyset$ N C,I $y_i = x_i \frac{\sum y_i}{\sum x_i}$ Fixed per-capita rate $\beta=1$ [2]
Least-square $\beta,A$ N C $\log(y) = A +\beta \log(x)$ Least-squared fitting of log-transformed variables [1]
Gaussian $\beta,\alpha,\gamma,\delta$ N C $\mathbb{E}(y\mid x) = \alpha x^{\beta}, \mathbb{V}(y\mid x) = \gamma \mathbb{E}(y\mid x)^{\delta}$ Gaussian $P(y\mid x)$ [1]
Log-normal $\beta,\alpha,\gamma,\delta$ N C $\mathbb{E}(y\mid x) = \alpha x^{\beta}, \mathbb{V}(y\mid x) = \gamma \mathbb{E}(y\mid x)^{\delta}$ Log-normal $P(y\mid x)$ [1]
Persons $\beta$ N I $p(j) \sim x_{c(j)}^{\beta-1}$ Tokens are attributed to individuals with probability $p(j)$ [1][2]
Gravitational $\beta,\alpha_G$ Y I $a_G = \frac{1}{1+ \left(\frac{d}{\alpha_G}\right)^2}$ Tokens to individuals with prob. $p(j)$, who interact according to $a_G$ depending on distance $d$ [1] [2]
Exponential $\beta,\alpha_E$ Y I $a_E = e^{- d \ln(2) / \alpha_E}$ Tokens to individuals with prob. $p(j)$, who interact according to $a_E$ depending on distance $d$ [2]

Data

The datasets listed below are available for investigation. The column "tag" indicates the key to be used to call this data in our code (e.g., in the notebook). The column "Location?" indicates whether the latitude and logitude is available (Y/N).

The area of urban areas is available for data from Australia, Europe, and USA. This data can be used as an alternative measure $x_i$ of city size or in combination with population, an example of the analysis can be found here. An example of the analysis of COVID19 cases can be found here.

Region: Tag: N Location? Year Description Source
Australia
covid19_NSW 144 N 2021 COVID19 cases in the state of NSW NSW
australia_area 102 Y 2021 Area Australian Bureau of Statistics
australia_education 102 Y 2021 Top bracket in Eduction Census, Australian Bureau of Statistics
australia_income 102 Y 2021 Top bracket in Income Census, Australian Bureau of Statistics
Brazil
brazil_aids_2010 1812 Y 2010 AIDS cases Brazilian Health Ministry
brazil_externalCauses_2010 5286 Y 2010 Death by external causes Brazilian Health Ministry
brazil_gdp_2010 5565 Y 2010 GDP Brazilian Health Ministry
covid19_brazil 5570 N 2021 COVID19 cases Brasil.io and wcota
Chile
covid19_chile 346 N 2021 COVID19 cases MinCiencia
Europe
eurostat_cinema_seats 418 N 2011 Cinema seats Eurostat
eurostat_cinema_attendance 221 N 2011 Attendance to cinemas Eurostat
eurostat_museum_visitors 443 N 2011 Visitors to museums Eurostat
eurostat_theaters 398 N 2011 Theaters Eurostat
eurostat_libraries 597 N 2011 Libraries Eurostat
eurostat_landarea 298 N 1990 to 2023 Land Area Eurostat
eurostat_totalarea 298 N 1990 to 2023 Total Area Eurostat
eurostat_trademarks 273 N 1996 to 2016 Trademarks Eurostat
eurostat_gdp 238 N 2000 to 2021 GDP Eurostat
eurostat_gdp_purchasing_power 238 N 2000 to 2021 GDP Purchasing Power Adjusted Eurostat
eurostat_gdp_perperson 238 N 2000 to 2021 GDP Per Person Eurostat
eurostat_gdp_perperson_purchasing_power 238 N 2000 to 2021 GDP Per Person Purchasing Power Adjusted Eurostat
eurostat_patents 278 N 1977 to 2012 Patents Eurostat
eurostat_gva 255 N 1995 to 2021 GVA Eurostat
eurostat_burglaries 217 N 2008 to 2020 Burglaries Eurostat
eurostat_robberies 217 N 2008 to 2020 Robberies Eurostat
eurostat_homicides 217 N 2008 to 2020 Homicides Eurostat
eurostat_motor_theft 217 N 2008 to 2020 Motor Theft Eurostat
eurostat_lower_secondary_education 286 N 2009 to 2022 Lower Secondary Education Eurostat
eurostat_upper_secondary_to_non_tertiary_education 286 N 2009 to 2022 Upper Secondary to Non Tertiary Education Eurostat
eurostat_upper_secondary_to_tertiary_education 286 N 2009 to 2022 Upper Secondary to Tertiary Education Eurostat
eurostat_tertiary_education 286 N 2009 to 2022 Tertiary Education Eurostat
Germany
germany_gdp 108 N 2012 GDP German Statistical Office
OECD
oecd_gdp 275 N 2010 GDP OECD
oecd_patents 218 N 2008 Patents filed OECD
UK
uk_income 100 N 2000 to 2011 Weekly income Arcaute et al.
uk_patents 93 N 2000 to 2011 Patents filed Arcaute et al.
uk_train 97 N 2000 to 2011 Train statiions Arcaute et al.
USA
usa_gdp 381 Y 2013 GDP BEA
usa_miles 459 Y 2013 Length of roads in miles FHWA
covid19_USA 3131 N 2021 Covid19 cases Kaggle
usa_area 938 N 2019 Area US Census Bureau
usa_travel_time 632 N 2010 to 2022 Travel Time Census
usa_poverty 505 N 2010 to 2022 Poverty Census
usa_mean_income 632 N 2010 to 2022 Mean Income Census
usa_median_income 632 N 2010 to 2022 Median Income Census
usa_highschool_education 433 N 2015 to 2022 High School Education Census

The data is stored in the folder data, where more information about its sources and filtering can be found. It consists of Python packages (e.g. brazil). Each package has functions that return the data there, defined in the __init__.py of the package. The data is always a tuple (x, y) of numpy arrays of the same size, where x is always population.

For example, to get the population-gdp of brazilian cities from 2010 use:

import brazil
x, y = brazil.gdp(2010)

For the spatial data, an additional array (l) indicates the location (latitude and longitude) of the urban area.

Import your own data:

New data can be added as .csv file to

new_dataset/generic_dataset.txt (for three columns: city name, $x,y$)

or

new_dataset2/generic_dataset.txt (for two columns: $x,y$)

For the spatial analysis, import your resuts as $x$ (population), $y$ (observable), $\ell$ (latitude and longitude) directly in the notebook

Code

The easiset way to interact and run the code is through the Notebooks in the folder notebooks. Follow the link in the "Notebook-*-Colab.ipynb" files to run them in Colab or download this repository and run using Jupyter. The source Python code is in the folder src

Likelihood and minimisation

All inference is performed based on the likelihood of different models. The module best_parameters.py contains the definition of the likelihood functions of the models, the minimization algorithm, and the parameters we use in it. The bootstrap used to estimate error bars is also defined in this module, at minimize_with_errors. The bootstrap for the person model is implemented in pvalue_population.py. The likelihood and minimization of the spatial models appear in 'spatial.py'

Analysis

The different analysis we perform, as well as the list of databases we use, are defined in analysis.py. The general setting is defined in LikelihoodAnalysis and respective methods.

For example, to get beta estimated by Log-Normal with free \delta and other statistical information, use

from analysis import LogNormalAnalysis
>>> analysis = LogNormalAnalysis('brazil_aids_2010', required_successes=512)
>>> analysis.beta[0]
>>> analysis.p_value
>>> analysis.bic

You can run the Jupyter Notebook (or Open Notebook in Colab[Jupyter Notebook] or run python -m analyze.py. For example,

MODEL=LogNormalAnalysis ERROR_SAMPLES=10 python -m analyze

runs the LogNormal model with 10 samples for bootstrap on the new dataset. It prints the best \beta, the bootstrap error for beta, p_value, and BIC for the specific model (the script explains how to select the model).

Pre-computed results are stored at _results. In case you want to reproduce some of the results stored in _results, you can delete the respective analysis in the directory and run (may take some time)

python -m analysis_run

this requires some environment variables that are documented when you run it.

References

This repository contains both data and code from the papers:

[1] Is this scaling non-linear? by Jorge C. Leitão, José M. Miotto, Martin Gerlach, and Eduardo G. Altmann, Royal Society Open Science 3, 150649 (2016). | See Notebook | Open Notebook in Colab

[2] Spatial Interctions in urban scaling laws, by Eduardo G. Altmann, PLOS ONE 15, e0243390 (2020). | See Notebook | Open Notebook in Colab

And also specific projects:

Results for COVID-19 data performed by Jimena Espinoza in Semester 2 2021| See Notebook | Open Notebook in Colab.

Results considering area and population as measures of city size, performed byIsaac Riad in 2024 | See Notebook.

Contributions are welcome. If results of this repository are used, please cite it and the corresponding publications.

About

Statistical methods for estimating scaling laws in urban data

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •