An inferential approach to Urban Scaling Laws

This repository contains data and the Python implementation of probabilistic models to investigate Urban scaling laws [1][2]. These statistical laws state that many observables $y_i$ (e.g., GDP) of $i=1, 2, \ldots, N$ urban areas in a country (or region) scale with the population $x_i$ as

$$ y_i \sim x_i^{\beta},$$

with $0<\beta<2$. The primary interest is to compare different models, test the validity of the urban scaling law, and estimate the scaling paramter $\beta$.

Models

For the application of models based on cities (C), see Ref. [1] and the Jupyter Notebook (or Open Notebook in Colab).

For the application of models based on the attribution of tokens to individuals (I), which account also for the spatial interaction between urban areas, see Ref. [2]|the Jupyter Notebook (or Open Notebook in Colab).

Model	Parameters	Spatial interaction (Y/N)?	Cities(C) or Individuals(I)	Formula	Description/Reference
Per-capita	$\emptyset$	N	C,I	$y_i = x_i \frac{\sum y_i}{\sum x_i}$	Fixed per-capita rate $\beta=1$ [2]
Least-square	$\beta,A$	N	C	$\log(y) = A +\beta \log(x)$	Least-squared fitting of log-transformed variables [1]
Gaussian	$\beta,\alpha,\gamma,\delta$	N	C	$\mathbb{E}(y\mid x) = \alpha x^{\beta}, \mathbb{V}(y\mid x) = \gamma \mathbb{E}(y\mid x)^{\delta}$	Gaussian $P(y\mid x)$ [1]
Log-normal	$\beta,\alpha,\gamma,\delta$	N	C	$\mathbb{E}(y\mid x) = \alpha x^{\beta}, \mathbb{V}(y\mid x) = \gamma \mathbb{E}(y\mid x)^{\delta}$	Log-normal $P(y\mid x)$ [1]
Persons	$\beta$	N	I	$p(j) \sim x_{c(j)}^{\beta-1}$	Tokens are attributed to individuals with probability $p(j)$ [1][2]
Gravitational	$\beta,\alpha_G$	Y	I	$a_G = \frac{1}{1+ \left(\frac{d}{\alpha_G}\right)^2}$	Tokens to individuals with prob. $p(j)$, who interact according to $a_G$ depending on distance $d$ [1] [2]
Exponential	$\beta,\alpha_E$	Y	I	$a_E = e^{- d \ln(2) / \alpha_E}$	Tokens to individuals with prob. $p(j)$, who interact according to $a_E$ depending on distance $d$ [2]

Data

The datasets listed below are available for investigation. The column "tag" indicates the key to be used to call this data in our code (e.g., in the notebook). The column "Location?" indicates whether the latitude and logitude is available (Y/N).

The area of urban areas is available for data from Australia, Europe, and USA. This data can be used as an alternative measure $x_i$ of city size or in combination with population, an example of the analysis can be found here. An example of the analysis of COVID19 cases can be found here.

Region:	Tag:	N	Location?	Year	Description	Source
Australia
	covid19_NSW	144	N	2021	COVID19 cases in the state of NSW	NSW
	australia_area	102	Y	2021	Area	Australian Bureau of Statistics
	australia_education	102	Y	2021	Top bracket in Eduction	Census, Australian Bureau of Statistics
	australia_income	102	Y	2021	Top bracket in Income	Census, Australian Bureau of Statistics
Brazil
	brazil_aids_2010	1812	Y	2010	AIDS cases	Brazilian Health Ministry
	brazil_externalCauses_2010	5286	Y	2010	Death by external causes	Brazilian Health Ministry
	brazil_gdp_2010	5565	Y	2010	GDP	Brazilian Health Ministry
	covid19_brazil	5570	N	2021	COVID19 cases	Brasil.io and wcota
Chile
	covid19_chile	346	N	2021	COVID19 cases	MinCiencia
Europe
	eurostat_cinema_seats	418	N	2011	Cinema seats	Eurostat
	eurostat_cinema_attendance	221	N	2011	Attendance to cinemas	Eurostat
	eurostat_museum_visitors	443	N	2011	Visitors to museums	Eurostat
	eurostat_theaters	398	N	2011	Theaters	Eurostat
	eurostat_libraries	597	N	2011	Libraries	Eurostat
	eurostat_landarea	298	N	1990 to 2023	Land Area	Eurostat
	eurostat_totalarea	298	N	1990 to 2023	Total Area	Eurostat
	eurostat_trademarks	273	N	1996 to 2016	Trademarks	Eurostat
	eurostat_gdp	238	N	2000 to 2021	GDP	Eurostat
	eurostat_gdp_purchasing_power	238	N	2000 to 2021	GDP Purchasing Power Adjusted	Eurostat
	eurostat_gdp_perperson	238	N	2000 to 2021	GDP Per Person	Eurostat
	eurostat_gdp_perperson_purchasing_power	238	N	2000 to 2021	GDP Per Person Purchasing Power Adjusted	Eurostat
	eurostat_patents	278	N	1977 to 2012	Patents	Eurostat
	eurostat_gva	255	N	1995 to 2021	GVA	Eurostat
	eurostat_burglaries	217	N	2008 to 2020	Burglaries	Eurostat
	eurostat_robberies	217	N	2008 to 2020	Robberies	Eurostat
	eurostat_homicides	217	N	2008 to 2020	Homicides	Eurostat
	eurostat_motor_theft	217	N	2008 to 2020	Motor Theft	Eurostat
	eurostat_lower_secondary_education	286	N	2009 to 2022	Lower Secondary Education	Eurostat
	eurostat_upper_secondary_to_non_tertiary_education	286	N	2009 to 2022	Upper Secondary to Non Tertiary Education	Eurostat
	eurostat_upper_secondary_to_tertiary_education	286	N	2009 to 2022	Upper Secondary to Tertiary Education	Eurostat
	eurostat_tertiary_education	286	N	2009 to 2022	Tertiary Education	Eurostat
Germany
	germany_gdp	108	N	2012	GDP	German Statistical Office
OECD
	oecd_gdp	275	N	2010	GDP	OECD
	oecd_patents	218	N	2008	Patents filed	OECD
UK
	uk_income	100	N	2000 to 2011	Weekly income	Arcaute et al.
	uk_patents	93	N	2000 to 2011	Patents filed	Arcaute et al.
	uk_train	97	N	2000 to 2011	Train statiions	Arcaute et al.
USA
	usa_gdp	381	Y	2013	GDP	BEA
	usa_miles	459	Y	2013	Length of roads in miles	FHWA
	covid19_USA	3131	N	2021	Covid19 cases	Kaggle
	usa_area	938	N	2019	Area	US Census Bureau
	usa_travel_time	632	N	2010 to 2022	Travel Time	Census
	usa_poverty	505	N	2010 to 2022	Poverty	Census
	usa_mean_income	632	N	2010 to 2022	Mean Income	Census
	usa_median_income	632	N	2010 to 2022	Median Income	Census
	usa_highschool_education	433	N	2015 to 2022	High School Education	Census

The data is stored in the folder data, where more information about its sources and filtering can be found. It consists of Python packages (e.g. brazil). Each package has functions that return the data there, defined in the __init__.py of the package. The data is always a tuple (x, y) of numpy arrays of the same size, where x is always population.

For example, to get the population-gdp of brazilian cities from 2010 use:

import brazil
x, y = brazil.gdp(2010)

For the spatial data, an additional array (l) indicates the location (latitude and longitude) of the urban area.

Import your own data:

New data can be added as .csv file to

new_dataset/generic_dataset.txt (for three columns: city name, $x,y$)

or

new_dataset2/generic_dataset.txt (for two columns: $x,y$)

For the spatial analysis, import your resuts as $x$ (population), $y$ (observable), $\ell$ (latitude and longitude) directly in the notebook

Code

The easiset way to interact and run the code is through the Notebooks in the folder notebooks. Follow the link in the "Notebook-*-Colab.ipynb" files to run them in Colab or download this repository and run using Jupyter. The source Python code is in the folder src

Likelihood and minimisation

All inference is performed based on the likelihood of different models. The module best_parameters.py contains the definition of the likelihood functions of the models, the minimization algorithm, and the parameters we use in it. The bootstrap used to estimate error bars is also defined in this module, at minimize_with_errors. The bootstrap for the person model is implemented in pvalue_population.py. The likelihood and minimization of the spatial models appear in 'spatial.py'

Analysis

The different analysis we perform, as well as the list of databases we use, are defined in analysis.py. The general setting is defined in LikelihoodAnalysis and respective methods.

For example, to get beta estimated by Log-Normal with free \delta and other statistical information, use

from analysis import LogNormalAnalysis
>>> analysis = LogNormalAnalysis('brazil_aids_2010', required_successes=512)
>>> analysis.beta[0]
>>> analysis.p_value
>>> analysis.bic

You can run the Jupyter Notebook (or Open Notebook in Colab[Jupyter Notebook] or run python -m analyze.py. For example,

MODEL=LogNormalAnalysis ERROR_SAMPLES=10 python -m analyze

runs the LogNormal model with 10 samples for bootstrap on the new dataset. It prints the best \beta, the bootstrap error for beta, p_value, and BIC for the specific model (the script explains how to select the model).

Pre-computed results are stored at _results. In case you want to reproduce some of the results stored in _results, you can delete the respective analysis in the directory and run (may take some time)

python -m analysis_run

this requires some environment variables that are documented when you run it.

References

This repository contains both data and code from the papers:

[1] Is this scaling non-linear? by Jorge C. Leitão, José M. Miotto, Martin Gerlach, and Eduardo G. Altmann, Royal Society Open Science 3, 150649 (2016). | See Notebook | Open Notebook in Colab

[2] Spatial Interctions in urban scaling laws, by Eduardo G. Altmann, PLOS ONE 15, e0243390 (2020). | See Notebook | Open Notebook in Colab

And also specific projects:

Results for COVID-19 data performed by Jimena Espinoza in Semester 2 2021| See Notebook | Open Notebook in Colab.

Results considering area and population as measures of city size, performed byIsaac Riad in 2024 | See Notebook.

Contributions are welcome. If results of this repository are used, please cite it and the corresponding publications.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

An inferential approach to Urban Scaling Laws

Models

Data

Import your own data:

Code

Likelihood and minimisation

Analysis

References

Files

README.md

Latest commit

History

README.md

File metadata and controls

An inferential approach to Urban Scaling Laws

Models

Data

Import your own data:

Code

Likelihood and minimisation

Analysis

References