This repository contains data and the Python implementation of probabilistic models to investigate Urban scaling laws [1][2]. These statistical laws state that many observables
with
For the application of models based on cities (C), see Ref. [1] and the Jupyter Notebook (or Open Notebook in Colab).
For the application of models based on the attribution of tokens to individuals (I), which account also for the spatial interaction between urban areas, see Ref. [2]|the Jupyter Notebook (or Open Notebook in Colab).
Model | Parameters | Spatial interaction (Y/N)? | Cities(C) or Individuals(I) | Formula | Description/Reference |
---|---|---|---|---|---|
Per-capita | N | C,I | Fixed per-capita rate |
||
Least-square | N | C | Least-squared fitting of log-transformed variables [1] | ||
Gaussian | N | C | Gaussian |
||
Log-normal | N | C | Log-normal |
||
Persons | N | I | Tokens are attributed to individuals with probability |
||
Gravitational | Y | I | Tokens to individuals with prob. |
||
Exponential | Y | I | Tokens to individuals with prob. |
The datasets listed below are available for investigation. The column "tag" indicates the key to be used to call this data in our code (e.g., in the notebook). The column "Location?" indicates whether the latitude and logitude is available (Y/N).
The area of urban areas is available for data from Australia, Europe, and USA. This data can be used as an alternative measure
Region: | Tag: | N | Location? | Year | Description | Source |
---|---|---|---|---|---|---|
Australia | ||||||
covid19_NSW | 144 | N | 2021 | COVID19 cases in the state of NSW | NSW | |
australia_area | 102 | Y | 2021 | Area | Australian Bureau of Statistics | |
australia_education | 102 | Y | 2021 | Top bracket in Eduction | Census, Australian Bureau of Statistics | |
australia_income | 102 | Y | 2021 | Top bracket in Income | Census, Australian Bureau of Statistics | |
Brazil | ||||||
brazil_aids_2010 | 1812 | Y | 2010 | AIDS cases | Brazilian Health Ministry | |
brazil_externalCauses_2010 | 5286 | Y | 2010 | Death by external causes | Brazilian Health Ministry | |
brazil_gdp_2010 | 5565 | Y | 2010 | GDP | Brazilian Health Ministry | |
covid19_brazil | 5570 | N | 2021 | COVID19 cases | Brasil.io and wcota | |
Chile | ||||||
covid19_chile | 346 | N | 2021 | COVID19 cases | MinCiencia | |
Europe | ||||||
eurostat_cinema_seats | 418 | N | 2011 | Cinema seats | Eurostat | |
eurostat_cinema_attendance | 221 | N | 2011 | Attendance to cinemas | Eurostat | |
eurostat_museum_visitors | 443 | N | 2011 | Visitors to museums | Eurostat | |
eurostat_theaters | 398 | N | 2011 | Theaters | Eurostat | |
eurostat_libraries | 597 | N | 2011 | Libraries | Eurostat | |
eurostat_landarea | 298 | N | 1990 to 2023 | Land Area | Eurostat | |
eurostat_totalarea | 298 | N | 1990 to 2023 | Total Area | Eurostat | |
eurostat_trademarks | 273 | N | 1996 to 2016 | Trademarks | Eurostat | |
eurostat_gdp | 238 | N | 2000 to 2021 | GDP | Eurostat | |
eurostat_gdp_purchasing_power | 238 | N | 2000 to 2021 | GDP Purchasing Power Adjusted | Eurostat | |
eurostat_gdp_perperson | 238 | N | 2000 to 2021 | GDP Per Person | Eurostat | |
eurostat_gdp_perperson_purchasing_power | 238 | N | 2000 to 2021 | GDP Per Person Purchasing Power Adjusted | Eurostat | |
eurostat_patents | 278 | N | 1977 to 2012 | Patents | Eurostat | |
eurostat_gva | 255 | N | 1995 to 2021 | GVA | Eurostat | |
eurostat_burglaries | 217 | N | 2008 to 2020 | Burglaries | Eurostat | |
eurostat_robberies | 217 | N | 2008 to 2020 | Robberies | Eurostat | |
eurostat_homicides | 217 | N | 2008 to 2020 | Homicides | Eurostat | |
eurostat_motor_theft | 217 | N | 2008 to 2020 | Motor Theft | Eurostat | |
eurostat_lower_secondary_education | 286 | N | 2009 to 2022 | Lower Secondary Education | Eurostat | |
eurostat_upper_secondary_to_non_tertiary_education | 286 | N | 2009 to 2022 | Upper Secondary to Non Tertiary Education | Eurostat | |
eurostat_upper_secondary_to_tertiary_education | 286 | N | 2009 to 2022 | Upper Secondary to Tertiary Education | Eurostat | |
eurostat_tertiary_education | 286 | N | 2009 to 2022 | Tertiary Education | Eurostat | |
Germany | ||||||
germany_gdp | 108 | N | 2012 | GDP | German Statistical Office | |
OECD | ||||||
oecd_gdp | 275 | N | 2010 | GDP | OECD | |
oecd_patents | 218 | N | 2008 | Patents filed | OECD | |
UK | ||||||
uk_income | 100 | N | 2000 to 2011 | Weekly income | Arcaute et al. | |
uk_patents | 93 | N | 2000 to 2011 | Patents filed | Arcaute et al. | |
uk_train | 97 | N | 2000 to 2011 | Train statiions | Arcaute et al. | |
USA | ||||||
usa_gdp | 381 | Y | 2013 | GDP | BEA | |
usa_miles | 459 | Y | 2013 | Length of roads in miles | FHWA | |
covid19_USA | 3131 | N | 2021 | Covid19 cases | Kaggle | |
usa_area | 938 | N | 2019 | Area | US Census Bureau | |
usa_travel_time | 632 | N | 2010 to 2022 | Travel Time | Census | |
usa_poverty | 505 | N | 2010 to 2022 | Poverty | Census | |
usa_mean_income | 632 | N | 2010 to 2022 | Mean Income | Census | |
usa_median_income | 632 | N | 2010 to 2022 | Median Income | Census | |
usa_highschool_education | 433 | N | 2015 to 2022 | High School Education | Census |
The data is stored in the folder data, where more information about its sources and filtering can be found. It consists of Python packages (e.g. brazil
). Each package has functions
that return the data there, defined in the __init__.py
of the package.
The data is always a tuple (x, y) of numpy arrays of the same size, where x is always population.
For example, to get the population-gdp of brazilian cities from 2010 use:
import brazil
x, y = brazil.gdp(2010)
For the spatial data, an additional array (l) indicates the location (latitude and longitude) of the urban area.
New data can be added as .csv file to
new_dataset/generic_dataset.txt (for three columns: city name,
or
new_dataset2/generic_dataset.txt (for two columns:
For the spatial analysis, import your resuts as
The easiset way to interact and run the code is through the Notebooks in the folder notebooks. Follow the link in the "Notebook-*-Colab.ipynb" files to run them in Colab or download this repository and run using Jupyter. The source Python code is in the folder src
All inference is performed based on the likelihood of different models. The module best_parameters.py
contains the definition of the likelihood functions of the models,
the minimization algorithm, and the parameters we use in it. The bootstrap used to estimate error bars is also defined in this module, at minimize_with_errors
.
The bootstrap for the person model is implemented in pvalue_population.py
. The likelihood and minimization of the spatial models appear in 'spatial.py'
The different analysis we perform, as well as the list of databases we use, are defined in analysis.py
.
The general setting is defined in LikelihoodAnalysis
and respective methods.
For example, to get beta estimated by Log-Normal with free \delta and other statistical information, use
from analysis import LogNormalAnalysis
>>> analysis = LogNormalAnalysis('brazil_aids_2010', required_successes=512)
>>> analysis.beta[0]
>>> analysis.p_value
>>> analysis.bic
You can run the Jupyter Notebook (or Open Notebook in Colab[Jupyter Notebook] or run python -m analyze.py
. For example,
MODEL=LogNormalAnalysis ERROR_SAMPLES=10 python -m analyze
runs the LogNormal
model with 10 samples for bootstrap on the new dataset.
It prints the best \beta, the bootstrap error for beta, p_value, and BIC for the specific model
(the script explains how to select the model).
Pre-computed results are stored at _results
. In case you want to reproduce some of the results stored in _results
, you can delete the respective
analysis in the directory and run (may take some time)
python -m analysis_run
this requires some environment variables that are documented when you run it.
This repository contains both data and code from the papers:
[1] Is this scaling non-linear? by Jorge C. Leitão, José M. Miotto, Martin Gerlach, and Eduardo G. Altmann, Royal Society Open Science 3, 150649 (2016). | See Notebook | Open Notebook in Colab
[2] Spatial Interctions in urban scaling laws, by Eduardo G. Altmann, PLOS ONE 15, e0243390 (2020). | See Notebook | Open Notebook in Colab
And also specific projects:
Results for COVID-19 data performed by Jimena Espinoza in Semester 2 2021| See Notebook | Open Notebook in Colab.
Results considering area and population as measures of city size, performed byIsaac Riad in 2024 | See Notebook.
Contributions are welcome. If results of this repository are used, please cite it and the corresponding publications.