Skip to content

Commit 271daa4

Browse files
committed
Initial commit
0 parents  commit 271daa4

29 files changed

+34791
-0
lines changed

README.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# Machine Learning simulates Agent-based Model towards Policy-Making
2+
3+
Submitted. Under review.
4+
5+
### Proposed scheme:
6+
7+
![](analysis/Model_Proposal.png)
8+
9+
### Results
10+
11+
![](analysis/graph_sorted_POLICIES_no_policy.png)
12+
13+
14+
15+
### Requires:
16+
17+
````angular2html
18+
python==3.7 numpy==1.20.2 pandas==1.2.4 matplotlib==3.3.4 scipy==1.6.2 scikit-learn==0.24.2
19+
````
20+
21+
The program:
22+
23+
1. Reads output from an ABM model and its parameters' configuration
24+
2. Creates a socioeconomic optimal output based on two ABM results of the modelers choice
25+
3. Organizes the data as X and Y matrices
26+
4. Trains some Machine Learning algorithms
27+
5. Generates random configuration of parameters based on the mean and standard deviation of the original parameters
28+
6. Apply the trained ML algorithms to the set of randomly generated data
29+
7. Outputs the mean and values for the actual data, the randomly generated data and the optimal and non-optimal cases
30+
31+
The original database is large (63.7 GB).
32+
Thus, we provide pre-processed data to run the programme.
33+
The code to make the data selection, however, is presented here at `preparing_data.py`
34+
35+
## Running the program
36+
`python main.py`
37+
38+
Output will be produced at the pre_processed folder
39+
40+
With access to the 60 GB original data, it was possible to change the parameters for the targets at main.py.
41+
We have chosen GDP and Gini coefficient as they carry a powerful, simple message of larger production with less inequality.
42+
Further work--with a combination (PCA) of output indicators in being developed (PolicyMix)
43+
44+
You may change the parameters of the ML in machines.py
45+
46+
Or the size of the sample at generating_random_conf.py
47+
48+
## Figures
49+
1. To produce Figure 2, `cd analysis` and run `python read_comparison.py` to generate IQR.csv
50+
2. Then, run `python plot_alternative_for_table.py`
51+
3. To produce Figure 4, run `python means_comparison.py` and `python counting.py` to generate the input files.
52+
4. Then, run `python plot_z_score_parameters.py`

analysis/IQR.csv

Lines changed: 189 additions & 0 deletions
Large diffs are not rendered by default.

analysis/Model_Proposal.png

97.9 KB
Loading

analysis/counting.py

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
import pandas as pd
2+
3+
import groups_cols
4+
from groups_cols import abm_dummies as dummies
5+
from groups_cols import abm_params as params
6+
7+
8+
def getting_counting(data, name):
9+
""" Produces a csv with information regarding each dummy, i.e., when the dummy is active (=1).
10+
In the final csv we have three columns: sample size, optimal and non-optimal.
11+
All columns are in percentage of the total: sample size in relation to the whole sample, and optimal and
12+
non-optimal in relation to the sample of that specific dummy.
13+
14+
For example, policies_buy has a sample size of 0.12 meaning that 12% of the sample had that dummy as true, an
15+
optimal 2% and a non-optimal of 97%, meaning that, when that dummy is active and policy used is buy, 97% of the
16+
samples fall under the non-optimal category.
17+
18+
:param data: base csv
19+
:param name: name of the file
20+
:return: returns nothing, but saves the csv
21+
"""
22+
table = pd.DataFrame(columns=['size', 'optimal', 'non_optimal', 'optimal_count', 'non_optimal_count'])
23+
for key in dummies:
24+
for each in dummies[key]:
25+
sample_size = len(data[data[each] == 1])/len(data)
26+
optimal = len(data[(data[each] == 1) & (data['Tree'] == 1)])
27+
non_optimal = len(data[(data[each] == 1) & (data['Tree'] == 0)])
28+
total = optimal + non_optimal
29+
print(f'{each}: size {sample_size:.04f}: optimal {optimal/total:.0f}: '
30+
f'non-optimal {non_optimal/total:.04f}: optimal_count {optimal} non-optimal_count {non_optimal}')
31+
table.loc[each, 'size'] = sample_size
32+
table.loc[each, 'optimal'] = optimal/total
33+
table.loc[each, 'non_optimal'] = non_optimal/total
34+
table.loc[each, 'optimal_count'] = optimal
35+
table.loc[each, 'non_optimal_count'] = non_optimal
36+
table.to_csv(f'../pre_processed_data/counting_{name}.csv', sep=';')
37+
38+
39+
# Parameters analysis
40+
def coefficient_variation_comparison(simulated, ml):
41+
""" This function compares the ABM simulated results to the ML surrogate results in order to identify the
42+
differences between the two methods. How much of the cases fall under the optimal in relation to the mean?
43+
Added the column difference
44+
45+
Using standard-score: (optimal value mean - full sample mean) / full sample standard-deviation
46+
47+
Added absolute optimal value for simulated and ML
48+
49+
:param simulated: the simulated database in csv
50+
:param ml: the ML surrogate database in csv
51+
:return: returns nothing, but saves the csv
52+
"""
53+
table = pd.DataFrame(columns=['simulated_optimal', 'ml_optimal', 'difference'])
54+
for param in params:
55+
sim_mean = simulated[param].mean()
56+
sim_std = simulated[param].std()
57+
sim_optimal_mean = simulated[simulated['Tree'] == 1][param].mean()
58+
ml_mean = ml[param].mean()
59+
ml_std = ml[param].std()
60+
ml_optimal_mean = ml[ml['Tree'] == 1][param].mean()
61+
print(f'{param}: {(sim_optimal_mean - sim_mean) / sim_std:.06f}')
62+
print(f'{param}: {(ml_optimal_mean - ml_mean) / ml_std:.06f}')
63+
table.loc[param, 'simulated_optimal'] = (sim_optimal_mean - sim_mean) / sim_std
64+
table.loc[param, 'ml_optimal'] = (ml_optimal_mean - ml_mean) / ml_std
65+
table.loc[param, 'difference'] = table.loc[param, 'simulated_optimal'] - table.loc[param, 'ml_optimal']
66+
table.loc[param, 'abs_sim_optimal'] = sim_optimal_mean
67+
table.loc[param, 'abs_ml_optimal'] = ml_optimal_mean
68+
table.to_csv(f'../pre_processed_data/parameters_comparison.csv', sep=';')
69+
table.reset_index(inplace=True)
70+
table['Parameters'] = table['index'].map(groups_cols.abm_params_show)
71+
to_latex = table[['Parameters', 'abs_sim_optimal', 'abs_ml_optimal']]
72+
to_latex = to_latex.sort_values(by='Parameters')
73+
to_latex.set_index('Parameters', inplace=True)
74+
to_latex.to_latex('../pre_processed_data/parameters_comparison_latex.txt',
75+
float_format="{:0.3f}".format)
76+
77+
78+
# Parameters analysis
79+
def normalize_and_optimal(simulated, ml):
80+
table = pd.DataFrame(columns=['z_simulated_optimal', 'z_ml_optimal'])
81+
for param in params:
82+
# normalize
83+
simulated.loc[:, f'n_{param}'] = (simulated[param] - simulated[param].min()) / \
84+
(simulated[param].max() - simulated[param].min())
85+
ml.loc[:, f'n_{param}'] = (ml[param] - ml[param].min()) / (ml[param].max() - ml[param].min())
86+
sim_optimal_mean = simulated[simulated['Tree'] == 1][f'n_{param}'].mean()
87+
ml_optimal_mean = ml[ml['Tree'] == 1][f'n_{param}'].mean()
88+
print(f'{param}: {sim_optimal_mean:.06f}')
89+
print(f'{param}: {ml_optimal_mean:.06f}')
90+
table.loc[param, 'z_simulated_optimal'] = sim_optimal_mean
91+
table.loc[param, 'z_ml_optimal'] = ml_optimal_mean
92+
table.loc[param, 'difference'] = sim_optimal_mean - ml_optimal_mean
93+
table.loc[param, 'abs_difference'] = abs(sim_optimal_mean - ml_optimal_mean)
94+
table.to_csv(f'../pre_processed_data/parameters_norm_optimal.csv', sep=';')
95+
table.reset_index(inplace=True)
96+
table['Parameters'] = table['index'].map(groups_cols.abm_params_show)
97+
to_latex = table[['Parameters', 'z_simulated_optimal', 'z_ml_optimal']]
98+
to_latex = to_latex.sort_values(by='Parameters')
99+
to_latex.set_index('Parameters', inplace=True)
100+
to_latex.to_latex('../pre_processed_data/parameters_norm_optimal_latex.txt',
101+
float_format="{:0.3f}".format)
102+
103+
104+
if __name__ == '__main__':
105+
th = pd.read_csv('../output/Tree_gdp_index_75_gini_index_25_1000000_temp_stats.csv', sep=';')
106+
c = pd.read_csv('../output/current_gdp_index_75_gini_index_25_1000000_temp_stats.csv', sep=';')
107+
c.rename(columns={'0': 'Tree'}, inplace=True)
108+
getting_counting(th, 'Tree')
109+
getting_counting(c, 'Current')
110+
coefficient_variation_comparison(c, th)
111+
normalize_and_optimal(c, th)
112+
370 KB
Loading

analysis/groups_cols.py

Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
abm_dummies = {'policies': ['POLICIES_buy',
2+
'POLICIES_rent',
3+
'POLICIES_wage',
4+
'POLICIES_no_policy'],
5+
'interest': ['INTEREST_fixed',
6+
'INTEREST_real',
7+
'INTEREST_nominal'],
8+
'acps': ['PROCESSING_ACPS_BELO HORIZONTE',
9+
'PROCESSING_ACPS_FORTALEZA',
10+
'PROCESSING_ACPS_PORTO ALEGRE',
11+
'PROCESSING_ACPS_CAMPINAS',
12+
'PROCESSING_ACPS_SALVADOR',
13+
'PROCESSING_ACPS_RECIFE',
14+
'PROCESSING_ACPS_SAO PAULO',
15+
'PROCESSING_ACPS_JOINVILLE',
16+
'PROCESSING_ACPS_CAMPO GRANDE',
17+
'PROCESSING_ACPS_JUNDIAI',
18+
'PROCESSING_ACPS_FEIRA DE SANTANA',
19+
'PROCESSING_ACPS_IPATINGA',
20+
'PROCESSING_ACPS_LONDRINA',
21+
'PROCESSING_ACPS_SOROCABA',
22+
'PROCESSING_ACPS_JOAO PESSOA',
23+
'PROCESSING_ACPS_SAO JOSE DO RIO PRETO',
24+
'PROCESSING_ACPS_MACEIO',
25+
'PROCESSING_ACPS_SAO JOSE DOS CAMPOS',
26+
'PROCESSING_ACPS_ILHEUS - ITABUNA',
27+
'PROCESSING_ACPS_SAO LUIS',
28+
'PROCESSING_ACPS_UBERLANDIA',
29+
'PROCESSING_ACPS_MARINGA',
30+
'PROCESSING_ACPS_VITORIA',
31+
'PROCESSING_ACPS_CUIABA',
32+
'PROCESSING_ACPS_BELEM',
33+
'PROCESSING_ACPS_NOVO HAMBURGO - SAO LEOPOLDO',
34+
'PROCESSING_ACPS_TERESINA',
35+
'PROCESSING_ACPS_MANAUS',
36+
'PROCESSING_ACPS_BRASILIA',
37+
'PROCESSING_ACPS_ARACAJU',
38+
'PROCESSING_ACPS_CAMPINA GRANDE',
39+
'PROCESSING_ACPS_CAMPOS DOS GOYTACAZES',
40+
'PROCESSING_ACPS_CAXIAS DO SUL',
41+
'PROCESSING_ACPS_CRAJUBAR',
42+
'PROCESSING_ACPS_CURITIBA',
43+
'PROCESSING_ACPS_FLORIANOPOLIS',
44+
'PROCESSING_ACPS_GOIANIA',
45+
'PROCESSING_ACPS_JUIZ DE FORA',
46+
'PROCESSING_ACPS_MACAPA',
47+
'PROCESSING_ACPS_NATAL',
48+
'PROCESSING_ACPS_PELOTAS - RIO GRANDE',
49+
'PROCESSING_ACPS_PETROLINA - JUAZEIRO',
50+
'PROCESSING_ACPS_RIBEIRAO PRETO',
51+
'PROCESSING_ACPS_RIO DE JANEIRO',
52+
'PROCESSING_ACPS_SANTOS',
53+
'PROCESSING_ACPS_VOLTA REDONDA - BARRA MANSA'],
54+
'r_licenses': ['T_LICENSES_PER_REGION_False',
55+
'T_LICENSES_PER_REGION_True',
56+
'T_LICENSES_PER_REGION_random'],
57+
'days': ['STARTING_DAY_2000-01-01',
58+
'STARTING_DAY_2010-01-01'],
59+
'r_municipal_fund': ['FPM_DISTRIBUTION_False',
60+
'FPM_DISTRIBUTION_True'],
61+
'r_metro_fund': ['ALTERNATIVE0_False',
62+
'ALTERNATIVE0_True']}
63+
64+
abm_dummies_show = {'POLICIES_buy': 'Policy: buy',
65+
'POLICIES_rent': 'Policy: rent',
66+
'POLICIES_wage': 'Policy: wage',
67+
'POLICIES_no_policy': 'Policy: none',
68+
'PROCESSING_ACPS_BELO HORIZONTE': 'Belo Horizonte',
69+
'PROCESSING_ACPS_FORTALEZA': 'Fortaleza',
70+
'PROCESSING_ACPS_PORTO ALEGRE': 'Porto Alegre',
71+
'PROCESSING_ACPS_CAMPINAS': 'Campinas',
72+
'PROCESSING_ACPS_SALVADOR': 'Salvador',
73+
'PROCESSING_ACPS_RECIFE': 'Recife',
74+
'PROCESSING_ACPS_SAO PAULO': 'São Paulo',
75+
'PROCESSING_ACPS_JOINVILLE': 'Joinville',
76+
'PROCESSING_ACPS_CAMPO GRANDE': 'Campo Grande',
77+
'PROCESSING_ACPS_JUNDIAI': 'Jundiai',
78+
'PROCESSING_ACPS_FEIRA DE SANTANA': 'Feira de Santana',
79+
'PROCESSING_ACPS_IPATINGA': 'Ipatinga',
80+
'PROCESSING_ACPS_LONDRINA': 'Londrina',
81+
'PROCESSING_ACPS_SOROCABA': 'Sorocaba',
82+
'PROCESSING_ACPS_JOAO PESSOA': 'João Pessoa',
83+
'PROCESSING_ACPS_SAO JOSE DO RIO PRETO': 'SJRP',
84+
'PROCESSING_ACPS_MACEIO': 'Maceio',
85+
'PROCESSING_ACPS_SAO JOSE DOS CAMPOS': 'SJC',
86+
'PROCESSING_ACPS_ILHEUS - ITABUNA': 'Ilheus-Itabuna',
87+
'PROCESSING_ACPS_SAO LUIS': 'Sao Luis',
88+
'PROCESSING_ACPS_UBERLANDIA': 'Uberlandia',
89+
'PROCESSING_ACPS_MARINGA': 'Maringá',
90+
'PROCESSING_ACPS_VITORIA': 'Vitória',
91+
'PROCESSING_ACPS_CUIABA': 'Cuiabá',
92+
'PROCESSING_ACPS_BELEM': 'Belém',
93+
'PROCESSING_ACPS_NOVO HAMBURGO - SAO LEOPOLDO': 'NH-SL',
94+
'PROCESSING_ACPS_TERESINA': 'Teresina',
95+
'PROCESSING_ACPS_MANAUS': 'Manaus',
96+
'PROCESSING_ACPS_BRASILIA': 'Brasília',
97+
'T_LICENSES_PER_REGION_False': 'Licenses: False',
98+
'T_LICENSES_PER_REGION_True': 'Licenses: True',
99+
'T_LICENSES_PER_REGION_random': 'Licenses: Random',
100+
'STARTING_DAY_2000-01-01': 'Jan. 2000',
101+
'STARTING_DAY_2010-01-01': 'Jan. 2010',
102+
'FPM_DISTRIBUTION_False': 'FPM: False',
103+
'FPM_DISTRIBUTION_True': 'FPM: True',
104+
'ALTERNATIVE0_False': 'Alternative0: False',
105+
'ALTERNATIVE0_True': 'Alternative0: True',
106+
'INTEREST_fixed': 'Interest: fixed',
107+
'INTEREST_real': 'Interest: real',
108+
'INTEREST_nominal': 'Interest: nominal',
109+
'PROCESSING_ACPS_ARACAJU': 'Aracaju',
110+
'PROCESSING_ACPS_CAMPINA GRANDE': 'Campina Grande',
111+
'PROCESSING_ACPS_CAMPOS DOS GOYTACAZES': 'Campos',
112+
'PROCESSING_ACPS_CAXIAS DO SUL': 'Caxias do Sul',
113+
'PROCESSING_ACPS_CRAJUBAR': 'Crato',
114+
'PROCESSING_ACPS_CURITIBA': 'Curitiba',
115+
'PROCESSING_ACPS_FLORIANOPOLIS': 'Florianópolis',
116+
'PROCESSING_ACPS_GOIANIA': 'Goiânia',
117+
'PROCESSING_ACPS_JUIZ DE FORA': 'Juiz de Fora',
118+
'PROCESSING_ACPS_MACAPA': 'Macapá',
119+
'PROCESSING_ACPS_NATAL': 'Natal',
120+
'PROCESSING_ACPS_PELOTAS - RIO GRANDE': 'Pelotas',
121+
'PROCESSING_ACPS_PETROLINA - JUAZEIRO': 'Petrolina-Juazeiro',
122+
'PROCESSING_ACPS_RIBEIRAO PRETO': 'Ribeirão Preto',
123+
'PROCESSING_ACPS_RIO DE JANEIRO': 'Rio de Janeiro',
124+
'PROCESSING_ACPS_SANTOS': 'Santos',
125+
'PROCESSING_ACPS_VOLTA REDONDA - BARRA MANSA': 'Volta Redonda',
126+
'all': 'All'}
127+
128+
# 'CONSTRUCTION_ACC_CASH_FLOW',
129+
# 'LOT_COST',
130+
# 'TAX_PROPERTY',
131+
abm_params = ['HIRING_SAMPLE_SIZE',
132+
'LABOR_MARKET',
133+
'LOAN_PAYMENT_TO_PERMANENT_INCOME',
134+
'MARKUP',
135+
'MAX_LOAN_TO_VALUE',
136+
'MUNICIPAL_EFFICIENCY_MANAGEMENT',
137+
'NEIGHBORHOOD_EFFECT',
138+
'OFFER_SIZE_ON_PRICE',
139+
'PCT_DISTANCE_HIRING',
140+
'PERCENTAGE_ACTUAL_POP',
141+
'PERCENTAGE_ENTERING_ESTATE_MARKET',
142+
'PERCENT_CONSTRUCTION_FIRMS',
143+
'POLICY_COEFFICIENT',
144+
'POLICY_DAYS',
145+
'POLICY_QUANTILE',
146+
'PRIVATE_TRANSIT_COST',
147+
'PRODUCTIVITY_EXPONENT',
148+
'PRODUCTIVITY_MAGNITUDE_DIVISOR',
149+
'PUBLIC_TRANSIT_COST',
150+
'SIZE_MARKET',
151+
'STICKY_PRICES',
152+
'TAX_ESTATE_TRANSACTION',
153+
'TOTAL_DAYS']
154+
155+
abm_params_show = {'HIRING_SAMPLE_SIZE': 'Hiring sample size',
156+
'LABOR_MARKET': 'Frequency of firms entering the labor market',
157+
'LOAN_PAYMENT_TO_PERMANENT_INCOME': 'Loan/permament income ratio',
158+
'MARKUP': 'Markup',
159+
'MAX_LOAN_TO_VALUE': 'Maximum Loan-to-Value',
160+
'MUNICIPAL_EFFICIENCY_MANAGEMENT': 'Municipal efficiency management',
161+
'NEIGHBORHOOD_EFFECT': 'Neighborhood effect',
162+
'OFFER_SIZE_ON_PRICE': 'Supply-demand effect on real estate prices',
163+
'PCT_DISTANCE_HIRING': '% firms analyzing commuting distance',
164+
'PERCENTAGE_ACTUAL_POP': '% of population',
165+
'PERCENTAGE_ENTERING_ESTATE_MARKET': '% families entering real estate market',
166+
'PERCENT_CONSTRUCTION_FIRMS': '% of construction firms',
167+
'POLICY_COEFFICIENT': 'Policy coefficient',
168+
'POLICY_DAYS': 'Policy days',
169+
'POLICY_QUANTILE': 'Policy Quantile',
170+
'PRIVATE_TRANSIT_COST': 'Cost of private transit',
171+
'PRODUCTIVITY_EXPONENT': 'Productivity: exponent',
172+
'PRODUCTIVITY_MAGNITUDE_DIVISOR': 'Productivity: divisor',
173+
'PUBLIC_TRANSIT_COST': 'Cost of public transit',
174+
'SIZE_MARKET': 'Perceived market size',
175+
'STICKY_PRICES': 'Sticky Prices',
176+
'TAX_ESTATE_TRANSACTION': 'Tax over estate transactions',
177+
'TOTAL_DAYS': 'Total Days'}

0 commit comments

Comments
 (0)