-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introducing federal senate script #53
Changes from 3 commits
985d73a
81bbe55
1190390
119f653
bd5f378
7b72148
2613750
a7b9dcb
41dc9db
1199044
999ada3
ca1921a
798bf71
e772328
b1c7b7d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
import os.path | ||
from urllib.request import urlretrieve | ||
import numpy as np | ||
import pandas as pd | ||
|
||
class FederalSenateDataset: | ||
URL = 'http://www.senado.gov.br/transparencia/LAI/verba/{}.csv' | ||
|
||
def __init__(self, path): | ||
self.path = path | ||
|
||
def fetch(self): | ||
urls = [self.URL.format(year) for year in range(2008, 2018)] | ||
filename_from_url = lambda url: 'federal-senate-{}'.format(url.split('/')[-1]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Having a Something among these lines: from urllib.parse import urlsplit
…
def fetch(self):
…
url_paths = (urlsplit(url).path for url in urls)
filenames = map(os.path.basename, url_paths)
… |
||
filenames = map(filename_from_url, urls) | ||
|
||
for url, filename in zip(urls, filenames): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. <3 |
||
csv_file_path = os.path.join(self.path, filename) | ||
urlretrieve(url, csv_file_path) | ||
|
||
def translate(self): | ||
filenames = ['federal-senate-{}.csv'.format(year) for year in range(2008, 2018)] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oops… this is the 2nd time I see this |
||
for filename in filenames: | ||
csv_path = os.path.join(self.path, filename) | ||
self.__translate_file(csv_path) | ||
|
||
def __translate_file(self, csv_path): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This internal method looks a little big, could we break it to make the parts smaller and more meaningful and clear? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So that is a thing that I was thinking about, but is a translation, so we need it :T |
||
output_file_path = csv_path \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm afraid this line break here is unnecessary… There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
.replace('.csv', '.xz') | ||
|
||
|
||
data = pd.read_csv(csv_path, | ||
sep=';', | ||
encoding = "ISO-8859-1", | ||
skiprows=1) | ||
|
||
data.columns = map(str.lower, data.columns) | ||
|
||
data.rename(columns={ | ||
'ano': 'year', | ||
'mes': 'month', | ||
'senador': 'congressperson_name', | ||
'tipo_despesa': 'expense_type', | ||
'cnpj_cpf': 'cnpj_cpf', | ||
'fornecedor': 'supplier', | ||
'documento': 'document_id', | ||
'data': 'date', | ||
'detalhamento': 'expense_details', | ||
'valor_reembolsado': 'reimbursement_value', | ||
}, inplace=True) | ||
|
||
data['expense_type'] = data['expense_type'].astype('category') | ||
|
||
data['expense_type'] = \ | ||
data['expense_type'].astype('category') | ||
|
||
categories = { | ||
'Aluguel de imóveis para escritório político, compreendendo despesas concernentes a eles.': | ||
'Rent of real estate for political office, comprising expenses concerning them', | ||
'Aquisição de material de consumo para uso no escritório político, inclusive aquisição ou locação de software, despesas postais, aquisição de publicações, locação de móveis e de equipamentos. ': | ||
'Acquisition of consumables for use in the political office, including acquisition or leasing of software, postal expenses, acquisition of publications, rental of furniture and equipment', | ||
'Contratação de consultorias, assessorias, pesquisas, trabalhos técnicos e outros serviços de apoio ao exercício do mandato parlamentar': | ||
'Recruitment of consultancies, advisory services, research, technical work and other services in support of the exercise of the parliamentary mandate', | ||
'Divulgação da atividade parlamentar': | ||
'Publicity of parliamentary activity', | ||
'Locomoção, hospedagem, alimentação, combustíveis e lubrificantes': | ||
'Locomotion, lodging, food, fuels and lubricants', | ||
'Passagens aéreas, aquáticas e terrestres nacionais': | ||
'National air, water and land transport', | ||
'Serviços de Segurança Privada': | ||
'Private Security Services' | ||
} | ||
|
||
categories = [categories[cat] | ||
for cat in data['expense_type'].cat.categories] | ||
data['expense_type'].cat.rename_categories(categories, | ||
inplace=True) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🎉 would remove those extra spaces before There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done! 🎉 |
||
|
||
data.to_csv(output_file_path, compression='xz', index=False, | ||
encoding='utf-8') | ||
|
||
return output_file_path | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
import os | ||
from tempfile import gettempdir | ||
from unittest import main, skipIf, TestCase, TestLoader | ||
|
||
from serenata_toolbox.federal_senate.federal_senate_dataset import FederalSenateDataset | ||
|
||
if os.environ.get('RUN_INTEGRATION_TESTS') == '1': | ||
TestLoader.sortTestMethodsUsing = None | ||
|
||
class TestFederalSenateDataset(TestCase): | ||
def setUp(self): | ||
self.path = gettempdir() | ||
self.subject = FederalSenateDataset(self.path) | ||
|
||
@skipIf(os.environ.get('RUN_INTEGRATION_TESTS') != '1', | ||
'Skipping integration test') | ||
def test_fetch_saves_raw_files(self): | ||
self.subject.fetch() | ||
names = ['federal-senate-{}.csv'.format(year) for year in range(2008, 2018)] | ||
for name in names: | ||
file_path = os.path.join(self.path, name) | ||
assert(os.path.exists(file_path)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What if files were there from another round of tests? This test suite need either a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I completely agree that we need to add a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's say your test creates a bizarre file: class TestSomething(TestCase):
def setUp(self):
self.path = gettempdir()
self.file_path = os.path.join(self.path, 'my-bizarre-and-beloved-test-side-effect.pkl')
…
def tearDowm(self):
os.remove(self.file_path)
def test_something(self):
pass # something that creates the bizarre file Everything on Makes sense? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It makes sense, the only thing I need to know is: I need to do it for all files I created in the test, or this only line destroy all of them? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the example There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thank you <3 |
||
|
||
@skipIf(os.environ.get('RUN_INTEGRATION_TESTS') != '1', | ||
'Skipping integration test') | ||
def test_translate_creates_english_versions_for_every_csv(self): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We test for files existence/absence, but not for translations themselves — this might not be a priority right now, but at least we should be aware of that and register that as an issue. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was thinking about it yesterday, we don't test if the translation was successfully made. I was thinking about opening an issue with that, and let it to be done after the migration. We need to do it in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Do it. |
||
self.subject.translate() | ||
names = ['federal-senate-{}.xz'.format(year) for year in range(2008, 2018)] | ||
for name in names: | ||
file_path = os.path.join(self.path, name) | ||
assert(os.path.exists(file_path)) | ||
|
||
if __name__ == '__main__': | ||
main() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't need that I guess — we use a test finder ; ) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We already need this to run the tests individually. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fair enough. That wasn't supposed to be like that, but let's not bother about it now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hardcoded
2018
seems like a problem. We could use something like that to avoid bumping this year every new years eve:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I made a constant for first and next year :)
Adapted to what you suggested, soon will be pushing it.