Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing federal senate script #53

Merged
merged 15 commits into from
May 16, 2017
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file.
Empty file.
83 changes: 83 additions & 0 deletions serenata_toolbox/federal_senate/federal_senate_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
import os.path
from urllib.request import urlretrieve
import numpy as np
import pandas as pd

class FederalSenateDataset:
URL = 'http://www.senado.gov.br/transparencia/LAI/verba/{}.csv'

def __init__(self, path):
self.path = path

def fetch(self):
urls = [self.URL.format(year) for year in range(2008, 2018)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hardcoded 2018 seems like a problem. We could use something like that to avoid bumping this year every new years eve:

from datetime import datedef fetch(self):
        next_year = date.today().year + 1
        urls = [self.URL.format(year) for year in range(2008, next_year)]
        …

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I made a constant for first and next year :)
Adapted to what you suggested, soon will be pushing it.

filename_from_url = lambda url: 'federal-senate-{}'.format(url.split('/')[-1])
Copy link
Collaborator

@cuducos cuducos May 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a lambda this way is not a best practice, but I can handle that. The real problem here is that using urllib.parse.urlsplit and os.path.basename would be way safer ; )

Something among these lines:

from urllib.parse import urlsplitdef fetch(self):
        …
        url_paths = (urlsplit(url).path for url in urls)
        filenames = map(os.path.basename, url_paths)
        …

filenames = map(filename_from_url, urls)

for url, filename in zip(urls, filenames):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<3

csv_file_path = os.path.join(self.path, filename)
urlretrieve(url, csv_file_path)

def translate(self):
filenames = ['federal-senate-{}.csv'.format(year) for year in range(2008, 2018)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops… this is the 2nd time I see this range(2008, 20180)! Let's make it a constant (at least a class constant like URL).

for filename in filenames:
csv_path = os.path.join(self.path, filename)
self.__translate_file(csv_path)

def __translate_file(self, csv_path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This internal method looks a little big, could we break it to make the parts smaller and more meaningful and clear?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that is a thing that I was thinking about, but is a translation, so we need it :T

output_file_path = csv_path \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid this line break here is unnecessary…

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

.replace('.csv', '.xz')


data = pd.read_csv(csv_path,
sep=';',
encoding = "ISO-8859-1",
skiprows=1)

data.columns = map(str.lower, data.columns)

data.rename(columns={
'ano': 'year',
'mes': 'month',
'senador': 'congressperson_name',
'tipo_despesa': 'expense_type',
'cnpj_cpf': 'cnpj_cpf',
'fornecedor': 'supplier',
'documento': 'document_id',
'data': 'date',
'detalhamento': 'expense_details',
'valor_reembolsado': 'reimbursement_value',
}, inplace=True)

data['expense_type'] = data['expense_type'].astype('category')

data['expense_type'] = \
data['expense_type'].astype('category')

categories = {
'Aluguel de imóveis para escritório político, compreendendo despesas concernentes a eles.':
'Rent of real estate for political office, comprising expenses concerning them',
'Aquisição de material de consumo para uso no escritório político, inclusive aquisição ou locação de software, despesas postais, aquisição de publicações, locação de móveis e de equipamentos. ':
'Acquisition of consumables for use in the political office, including acquisition or leasing of software, postal expenses, acquisition of publications, rental of furniture and equipment',
'Contratação de consultorias, assessorias, pesquisas, trabalhos técnicos e outros serviços de apoio ao exercício do mandato parlamentar':
'Recruitment of consultancies, advisory services, research, technical work and other services in support of the exercise of the parliamentary mandate',
'Divulgação da atividade parlamentar':
'Publicity of parliamentary activity',
'Locomoção, hospedagem, alimentação, combustíveis e lubrificantes':
'Locomotion, lodging, food, fuels and lubricants',
'Passagens aéreas, aquáticas e terrestres nacionais':
'National air, water and land transport',
'Serviços de Segurança Privada':
'Private Security Services'
}

categories = [categories[cat]
for cat in data['expense_type'].cat.categories]
data['expense_type'].cat.rename_categories(categories,
inplace=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 would remove those extra spaces before inplace=True to vertically align to categories

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! 🎉


data.to_csv(output_file_path, compression='xz', index=False,
encoding='utf-8')

return output_file_path

34 changes: 34 additions & 0 deletions tests/test_federal_senate_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
import os
from tempfile import gettempdir
from unittest import main, skipIf, TestCase, TestLoader

from serenata_toolbox.federal_senate.federal_senate_dataset import FederalSenateDataset

if os.environ.get('RUN_INTEGRATION_TESTS') == '1':
TestLoader.sortTestMethodsUsing = None

class TestFederalSenateDataset(TestCase):
def setUp(self):
self.path = gettempdir()
self.subject = FederalSenateDataset(self.path)

@skipIf(os.environ.get('RUN_INTEGRATION_TESTS') != '1',
'Skipping integration test')
def test_fetch_saves_raw_files(self):
self.subject.fetch()
names = ['federal-senate-{}.csv'.format(year) for year in range(2008, 2018)]
for name in names:
file_path = os.path.join(self.path, name)
assert(os.path.exists(file_path))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if files were there from another round of tests? This test suite need either a tearDown to clean up or to use mocks to avoid writing to file system IMHO…

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree that we need to add a tearDown to clean up, but I don't know how to do it, examples?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say your test creates a bizarre file: /tmp/my-bizarre-and-beloved-test-side-effect.pkl:

class TestSomething(TestCase):

    def setUp(self):
        self.path = gettempdir()
        self.file_path = os.path.join(self.path, 'my-bizarre-and-beloved-test-side-effect.pkl')
        …

    def tearDowm(self):
        os.remove(self.file_path)

    def test_something(self):
        pass  # something that creates the bizarre file

Everything on setUp runs before every test. Everything in tearDown runs after every test. So you can be sure that after every test the bizarre file is deleted. Surely it might need an is os.path.exist, or a try/except if the file wasn't created in every test… but this is the general idea.

Makes sense?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense, the only thing I need to know is: I need to do it for all files I created in the test, or this only line destroy all of them?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the example os.remove will only remove the file from the path passed as an argument ; ) You have to pass the files you need to clean up ; )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you <3


@skipIf(os.environ.get('RUN_INTEGRATION_TESTS') != '1',
'Skipping integration test')
def test_translate_creates_english_versions_for_every_csv(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We test for files existence/absence, but not for translations themselves — this might not be a priority right now, but at least we should be aware of that and register that as an issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about it yesterday, we don't test if the translation was successfully made. I was thinking about opening an issue with that, and let it to be done after the migration.

We need to do it in chamber_of_deputies module too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about opening an issue with that, and let it to be done after the migration.

Do it.

self.subject.translate()
names = ['federal-senate-{}.xz'.format(year) for year in range(2008, 2018)]
for name in names:
file_path = os.path.join(self.path, name)
assert(os.path.exists(file_path))

if __name__ == '__main__':
main()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need that I guess — we use a test finder ; )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already need this to run the tests individually.
I will keep that part :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. That wasn't supposed to be like that, but let's not bother about it now.