Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing federal senate script #53

Merged
merged 15 commits into from
May 16, 2017

Conversation

anaschwendler
Copy link
Collaborator

@anaschwendler anaschwendler commented May 8, 2017

This is the first test and script for fetching the Federal Senate datasets.
Soon we will be able to add the new datasets to Amazon and use them normally.

I don't think it needs more cleaning, but I will be studying it later.
The translation and compression tasks are already working, an everything was developed using TDD.
Feel free to help :)

This is the first test and script for fetching the Federal Senate datasets.
Soon we will be able to add the new datasets to Amazon and use them normally.

I don't think it needs more cleaning, but I will be studying it later.
The translation and compression tasks are already working, an everything was developed using TDD.
Feel free to help :)
self.path = path

def fetch(self):
urls = [self.URL.format(year) for year in range(2008, 2018)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hardcoded 2018 seems like a problem. We could use something like that to avoid bumping this year every new years eve:

from datetime import datedef fetch(self):
        next_year = date.today().year + 1
        urls = [self.URL.format(year) for year in range(2008, next_year)]
        …

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I made a constant for first and next year :)
Adapted to what you suggested, soon will be pushing it.

filename_from_url = lambda url: 'federal-senate-{}'.format(url.split('/')[-1])
filenames = map(filename_from_url, urls)

for url, filename in zip(urls, filenames):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<3


def fetch(self):
urls = [self.URL.format(year) for year in range(2008, 2018)]
filename_from_url = lambda url: 'federal-senate-{}'.format(url.split('/')[-1])
Copy link
Collaborator

@cuducos cuducos May 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a lambda this way is not a best practice, but I can handle that. The real problem here is that using urllib.parse.urlsplit and os.path.basename would be way safer ; )

Something among these lines:

from urllib.parse import urlsplitdef fetch(self):
        …
        url_paths = (urlsplit(url).path for url in urls)
        filenames = map(os.path.basename, url_paths)
        …

urlretrieve(url, csv_file_path)

def translate(self):
filenames = ['federal-senate-{}.csv'.format(year) for year in range(2008, 2018)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops… this is the 2nd time I see this range(2008, 20180)! Let's make it a constant (at least a class constant like URL).

self.__translate_file(csv_path)

def __translate_file(self, csv_path):
output_file_path = csv_path \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid this line break here is unnecessary…

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

names = ['federal-senate-{}.csv'.format(year) for year in range(2008, 2018)]
for name in names:
file_path = os.path.join(self.path, name)
assert(os.path.exists(file_path))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if files were there from another round of tests? This test suite need either a tearDown to clean up or to use mocks to avoid writing to file system IMHO…

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree that we need to add a tearDown to clean up, but I don't know how to do it, examples?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say your test creates a bizarre file: /tmp/my-bizarre-and-beloved-test-side-effect.pkl:

class TestSomething(TestCase):

    def setUp(self):
        self.path = gettempdir()
        self.file_path = os.path.join(self.path, 'my-bizarre-and-beloved-test-side-effect.pkl')
        …

    def tearDowm(self):
        os.remove(self.file_path)

    def test_something(self):
        pass  # something that creates the bizarre file

Everything on setUp runs before every test. Everything in tearDown runs after every test. So you can be sure that after every test the bizarre file is deleted. Surely it might need an is os.path.exist, or a try/except if the file wasn't created in every test… but this is the general idea.

Makes sense?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense, the only thing I need to know is: I need to do it for all files I created in the test, or this only line destroy all of them?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the example os.remove will only remove the file from the path passed as an argument ; ) You have to pass the files you need to clean up ; )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you <3


@skipIf(os.environ.get('RUN_INTEGRATION_TESTS') != '1',
'Skipping integration test')
def test_translate_creates_english_versions_for_every_csv(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We test for files existence/absence, but not for translations themselves — this might not be a priority right now, but at least we should be aware of that and register that as an issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about it yesterday, we don't test if the translation was successfully made. I was thinking about opening an issue with that, and let it to be done after the migration.

We need to do it in chamber_of_deputies module too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about opening an issue with that, and let it to be done after the migration.

Do it.

assert(os.path.exists(file_path))

if __name__ == '__main__':
main()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need that I guess — we use a test finder ; )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already need this to run the tests individually.
I will keep that part :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. That wasn't supposed to be like that, but let's not bother about it now.

@cuducos
Copy link
Collaborator

cuducos commented May 11, 2017

Ok, so now there's two things pending, right?

  • Resolve conflicts
  • Merge datasets by year in a single dataset

@anaschwendler
Copy link
Collaborator Author

anaschwendler commented May 11, 2017

Yes, that is what I and @jtemporal will be doing now :)

@anaschwendler
Copy link
Collaborator Author

On hold until we get the full analysis of the datasets, to clean it in the right way.

So we get back, to go further ¯_(ツ)_/¯

@anaschwendler
Copy link
Collaborator Author

anaschwendler commented May 15, 2017

Unholding this, because finally we decided something.
All the study is here.

We decided that we will only clean the date and the cnpj_cpf fields and make another exploratory works after get the basic done.

Thanks @jtemporal and @cuducos for all feedback, everything is close to an end.

Copy link
Collaborator

@cuducos cuducos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we gonna to go for step 2 of 2 in this PR or are we leaving that for a new one?

urlretrieve(url, file_path)

def translate(self):
filenames = ['federal-senate-{}.csv'.format(year) for year in range(self.FIRST_YEAR, self.NEXT_YEAR)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add something as YEAR_RANGE as a class constant to avoid repeating this range(…).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The step 2 is done here! <3
We merged the datasets already and cleaned up :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do this YEAR_RANGE by now!
Thanks for that!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YEAR_RANGE done too, great idea, thank you! <3

categories = [categories[cat]
for cat in data['expense_type'].cat.categories]
data['expense_type'].cat.rename_categories(categories,
inplace=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 would remove those extra spaces before inplace=True to vertically align to categories

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! 🎉

@anaschwendler
Copy link
Collaborator Author

Checked all @cuducos and @jtemporal suggestions.
Checking step 2 of 2 because its done with clean() method.


return reimbursement_path

def __translate_file(self, csv_path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This internal method looks a little big, could we break it to make the parts smaller and more meaningful and clear?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that is a thing that I was thinking about, but is a translation, so we need it :T

'Private Security Services'
}

categories = [categories[cat] for cat in data['expense_type'].cat.categories]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this line looks that is a translation from portuguese to english but it's changing state of categories variable here. It is making this code more complex than it could be. :/

May it could be done in a function just to give it a name and help with clarity.
Show all reviewers

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it got merged, can you suggest something?
It will be pretty awesome to us! <3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course I can! I will be very happy to contribute for this awesome project. :)

@lipemorais
Copy link
Contributor

I missing some unit tests. :(

In order to have a smaller feedback cycle, I think that unit tests would be very helpful. Like test for each step of the script, for now I can think in tests for fetch using mocks for urlretrieve, clean and translate methods.

@cuducos
Copy link
Collaborator

cuducos commented May 16, 2017

I missing some unit tests. :(

Me too, but we've decided to let it go for now and we'll refactor tests later ; ) Wanna pair on that?

@anaschwendler
Copy link
Collaborator Author

Hi @lipemorais!

There is a issue for unit tests, we are running to finish the script by now, but we are thinking about it.
The issue is here: #58

@lipemorais
Copy link
Contributor

... Wanna pair on that?

Hell yeah! <3

@cuducos
Copy link
Collaborator

cuducos commented May 16, 2017

Anytime next week — I'll drop you a line ; )

@cuducos
Copy link
Collaborator

cuducos commented May 16, 2017

Unfortunately it looks like we still have a bug. clean method is looking for federal-senate-YYYY.xz, but the original files are saved as .csv.

In [1]: from serenata_toolbox.federal_senate.federal_senate_dataset import FederalSenateDataset

In [2]: d = FederalSenateDataset('.')

In [3]: d.fetch()

In [4]: d.clean()
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-4-472204f4cada> in <module>()
----> 1 d.clean()

/Users/cuducos/serenata-toolbox/serenata_toolbox/federal_senate/federal_senate_dataset.py in clean(self)
     36         for filename in filenames:
     37             file_path = os.path.join(self.path, filename)
---> 38             data = pd.read_csv(file_path, encoding = "utf-8")
     39             dataset = pd.concat([dataset, data])
     40

/Users/cuducos/.virtualenvs/serenata-toolbox/lib/python3.5/site-packages/pandas-0.19.1-py3.5-macosx-10.11-x86_64.egg/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    643                     skip_blank_lines=skip_blank_lines)
    644
--> 645         return _read(filepath_or_buffer, kwds)
    646
    647     parser_f.__name__ = name

/Users/cuducos/.virtualenvs/serenata-toolbox/lib/python3.5/site-packages/pandas-0.19.1-py3.5-macosx-10.11-x86_64.egg/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    386
    387     # Create the parser.
--> 388     parser = TextFileReader(filepath_or_buffer, **kwds)
    389
    390     if (nrows is not None) and (chunksize is not None):

/Users/cuducos/.virtualenvs/serenata-toolbox/lib/python3.5/site-packages/pandas-0.19.1-py3.5-macosx-10.11-x86_64.egg/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    727             self.options['has_index_names'] = kwds['has_index_names']
    728
--> 729         self._make_engine(self.engine)
    730
    731     def close(self):

/Users/cuducos/.virtualenvs/serenata-toolbox/lib/python3.5/site-packages/pandas-0.19.1-py3.5-macosx-10.11-x86_64.egg/pandas/io/parsers.py in _make_engine(self, engine)
    920     def _make_engine(self, engine='c'):
    921         if engine == 'c':
--> 922             self._engine = CParserWrapper(self.f, **self.options)
    923         else:
    924             if engine == 'python':

/Users/cuducos/.virtualenvs/serenata-toolbox/lib/python3.5/site-packages/pandas-0.19.1-py3.5-macosx-10.11-x86_64.egg/pandas/io/parsers.py in __init__(self, src, **kwds)
   1387         kwds['allow_leading_cols'] = self.index_col is not False
   1388
-> 1389         self._reader = _parser.TextReader(src, **kwds)
   1390
   1391         # XXX

pandas/parser.pyx in pandas.parser.TextReader.__cinit__ (pandas/parser.c:4025)()

pandas/parser.pyx in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:7578)()

/usr/local/var/pyenv/versions/3.5.2/lib/python3.5/lzma.py in __init__(self, filename, mode, format, check, preset, filters)
    116             if "b" not in mode:
    117                 mode += "b"
--> 118             self._fp = builtins.open(filename, mode)
    119             self._closefp = True
    120             self._mode = mode_code

FileNotFoundError: [Errno 2] No such file or directory: './federal-senate-2008.xz'

@anaschwendler
Copy link
Collaborator Author

anaschwendler commented May 16, 2017

The clean() method calls the fetch() and translate(), the .xzfile is create at the translation of the dataset.
Did you pulled the whole PR?
I updated that yesterday :T

So you need to run:
d.fetch()
d.translate()
d.clean()

@cuducos
Copy link
Collaborator

cuducos commented May 16, 2017

The clean() method calls the fetch() and translate(), the .xzfile is create at the translation of the dataset.

OMG we must pay attention to #23 so bad…

Did you pulled the whole PR?

Sure think.

So you need to run:
d.fetch()
d.translate()
d.clean()

As you can see in my output I skip the d.translate()… testing it again and updating this thread.

@anaschwendler
Copy link
Collaborator Author

anaschwendler commented May 16, 2017

OMG we must pay attention to #23 so bad…

I know, that is a serious thing, we should work on it ASAP.

As you can see in my output I skip the d.translate()… testing it again and updating this thread.

Thank you :)

@anaschwendler
Copy link
Collaborator Author

I just noticed that what I said was kinda wrong.
Fixing it:
@cuducos
The fetch, translate and clean methods are uniques, and they do specific things.
fetch get the datasets
translate translate the columns and the categories
clean concatenate the datasets and clear the date and cnpj_cpf fields.

So, what I was trying to say was that in my test, to test the clean method, I run fetch() and translate() after running clean()

Does it makes sense right now?

@lipemorais
Copy link
Contributor

It make sense for me, but looks that all them could be testeds in clean as a journey. It helps also with don't repeat yourself principle and we will have less time ruining tests but the same coverage.

@cuducos
Copy link
Collaborator

cuducos commented May 16, 2017

Ok. This is so confusing.

  • The clean method merges the thing?
  • The translate method converts the thing to .xz?

We really refactor to

  1. Make methods more atomic (UNNIX philosophy: do one thing and one thing well) — this will lead us to five methods instead of three: something like fetch, clean, translate, convert_to_xz and merge
  2. End up with meaningful and less confusing method names
  3. Reflect this changes in chamber_of_deputies

If you agree would you mind opening an issue about it?

Besides that now that I ran thing i the right order (fecth, translate and clean) it works merging it 🎉

@cuducos cuducos merged commit 65ef7bf into master May 16, 2017
@anaschwendler anaschwendler deleted the anaschwendler-introduce-federal-senate-script branch May 16, 2017 14:51
@anaschwendler
Copy link
Collaborator Author

The clean method merges the thing?

Yes, it does merge all datasets, with an .xz compression

The translate method converts the thing to .xz?

Yes, it translate each one of the datasets fetched from senate website.

If you agree would you mind opening an issue about it?

No, I totally agree with all of that, we really need to refactor it in chamber_of_deputies too, but I don't know what is the best way.

@lipemorais has raised a flag to help, just can't wait to see the results :)

Thanks for merging it <3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants