Introducing federal senate script #53

anaschwendler · 2017-05-08T17:16:02Z

This is the first test and script for fetching the Federal Senate datasets.
Soon we will be able to add the new datasets to Amazon and use them normally.

I don't think it needs more cleaning, but I will be studying it later.
The translation and compression tasks are already working, an everything was developed using TDD.
Feel free to help :)

This is the first test and script for fetching the Federal Senate datasets. Soon we will be able to add the new datasets to Amazon and use them normally. I don't think it needs more cleaning, but I will be studying it later. The translation and compression tasks are already working, an everything was developed using TDD. Feel free to help :)

cuducos · 2017-05-08T19:25:36Z

serenata_toolbox/federal_senate/federal_senate_dataset.py

+        self.path = path
+
+    def fetch(self):
+        urls = [self.URL.format(year) for year in range(2008, 2018)]


This hardcoded 2018 seems like a problem. We could use something like that to avoid bumping this year every new years eve:

from datetime import date … def fetch(self): next_year = date.today().year + 1 urls = [self.URL.format(year) for year in range(2008, next_year)] …

Ok, I made a constant for first and next year :)
Adapted to what you suggested, soon will be pushing it.

cuducos · 2017-05-08T19:37:19Z

serenata_toolbox/federal_senate/federal_senate_dataset.py

+        filename_from_url = lambda url: 'federal-senate-{}'.format(url.split('/')[-1])
+        filenames = map(filename_from_url, urls)
+
+        for url, filename in zip(urls, filenames):


cuducos · 2017-05-08T19:38:36Z

serenata_toolbox/federal_senate/federal_senate_dataset.py

+
+    def fetch(self):
+        urls = [self.URL.format(year) for year in range(2008, 2018)]
+        filename_from_url = lambda url: 'federal-senate-{}'.format(url.split('/')[-1])


Having a lambda this way is not a best practice, but I can handle that. The real problem here is that using urllib.parse.urlsplit and os.path.basename would be way safer ; )

Something among these lines:

from urllib.parse import urlsplit … def fetch(self): … url_paths = (urlsplit(url).path for url in urls) filenames = map(os.path.basename, url_paths) …

cuducos · 2017-05-08T19:40:44Z

serenata_toolbox/federal_senate/federal_senate_dataset.py

+            urlretrieve(url, csv_file_path)
+
+    def translate(self):
+        filenames = ['federal-senate-{}.csv'.format(year) for year in range(2008, 2018)]


Oops… this is the 2nd time I see this range(2008, 20180)! Let's make it a constant (at least a class constant like URL).

cuducos · 2017-05-08T19:41:52Z

serenata_toolbox/federal_senate/federal_senate_dataset.py

+            self.__translate_file(csv_path)
+
+    def __translate_file(self, csv_path):
+        output_file_path = csv_path \


I'm afraid this line break here is unnecessary…

cuducos · 2017-05-08T19:45:41Z

tests/test_federal_senate_dataset.py

+        names = ['federal-senate-{}.csv'.format(year) for year in range(2008, 2018)]
+        for name in names:
+            file_path = os.path.join(self.path, name)
+            assert(os.path.exists(file_path))


What if files were there from another round of tests? This test suite need either a tearDown to clean up or to use mocks to avoid writing to file system IMHO…

I completely agree that we need to add a tearDown to clean up, but I don't know how to do it, examples?

Let's say your test creates a bizarre file: /tmp/my-bizarre-and-beloved-test-side-effect.pkl:

class TestSomething(TestCase): def setUp(self): self.path = gettempdir() self.file_path = os.path.join(self.path, 'my-bizarre-and-beloved-test-side-effect.pkl') … def tearDowm(self): os.remove(self.file_path) def test_something(self): pass # something that creates the bizarre file

Everything on setUp runs before every test. Everything in tearDown runs after every test. So you can be sure that after every test the bizarre file is deleted. Surely it might need an is os.path.exist, or a try/except if the file wasn't created in every test… but this is the general idea.

Makes sense?

It makes sense, the only thing I need to know is: I need to do it for all files I created in the test, or this only line destroy all of them?

In the example os.remove will only remove the file from the path passed as an argument ; ) You have to pass the files you need to clean up ; )

thank you <3

cuducos · 2017-05-08T19:47:05Z

tests/test_federal_senate_dataset.py

+
+    @skipIf(os.environ.get('RUN_INTEGRATION_TESTS') != '1',
+            'Skipping integration test')
+    def test_translate_creates_english_versions_for_every_csv(self):


We test for files existence/absence, but not for translations themselves — this might not be a priority right now, but at least we should be aware of that and register that as an issue.

I was thinking about it yesterday, we don't test if the translation was successfully made. I was thinking about opening an issue with that, and let it to be done after the migration.

We need to do it in chamber_of_deputies module too.

I was thinking about opening an issue with that, and let it to be done after the migration.

Do it.

cuducos · 2017-05-08T19:48:12Z

tests/test_federal_senate_dataset.py

+            assert(os.path.exists(file_path))
+
+if __name__ == '__main__':
+    main()


We don't need that I guess — we use a test finder ; )

We already need this to run the tests individually.
I will keep that part :)

Fair enough. That wasn't supposed to be like that, but let's not bother about it now.

cuducos · 2017-05-11T15:35:36Z

Ok, so now there's two things pending, right?

Resolve conflicts
Merge datasets by year in a single dataset

anaschwendler · 2017-05-11T15:36:10Z

Yes, that is what I and @jtemporal will be doing now :)

anaschwendler · 2017-05-11T19:43:25Z

On hold until we get the full analysis of the datasets, to clean it in the right way.

So we get back, to go further ¯_(ツ)_/¯

anaschwendler · 2017-05-15T12:52:42Z

Unholding this, because finally we decided something.
All the study is here.

We decided that we will only clean the date and the cnpj_cpf fields and make another exploratory works after get the basic done.

Thanks @jtemporal and @cuducos for all feedback, everything is close to an end.

…b.com:datasciencebr/serenata-toolbox into anaschwendler-introduce-federal-senate-script

cuducos

Are we gonna to go for step 2 of 2 in this PR or are we leaving that for a new one?

cuducos · 2017-05-15T17:37:33Z

serenata_toolbox/federal_senate/federal_senate_dataset.py

+            urlretrieve(url, file_path)
+
+    def translate(self):
+        filenames = ['federal-senate-{}.csv'.format(year) for year in range(self.FIRST_YEAR, self.NEXT_YEAR)]


I'd add something as YEAR_RANGE as a class constant to avoid repeating this range(…).

I like this idea :)

The step 2 is done here! <3
We merged the datasets already and cleaned up :)

I'll do this YEAR_RANGE by now!
Thanks for that!

YEAR_RANGE done too, great idea, thank you! <3

jtemporal · 2017-05-15T18:21:03Z

serenata_toolbox/federal_senate/federal_senate_dataset.py

+        categories = [categories[cat]
+                      for cat in data['expense_type'].cat.categories]
+        data['expense_type'].cat.rename_categories(categories,
+                                                           inplace=True)


🎉 would remove those extra spaces before inplace=True to vertically align to categories

anaschwendler · 2017-05-16T09:53:51Z

Checked all @cuducos and @jtemporal suggestions.
Checking step 2 of 2 because its done with clean() method.

lipemorais · 2017-05-16T11:03:25Z

serenata_toolbox/federal_senate/federal_senate_dataset.py

+
+        return reimbursement_path
+
+    def __translate_file(self, csv_path):


This internal method looks a little big, could we break it to make the parts smaller and more meaningful and clear?

So that is a thing that I was thinking about, but is a translation, so we need it :T

lipemorais · 2017-05-16T11:41:37Z

serenata_toolbox/federal_senate/federal_senate_dataset.py

+                'Private Security Services'
+        }
+
+        categories = [categories[cat] for cat in data['expense_type'].cat.categories]


At this line looks that is a translation from portuguese to english but it's changing state of categories variable here. It is making this code more complex than it could be. :/

May it could be done in a function just to give it a name and help with clarity.
Show all reviewers

When it got merged, can you suggest something?
It will be pretty awesome to us! <3

Of course I can! I will be very happy to contribute for this awesome project. :)

lipemorais · 2017-05-16T11:43:29Z

I missing some unit tests. :(

In order to have a smaller feedback cycle, I think that unit tests would be very helpful. Like test for each step of the script, for now I can think in tests for fetch using mocks for urlretrieve, clean and translate methods.

cuducos · 2017-05-16T11:58:38Z

I missing some unit tests. :(

Me too, but we've decided to let it go for now and we'll refactor tests later ; ) Wanna pair on that?

anaschwendler · 2017-05-16T12:05:51Z

Hi @lipemorais!

There is a issue for unit tests, we are running to finish the script by now, but we are thinking about it.
The issue is here: #58

lipemorais · 2017-05-16T12:08:38Z

... Wanna pair on that?

Hell yeah! <3

cuducos · 2017-05-16T12:09:40Z

Anytime next week — I'll drop you a line ; )

cuducos · 2017-05-16T14:30:23Z

Unfortunately it looks like we still have a bug. clean method is looking for federal-senate-YYYY.xz, but the original files are saved as .csv.

In [1]: from serenata_toolbox.federal_senate.federal_senate_dataset import FederalSenateDataset

In [2]: d = FederalSenateDataset('.')

In [3]: d.fetch()

In [4]: d.clean()
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-4-472204f4cada> in <module>()
----> 1 d.clean()

/Users/cuducos/serenata-toolbox/serenata_toolbox/federal_senate/federal_senate_dataset.py in clean(self)
     36         for filename in filenames:
     37             file_path = os.path.join(self.path, filename)
---> 38             data = pd.read_csv(file_path, encoding = "utf-8")
     39             dataset = pd.concat([dataset, data])
     40

/Users/cuducos/.virtualenvs/serenata-toolbox/lib/python3.5/site-packages/pandas-0.19.1-py3.5-macosx-10.11-x86_64.egg/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    643                     skip_blank_lines=skip_blank_lines)
    644
--> 645         return _read(filepath_or_buffer, kwds)
    646
    647     parser_f.__name__ = name

/Users/cuducos/.virtualenvs/serenata-toolbox/lib/python3.5/site-packages/pandas-0.19.1-py3.5-macosx-10.11-x86_64.egg/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    386
    387     # Create the parser.
--> 388     parser = TextFileReader(filepath_or_buffer, **kwds)
    389
    390     if (nrows is not None) and (chunksize is not None):

/Users/cuducos/.virtualenvs/serenata-toolbox/lib/python3.5/site-packages/pandas-0.19.1-py3.5-macosx-10.11-x86_64.egg/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    727             self.options['has_index_names'] = kwds['has_index_names']
    728
--> 729         self._make_engine(self.engine)
    730
    731     def close(self):

/Users/cuducos/.virtualenvs/serenata-toolbox/lib/python3.5/site-packages/pandas-0.19.1-py3.5-macosx-10.11-x86_64.egg/pandas/io/parsers.py in _make_engine(self, engine)
    920     def _make_engine(self, engine='c'):
    921         if engine == 'c':
--> 922             self._engine = CParserWrapper(self.f, **self.options)
    923         else:
    924             if engine == 'python':

/Users/cuducos/.virtualenvs/serenata-toolbox/lib/python3.5/site-packages/pandas-0.19.1-py3.5-macosx-10.11-x86_64.egg/pandas/io/parsers.py in __init__(self, src, **kwds)
   1387         kwds['allow_leading_cols'] = self.index_col is not False
   1388
-> 1389         self._reader = _parser.TextReader(src, **kwds)
   1390
   1391         # XXX

pandas/parser.pyx in pandas.parser.TextReader.__cinit__ (pandas/parser.c:4025)()

pandas/parser.pyx in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:7578)()

/usr/local/var/pyenv/versions/3.5.2/lib/python3.5/lzma.py in __init__(self, filename, mode, format, check, preset, filters)
    116             if "b" not in mode:
    117                 mode += "b"
--> 118             self._fp = builtins.open(filename, mode)
    119             self._closefp = True
    120             self._mode = mode_code

FileNotFoundError: [Errno 2] No such file or directory: './federal-senate-2008.xz'

anaschwendler · 2017-05-16T14:33:07Z

The clean() method calls the fetch() and translate(), the .xzfile is create at the translation of the dataset.
Did you pulled the whole PR?
I updated that yesterday :T

So you need to run:
d.fetch()
d.translate()
d.clean()

cuducos · 2017-05-16T14:36:47Z

The clean() method calls the fetch() and translate(), the .xzfile is create at the translation of the dataset.

OMG we must pay attention to #23 so bad…

Did you pulled the whole PR?

Sure think.

So you need to run:
d.fetch()
d.translate()
d.clean()

As you can see in my output I skip the d.translate()… testing it again and updating this thread.

anaschwendler · 2017-05-16T14:38:19Z

OMG we must pay attention to #23 so bad…

I know, that is a serious thing, we should work on it ASAP.

As you can see in my output I skip the d.translate()… testing it again and updating this thread.

Thank you :)

anaschwendler · 2017-05-16T14:43:55Z

I just noticed that what I said was kinda wrong.
Fixing it:
@cuducos
The fetch, translate and clean methods are uniques, and they do specific things.
fetch get the datasets
translate translate the columns and the categories
clean concatenate the datasets and clear the date and cnpj_cpf fields.

So, what I was trying to say was that in my test, to test the clean method, I run fetch() and translate() after running clean()

Does it makes sense right now?

lipemorais · 2017-05-16T14:48:48Z

It make sense for me, but looks that all them could be testeds in clean as a journey. It helps also with don't repeat yourself principle and we will have less time ruining tests but the same coverage.

cuducos · 2017-05-16T14:50:43Z

Ok. This is so confusing.

The clean method merges the thing?
The translate method converts the thing to .xz?

We really refactor to

Make methods more atomic (UNNIX philosophy: do one thing and one thing well) — this will lead us to five methods instead of three: something like fetch, clean, translate, convert_to_xz and merge
End up with meaningful and less confusing method names
Reflect this changes in chamber_of_deputies

If you agree would you mind opening an issue about it?

Besides that now that I ran thing i the right order (fecth, translate and clean) it works merging it 🎉

anaschwendler · 2017-05-16T14:55:00Z

The clean method merges the thing?

Yes, it does merge all datasets, with an .xz compression

The translate method converts the thing to .xz?

Yes, it translate each one of the datasets fetched from senate website.

If you agree would you mind opening an issue about it?

No, I totally agree with all of that, we really need to refactor it in chamber_of_deputies too, but I don't know what is the best way.

@lipemorais has raised a flag to help, just can't wait to see the results :)

Thanks for merging it <3

anaschwendler added 3 commits May 4, 2017 11:39

Bump toolbox version as a major version

985d73a

Merge branch 'master' of github.com:datasciencebr/serenata-toolbox

81bbe55

anaschwendler requested review from Irio, cuducos and jtemporal May 8, 2017 17:16

cuducos requested changes May 8, 2017

View reviewed changes

anaschwendler and others added 4 commits May 9, 2017 13:54

Refactor year definition to Federal Senate structure.

119f653

Refactor on Federal Senate URL fetch

bd5f378

Refactor Federal Senate test with clean tearDown method

7b72148

Move fetch to setup method on test suit

2613750

cuducos approved these changes May 11, 2017

View reviewed changes

anaschwendler added 2 commits May 11, 2017 17:37

Merge branch 'master' into anaschwendler-introduce-federal-senate-script

a7b9dcb

Introduce the clean() method to the Federal Senate dataset script

41dc9db

Merge branch 'master' into anaschwendler-introduce-federal-senate-script

1199044

anaschwendler added 2 commits May 15, 2017 17:51

Cleaning method and concatening all Federal Senate reimbursements

999ada3

Merge branch 'anaschwendler-introduce-federal-senate-script' of githu…

ca1921a

…b.com:datasciencebr/serenata-toolbox into anaschwendler-introduce-federal-senate-script

cuducos reviewed May 15, 2017

View reviewed changes

jtemporal reviewed May 15, 2017

View reviewed changes

anaschwendler added 2 commits May 16, 2017 11:51

Refactor YEAR_RANGE constant to Federal Senate script

798bf71

Refactor code lines to fill all line caracters.

e772328

Adjust the self.YEAR_RANGE constant

b1c7b7d

lipemorais reviewed May 16, 2017

View reviewed changes

cuducos mentioned this pull request May 16, 2017

Refactoring tests to its specific tasks #58

Closed

cuducos merged commit 65ef7bf into master May 16, 2017

anaschwendler deleted the anaschwendler-introduce-federal-senate-script branch May 16, 2017 14:51

cuducos mentioned this pull request May 16, 2017

Add the script to get the Federal Senate datasets. #52

Closed

cuducos mentioned this pull request May 16, 2017

Missing version bump #61

Closed

cuducos mentioned this pull request Jun 7, 2017

Big refactor of the public API to generate the datasets #87

Open

5 tasks

anaschwendler mentioned this pull request May 15, 2017

Start of a long study about the Federal Senate dataset okfn-brasil/serenata-de-amor#231

Merged


		return reimbursement_path

		def __translate_file(self, csv_path):

Introducing federal senate script #53

Introducing federal senate script #53

Conversation

anaschwendler commented May 8, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cuducos May 8, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cuducos commented May 11, 2017 • edited by anaschwendler Loading

anaschwendler commented May 11, 2017 • edited Loading

anaschwendler commented May 11, 2017

anaschwendler commented May 15, 2017 • edited Loading

cuducos left a comment • edited by anaschwendler Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anaschwendler commented May 16, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lipemorais commented May 16, 2017

cuducos commented May 16, 2017

anaschwendler commented May 16, 2017

lipemorais commented May 16, 2017

cuducos commented May 16, 2017

cuducos commented May 16, 2017

anaschwendler commented May 16, 2017 • edited Loading

cuducos commented May 16, 2017 • edited by anaschwendler Loading

anaschwendler commented May 16, 2017 • edited Loading

anaschwendler commented May 16, 2017

lipemorais commented May 16, 2017

cuducos commented May 16, 2017

anaschwendler commented May 16, 2017

anaschwendler commented May 8, 2017 •

edited

Loading

cuducos May 8, 2017 •

edited

Loading

cuducos commented May 11, 2017 •

edited by anaschwendler

Loading

anaschwendler commented May 11, 2017 •

edited

Loading

anaschwendler commented May 15, 2017 •

edited

Loading

cuducos left a comment •

edited by anaschwendler

Loading

anaschwendler commented May 16, 2017 •

edited

Loading

cuducos commented May 16, 2017 •

edited by anaschwendler

Loading

anaschwendler commented May 16, 2017 •

edited

Loading