-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introducing federal senate script #53
Introducing federal senate script #53
Conversation
This is the first test and script for fetching the Federal Senate datasets. Soon we will be able to add the new datasets to Amazon and use them normally. I don't think it needs more cleaning, but I will be studying it later. The translation and compression tasks are already working, an everything was developed using TDD. Feel free to help :)
self.path = path | ||
|
||
def fetch(self): | ||
urls = [self.URL.format(year) for year in range(2008, 2018)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hardcoded 2018
seems like a problem. We could use something like that to avoid bumping this year every new years eve:
from datetime import date
…
def fetch(self):
next_year = date.today().year + 1
urls = [self.URL.format(year) for year in range(2008, next_year)]
…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I made a constant for first and next year :)
Adapted to what you suggested, soon will be pushing it.
filename_from_url = lambda url: 'federal-senate-{}'.format(url.split('/')[-1]) | ||
filenames = map(filename_from_url, urls) | ||
|
||
for url, filename in zip(urls, filenames): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<3
|
||
def fetch(self): | ||
urls = [self.URL.format(year) for year in range(2008, 2018)] | ||
filename_from_url = lambda url: 'federal-senate-{}'.format(url.split('/')[-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a lambda
this way is not a best practice, but I can handle that. The real problem here is that using urllib.parse.urlsplit
and os.path.basename
would be way safer ; )
Something among these lines:
from urllib.parse import urlsplit
…
def fetch(self):
…
url_paths = (urlsplit(url).path for url in urls)
filenames = map(os.path.basename, url_paths)
…
urlretrieve(url, csv_file_path) | ||
|
||
def translate(self): | ||
filenames = ['federal-senate-{}.csv'.format(year) for year in range(2008, 2018)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops… this is the 2nd time I see this range(2008, 20180)
! Let's make it a constant (at least a class constant like URL
).
self.__translate_file(csv_path) | ||
|
||
def __translate_file(self, csv_path): | ||
output_file_path = csv_path \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid this line break here is unnecessary…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
tests/test_federal_senate_dataset.py
Outdated
names = ['federal-senate-{}.csv'.format(year) for year in range(2008, 2018)] | ||
for name in names: | ||
file_path = os.path.join(self.path, name) | ||
assert(os.path.exists(file_path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if files were there from another round of tests? This test suite need either a tearDown
to clean up or to use mocks to avoid writing to file system IMHO…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completely agree that we need to add a tearDown
to clean up, but I don't know how to do it, examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's say your test creates a bizarre file: /tmp/my-bizarre-and-beloved-test-side-effect.pkl
:
class TestSomething(TestCase):
def setUp(self):
self.path = gettempdir()
self.file_path = os.path.join(self.path, 'my-bizarre-and-beloved-test-side-effect.pkl')
…
def tearDowm(self):
os.remove(self.file_path)
def test_something(self):
pass # something that creates the bizarre file
Everything on setUp
runs before every test. Everything in tearDown
runs after every test. So you can be sure that after every test the bizarre file is deleted. Surely it might need an is os.path.exist
, or a try
/except
if the file wasn't created in every test… but this is the general idea.
Makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense, the only thing I need to know is: I need to do it for all files I created in the test, or this only line destroy all of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the example os.remove
will only remove the file from the path passed as an argument ; ) You have to pass the files you need to clean up ; )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you <3
|
||
@skipIf(os.environ.get('RUN_INTEGRATION_TESTS') != '1', | ||
'Skipping integration test') | ||
def test_translate_creates_english_versions_for_every_csv(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We test for files existence/absence, but not for translations themselves — this might not be a priority right now, but at least we should be aware of that and register that as an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about it yesterday, we don't test if the translation was successfully made. I was thinking about opening an issue with that, and let it to be done after the migration.
We need to do it in chamber_of_deputies
module too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about opening an issue with that, and let it to be done after the migration.
Do it.
tests/test_federal_senate_dataset.py
Outdated
assert(os.path.exists(file_path)) | ||
|
||
if __name__ == '__main__': | ||
main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need that I guess — we use a test finder ; )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already need this to run the tests individually.
I will keep that part :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. That wasn't supposed to be like that, but let's not bother about it now.
Ok, so now there's two things pending, right?
|
Yes, that is what I and @jtemporal will be doing now :) |
On hold until we get the full analysis of the datasets, to clean it in the right way. So we get back, to go further ¯_(ツ)_/¯ |
Unholding this, because finally we decided something. We decided that we will only clean the Thanks @jtemporal and @cuducos for all feedback, everything is close to an end. |
…b.com:datasciencebr/serenata-toolbox into anaschwendler-introduce-federal-senate-script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we gonna to go for step 2 of 2 in this PR or are we leaving that for a new one?
urlretrieve(url, file_path) | ||
|
||
def translate(self): | ||
filenames = ['federal-senate-{}.csv'.format(year) for year in range(self.FIRST_YEAR, self.NEXT_YEAR)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add something as YEAR_RANGE
as a class constant to avoid repeating this range(…)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The step 2 is done here! <3
We merged the datasets already and cleaned up :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do this YEAR_RANGE
by now!
Thanks for that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YEAR_RANGE
done too, great idea, thank you! <3
categories = [categories[cat] | ||
for cat in data['expense_type'].cat.categories] | ||
data['expense_type'].cat.rename_categories(categories, | ||
inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 would remove those extra spaces before inplace=True
to vertically align to categories
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! 🎉
Checked all @cuducos and @jtemporal suggestions. |
|
||
return reimbursement_path | ||
|
||
def __translate_file(self, csv_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This internal method looks a little big, could we break it to make the parts smaller and more meaningful and clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that is a thing that I was thinking about, but is a translation, so we need it :T
'Private Security Services' | ||
} | ||
|
||
categories = [categories[cat] for cat in data['expense_type'].cat.categories] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this line looks that is a translation from portuguese to english but it's changing state of categories variable here. It is making this code more complex than it could be. :/
May it could be done in a function just to give it a name and help with clarity.
Show all reviewers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When it got merged, can you suggest something?
It will be pretty awesome to us! <3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course I can! I will be very happy to contribute for this awesome project. :)
I missing some unit tests. :( In order to have a smaller feedback cycle, I think that unit tests would be very helpful. Like test for each step of the script, for now I can think in tests for fetch using mocks for urlretrieve, clean and translate methods. |
Me too, but we've decided to let it go for now and we'll refactor tests later ; ) Wanna pair on that? |
Hi @lipemorais! There is a issue for unit tests, we are running to finish the script by now, but we are thinking about it. |
Hell yeah! <3 |
Anytime next week — I'll drop you a line ; ) |
Unfortunately it looks like we still have a bug.
|
The So you need to run: |
OMG we must pay attention to #23 so bad…
Sure think.
As you can see in my output I skip the |
I know, that is a serious thing, we should work on it ASAP.
Thank you :) |
I just noticed that what I said was kinda wrong. So, what I was trying to say was that in my test, to test the Does it makes sense right now? |
It make sense for me, but looks that all them could be testeds in clean as a journey. It helps also with don't repeat yourself principle and we will have less time ruining tests but the same coverage. |
Ok. This is so confusing.
We really refactor to
If you agree would you mind opening an issue about it? Besides that now that I ran thing i the right order ( |
Yes, it does merge all datasets, with an
Yes, it translate each one of the datasets fetched from senate website.
No, I totally agree with all of that, we really need to refactor it in @lipemorais has raised a flag to help, just can't wait to see the results :) Thanks for merging it <3 |
This is the first test and script for fetching the Federal Senate datasets.
Soon we will be able to add the new datasets to Amazon and use them normally.
I don't think it needs more cleaning, but I will be studying it later.
The translation and compression tasks are already working, an everything was developed using TDD.
Feel free to help :)