Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get dataframe #15

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Get dataframe #15

wants to merge 6 commits into from

Conversation

elmotec
Copy link
Contributor

@elmotec elmotec commented Nov 7, 2015

Hey @mortada,

This is a cumulative PR for the fixes to test_search, get_series changes, and the new get_dataframe. Don't know if you are interested and if so whether you'd like all in one commit (that's quite a bit) or in 3 commits as it is. The normal order of the pull requests are:
1- fix-test_search
2- get_series_with_realtime
3- get_dataframe

Cheers

Removed usage of sys.stderr in mock exception which caused the stream to
be closed. This in turn failed the next test.  Simplified test_search to
avoid the zip and the loop.
Arguments realtime_start and realtime_end in get_series() now cause a
pandas.DataFrame to be returned with pandas.MultiIndex for realtime
data.  Added simple test for the new feature and documentation.

Added __init__.py in fredapi.tests so it's correctly interpreted as
a package. Now we could revert to python setup.py test in .travis.yml.

Fixed test_invalid_kwarg_in_get_series() as we sometimes get a
TypeError and sometimes a ValueError. Seems that pandas passes through
whatever exception it gets, might be a good reason for this so we
follow the same policy.

Simplified comparison of dataframe output in tests.
9999-12-31 cannot be converted to pandas.Timestamp because it's too big.
Reason it's prefereable to use pandas.Timestamp than datetime.datetime
is that the former can be used as an index whereas the second cannot.
Retrieve multiple series in one call with get_dataframe.

Reorganized test mock calls.
@mortada
Copy link
Owner

mortada commented Nov 9, 2015

@elmotec one high level comment is that it seems odd to have a function called get_dataframe()

the reason get_series() is called get_series is not because the returned type is a pandas Series, it is because it is getting data for an economic series

it also seems like it can potentially be quite tricky to try to fit different series data into the same index of a DataFrame. Maybe it's best for the user to decide on how to deal with different frequencies

@elmotec
Copy link
Contributor Author

elmotec commented Nov 9, 2015

Fair enough. I guess we could call it get_multi_series() then to reflect the fact that one fetches multiple series in one call to Python (there are still multiple requests to the web API).

With respect to the join of multiple series, I have not pushed the concept much to be honest (it fits my smallish purpose) but pandas can handle quite a bit of data even if the join were to create a lot of rows (memory is usually the bottleneck when I had problems). I've added a frequency argument to force say a quarterly observation for series with a lot of data points.

Also, note that get_series() does not go away, so users can still aggregate series the way they want. The idea is really to have an easy way to retrieve multiple series in a pandas.DataFrame with just one line of code. I know I find value in that.

One last thing: I can see a challenge to backward compatibility as 9999-12-31 is now interpreted as None rather than literally. That was mostly to play nice with pandas.Timestamp format (which does not go that far in the future) so the data could be used in an index (datetime.datetime cannot), and also because no one really expects the series to be revised on 9999-12-31. None in that context is semantically equivalent to NULL for databases, so I thought it was a better choice than the max value of pandas.Timestamp.

@mortada
Copy link
Owner

mortada commented Nov 9, 2015

pandas has a datetime null value called pd.NaT, would it make sense to do that instead of None?

@elmotec
Copy link
Contributor Author

elmotec commented Nov 11, 2015

Right, right. Somehow I missed it even though pandas actually turns None into NaT. Thank you for pointing it.

What's more, checking it out made me realize that all the comparisons to NaT return false and that actually may not be what I want because things like df.query('rt_start < "2015-02-01" and "2015-02-01" < rt_end') don't return the expected result. Maybe I should change rt_end to pd.Timestamp.max for anything beyond pd.Timestamp.max.

Changed the translation of 9999-12-31 in Fred API to a customizable
variable Fred.latest_time_stamp which defaults to pandas.Timestamp.max

Renamed get_dataframe to get_multi_series to reflect what it does rather
than what it returns.

Made Fred.freq_map an internal variable (Fred.__freq_map).

Documentation improvements.
@elmotec
Copy link
Contributor Author

elmotec commented Nov 19, 2015

@mortada, So I made the replacement of 9999-12-31 a class variable. For now, I think it should default to pd.Timestamp.max to allow selection on the resulting dataframe like df[(df.rt_start < my_date) & (my_date < df.rt_end)]. I also renamed the method get_multi_series.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants