Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT: Sync and fill benchmarks through latest trading day #2044

Merged
merged 2 commits into from
Dec 21, 2017
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 8 additions & 15 deletions zipline/data/loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -135,21 +135,8 @@ def load_market_data(trading_day=None, trading_days=None, bm_symbol='SPY',
first_date = trading_days[0]
now = pd.Timestamp.utcnow()

# We expect to have benchmark and treasury data that's current up until
# **two** full trading days prior to the most recently completed trading
# day.
# Example:
# On Thu Oct 22 2015, the previous completed trading day is Wed Oct 21.
# However, data for Oct 21 doesn't become available until the early morning
# hours of Oct 22. This means that there are times on the 22nd at which we
# cannot reasonably expect to have data for the 21st available. To be
# conservative, we instead expect that at any time on the 22nd, we can
# download data for Tuesday the 20th, which is two full trading days prior
# to the date on which we're running a test.

# We'll attempt to download new data if the latest entry in our cache is
# before this date.
last_date = trading_days[trading_days.get_loc(now, method='ffill') - 2]
# we will fill missing benchmark data through latest trading date
last_date = trading_days[trading_days.get_loc(now, method='ffill')]

br = ensure_benchmark_data(
bm_symbol,
Expand All @@ -168,6 +155,12 @@ def load_market_data(trading_day=None, trading_days=None, bm_symbol='SPY',
now,
environ,
)

# combine dt indices and reindex using ffill then bfill
all_dt = br.index.union(tc.index)
br = br.reindex(all_dt, method='ffill').fillna(method='bfill')
tc = tc.reindex(all_dt, method='ffill').fillna(method='bfill')
Copy link
Contributor

@freddiev4 freddiev4 Dec 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tradeoff here that I'm concerned about is that we'll end up having benchmark returns for much further back than 5 years (which is the max amount we can get from IEX), except the returns of that benchmark will just be the same b/c we're bfilling.

That's more memory we're using up and also then we have benchmark data that isn't accurate before the 5 year cutoff.

What is it that you're trying to get out of reindexing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to sync the two time series.

The Federal Reserve H15 report is not released on a timely basis with market closes, so there will always be missing data on the front-end of the treasury curve time series. At the same time, there are dates missing from both time series relative to each other. Joining the time series, re-indexing, and then filling the missing data solves both issues.

If the concern is filling data back past 5 years on the benchmark, you could add a line to drop all dates older than the most recent start of the two respective time series.

Copy link
Contributor

@freddiev4 freddiev4 Dec 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. That makes sense to me and sounds fair. Also just did some quick profiling and mem. usage wasn't what I had thought it would be like.

Can you amend your commit to use a Commit Prefix? I think MAINT: would be appropriate here, with a shorter commit message. Should be good to merge after that 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddiev4 this PR is good to go


benchmark_returns = br[br.index.slice_indexer(first_date, last_date)]
treasury_curves = tc[tc.index.slice_indexer(first_date, last_date)]
return benchmark_returns, treasury_curves
Expand Down