Udacity nano-degree to learn practical AI application in trading algo. Designed by WorldQuant.
- Basic Quantitative Trading - Trading with Momentum
- Advanced Quantitative Trading - Breakout Strategy
- Stocks, Indices, and ETFs - Smart Beta and Portfolio Optimization
- Factor Investing and Alpha Research - Alpha Research and Factor Modeling
- Sentiment Analysis with Natural Language Processing
- Advanced Natural Language Processing with Deep Leaning
- Combining Multiple Signals for Enhanced Alpha
- Simulating Trades with Historical Data - Backtesting
- Numpy, Pandas, Matplotlib
- Scikit-learn
- Pytorch
- Quantopian/zipline
- Quantmedia
- Started: September 2019
- Target End: February 2020
- Actual End: January 2020
- import pandas, numpy, helper
- Load Quatemedia EOD Price Data
- Resample to Month-end
close_price.resample('M').last()
- Compute Log Return
- Shift Returns
returns.shift(n)
- Generate Trading Signal
- Strategy tried:
For each month-end observation period, rank the stocks by previous returns, from the highest to the lowest. Select the top performing stocks for the long portfolio, and the bottom performing stocks for the short portfolio.
-
for i, row in prev_price: top_stock.loc[i] = row.nlargest(top_n)
- Strategy tried:
- Projected Return
portfolio_returns = (lookahead_returns * (df_long - df_short))/n_stocks
- Statistical Test
- Annualized Rate of Return
(np.exp(portfolio_returns.T.sum().dropna().mean()*12) - 1) * 100
- T-Test
- Null hypothesis (H0): Actual mean return from the signal is zero.
- When p value < 0.05, the null hypothesis is rejected
- One-sample, one-sided t-test
(t_value, p_value) = scipy.stats.ttest_1samp(portfolio_return, hypothesis)
- Annualized Rate of Return
- import pandas, numpy, helper
- Load Quatemedia EOD Price Data
- The Alpha Research Process
- What feature of markets or investor behaviour would lead to a persistent anomaly that my signal will try to use?
- Example Hypothesis:
- Stocks oscillate in a range without news or significant interest
- Traders seek to sell at the top of the range and buy at the bottom
- When stocks break out of the range,
- the liquidity traders seek to cover the losses, which magnify the move out of the range
- the move out of the range attract other investor interst due to herd behaviour which favor continuation of the trend
- Process:
- Observ & Research
- Form Hypothesis
- Validate Hypothesis, back to #1
- Code Expression
- Evaluate in-sample
- Evaluate out-of-sample
- Compute Highs and Lows in a Window
- e.g., Rolling max/min for the past 50 days
- Compute Long and Short Signals
- long = close > high, short = close < low, position = long - short
- Filter Signals (5, 10, 20 day signal window)
- Check if there was a signal in the past window_size of days
has_past_signal = bool(sum(clean_signals[signal_i:signal_i+window_size]))
- Use the current signal if there's no past signal, else 0/False
clean_signal.append(not has_past_signal and current_signal)
- Apply the above to short (signal[signal == -1].fillna(0.astype(int))) and long, add them up
- Check if there was a signal in the past window_size of days
- Lookahead Close Price
- How many days to short or long
close_price.shift(lookahead_days*-1)
- How many days to short or long
- Lookahead Price Return
- Log return between lookahead_price and close_price
- Compute the Signal Return
- signal * lookahead_returns
- Test for Significance
- Plot a histogram of the signal returns
- Check Outliers in the histogram
- Kolmogorov-Smirnov Test (KS-Test)
- Check which stock is causing the outlying returns
- Run KS-Test on a normal distribution against each stock's signal returns
ks_value, p_value = scipy.stats.kstest(rvs=group['signal_return'].values, cdf='norm', args=(mean_all, std_all))
- Find outliers
- Symbols that pass the null hypothesis with a p-value less than 0.05
- Symbols that with a KS value above ks_threshod(0.8)
- Remove them by
good_tickers = list(set(close.column) - outlier_tickers)
- Load large dollar volume stocks from quotemedia
- Calculate Index Weights (dollar volume weights)
- Calculate Portfolio Weights based on Dividend
- Calculate Returns, Weighted Returns, Cumulative Returns
- Tracking Error
np.sqrt(252) * np.std(benchmark_returns_daily - etf_returns_daily, ddof=1)
np.cov(returns.fillna(0).values, rowvar=False)
7. Calculate optimal weights
- Portfolio Variance:
$\sigma^2_p = \mathbf{x^T} \mathbf{P} \mathbf{x}$ cov_quad = cvx.quad_form(x, P)
- Distance from index weights: $\left | \mathbf{x} - \mathbf{index} \right |2$ = $\sqrt{\sum{1}^{n}(weight_i - indexWeight_i)^2}$
index_diff = cvx.norm(x, p=2, axix=None)
- Objective function =
$\mathbf{x^T} \mathbf{P} \mathbf{x} + \lambda \left | \mathbf{x} - \mathbf{index} \right |_2$ cvx.Minimize(cov_quad + scale * index_diff)
- Constraints
-
x = cvx.Variable() constraints = [x >= 0, sum(x) == 1]
-
- Optimization
-
problem = cvx.Problem(objective, constraints problem.solve()
-
- Rebalance Portfolio over time
- Portfolio Turnover
- $ AnnualizedTurnover =\frac{SumTotalTurnover}{NumberOfRebalanceEvents} * NumberofRebalanceEventsPerYear $
- $ SumTotalTurnover =\sum_{t,n}{\left | x_{t,n} - x_{t+1,n} \right |} $ Where $ x_{t,n} $ are the weights at time $ t $ for equity $ n $.
- $ SumTotalTurnover $ is just a different way of writing $ \sum \left | x_{t_1,n} - x_{t_2,n} \right | $ Minimum volatility ETF
- import cvxpy, numpy, pandas, time, matplotlib.pyplot
- Load equitiies EOD price (zipline.data.bundles)
bundles.register(bundle_name, ingest_func)
bundles.load(bundle_name)
- Build Pipeline Engine
universe = AverageDollarVolume(window_length=120).top(500)
<- 490 Tickersengine = SimplePipelineEngine(get_loader, calendar, asset_finder)
- Get Returns
data_portal = DataPotal()
get_pricing = data_portal.get_history_window()
returns = get_pricing().pct_change()[1:].fillna(0)
<- e.g. 5 year: 1256x490
- Fit PCA
pca = sklearn.decomposition.PCA(n_components, svd_solver='full')
pca.fit()
-
pca.components_
<- 20x490
- Factor Betas
-
pd.DataFrame(pca.components_.T, index=returns.columns.values, columns=np.arange(20))
<- 20x490
-
- Factor Returns
-
pd.DataFrame(pca.transform(returns), index=returns.index , columns=np.arange(20))
<- 490x20
-
- Factor Coveriance Matrix
-
np.diag(np.var(factor_returns, axix=0, ddof=1)*252)
<- 20x20
-
- Idiosyncratic Variance Matrix
-
_common_returns = pd.DataFrame(np.dot(factor_returns, factor_betas.T), returns.index, returns.columns) _residuals = (returns - _common_returns) pd.DataFrame(np.diag(np.var(_residuals)*252), returns.columns, returns.columns) <- 490x490
-
- Idiosyncratic Variance Vector
# np.dot(idiosyncratic_variance_matrix, np.ones(len(idiosyncratic_variance_matrix)))
pd.DaraFrame(np.diag(idiosyncratic_variance_matrix), returns.columns)
- Predict Portfolio Risk using the Risk Model
- $ \sqrt{X^{T}(BFB^{T} + S)X} $ where:
- $ X $ is the portfolio weights
- $ B $ is the factor betas
- $ F $ is the factor covariance matrix
- $ S $ is the idiosyncratic variance matrix
np.sqrt(weight_df.T.dot(factor_betas.dot(factor_cov_matrix).dot(factor_betas.T) + idiosyncratic_var_matrix).dot(weight_df))
- $ \sqrt{X^{T}(BFB^{T} + S)X} $ where:
- Momentum 1 Year Factor
- Mean Reversion 5 Day Sector Neutral Factor
- Mean Reversion 5 Day Sector Neutral Smoothed Factor
- Overnight Sentiment Factor
- Overnight Sentiment Smoothed Factor
- Combine the Factors to a single Pipeline
-
pipeline = Pipeline(screen=universe) pipeline.add(momentum_1yr(252, universe, sector), 'Momentum_1YR') : all_factors = engine.run_pipeline(pipeline, start, end)
-
- Get Pricing Data
assets = all_factors.index.level[1].values.tolist()
- Format Alpha Factors and Pricing for Alphalens
clean_factor_data = {factor: alphalens.get_clean_factor_and_forward_returns(factor, prices, period=[1])}
unixt_factor_data = {factor: factor_data.set_index(pd.MultiIndex.from_tuples([(x.timestamp(), y) for x, y in factor_data.index.values], names=['date', 'asset']))}
- Quantile Analysis
- Factor Returns:
alphalens.performance.factor_returns(factor_data).iloc[:, 0].cumprod().plot()
- This should be generally move up and to the right
- Basis Points Per Day per Quantile
alphalens.performance.mean_return_by_quantile(factor_data)[0].iloc[:, 0].plot.bar()
- Should be monotonic, not too much on short that is not practical to implement
- Return spread (Q1 minus Q5)*252, considering transaction cost to cut this half, should be clear that these alphas can only survive in an institutional setting and that leverage will likely need to be applied
- Factor Returns:
- Turnover Analysis
- Light test before full backtest to see the stability of the alphas over time
- Factor Rank Autocorrelation (FRA) should be close to 1
alphalens.performance.factor_rank_autocorrelation(factor_data).plot()
- Sharpe Ratio of the Alphas
pd.Series(data=252*factor_returns.mean()/factor_returns.std())
- The Combined Alpha Vector
- ML like Random Forest to get a single score per stock
- Simpler approach is to jsut average
- Objective and Constraints
- Objective Function:
- CVXPY objective function that maximizes $ \alpha^T * x \ $, where $ x $ is the portfolio weights and $ \alpha $ is the alpha vector.
cvx.Minimize(-alpha_vector.values.flatten()*weights)
- Constraints
- $ r \leq risk_{\text{cap}}^2 \ $
risk <= self.risk_cap **2
- $ B^T * x \preceq factor_{\text{max}} \ $
factor_betas.T*weights <= self.factor_max
- $ B^T * x \succeq factor_{\text{min}} \ $
factor_betas.T*weight >= self.factor_min
- $ x^T\mathbb{1} = 0 \ $
sum(weights) == 0.0
- $ |x|_1 \leq 1 \ $
sum(cvs.abs(weights)) <= 1.0
- $ x \succeq weights_{\text{min}} \ $
weights >= self.weights_min
- $ x \preceq weights_{\text{max}} $
weights <= self.weights_max
- Where $ x $ is the portfolio weights, $ B $ is the factor betas, and $ r $ is the portfolio risk
- $ r \leq risk_{\text{cap}}^2 \ $
- OptimalHoldings(ABC).find()
-
weights = cvx.Variable(len(alpha_vector)) risk = cvx.quad_form(f, X) + cvx.quad_form(weights, S) prob = cvx.Problem(obj, constraints) prob.solve(max_iter=500) optimal_weights = np.asarray(weights.value).flatten() returns pd.DataFrame(data=optimal_weights, index=alpha_vector.index)
-
- Objective Function:
- Optimize with a Regularization Parameter
- To enforce diversification, change Objective Function
- CVXPY objective function that maximize $ \alpha^T * x + \lambda|x|_2\ $, where $ x $ is the portfolio weights, $ \alpha $ is the alpha vector, and $ \lambda $ is the regularization parameter.
objective = cvx.Minimize(-alpha_vector.values.flatten()*weights + self.lambda_reg*cvx.norm(weights, 2))
- Optimize with a Strict Factor Constrains and Target Weighting
- Another common constraints is to take a predefined target weighting,
$x^*$ (e.g., a quantile portfolio), and solve to get as close to that portfolio while respecting portfolio-level constraints. - Minimize on on $ |x - x^|_2 $, where $ x $ is the portfolio weights $ x^ $ is the target weighting
objective = cvs.Minimize(cvx.norm(alpha_vector.values.flatten()-weights), 2)
- Another common constraints is to take a predefined target weighting,
- import nltk, numpy, pandas, pickle, pprint, tqdm.tqdm, bs4.BeautifulSoup, re
nltk.download('stopwords'), nltk.download('wordnet')
- Get 10-k documents
- Limit number of request per second by @limits
feed = BeautifulSoup(request.get.text).feed
entries = [entry.content.find('filing-href').getText(), ... for entry in feed.find_all('entry')]
- Download 10-k documents
- Extract Documents
doc_start_pattern = re.compile(r'<DOCUMENT>')
doc_start_position_list = [x.end() for x in doc_start_pattern.finditer(text)]
- Get Document Types
doc_type_pattern = re.compile(r'<TYPE>[^\n]+')
doc_type = doc_type_pattern.findall(doc)[0][len("<TYPE>"):].lower()
- Process the Data
- Clean up
text.lower()
BeautifulSoup(text, 'html.parser').get_text()
- Lemmatize
nltk.stem.WordNetLemmatizer, nltk.corpus.wordnet
- Remove Stopwords
nltk.corpus.stopwords
- Clean up
- Analysis on 10ks
- Loughran and McDonald sentiment word list
- Negative, Positive, Uncertainty, Litigious, Constraining, Superfluous, Modal
- Sentiment Bag of Words (Count for each ticker, sentiment)
-
sklearn.feature_extraction.text.CountVectorizer(analyzer='word', vocabulary=sentiment) X = vectorizer.fit_transform(docs) features = vectorizer.get_feature_names()
-
- Jaccard Similarity
sklearn.metrics.jaccard_similarity_score(u, v)
- Get the similarity between neighboring bag of words
- TF-IDF
sklearn.feature_extraction.text.TfidfVectorizer(analyzer='word', vocabulary=sentiments)
- Cosine Similarity
sklearn.metrics.pairwise.cosine_similarity(u, v)
- Get the similarity between neighboring IFIDF vectors
- Loughran and McDonald sentiment word list
- Evaluate Alpha Factors
- Use yearly pricing to match with 10K frequency of annual production
- Turn the sentiment dictionary into a dataframe so that alphalens can read
- Alphalens Format
data = alphalens.utils.get_clean_factor_and_forward_return(df.stack(), pricing, quantiles=5, bins=None, period=[1])
- Alphalens Format with Unix Timestamp
{factor: data.set_index(pd.MultiIndex.from_tuples([(x.timestamp(), y) for x, y in data.index.values], names=['date', 'asset'])) for factor, data in factor_data.items()}
- Factor Returns
alphalens.performance.factor_returns(data)
- Should move up and to the right
- Basis Points Per Day per Quantile
alphalens.performance.mean_return_by_quantile(data)
- Should be monotonic in quantiles
- Turnover Analysis
- Factor Rank Autocorrelation (FRA) to measure the stability without full backtest
alphalens.factor_rank_autocorrelation(data)
- Sharpe Ratio of the Alphas
- Should be 1 or higher
np.sqrt(252)*factor_returns.mean() / factor_returns.std()
- import json, nltk, os, random, re, torch, torch.nn, torch.optim, torch.nn.functional, numpy
- Import Twits
- json.load()
- Preprocessing the Data
- Pre-Processing
-
nltk.download('wordnet') nltk.download('stopwords') text = message.lower() text = re.sub('https?:\/\/[a-zA-Z0-9@:%._\/+~#=?&;-]*', ' ', text) text = re.sub('\$[a-zA-Z0-9]*', ' ', text) text = re.sub('\@[a-zA-Z0-9]*', ' ', text) text = re.sub('[^a-zA-Z]', ' ', text) tokens = text.split() wnl = nltk.stem.WordNetLemmatizer() tokens = [wnl.lemmatize(wnl.lemmatize(word, 'n'), 'v') for word in tokens]
-
- Bag of Words
bow = sorted(Counter(all_words), key=counts.get, reverse=True)
- Remove most common words such as 'the, 'and' by high_cutoff=20, rare words by low_cutoff=1e-6
- Create Dictionaries
-
vocab = {word: ii for ii, word in enumarate(filtered_words, 1)} id2vodab = {v: k for k, v in vocab.items()} filtered = [[word for word in message if word in vocab] for message in tokenized]
-
- Balancing the classes
- 50% is neutral --> make it 20% by dropping some neutral twits
- Remove messages with zero length
- Pre-Processing
- Neural Network
- Embed -> RNN -> Dense -> Softmax
- Text Classifier
-
class TextClassifier(nn.Module): def __init__(self, vocab_size, embed_size, lstm_size, output_size, lstm_layers=1, dropout=0.1): super().__init__() self.vocab_size = vocab_size self.embed_size = embed_size self.lstm_size = lstm_size self.output_size = output_size self.lstm_layers = lstm_layers self.dropout = dropout self.embedding = nn.Embedding(vodab_size, embed_size) self.lsfm = nn.LSTM(embed_size, lstm_size, lstm_layers, dropout=dropout, batch_first=False) self.dropout = nn.Dropout(-0.2) self.fc = nn.Linear(lstm_size, output_size) self.softmax = nn.LogSoftmax(dim=1) def init_hidden(self, batch_size): weight = next(self.parameters()).data hidden = (weight.new(self.lstm_layers, batch_size, self.lstm_size).zero_(), weight.new(self.lstm_layers, batch_size, self.lstm_size).zero_()) return hidden def forward(self, nn_input, hidden_state) batch_size = nn_input.size(0) nn_input = nn_input.long() embeds = self.embedding(nn_input) lstm_out, hidden_state = self.lstm(embeds, hidden_state) lstm_out = lstm_out[-1,:,:] # Stack up LSMT Outputs out = self.dropout(lstm_out) out = self.fc(out) logps = self.softmax(out) return logps, hidden_state
-
- Training
- DataLoaders and Batching
- Input Tensor shape should be (sequence_length, batch_size)
- Left pad with zeros if a message has less tokens than sequence_length.
- If a message has more token than sequence_length, keep the first sequence_length tokens
- Build a DataLoader as a generator
def dataloader(): yield batch, label_tensor # both variables are torch.tensor()
- Training and Validation
- Split data to training set and validation set, then check the model
text_batch, labels = next(iter(dataloader(train_features, train_labels, sequence_length=20, batch_size=64))) model = TextClassifier(len(vocab)+1, 200, 128, 5, dropout=0.) hidden = model.init_hidden(64) logps, hidden = model.forward(text_batch, hidden) print(logps)
- Model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = TextClassifier(len(vocab)+1, 1024, 512, 5, lstm_layers=2, dropout=0.2) model.embedding.weight.data.uniform_(-1,1) model.to(device)
- Train!
epochs = 3 batch_size = 1024 learning_rate = 0.001 clip = 5 print_every = 100 criterion = nn.NLLLoss() optimizer = optim.Adam(model.parameters(), lr=learning_rate) model.train() for epoch in range(epochs): print ('Starting epoch {}'.format(epoch + 1)) hidden = model.init_hidden(batch_size) steps = 0 for text_batch, labels in dataloader(train_features, train_labels, batch_size=batch_size, sequence_length=20, shuffle=True): steps += 1 if text_batch.size(1) != batch_size: break hidden = tuple([each.data for each in hidden]) text_batch, labels = text_batch.to(device), labels.to(device) for each in hidden: each.to(device) model.zero_grad() output, hidden = model(text_batch, hidden) loss = criterion(output, labels) loss.backwards() nn.utils.clip_grad_norm_(model.parameters(), clip) optimizer.step() # Optimize if steps % print_every == 0: model.eval() valid_losses = [] accuracy = [] valid_hidden = model.init_hidden(batch_size) for text_batch, labels in dataloader(valid_features, valid_labels, batch_size=batch_size, sequence_length=20, shuffle=False): if text_batch.size(1) != batch_size: break valid_hidden = tuple([each.data for each in valid_hidden]) text_batch, lables = text_batch.to(device), labels.to(device) for each in valid_hidden: each.to(device) valid_output, valid_hidden = model(text_batch, valid_hidden) valid_loss = criterion(valid_output.squeeze(), labels) valid_losses.append(valid_loss.item()) ps = torch.exp(valid_output) top_p, top_class = ps.topk(1, dim=1) equals = top_class == labels.view(*top_class.shape) accuracy.append(torch.mean(equals.type(torch.FloatTensor)).item()) model.train() print("Epoch: {}/{}...".format(epoch+1, epochs), "Step: {}...".format(steps), "Loss: {:.6f}...".format(loss.item()), "Val Loss: {:.6f}".format(np.mean(valid_losses)), "Accuracy: {:.6f}".format(np.mean(accuracy)))
- Split data to training set and validation set, then check the model
- DataLoaders and Batching
- Making Predictions
- preprocess, filter non-vocab words, convert words to ids, add a batch dimention (
torch.tensor(tokens).view(-1,1))
hidden = model.init_hidden(1) logps, _ = model.forward(text_input, hidden) pred = torch.exp(logps)
- preprocess, filter non-vocab words, convert words to ids, add a batch dimention (
- Testing
- import numpy, pandas, tqdm, matplotlib.pyplot
- Data Pipeline
zipline.data.bundles
- register, loadzipline.pipeline.Pipeline
universe = zipline.pipeline.AverageDollarVolume
zipline.utils.calendar.get_calendar('NYSE')
zipline.pipeline.loaders.USEquityPricingLoader
engine = zipline.pipeline.engine.SimplePipelineEngine
zipline.data.data_portal.DataPortal
- Alpha Factors
- Momentum 1 Year Factor
zipline.pipeline.factors.Returns().demean(groupby=Sector).rank().zscore()
- Mean Reversion 5 Day Sector Neutral Smoothed Factor
unsmoothed = -Returns().demean(groupby=Sector).rank().zscore()
smoothed = zipline.pipeline.factors.SimpleMovingAverage(unsmoothed).rank().zscore()
- Overnight Sentiment Smoothed Factor
- CTO(Returns), TrainingOvernightReturns(Returns)
- Combine the three factors by pipeline.add()
- Momentum 1 Year Factor
- Features and Labels
- Universal Quant Features
- Stock Volatility 20d, 120d:
pipeline.add(zipline.pipeline.factors.AnnualizedVolatility)
- Stock Dollar Volume 20d, 120d:
pipeline.add(zipline.pipeline.factors.AverageDollarVolume)
- Sector
- Stock Volatility 20d, 120d:
- Regime Features
- High and low volatility 20d, 120d:
MarketVolatility(CustomFactor)
- High and low dispersion 20d, 120d:
SimpleMovingAverage(MarketDispersion(CustomFactor))
- High and low volatility 20d, 120d:
- Target
- 1 Week Return, Quantized:
pipeline.add(Returns().quantiles(2)), pipeline.add(Returns().quantiles(25))
- 1 Week Return, Quantized:
- engine.run_pipeline()
- Date Feature
- January, December, Weekday, Quarter, Qtr-Year, Month End, Month Start, Qtr Start, Qtr End
- One-hot encode Sector
- Shift Target
- IID Check (Independent and Identically Distributed)
- Check rolling autocorelation between 1d to 5d shifted target using
scipy.stats.speamanr
- Check rolling autocorelation between 1d to 5d shifted target using
- Train/Validation/Test Splits
- Universal Quant Features
- Random Forests
- Visualize a Simple Tree
- clf =
sklearn.tree.DecisionTreeClassifier()
- Graph:
IPython.display.display
- Rank features by importance
clf.feature_importances_
- clf =
- Random Forest
- clf =
sklearn.ensemble.RandomForestClassifier()
- Scores:
clf.score(), clf.oob_score_, clf.feature_importances_
- clf =
- Model Results
- Sharpe Ratios
sqrt(252)*factor_returns.mean()/factor_returns.std()
- Factor Returns
alphalens.performance.factor_returns()
- Factor Rank Autocorelation
alphalens.performance.factor_rank_autocorrelation()
- Scores:
clf.predict_proba()
- Sharpe Ratios
- Check the above for Training Data and Validation Data
- Visualize a Simple Tree
- Overlapping Samples
- Option 1) Drop Overlapping Samples
- Option 2) Use
sklearn.ensemble.BaggingClassifier
's max_samples withbase_clf = DecisionTreeClassifier()
- Option 3) Build an ensemble of non-overlapping trees
-
sklearn.ensemble.VotingClassifier sklearn.base.clone sklearn.preprocessing.LavelEncoder sklearn.utils.Bunch
-
- Final Model
- Re-Training Model using Training Set + Validation Set
-
Load Price, Covariance and Factor Exposure from Barra - data.update(pickle.load())
-
Shift daily returns by 2 days
-
Winsorize
- np.where(x <= a,a, np.where(x >= b, b, x)) and Density plot
-
Factor Exposures and Factor Returns
- model = ols (Ordinary Least Squares)
- universe = Market Cap > 1e9, Winsorize
- variable: dependent = Daily Return, independent = Factor Exposures
- estimation: Factor Returns
-
Choose 4 Alpha Factors
- 1 Day Reversal, Earnings Yield, Value, Sentiment
-
Merge Previous Portfolio Holdings and Add h.opt.previous with 0
-
Convert all NaN to 0, and median for 0 Specific Risk
-
Build Universe - (df['IssuerMarketCap'] >= 1e9) | (abs(df['h.opt.previous']) > 0.0)
-
Set Risk Factors (B)
- All Factors - Alpha Factors
- patsy.dmatrices to one-hot encode categories
-
Calculate Specific Variance
- (Specific Risk * 0.01)**2
-
Build Factor Covariance Matrix
- Take off diagonal
-
Estimate Transaction Cost
- Lambda
-
Combine the four Alpha Factors
- sum(B_Alpha(Design Matrix)) * 1e-4
-
Define Objective Function
- $$ f(\mathbf{h}) = \frac{1}{2}\kappa \mathbf{h}_t^T\mathbf{Q}^T\mathbf{Q}\mathbf{h}t + \frac{1}{2} \kappa \mathbf{h}t^T \mathbf{S} \mathbf{h}t - \mathbf{\alpha}^T \mathbf{h}t + (\mathbf{h}{t} - \mathbf{h}{t-1})^T \mathbf{\Lambda} (\mathbf{h}{t} - \mathbf{h}{t-1}) $$
-
Define Gradient of Objective Function
- $$ f'(\mathbf{h}) = \frac{1}{2}\kappa (2\mathbf{Q}^T\mathbf{Qh}) + \frac{1}{2}\kappa (2\mathbf{Sh}) - \mathbf{\alpha} + 2(\mathbf{h}{t} - \mathbf{h}{t-1}) \mathbf{\Lambda} $$
-
Optimize Portfolio
- h = scipy.optimize.fmin_l_bfgs_b(func, initial_guess, func_gradient)
-
Calculate Risk Exposure
- B.T * h
-
Calculate Alpha Exposure
- B_Alpha.T * h
-
Calculate Transaction Cost
- $$ tcost = \sum_i^{N} \lambda_{i} (h_{i,t} - h_{i,t-1})^2 $$
-
Build Tradelist
- h - h_previous
-
Save optimal holdings as previous optimal holdings
- h_previous = h
-
Run the Backtest
- Loop #6 to #21 for all the dates
-
PnL Attrribution
- $$ {PnL}{alpha}= f \times b{alpha} $$
- $$ {PnL}{risk} = f \times b{risk} $$
-
Build Portfolio Characteristics
- calculate the sum of long positions, short positions, net positions, gross market value, and amount of dollars traded.