From ec29461d4b4547845d1b7d0f260e376f44fd67f7 Mon Sep 17 00:00:00 2001 From: Aditi Date: Sat, 10 Aug 2024 10:25:37 +0530 Subject: [PATCH] Added Dow_Jones_Industrial_Average_Prediction --- .../DJIA_prediction_using_news.html | 18768 ++++++++++++++++ .../DJIA_prediction_using_news.py | 1103 + .../README.md | 2 + 3 files changed, 19873 insertions(+) create mode 100644 NYSE/Dow_Jones_Industrial_Average_Prediction/DJIA_prediction_using_news.html create mode 100644 NYSE/Dow_Jones_Industrial_Average_Prediction/DJIA_prediction_using_news.py create mode 100644 NYSE/Dow_Jones_Industrial_Average_Prediction/README.md diff --git a/NYSE/Dow_Jones_Industrial_Average_Prediction/DJIA_prediction_using_news.html b/NYSE/Dow_Jones_Industrial_Average_Prediction/DJIA_prediction_using_news.html new file mode 100644 index 00000000..25539ce1 --- /dev/null +++ b/NYSE/Dow_Jones_Industrial_Average_Prediction/DJIA_prediction_using_news.html @@ -0,0 +1,18768 @@ + + + + +DJIA prediction using news + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+
+
+

Dow Jones Industrial Average (DJIA) prediction

We will be predicting the DJIA closing value by using the top 25 headlines for the day.

+

The Dow Jones Industrial Average, Dow Jones, or simply the Dow, is a stock market index that measures the stock performance of 30 large companies listed on stock exchanges in the United States.

+

image.png

+ +
+
+
+
+
+
+

Step 1 - Loading all the different libraries

+
+
+
+
+
+
In [6]:
+
+
+
import time
+start = time.time()
+import os
+
+import numpy as np # linear algebra
+import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
+
+from sklearn.model_selection import cross_val_score #sklearn features various ML algorithms
+from sklearn.feature_selection import VarianceThreshold
+from sklearn.preprocessing import StandardScaler
+from nltk.sentiment.vader import SentimentIntensityAnalyzer #sentiment analysis 
+from textblob import TextBlob #processing textual data
+
+#plotting
+import plotly.express as px
+import plotly.graph_objects as go
+import seaborn as sns
+import matplotlib.pyplot as plt
+
+#statistics & econometrics
+import statsmodels.tsa.api as smt
+import statsmodels.api as sm
+
+#model fiiting and selection
+from sklearn.metrics import mean_squared_error
+from sklearn.metrics import make_scorer
+from sklearn.preprocessing import StandardScaler
+from sklearn.pipeline import Pipeline
+from sklearn.model_selection import GridSearchCV
+from sklearn.model_selection import TimeSeriesSplit
+from sklearn.linear_model import Lasso, Ridge
+from sklearn.ensemble import RandomForestRegressor
+from xgboost.sklearn import XGBRegressor
+
+ +
+
+
+ +
+
+
+
+

Step 2 - Loading the data

+
+
+
+
+
+
In [7]:
+
+
+
df = pd.read_csv("Combined_News_DJIA.csv", sep=',',low_memory=False,
+                    parse_dates=[0])
+
+full_stock = pd.read_csv("upload_DJIA_table.csv",low_memory=False,
+                    parse_dates=[0])
+
+#add the closing stock value to the df - this will be the y variable
+df["Close"]=full_stock.Close
+
+#show how the dataset looks like
+df.head(5)
+
+ +
+
+
+ +
+
+ + +
+ +
Out[7]:
+ + + +
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
DateLabelTop1Top2Top3Top4Top5Top6Top7Top8...Top17Top18Top19Top20Top21Top22Top23Top24Top25Close
02008-08-080b"Georgia 'downs two Russian warplanes' as cou...b'BREAKING: Musharraf to be impeached.'b'Russia Today: Columns of troops roll into So...b'Russian tanks are moving towards the capital...b"Afghan children raped with 'impunity,' U.N. ...b'150 Russian tanks have entered South Ossetia...b"Breaking: Georgia invades South Ossetia, Rus...b"The 'enemy combatent' trials are nothing but......b'Al-Qaeda Faces Islamist Backlash'b'Condoleezza Rice: "The US would not act to p...b'This is a busy day: The European Union has ...b"Georgia will withdraw 1,000 soldiers from Ir...b'Why the Pentagon Thinks Attacking Iran is a ...b'Caucasus in crisis: Georgia invades South Os...b'Indian shoe manufactory - And again in a se...b'Visitors Suffering from Mental Illnesses Ban...b"No Help for Mexico's Kidnapping Surge"17949.369141
12008-08-111b'Why wont America and Nato help us? If they w...b'Bush puts foot down on Georgian conflict'b"Jewish Georgian minister: Thanks to Israeli ...b'Georgian army flees in disarray as Russians ...b"Olympic opening ceremony fireworks 'faked'"b'What were the Mossad with fraudulent New Zea...b'Russia angered by Israeli military sale to G...b'An American citizen living in S.Ossetia blam......b'"Do not believe TV, neither Russian nor Geor...b'Riots are still going on in Montreal (Canada...b'China to overtake US as largest manufacturer'b'War in South Ossetia [PICS]'b'Israeli Physicians Group Condemns State Tort...b' Russia has just beaten the United States ov...b'Perhaps *the* question about the Georgia - R...b'Russia is so much better at war'b"So this is what it's come to: trading sex fo...17929.990234
22008-08-120b'Remember that adorable 9-year-old who sang a...b"Russia 'ends Georgia operation'"b'"If we had no sexual harassment we would hav...b"Al-Qa'eda is losing support in Iraq because ...b'Ceasefire in Georgia: Putin Outmaneuvers the...b'Why Microsoft and Intel tried to kill the XO...b'Stratfor: The Russo-Georgian War and the Bal...b"I'm Trying to Get a Sense of This Whole Geor......b'Why Russias response to Georgia was right'b'Gorbachev accuses U.S. of making a "serious ...b'Russia, Georgia, and NATO: Cold War Two'b'Remember that adorable 62-year-old who led y...b'War in Georgia: The Israeli connection'b'All signs point to the US encouraging Georgi...b'Christopher King argues that the US and NATO...b'America: The New Mexico?'b"BBC NEWS | Asia-Pacific | Extinction 'by man...17694.679688
32008-08-130b' U.S. refuses Israel weapons to attack Iran:...b"When the president ordered to attack Tskhinv...b' Israel clears troops who killed Reuters cam...b'Britain\'s policy of being tough on drugs is...b'Body of 14 year old found in trunk; Latest (...b'China has moved 10 *million* quake survivors...b"Bush announces Operation Get All Up In Russi...b'Russian forces sink Georgian ships '...b'US humanitarian missions soon in Georgia - i...b"Georgia's DDOS came from US sources"b'Russian convoy heads into Georgia, violating...b'Israeli defence minister: US against strike ...b'Gorbachev: We Had No Choice'b'Witness: Russian forces head towards Tbilisi...b' Quarter of Russians blame U.S. for conflict...b'Georgian president says US military will ta...b'2006: Nobel laureate Aleksander Solzhenitsyn...17409.720703
42008-08-141b'All the experts admit that we should legalis...b'War in South Osetia - 89 pictures made by a ...b'Swedish wrestler Ara Abrahamian throws away ...b'Russia exaggerated the death toll in South O...b'Missile That Killed 9 Inside Pakistan May Ha...b"Rushdie Condemns Random House's Refusal to P...b'Poland and US agree to missle defense deal. ...b'Will the Russians conquer Tblisi? Bet on it,......b"Georgia confict could set back Russia's US r...b'War in the Caucasus is as much the product o...b'"Non-media" photos of South Ossetia/Georgia ...b'Georgian TV reporter shot by Russian sniper ...b'Saudi Arabia: Mother moves to block child ma...b'Taliban wages war on humanitarian aid workers'b'Russia: World "can forget about" Georgia\'s...b'Darfur rebels accuse Sudan of mounting major...b'Philippines : Peace Advocate say Muslims nee...17140.240234
+

5 rows Ă— 28 columns

+
+
+ +
+ +
+
+ +
+
+
+
In [8]:
+
+
+
#drop the label column
+df2 = df.drop(['Label'], axis=1)
+
+#view the new dataframe
+df2.head(10)
+
+ +
+
+
+ +
+
+ + +
+ +
Out[8]:
+ + + +
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
DateTop1Top2Top3Top4Top5Top6Top7Top8Top9...Top17Top18Top19Top20Top21Top22Top23Top24Top25Close
02008-08-08b"Georgia 'downs two Russian warplanes' as cou...b'BREAKING: Musharraf to be impeached.'b'Russia Today: Columns of troops roll into So...b'Russian tanks are moving towards the capital...b"Afghan children raped with 'impunity,' U.N. ...b'150 Russian tanks have entered South Ossetia...b"Breaking: Georgia invades South Ossetia, Rus...b"The 'enemy combatent' trials are nothing but...b'Georgian troops retreat from S. Osettain cap......b'Al-Qaeda Faces Islamist Backlash'b'Condoleezza Rice: "The US would not act to p...b'This is a busy day: The European Union has ...b"Georgia will withdraw 1,000 soldiers from Ir...b'Why the Pentagon Thinks Attacking Iran is a ...b'Caucasus in crisis: Georgia invades South Os...b'Indian shoe manufactory - And again in a se...b'Visitors Suffering from Mental Illnesses Ban...b"No Help for Mexico's Kidnapping Surge"17949.369141
12008-08-11b'Why wont America and Nato help us? If they w...b'Bush puts foot down on Georgian conflict'b"Jewish Georgian minister: Thanks to Israeli ...b'Georgian army flees in disarray as Russians ...b"Olympic opening ceremony fireworks 'faked'"b'What were the Mossad with fraudulent New Zea...b'Russia angered by Israeli military sale to G...b'An American citizen living in S.Ossetia blam...b'Welcome To World War IV! Now In High Definit......b'"Do not believe TV, neither Russian nor Geor...b'Riots are still going on in Montreal (Canada...b'China to overtake US as largest manufacturer'b'War in South Ossetia [PICS]'b'Israeli Physicians Group Condemns State Tort...b' Russia has just beaten the United States ov...b'Perhaps *the* question about the Georgia - R...b'Russia is so much better at war'b"So this is what it's come to: trading sex fo...17929.990234
22008-08-12b'Remember that adorable 9-year-old who sang a...b"Russia 'ends Georgia operation'"b'"If we had no sexual harassment we would hav...b"Al-Qa'eda is losing support in Iraq because ...b'Ceasefire in Georgia: Putin Outmaneuvers the...b'Why Microsoft and Intel tried to kill the XO...b'Stratfor: The Russo-Georgian War and the Bal...b"I'm Trying to Get a Sense of This Whole Geor...b"The US military was surprised by the timing ......b'Why Russias response to Georgia was right'b'Gorbachev accuses U.S. of making a "serious ...b'Russia, Georgia, and NATO: Cold War Two'b'Remember that adorable 62-year-old who led y...b'War in Georgia: The Israeli connection'b'All signs point to the US encouraging Georgi...b'Christopher King argues that the US and NATO...b'America: The New Mexico?'b"BBC NEWS | Asia-Pacific | Extinction 'by man...17694.679688
32008-08-13b' U.S. refuses Israel weapons to attack Iran:...b"When the president ordered to attack Tskhinv...b' Israel clears troops who killed Reuters cam...b'Britain\'s policy of being tough on drugs is...b'Body of 14 year old found in trunk; Latest (...b'China has moved 10 *million* quake survivors...b"Bush announces Operation Get All Up In Russi...b'Russian forces sink Georgian ships 'b"The commander of a Navy air reconnaissance s......b'US humanitarian missions soon in Georgia - i...b"Georgia's DDOS came from US sources"b'Russian convoy heads into Georgia, violating...b'Israeli defence minister: US against strike ...b'Gorbachev: We Had No Choice'b'Witness: Russian forces head towards Tbilisi...b' Quarter of Russians blame U.S. for conflict...b'Georgian president says US military will ta...b'2006: Nobel laureate Aleksander Solzhenitsyn...17409.720703
42008-08-14b'All the experts admit that we should legalis...b'War in South Osetia - 89 pictures made by a ...b'Swedish wrestler Ara Abrahamian throws away ...b'Russia exaggerated the death toll in South O...b'Missile That Killed 9 Inside Pakistan May Ha...b"Rushdie Condemns Random House's Refusal to P...b'Poland and US agree to missle defense deal. ...b'Will the Russians conquer Tblisi? Bet on it,...b'Russia exaggerating South Ossetian death tol......b"Georgia confict could set back Russia's US r...b'War in the Caucasus is as much the product o...b'"Non-media" photos of South Ossetia/Georgia ...b'Georgian TV reporter shot by Russian sniper ...b'Saudi Arabia: Mother moves to block child ma...b'Taliban wages war on humanitarian aid workers'b'Russia: World "can forget about" Georgia\'s...b'Darfur rebels accuse Sudan of mounting major...b'Philippines : Peace Advocate say Muslims nee...17140.240234
52008-08-15b"Mom of missing gay man: Too bad he's not a 2...b"Russia: U.S. Poland Missile Deal Won't Go 'U...b"The government has been accused of creating ...b'The Italian government has lashed out at an ...b'Gorbachev: Georgia started conflict in S. Os...b"China fakes more than your girlfriend; 'Ethn...b"The UN's criticism of freedom of expression ...b'Russian general threatens nuclear strike on ...b'Russia can inspect Polish missile defence site'...b'Johann Hari: We need to stop being such cowa...b'US officials have said that their military p...b'Israel clears troops who killed Reuters came...b'Unenforceable laws encourage cops to escalat...b'What Chinese pollution really looks like'b'Hacker Kidnaps and Tortures Informant, Posts...b'Bush Tells Putin: This Aggression Will Not S...b'Georgia is all about the oil pipelines'b'Rivals say they plan to remove Georgian pres...17400.750000
62008-08-18b'In an Afghan prison, the majority of female ...b"Little girl, you're not ugly; they are"b"Pakistan's Musharraf to Resign, Leave the Co...b'Tornado throws a bus in Poland, captured by ...b"Britain's terror laws have left me and my fa...b"Iran 'fires satellite into space'"b'Rights of Non-Muslims restricted by new Mald...b'Tour of Tskhinvali undercuts Russian version...b'The Great Resource War is already underway, ......b' New porn channel lets Canadians strut their...b'The Dangerous Neighbor: Vladimir Putin Takes...b'Israel opinion page: Russians are saner.'b"NATO's Hour"b'Georgian President Saakashvili Eats His Tie ...b'No Chicken Left Behind: Animal RFID Surveill...b'Putin has given us an order that everyone mu...b'National DNA database grows on the genes of ...b'Mayor Asks Ugly Women To Visit His Town'18011.070312
72008-08-19b"Man arrested and locked up for five hours af...b'The US missile defence system is the magic p...b'Schrder lambasted for blaming Russian confli...b'Officials: 10 French soldiers killed near Ka...b'These ten laws make China a totalitarian was...b'Russia seizes US vehicles'b"Muslims are only 4% of Denmark's 5.4 million...b'Taliban Forces Kill 10 French Soldiers and R...b'Assaults, kidnappings and killings of humani......b'16,000 fine for British woman caught sharing...b'102-year-old grandma is oldest person on Fac...b'Today 5 years ago - August 19th 2003. Bombin...b'US national Ken Haywood, whose computer was ...b' Taliban kill 10 French troops near Afghan c...b'Not Everybody Loves Offshore Wind Power in S...b'Taliban Forces Kill 10 French Soldiers and R...b'Pakistan is more democratic than America. 'b'Blaze engulfs Egyptian parliament'17780.830078
82008-08-20b'Two elderly Chinese women have been sentence...b'The Power of Islam: The Human Rights Council...b"We had 55 times more military soldiers in th...b'"I live here on less than a dollar a month" ...b'Russia sends aircraft carrier to Syria.'b'The American people should be eternally grat...b'Abkhazia officially appeals to Russia for in...b'Russia warns of response "beyond diplomacy" ...b'India Sets Aside 40% of Regional Wasteland f......b'Russia has informed Norway that it plans to ...b"'What Are the Aims of this War?': French Opp...b'Bush Covered up Musharraf Ties with Al Qaeda'b'Mikhail Gorbachev: Russia Never Wanted a War'b'Germans urge tougher laws after new privacy ...b'The Time of the Wimps: Dialogue with Russia ...b'1998 Missile Strikes on Bin Laden May Have B...b"For a moment let's forget everything else an...b'The First Solar Radio Station in Argentina'17829.730469
92008-08-21b"British resident held in Guantanamo Bay wins...b'Chinese may have killed 140 Tibetans this we...b'U.S. Navy Ships Head to Georgia'b'Hacker uncovers Chinese olympic fraud'b"If you've ever wondered what Kim Jong Il was...b"Russia's Nuclear Threat Is More Than Words"b'Czech President: "I must protest aloud again...b'50% Of All Food Produced Is Wasted Before It...b"China sentences Alive in Baghdad blogger, GR......b'NATOs decision to freeze relations with Mosc...b'Sweet Sixteen or Fraudulent Fourteen, Hacke...b'If Russias feeling churlish, they can pretty...b'Chinese Gymnasts 14, Official Document Shows'b'Suicide attack kills at least 50 at Pakistan...b'The Abkhazian Parliament has approved an off...b'Georgia, Bulgaria and the Second Balkan War ...b"Terrorist reveals Pak's sinister designs on ...b"International Olympic Committee launches pro...17804.869141
+

10 rows Ă— 27 columns

+
+
+ +
+ +
+
+ +
+
+
+
+

Step 3 - Data cleaning

NA treatment : +We'll simply fill the NAs in the numerical features (Date, Close). In the text features we'll fill the missing values with ''.

+ +
+
+
+
+
+
In [9]:
+
+
+
#check for NAN
+df2.isnull().sum()
+
+ +
+
+
+ +
+
+ + +
+ +
Out[9]:
+ + + + +
+
Date     0
+Top1     0
+Top2     0
+Top3     0
+Top4     0
+Top5     0
+Top6     0
+Top7     0
+Top8     0
+Top9     0
+Top10    0
+Top11    0
+Top12    0
+Top13    0
+Top14    0
+Top15    0
+Top16    0
+Top17    0
+Top18    0
+Top19    0
+Top20    0
+Top21    0
+Top22    0
+Top23    1
+Top24    3
+Top25    3
+Close    0
+dtype: int64
+
+ +
+ +
+
+ +
+
+
+
+

There are a few headlines missing. Let's fill them in with a whitespace.

+ +
+
+
+
+
+
In [10]:
+
+
+
#replacing nan values with a whitespace
+df2 = df2.replace(np.nan, ' ', regex=True)
+
+#sanity check
+df2.isnull().sum().sum()
+
+ +
+
+
+ +
+
+ + +
+ +
Out[10]:
+ + + + +
+
0
+
+ +
+ +
+
+ +
+
+
+
+

Remove the HTML tags : +There are several non-word tags in the headlines that would just bias the sentiment analysis so we need to remove them and replace with ''. This can be done with regex.

+ +
+
+
+
+
+
In [11]:
+
+
+
#Remove the HTML tags
+df2 = df2.replace('b\"|b\'|\\\\|\\\"', '', regex=True)
+df2.head(2)
+
+ +
+
+
+ +
+
+ + +
+ +
Out[11]:
+ + + +
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
DateTop1Top2Top3Top4Top5Top6Top7Top8Top9...Top17Top18Top19Top20Top21Top22Top23Top24Top25Close
02008-08-08Georgia 'downs two Russian warplanes' as count...BREAKING: Musharraf to be impeached.'Russia Today: Columns of troops roll into Sout...Russian tanks are moving towards the capital o...Afghan children raped with 'impunity,' U.N. of...150 Russian tanks have entered South Ossetia w...Breaking: Georgia invades South Ossetia, Russi...The 'enemy combatent' trials are nothing but a...Georgian troops retreat from S. Osettain capit......Al-Qaeda Faces Islamist Backlash'Condoleezza Rice: The US would not act to prev...This is a busy day: The European Union has ap...Georgia will withdraw 1,000 soldiers from Iraq...Why the Pentagon Thinks Attacking Iran is a Ba...Caucasus in crisis: Georgia invades South Osse...Indian shoe manufactory - And again in a seri...Visitors Suffering from Mental Illnesses Banne...No Help for Mexico's Kidnapping Surge17949.369141
12008-08-11Why wont America and Nato help us? If they won...Bush puts foot down on Georgian conflict'Jewish Georgian minister: Thanks to Israeli tr...Georgian army flees in disarray as Russians ad...Olympic opening ceremony fireworks 'faked'What were the Mossad with fraudulent New Zeala...Russia angered by Israeli military sale to Geo...An American citizen living in S.Ossetia blames...Welcome To World War IV! Now In High Definition!'...Do not believe TV, neither Russian nor Georgia...Riots are still going on in Montreal (Canada) ...China to overtake US as largest manufacturer'War in South Ossetia [PICS]'Israeli Physicians Group Condemns State Torture'Russia has just beaten the United States over...Perhaps *the* question about the Georgia - Rus...Russia is so much better at war'So this is what it's come to: trading sex for ...17929.990234
+

2 rows Ă— 27 columns

+
+
+ +
+ +
+
+ +
+
+
+
+

Sentiment and subjectivity score extraction : +Now we run the sentiment analysis extracting the compound score that goes from -0.5 (most negative) to 0.5 (most positive). I'm going to use the "dirty" texts in this part because VADER can utilize the information such as ALL CAPS, punctuation, etc. We'll also calculate the subjectivity of each headline using the TextBlob package.

+

Initialise the VADER analyzer.

+ +
+
+
+
+
+
In [12]:
+
+
+
#Sentiment and subjectivity score extraction
+Anakin = SentimentIntensityAnalyzer()
+
+Anakin.polarity_scores(" ")
+
+ +
+
+
+ +
+
+ + +
+ +
Out[12]:
+ + + + +
+
{'neg': 0.0, 'neu': 0.0, 'pos': 0.0, 'compound': 0.0}
+
+ +
+ +
+
+ +
+
+
+
+

Write a function to save the subjectivity score directly from TextBlob function's output. Subjectivity score might detect direct quotes in the headlines and positive stuff is rarely quoted in the headline.

+ +
+
+
+
+
+
In [13]:
+
+
+
def detect_subjectivity(text):
+    return TextBlob(text).sentiment.subjectivity
+
+detect_subjectivity(" ") #should return 0
+
+ +
+
+
+ +
+
+ + +
+ +
Out[13]:
+ + + + +
+
0.0
+
+ +
+ +
+
+ +
+
+
+
In [14]:
+
+
+
start_vect=time.time()
+print("ANAKIN: 'Intializing the process..'")
+
+#get the name of the headline columns
+cols = []
+for i in range(1,26):
+    col = ("Top{}".format(i))
+    cols.append(col)
+
+
+for col in cols:
+    df2[col] = df2[col].astype(str) # Make sure data is treated as a string
+    df2[col+'_comp']= df2[col].apply(lambda x:Anakin.polarity_scores(x)['compound'])
+    df2[col+'_sub'] = df2[col].apply(detect_subjectivity)
+    print("{} Done".format(col))
+    
+print("VADER: Vaderization completed after %0.2f Minutes"%((time.time() - start_vect)/60))
+
+ +
+
+
+ +
+
+ + +
+ +
+ + +
+
ANAKIN: 'Intializing the process..'
+Top1 Done
+Top2 Done
+Top3 Done
+Top4 Done
+Top5 Done
+Top6 Done
+Top7 Done
+Top8 Done
+Top9 Done
+Top10 Done
+Top11 Done
+Top12 Done
+Top13 Done
+Top14 Done
+Top15 Done
+Top16 Done
+Top17 Done
+Top18 Done
+Top19 Done
+Top20 Done
+Top21 Done
+Top22 Done
+Top23 Done
+Top24 Done
+Top25 Done
+VADER: Vaderization completed after 0.54 Minutes
+
+
+
+ +
+
+ +
+
+
+
In [15]:
+
+
+
#the text isn't required anymore
+df2 = df2.drop(cols,axis=1)
+df2.head(5)
+
+ +
+
+
+ +
+
+ + +
+ +
Out[15]:
+ + + +
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
DateCloseTop1_compTop1_subTop2_compTop2_subTop3_compTop3_subTop4_compTop4_sub...Top21_compTop21_subTop22_compTop22_subTop23_compTop23_subTop24_compTop24_subTop25_compTop25_sub
02008-08-0817949.369141-0.59940.00.00000.000000-0.36120.000000-0.70890.200000...-0.75790.666667-0.62490.0-0.27550.00-0.85190.2000000.12800.0
12008-08-1117929.9902340.81560.0-0.31820.2888890.44040.100000-0.19650.000000...-0.80200.0000000.00000.0-0.31820.00-0.18320.5000000.00000.0
22008-08-1217694.6796880.02581.00.00000.000000-0.78450.833333-0.61241.000000...-0.59940.0000000.52670.00.38180.350.00000.4545450.00000.0
32008-08-1317409.720703-0.71840.0-0.80740.000000-0.63690.000000-0.12800.444444...-0.29600.0000000.49390.0-0.57190.00-0.42150.100000-0.34000.0
42008-08-1417140.2402340.20230.0-0.59940.0000000.68080.400000-0.86890.666667...-0.44040.000000-0.59940.00.17790.00-0.69080.5000000.70960.0
+

5 rows Ă— 52 columns

+
+
+ +
+ +
+
+ +
+
+
+
+

Summarise the compound and subjectivity scores weighted by rating of the headline (top1 has the most weight)

+ +
+
+
+
+
+
In [16]:
+
+
+
comp_cols = []
+for col in cols:
+    comp_col = col + "_comp"
+    comp_cols.append(comp_col)
+
+w = np.arange(1,26,1).tolist()
+w.reverse()
+
+weighted_comp = []
+max_comp = []
+min_comp = []
+for i in range(0,len(df)):
+    a = df2.loc[i,comp_cols].tolist()
+    weighted_comp.append(np.average(a, weights=w))
+    max_comp.append(max(a))
+    min_comp.append(min(a))
+
+df2['compound_mean'] = weighted_comp
+df2['compound_max'] = max_comp
+df2['compound_min'] = min_comp
+
+
+sub_cols = []
+for col in cols:
+    sub_col = col + "_sub"
+    sub_cols.append(sub_col)
+    
+weighted_sub = []
+max_sub = []
+min_sub = []
+for i in range(0,len(df2)):
+    a = df2.loc[i,sub_cols].tolist()
+    weighted_sub.append(np.average(a, weights=w))
+    max_sub.append(max(a))
+    min_sub.append(min(a))
+
+df2['subjectivity_mean'] = weighted_sub
+df2['subjectivity_max'] = max_sub
+df2['subjectivity_min'] = min_sub
+
+to_drop = sub_cols+comp_cols
+df2 = df2.drop(to_drop, axis=1)
+
+df2.head(5)
+
+ +
+
+
+ +
+
+ + +
+ +
Out[16]:
+ + + +
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
DateClosecompound_meancompound_maxcompound_minsubjectivity_meansubjectivity_maxsubjectivity_min
02008-08-0817949.369141-0.3503370.2144-0.92600.1636850.6666670.0
12008-08-1117929.990234-0.0852770.8156-0.82710.2029210.7200000.0
22008-08-1217694.679688-0.3183940.5423-0.85910.3740761.0000000.0
32008-08-1317409.720703-0.1620320.5106-0.80740.1763710.9000000.0
42008-08-1417140.240234-0.1948790.7177-0.86890.3196151.0000000.0
+
+
+ +
+ +
+
+ +
+
+
+
+

4. Exploratory Data Analysis

First, the timeseries of the y (to be predicted) variable will be explored. It's likely the the timeseries isn't stationary which however doesn't worry us in this case as the models won't be of the classical timeseries methods family.

+ +
+
+
+
+
+
In [17]:
+
+
+
#EDA-time series plot
+fig1 = go.Figure()
+fig1.add_trace(go.Scatter(x=df2.Date, y=df.Close,
+                    mode='lines'))
+title = []
+title.append(dict(xref='paper', yref='paper', x=0.0, y=1.05,
+                              xanchor='left', yanchor='bottom',
+                              text='Development of stock values from Aug, 2008 to Jun, 2016',
+                              font=dict(family='Arial',
+                                        size=30,
+                                        color='rgb(37,37,37)'),
+                              showarrow=False))
+fig1.update_layout(xaxis_title='Date',
+                   yaxis_title='Closing stock value (in $)',
+                  annotations=title)
+fig1.show()
+
+ +
+
+
+ +
+
+ + +
+ +
+ + + +
+ + +
+ +
+ +
+ +
+ + + +
+
+
+ +
+ +
+
+ +
+
+
+
+

It is quite obvious that the timeseries isn't stationary at all. There just seems to be a downwards trend over the time. +So let's look at how much unstationary the timeseries actually is using Dickey-Fuller Test.

+

The Dickey-Fuller test is a test that tests the null hypothesis that a unit root is present in time series data. +To make things a bit more clear, this test is checking for stationarity or non-stationary data. The test is trying to reject the null hypothesis that a unit root exists and the data is non-stationary.

+ +
+
+
+
+
+
In [18]:
+
+
+
#function for quick plotting and testing of stationarity
+def stationary_plot(y, lags=None, figsize=(12, 7), style='bmh'):
+    """
+        Plot time series, its ACF and PACF, calculate Dickey–Fuller test
+        
+        y - timeseries
+        lags - how many lags to include in ACF, PACF calculation
+    """
+    if not isinstance(y, pd.Series):
+        y = pd.Series(y)
+        
+    with plt.style.context(style):    
+        fig = plt.figure(figsize=figsize)
+        layout = (2, 2)
+        ts_ax = plt.subplot2grid(layout, (0, 0), colspan=2)
+        acf_ax = plt.subplot2grid(layout, (1, 0))
+        pacf_ax = plt.subplot2grid(layout, (1, 1))
+        
+        y.plot(ax=ts_ax)
+        p_value = sm.tsa.stattools.adfuller(y)[1]
+        ts_ax.set_title('Time Series Analysis Plots\n Dickey-Fuller: p={0:.5f}'.format(p_value))
+        smt.graphics.plot_acf(y, lags=lags, ax=acf_ax)
+        smt.graphics.plot_pacf(y, lags=lags, ax=pacf_ax)
+        plt.tight_layout()
+
+ +
+
+
+ +
+
+
+
In [19]:
+
+
+
stationary_plot(df2.Close)
+
+ +
+
+
+ +
+
+ + +
+ +
+ + + + +
+ +
+ +
+ +
+
+ +
+
+
+
+

Next we look at the compound sentiment scores.

+ +
+
+
+
+
+
In [20]:
+
+
+
fig2 = go.Figure()
+fig2.add_trace(go.Scatter(x=df2.Date, y=df2.compound_mean,
+                    mode='lines',
+                    name='Mean'))
+fig2.add_trace(go.Scatter(x=df2.Date, y=df2.compound_max,
+                    mode='lines',
+                    name='Maximum'))
+fig2.add_trace(go.Scatter(x=df2.Date, y=df2.compound_min,
+                    mode='lines',
+                    name='Minimum'))
+title = []
+title.append(dict(xref='paper', yref='paper', x=0.0, y=1.05,
+                              xanchor='left', yanchor='bottom',
+                              text='Development of sentiment compound score',
+                               font=dict(family='Arial',
+                                       size=30,
+                                        color='rgb(37,37,37)'),
+                              showarrow=False))
+fig2.update_layout(xaxis_title='Date',
+                   yaxis_title='Compound score',
+                  annotations=title)
+fig2.show()
+
+ +
+
+
+ +
+
+ + +
+ +
+ + + +
+
+
+ +
+ +
+
+ +
+
+
+
+

Let's also plot the distribution of the compound score.

+ +
+
+
+
+
+
In [21]:
+
+
+
compm_hist = px.histogram(df2, x="compound_mean")
+compm_hist.show()
+
+ +
+
+
+ +
+
+ + +
+ +
+ + + +
+
+
+ +
+ +
+
+ +
+
+
+
+

Now, we take a look at the subjectivity scores.

+ +
+
+
+
+
+
In [22]:
+
+
+
fig3 = go.Figure()
+fig3.add_trace(go.Scatter(x=df2.Date, y=df2.subjectivity_mean,
+                    mode='lines',
+                    name='Mean'))
+fig3.add_trace(go.Scatter(x=df2.Date, y=df2.subjectivity_min,
+                    mode='lines',
+                    name='Min'))
+fig3.add_trace(go.Scatter(x=df2.Date, y=df2.subjectivity_max,
+                    mode='lines',
+                    name='Max'))
+title = []
+title.append(dict(xref='paper', yref='paper', x=0.0, y=1.05,
+                              xanchor='left', yanchor='bottom',
+                              text='Development of subjectivity score',
+                              font=dict(family='Arial',
+                                        size=30,
+                                        color='rgb(37,37,37)'),
+                              showarrow=False))
+fig3.update_layout(xaxis_title='Date',
+                   yaxis_title='Subjectivity score',
+                  annotations=title)
+fig3.show()
+
+ +
+
+
+ +
+
+ + +
+ +
+ + + +
+
+
+ +
+ +
+
+ +
+
+
+
+

Now we plot distribution of the subjectivity scores as well.

+ +
+
+
+
+
+
In [23]:
+
+
+
subm_hist = px.histogram(df2, x="subjectivity_mean")
+subm_hist.show()
+
+ +
+
+
+ +
+
+ + +
+ +
+ + + +
+
+
+ +
+ +
+
+ +
+
+
+
+

Next we'll look at some descriptive statistics about the data.

+ +
+
+
+
+
+
In [24]:
+
+
+
df2.describe()
+
+ +
+
+
+ +
+
+ + +
+ +
Out[24]:
+ + + +
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Closecompound_meancompound_maxcompound_minsubjectivity_meansubjectivity_maxsubjectivity_min
count1989.0000001989.0000001989.0000001989.0000001989.0000001989.0000001989.0
mean13463.032255-0.2133620.656276-0.8856790.2517790.8926450.0
std3144.0069960.1067890.1656630.0628310.0661960.1447960.0
min6547.049805-0.5463830.000000-0.9898000.0808970.4166670.0
25%10913.379883-0.2834810.542300-0.9316000.2055810.7857140.0
50%13025.580078-0.2106050.670500-0.8979000.2499271.0000000.0
75%16478.410156-0.1414340.784500-0.8519000.2970091.0000000.0
max18312.3906250.1663700.962300-0.5719000.4704981.0000000.0
+
+
+ +
+ +
+
+ +
+
+
+
+

Feature selection

We are not going to use many FS methods since the features were mostly handcrafted. So we'll simply look at their variance and proportion of unique values.

+ +
+
+
+
+
+
In [25]:
+
+
+
def unique_ratio (col):
+    return len(np.unique(col))/len(col)
+
+cols = ['Close', 'compound_mean', 'compound_max', 'compound_min', 'subjectivity_mean', 'subjectivity_max', 'subjectivity_min']
+
+ur = []
+var = []
+for col in cols:
+    ur.append(unique_ratio(df2[col]))
+    var.append(np.var(df2[col]))
+    
+feature_sel = pd.DataFrame({'Column': cols, 
+              'Unique': ur,
+              'Variance': var})
+feature_sel
+
+ +
+
+
+ +
+
+ + +
+ +
Out[25]:
+ + + +
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ColumnUniqueVariance
0Close0.9944709.879810e+06
1compound_mean1.0000001.139823e-02
2compound_max0.1880342.743036e-02
3compound_min0.1764713.945746e-03
4subjectivity_mean0.9994974.379750e-03
5subjectivity_max0.0839622.095546e-02
6subjectivity_min0.0005030.000000e+00
+
+
+ +
+ +
+
+ +
+
+
+
In [26]:
+
+
+
sel_fig = go.Figure(data=go.Scatter(
+    x=feature_sel.Column,
+    y=feature_sel.Unique,
+    mode='markers',
+    marker=dict(size=(feature_sel.Unique*100)),
+))
+sel_fig.update_layout(title='Ratio of unique values', 
+                      yaxis_title='Unique ratio')
+sel_fig.show()
+
+ +
+
+
+ +
+
+ + +
+ +
+ + + +
+
+
+ +
+ +
+
+ +
+
+
+
+

Compound maximum and minimum are potentially less interesting as only ~18% of their values are unique. Also maximum of subjectivity has very low values. Minimum subjectivity has contains almost only 0. We'll drop the subjectivity min and max.

+ +
+
+
+
+
+
In [27]:
+
+
+
drop = ['subjectivity_min', 'subjectivity_max']
+clean_df = df2.drop(drop,axis=1)
+
+ +
+
+
+ +
+
+
+
+

Step 5 - Lag the extracted features

To allow the models to look into the past, we'll add features which are essentially just copies of rows from n-steps back. In order to not create too many new features we'll add only features from 1 week prior to the current datapoint.

+ +
+
+
+
+
+
In [28]:
+
+
+
lag_df = clean_df.copy()
+lag_df.head(3)
+
+ +
+
+
+ +
+
+ + +
+ +
Out[28]:
+ + + +
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
DateClosecompound_meancompound_maxcompound_minsubjectivity_mean
02008-08-0817949.369141-0.3503370.2144-0.92600.163685
12008-08-1117929.990234-0.0852770.8156-0.82710.202921
22008-08-1217694.679688-0.3183940.5423-0.85910.374076
+
+
+ +
+ +
+
+ +
+
+
+
In [29]:
+
+
+
to_lag = list(lag_df.columns)
+to_lag_4 = to_lag[1]
+to_lag_1 = to_lag[2:len(to_lag)]
+
+ +
+
+
+ +
+
+
+
In [30]:
+
+
+
#lagging text features two days back
+for col in to_lag_1:
+    for i in range(1,3):
+        new_name = col + ('_lag_{}'.format(i))
+        lag_df[new_name] = lag_df[col].shift(i)
+    
+#lagging closing values 4 days back
+for i in range(1, 5):
+    new_name = to_lag_4 + ('_lag_{}'.format(i))
+    lag_df[new_name] = lag_df[to_lag_4].shift(i)
+
+ +
+
+
+ +
+
+
+
+

In this process, rows with NAs were created. Unfortunately these rows will have to be removed since we simply don't have the data from the future.

+ +
+
+
+
+
+
In [31]:
+
+
+
#Show many rows need to be removed
+lag_df.head(10) 
+
+ +
+
+
+ +
+
+ + +
+ +
Out[31]:
+ + + +
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
DateClosecompound_meancompound_maxcompound_minsubjectivity_meancompound_mean_lag_1compound_mean_lag_2compound_max_lag_1compound_max_lag_2compound_min_lag_1compound_min_lag_2subjectivity_mean_lag_1subjectivity_mean_lag_2Close_lag_1Close_lag_2Close_lag_3Close_lag_4
02008-08-0817949.369141-0.3503370.2144-0.92600.163685NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
12008-08-1117929.990234-0.0852770.8156-0.82710.202921-0.350337NaN0.2144NaN-0.9260NaN0.163685NaN17949.369141NaNNaNNaN
22008-08-1217694.679688-0.3183940.5423-0.85910.374076-0.085277-0.3503370.81560.2144-0.8271-0.92600.2029210.16368517929.99023417949.369141NaNNaN
32008-08-1317409.720703-0.1620320.5106-0.80740.176371-0.318394-0.0852770.54230.8156-0.8591-0.82710.3740760.20292117694.67968817929.99023417949.369141NaN
42008-08-1417140.240234-0.1948790.7177-0.86890.319615-0.162032-0.3183940.51060.5423-0.8074-0.85910.1763710.37407617409.72070317694.67968817929.99023417949.369141
52008-08-1517400.750000-0.1431040.4404-0.74810.227282-0.194879-0.1620320.71770.5106-0.8689-0.80740.3196150.17637117140.24023417409.72070317694.67968817929.990234
62008-08-1818011.070312-0.2635460.5106-0.92600.216935-0.143104-0.1948790.44040.7177-0.7481-0.86890.2272820.31961517400.75000017140.24023417409.72070317694.679688
72008-08-1917780.830078-0.3731720.5574-0.87200.256786-0.263546-0.1431040.51060.4404-0.9260-0.74810.2169350.22728218011.07031217400.75000017140.24023417409.720703
82008-08-2017829.730469-0.1971570.4847-0.88070.095403-0.373172-0.2635460.55740.5106-0.8720-0.92600.2567860.21693517780.83007818011.07031217400.75000017140.240234
92008-08-2117804.869141-0.2685220.5719-0.90220.107994-0.197157-0.3731720.48470.5574-0.8807-0.87200.0954030.25678617829.73046917780.83007818011.07031217400.750000
+
+
+ +
+ +
+
+ +
+
+
+
+

Above we can see that the first 4 rows now have missing values. Let's delete them and reset index.

+ +
+
+
+
+
+
In [32]:
+
+
+
lag_df = lag_df.drop(lag_df.index[[np.arange(0,4)]])
+lag_df = lag_df.reset_index(drop=True)
+
+#sanity check for NaNs
+lag_df.isnull().sum().sum()
+
+ +
+
+
+ +
+
+ + +
+ +
+ + +
+
C:\Users\ankit\Anaconda3\lib\site-packages\pandas\core\indexes\base.py:3941: FutureWarning:
+
+Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
+
+
+
+
+ +
+ +
Out[32]:
+ + + + +
+
0
+
+ +
+ +
+
+ +
+
+
+
In [33]:
+
+
+
lag_df.head(5)
+
+ +
+
+
+ +
+
+ + +
+ +
Out[33]:
+ + + +
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
DateClosecompound_meancompound_maxcompound_minsubjectivity_meancompound_mean_lag_1compound_mean_lag_2compound_max_lag_1compound_max_lag_2compound_min_lag_1compound_min_lag_2subjectivity_mean_lag_1subjectivity_mean_lag_2Close_lag_1Close_lag_2Close_lag_3Close_lag_4
02008-08-1417140.240234-0.1948790.7177-0.86890.319615-0.162032-0.3183940.51060.5423-0.8074-0.85910.1763710.37407617409.72070317694.67968817929.99023417949.369141
12008-08-1517400.750000-0.1431040.4404-0.74810.227282-0.194879-0.1620320.71770.5106-0.8689-0.80740.3196150.17637117140.24023417409.72070317694.67968817929.990234
22008-08-1818011.070312-0.2635460.5106-0.92600.216935-0.143104-0.1948790.44040.7177-0.7481-0.86890.2272820.31961517400.75000017140.24023417409.72070317694.679688
32008-08-1917780.830078-0.3731720.5574-0.87200.256786-0.263546-0.1431040.51060.4404-0.9260-0.74810.2169350.22728218011.07031217400.75000017140.24023417409.720703
42008-08-2017829.730469-0.1971570.4847-0.88070.095403-0.373172-0.2635460.55740.5106-0.8720-0.92600.2567860.21693517780.83007818011.07031217400.75000017140.240234
+
+
+ +
+ +
+
+ +
+
+
+
+

Step 8 - Model training

Let's train 3 ML models. We'll do this in 2 rounds. First, using the econometric features alone (7 lags of y). +Second, including the information extracted from the headlines (compound, subjectivity and their lags)

+

Models:

+

1)Ridge regression - punish model for using too many features but doesn't allow the coeficients drop to zero completely

+

2)Random forest

+

3)XGBoost

+

We'll score all models by mean squared error as it gives higher penalty to larger mistakes. And before each model training we'll standardize the training data.

+

The first step will be creating folds for cross-validation. We'll use the same folds for all models in order to allow for creating a meta-model. Since we're working with timeseries the folds cannot be randomly selected. Instead a fold will be a sequence of data so that we don't lose the time information.

+ +
+
+
+
+
+
In [34]:
+
+
+
# for time-series cross-validation set 10 folds 
+tscv = TimeSeriesSplit(n_splits=10)
+
+ +
+
+
+ +
+
+
+
+

The cost function to minimize is mean squared error because this function assigns cost proportionally to the error size. The mean absolute percentage error will be used for plotting and easier interpretation. It's much easier to understand the errors of a model in terms of percentage. Each training set is scaled (normalized) independently to minimize data leakage.

+ +
+
+
+
+
+
In [35]:
+
+
+
def mape(y_true, y_pred): 
+    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
+scorer = make_scorer(mean_squared_error)
+scaler = StandardScaler()   
+
+ +
+
+
+ +
+
+
+
+

Next we split the dataset into training and testing. 20% of the data will be used for testing.

+ +
+
+
+
+
+
In [36]:
+
+
+
def ts_train_test_split(X, y, test_size):
+    """
+        Perform train-test split with respect to time series structure
+    """
+    
+    # get the index after which test set starts
+    test_index = int(len(X)*(1-test_size))
+    
+    X_train = X.iloc[:test_index]
+    y_train = y.iloc[:test_index]
+    X_test = X.iloc[test_index:]
+    y_test = y.iloc[test_index:]
+    
+    return X_train, X_test, y_train, y_test
+
+ +
+
+
+ +
+
+
+
In [37]:
+
+
+
X = lag_df.drop(['Close'],axis=1)
+X.index = X["Date"]
+X = X.drop(['Date'],axis=1)
+y = lag_df.Close
+
+X_train, X_test, y_train, y_test = ts_train_test_split(X, y, test_size = 0.2)
+
+#sanity check
+(len(X_train)+len(X_test))==len(X)
+
+ +
+
+
+ +
+
+ + +
+ +
Out[37]:
+ + + + +
+
True
+
+ +
+ +
+
+ +
+
+
+
In [38]:
+
+
+
#function for plotting coeficients of models (lasso and XGBoost)
+def plotCoef(model,train_x):
+    """
+        Plots sorted coefficient values of the model
+    """
+    
+    coefs = pd.DataFrame(model.coef_, train_x.columns)
+    coefs.columns = ["coef"]
+    coefs["abs"] = coefs.coef.apply(np.abs)
+    coefs = coefs.sort_values(by="abs", ascending=False).drop(["abs"], axis=1)
+    
+    plt.figure(figsize=(15, 7))
+    coefs.coef.plot(kind='bar')
+    plt.grid(True, axis='y')
+    plt.hlines(y=0, xmin=0, xmax=len(coefs), linestyles='dashed');
+
+ +
+
+
+ +
+
+
+
+

Step 8.1 - Econometric models

First let's train models using only the lags of the y variable (i.e. Close).

+ +
+
+
+
+
+
In [39]:
+
+
+
econ_cols = list(X_train.columns)
+econ_cols = econ_cols[12:17]
+X_train_e = X_train[econ_cols]
+X_test_e = X_test[econ_cols]
+y_train_e = y_train
+y_test_e = y_test
+
+ +
+
+
+ +
+
+
+
In [40]:
+
+
+
econ_perf = pd.DataFrame(columns=['Model','MSE', 'SD'])
+econ_perf
+
+ +
+
+
+ +
+
+ + +
+ +
Out[40]:
+ + + +
+
+ + + + + + + + + + + + +
ModelMSESD
+
+
+ +
+ +
+
+ +
+
+
+
+

Ridge regression

+
+
+
+
+
+
In [41]:
+
+
+
ridge_param = {'model__alpha': list(np.arange(0.001,1,0.001))}
+ridge = Ridge(max_iter=5000)
+pipe = Pipeline([
+    ('scale', scaler),
+    ('model', ridge)])
+search_ridge = GridSearchCV(estimator=pipe,
+                          param_grid = ridge_param,
+                          scoring = scorer,
+                          cv = tscv,
+                          n_jobs=4
+                         )
+search_ridge.fit(X_train_e, y_train_e)
+
+ +
+
+
+ +
+
+ + +
+ +
Out[41]:
+ + + + +
+
GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=10),
+             estimator=Pipeline(steps=[('scale', StandardScaler()),
+                                       ('model', Ridge(max_iter=5000))]),
+             n_jobs=4,
+             param_grid={'model__alpha': [0.001, 0.002, 0.003, 0.004, 0.005,
+                                          0.006, 0.007, 0.008,
+                                          0.009000000000000001,
+                                          0.010000000000000002, 0.011, 0.012,
+                                          0.013000000000000001,
+                                          0.014000000000000002, 0.015, 0.016,
+                                          0.017, 0.018000000000000002,
+                                          0.019000000000000003, 0.02, 0.021,
+                                          0.022000000000000002, 0.023, 0.024,
+                                          0.025, 0.026000000000000002,
+                                          0.027000000000000003, 0.028, 0.029,
+                                          0.030000000000000002, ...]},
+             scoring=make_scorer(mean_squared_error))
+
+ +
+ +
+
+ +
+
+
+
In [42]:
+
+
+
ridge_e = search_ridge.best_estimator_
+
+#get cv results of the best model + confidence intervals
+from sklearn.model_selection import cross_val_score
+cv_score = cross_val_score(ridge_e, X_train_e, y_train_e, cv=tscv, scoring=scorer)
+econ_perf = econ_perf.append({'Model':'Ridge', 'MSE':np.mean(cv_score), 'SD':(np.std(cv_score))}, ignore_index=True)
+ridge_e
+
+ +
+
+
+ +
+
+ + +
+ +
Out[42]:
+ + + + +
+
Pipeline(steps=[('scale', StandardScaler()),
+                ('model', Ridge(alpha=0.999, max_iter=5000))])
+
+ +
+ +
+
+ +
+
+
+
In [43]:
+
+
+
plotCoef(ridge_e['model'], X_train_e)
+
+ +
+
+
+ +
+
+ + +
+ +
+ + + + +
+ +
+ +
+ +
+
+ +
+
+
+
In [44]:
+
+
+
coefs = ridge_e['model'].coef_
+ridge_coefs = pd.DataFrame({'Coef': coefs,
+                           'Name': list(X_train_e.columns)})
+ridge_coefs["abs"] = ridge_coefs.Coef.apply(np.abs)
+ridge_coefs = ridge_coefs.sort_values(by="abs", ascending=False).drop(["abs"], axis=1)
+ridge_coefs
+
+ +
+
+
+ +
+
+ + +
+ +
Out[44]:
+ + + +
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
CoefName
02041.601673Close_lag_1
1414.480793Close_lag_2
358.219956Close_lag_4
230.646149Close_lag_3
+
+
+ +
+ +
+
+ +
+
+
+
In [45]:
+
+
+
econ_perf
+
+ +
+
+
+ +
+
+ + +
+ +
Out[45]:
+ + + +
+
+ + + + + + + + + + + + + + + + + + +
ModelMSESD
0Ridge16148.543568475.253033
+
+
+ +
+ +
+
+ +
+
+
+
+

Random forest

+
+
+
+
+
+
In [46]:
+
+
+
rf_param = {'model__n_estimators': [10, 100, 300],
+            'model__max_depth': [10, 20, 30, 40],
+            'model__min_samples_split': [2, 5, 10],
+            'model__min_samples_leaf': [1, 2, 3],
+            'model__max_features': ["auto", 'sqrt']}
+
+ +
+
+
+ +
+
+
+
In [47]:
+
+
+
rf = RandomForestRegressor()
+pipe = Pipeline([
+    ('scale', scaler),
+    ('model', rf)])
+gridsearch_rf = GridSearchCV(estimator=pipe,
+                          param_grid = rf_param,
+                          scoring = scorer,
+                          cv = tscv,
+                          n_jobs=4,
+                          verbose=3
+                         )
+
+ +
+
+
+ +
+
+
+
In [48]:
+
+
+
gridsearch_rf.fit(X_train_e, y_train_e)
+
+ +
+
+
+ +
+
+ + +
+ +
+ + +
+
Fitting 10 folds for each of 216 candidates, totalling 2160 fits
+
+
+
+ +
+ +
+ + +
+
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
+[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:    3.7s
+[Parallel(n_jobs=4)]: Done 120 tasks      | elapsed:   24.7s
+[Parallel(n_jobs=4)]: Done 280 tasks      | elapsed:   51.2s
+[Parallel(n_jobs=4)]: Done 504 tasks      | elapsed:  1.4min
+[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:  2.4min
+[Parallel(n_jobs=4)]: Done 1144 tasks      | elapsed:  3.4min
+[Parallel(n_jobs=4)]: Done 1560 tasks      | elapsed:  4.6min
+[Parallel(n_jobs=4)]: Done 2040 tasks      | elapsed:  5.9min
+[Parallel(n_jobs=4)]: Done 2160 out of 2160 | elapsed:  6.2min finished
+
+
+
+ +
+ +
Out[48]:
+ + + + +
+
GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=10),
+             estimator=Pipeline(steps=[('scale', StandardScaler()),
+                                       ('model', RandomForestRegressor())]),
+             n_jobs=4,
+             param_grid={'model__max_depth': [10, 20, 30, 40],
+                         'model__max_features': ['auto', 'sqrt'],
+                         'model__min_samples_leaf': [1, 2, 3],
+                         'model__min_samples_split': [2, 5, 10],
+                         'model__n_estimators': [10, 100, 300]},
+             scoring=make_scorer(mean_squared_error), verbose=3)
+
+ +
+ +
+
+ +
+
+
+
In [49]:
+
+
+
rf_e = gridsearch_rf.best_estimator_
+
+#get cv results of the best model + confidence intervals
+cv_score = cross_val_score(rf_e, X_train_e, y_train_e, cv=tscv, scoring=scorer)
+econ_perf = econ_perf.append({'Model':'RF', 'MSE':np.mean(cv_score), 'SD':(np.std(cv_score))}, ignore_index=True)
+
+ +
+
+
+ +
+
+
+
In [50]:
+
+
+
econ_perf
+
+ +
+
+
+ +
+
+ + +
+ +
Out[50]:
+ + + +
+
+ + + + + + + + + + + + + + + + + + + + + + + + +
ModelMSESD
0Ridge16148.5435608475.253033
1RF175910.783385162425.696179
+
+
+ +
+ +
+
+ +
+
+
+
+

XGBoost

Using linear booster because tree methods don't work with timeseries very well

+ +
+
+
+
+
+
In [52]:
+
+
+
xgb_param = {'model__lambda': list(np.arange(0.1,3, 0.1)), #L2 regularisation
+             'model__alpha': list(np.arange(0.1,3, 0.1)),  #L1 regularisation
+            }
+
+ +
+
+
+ +
+
+
+
In [53]:
+
+
+
xgb = XGBRegressor(booster='gblinear', feature_selector='shuffle', objective='reg:squarederror')
+
+pipe = Pipeline([
+    ('scale', scaler),
+    ('model', xgb)])
+gridsearch_xgb = GridSearchCV(estimator=pipe,
+                          param_grid = xgb_param,
+                          scoring = scorer,
+                          cv = tscv,
+                          n_jobs=4,
+                          verbose=3
+                         )
+
+ +
+
+
+ +
+
+
+
In [54]:
+
+
+
gridsearch_xgb.fit(X_train_e, y_train_e)
+
+ +
+
+
+ +
+
+ + +
+ +
+ + +
+
Fitting 10 folds for each of 841 candidates, totalling 8410 fits
+
+
+
+ +
+ +
+ + +
+
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
+[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:    1.7s
+[Parallel(n_jobs=4)]: Done 380 tasks      | elapsed:    5.3s
+[Parallel(n_jobs=4)]: Done 1020 tasks      | elapsed:   11.3s
+[Parallel(n_jobs=4)]: Done 1916 tasks      | elapsed:   19.8s
+[Parallel(n_jobs=4)]: Done 3068 tasks      | elapsed:   30.7s
+[Parallel(n_jobs=4)]: Done 4476 tasks      | elapsed:   44.0s
+[Parallel(n_jobs=4)]: Done 6140 tasks      | elapsed:   59.7s
+[Parallel(n_jobs=4)]: Done 8060 tasks      | elapsed:  1.3min
+[Parallel(n_jobs=4)]: Done 8410 out of 8410 | elapsed:  1.4min finished
+
+
+
+ +
+ +
Out[54]:
+ + + + +
+
GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=10),
+             estimator=Pipeline(steps=[('scale', StandardScaler()),
+                                       ('model',
+                                        XGBRegressor(base_score=None,
+                                                     booster='gblinear',
+                                                     colsample_bylevel=None,
+                                                     colsample_bynode=None,
+                                                     colsample_bytree=None,
+                                                     feature_selector='shuffle',
+                                                     gamma=None, gpu_id=None,
+                                                     importance_type='gain',
+                                                     interaction_constraints=None,
+                                                     learni...
+                         'model__lambda': [0.1, 0.2, 0.30000000000000004, 0.4,
+                                           0.5, 0.6, 0.7000000000000001, 0.8,
+                                           0.9, 1.0, 1.1, 1.2000000000000002,
+                                           1.3000000000000003,
+                                           1.4000000000000001,
+                                           1.5000000000000002, 1.6,
+                                           1.7000000000000002,
+                                           1.8000000000000003,
+                                           1.9000000000000001, 2.0, 2.1, 2.2,
+                                           2.3000000000000003,
+                                           2.4000000000000004,
+                                           2.5000000000000004, 2.6, 2.7,
+                                           2.8000000000000003,
+                                           2.9000000000000004]},
+             scoring=make_scorer(mean_squared_error), verbose=3)
+
+ +
+ +
+
+ +
+
+
+
In [55]:
+
+
+
xgb_e = gridsearch_xgb.best_estimator_
+
+#get cv results of the best model + confidence intervals
+cv_score = cross_val_score(xgb_e, X_train_e, y_train_e, cv=tscv, scoring=scorer)
+econ_perf = econ_perf.append({'Model':'XGB', 'MSE':np.mean(cv_score), 'SD':(np.std(cv_score))}, ignore_index=True)
+xgb_e
+
+ +
+
+
+ +
+
+ + +
+ +
Out[55]:
+ + + + +
+
Pipeline(steps=[('scale', StandardScaler()),
+                ('model',
+                 XGBRegressor(alpha=2.9000000000000004, base_score=0.5,
+                              booster='gblinear', colsample_bylevel=None,
+                              colsample_bynode=None, colsample_bytree=None,
+                              feature_selector='shuffle', gamma=None, gpu_id=-1,
+                              importance_type='gain',
+                              interaction_constraints=None,
+                              lambda=2.9000000000000004, learning_rate=0.5,
+                              max_delta_step=None, max_depth=None,
+                              min_child_weight=None, missing=nan,
+                              monotone_constraints=None, n_estimators=100,
+                              n_jobs=0, num_parallel_tree=None, random_state=0,
+                              reg_alpha=0, reg_lambda=0, scale_pos_weight=1,
+                              subsample=None, tree_method=None,
+                              validate_parameters=1, verbosity=None))])
+
+ +
+ +
+
+ +
+
+
+
In [56]:
+
+
+
print(econ_perf)
+
+ +
+
+
+ +
+
+ + +
+ +
+ + +
+
   Model           MSE            SD
+0  Ridge  1.614854e+04  8.475253e+03
+1     RF  1.759108e+05  1.624257e+05
+2    XGB  1.249194e+06  1.982986e+06
+
+
+
+ +
+
+ +
+
+
+
In [57]:
+
+
+
econ_fig = px.scatter(econ_perf, x="Model", y='MSE', color='Model', error_y="SD")
+econ_fig.update_layout(title_text="Performance of models trained on lags of y")
+econ_fig.show()
+
+ +
+
+
+ +
+
+ + +
+ +
+ + + +
+
+
+ +
+ +
+
+ +
+
+
+
+

Step 8.2 - NLP models

Let's try now predict the stock value using only information from the news headlines.

+ +
+
+
+
+
+
In [58]:
+
+
+
X_train_n = X_train.drop(econ_cols, axis=1)
+X_test_n = X_test.drop(econ_cols, axis=1)
+y_train_n = y_train
+y_test_n = y_test
+
+ +
+
+
+ +
+
+
+
In [59]:
+
+
+
nlp_perf = pd.DataFrame(columns=['Model','MSE', 'SD'])
+nlp_perf
+
+ +
+
+
+ +
+
+ + +
+ +
Out[59]:
+ + + +
+
+ + + + + + + + + + + + +
ModelMSESD
+
+
+ +
+ +
+
+ +
+
+
+
+

Ridge regression

+
+
+
+
+
+
In [60]:
+
+
+
ridge_param = {'model__alpha': list(np.arange(1,10,0.1))}
+ridge = Ridge(max_iter=5000)
+pipe = Pipeline([
+    ('scale', scaler),
+    ('model', ridge)
+])
+search_ridge = GridSearchCV(estimator=pipe,
+                          param_grid = ridge_param,
+                          scoring = scorer,
+                          cv = tscv,
+                          n_jobs=4
+                         )
+search_ridge.fit(X_train_n, y_train_n)
+
+ +
+
+
+ +
+
+ + +
+ +
Out[60]:
+ + + + +
+
GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=10),
+             estimator=Pipeline(steps=[('scale', StandardScaler()),
+                                       ('model', Ridge(max_iter=5000))]),
+             n_jobs=4,
+             param_grid={'model__alpha': [1.0, 1.1, 1.2000000000000002,
+                                          1.3000000000000003,
+                                          1.4000000000000004,
+                                          1.5000000000000004,
+                                          1.6000000000000005,
+                                          1.7000000000000006,
+                                          1.8000000000000007,
+                                          1.9000000000000008, 2....
+                                          2.300000000000001, 2.4000000000000012,
+                                          2.5000000000000013,
+                                          2.6000000000000014,
+                                          2.7000000000000015,
+                                          2.8000000000000016,
+                                          2.9000000000000017,
+                                          3.0000000000000018, 3.100000000000002,
+                                          3.200000000000002, 3.300000000000002,
+                                          3.400000000000002, 3.500000000000002,
+                                          3.6000000000000023,
+                                          3.7000000000000024,
+                                          3.8000000000000025,
+                                          3.9000000000000026, ...]},
+             scoring=make_scorer(mean_squared_error))
+
+ +
+ +
+
+ +
+
+
+
In [61]:
+
+
+
ridge_n = search_ridge.best_estimator_
+
+#get cv results of the best model + confidence intervals
+cv_score = cross_val_score(ridge_n, X_train_n, y_train_n, cv=tscv, scoring=scorer)
+nlp_perf = nlp_perf.append({'Model':'Ridge', 'MSE':np.mean(cv_score), 'SD':(np.std(cv_score))}, ignore_index=True)
+ridge_n
+
+ +
+
+
+ +
+
+ + +
+ +
Out[61]:
+ + + + +
+
Pipeline(steps=[('scale', StandardScaler()), ('model', Ridge(max_iter=5000))])
+
+ +
+ +
+
+ +
+
+
+
In [62]:
+
+
+
plotCoef(ridge_n['model'], X_train_n)
+
+coefs = ridge_n['model'].coef_
+ridge_coefs = pd.DataFrame({'Coef': coefs,
+                           'Name': list(X_train_n.columns)})
+ridge_coefs["abs"] = ridge_coefs.Coef.apply(np.abs)
+ridge_coefs = ridge_coefs.sort_values(by="abs", ascending=False).drop(["abs"], axis=1)
+ridge_coefs
+
+ +
+
+
+ +
+
+ + +
+ +
Out[62]:
+ + + +
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
CoefName
9176.487936compound_min_lag_2
8174.531051compound_min_lag_1
4-162.089893compound_mean_lag_1
2155.208150compound_min
0-148.888611compound_mean
5-142.849053compound_mean_lag_2
11124.592908subjectivity_mean_lag_2
3120.468165subjectivity_mean
10119.085025subjectivity_mean_lag_1
1-37.363883compound_max
6-33.016224compound_max_lag_1
7-27.976245compound_max_lag_2
+
+
+ +
+ +
+ +
+ + + + +
+ +
+ +
+ +
+
+ +
+
+
+
In [63]:
+
+
+
mape(y_test, ridge_n.predict(X_test_n))
+
+ +
+
+
+ +
+
+ + +
+ +
Out[63]:
+ + + + +
+
58.838053988800944
+
+ +
+ +
+
+ +
+
+
+
+

Predicted values for years 2014-2016 using ridge regression.

+ +
+
+
+
+
+
In [64]:
+
+
+
print(ridge_n.predict(X_test_n))
+
+ +
+
+
+ +
+
+ + +
+ +
+ + +
+
[14716.64219864 14705.08819751 13980.10059014 13791.12076183
+ 14676.24671027 14686.22125486 14593.85122921 14000.79046681
+ 14502.44731183 14375.95034239 14688.04942339 14353.51442502
+ 14103.67449935 14116.34933655 14419.39078007 14596.81437977
+ 14614.03005235 14325.79387597 14481.54474328 14572.80654748
+ 14531.77731084 14357.0924025  13967.2853402  14151.41176422
+ 14673.88472455 14670.03226724 14688.91824998 14435.91990249
+ 14351.33827278 14128.78411089 14234.42232088 14861.38549773
+ 15279.58535055 15059.59589981 14443.75073891 13956.83386789
+ 13833.93647897 14014.45214329 14177.00415029 14720.5853178
+ 14915.55849948 15568.88895265 15174.39748957 14926.33275679
+ 14264.52157732 14140.31270066 14304.3287834  14283.84574707
+ 14575.90891318 14359.32563443 14806.19153749 14576.43917536
+ 14838.82908988 14763.04198214 14725.27659296 14738.37829882
+ 14851.09603646 14905.67521348 14676.88580713 14417.72132961
+ 14161.87091563 14535.78451018 14705.28795344 14520.6469565
+ 13937.47425067 13843.63461961 13878.02056841 14098.51448207
+ 14120.64336283 14190.20589615 14404.97163151 14231.25459752
+ 14320.8347799  14601.83043759 14474.05325609 14234.88630444
+ 13901.80475708 14273.04502341 14650.19989369 14855.28028944
+ 14260.70910616 14176.77206658 13708.05352069 14097.52991821
+ 14034.38394449 14030.21329192 13943.83940401 14319.54289611
+ 14564.61636532 14402.8173404  14127.87903737 14211.99920578
+ 14348.07969072 14242.29458968 14209.78296442 14378.57761623
+ 14215.47955442 14676.32372339 14644.97874967 14580.05272873
+ 14240.89240004 14301.72275924 14503.8527384  14203.54851615
+ 13965.13031537 14001.72100416 14025.33226088 14206.08792068
+ 13985.59838413 14133.38392954 14102.86127825 14668.35233182
+ 14562.34518647 14733.07848105 14539.49052096 14406.55151835
+ 13621.85061013 13570.68262424 13844.04619723 14392.31357706
+ 14131.07343929 14155.10529258 13993.26876016 14206.43133202
+ 14378.0046473  14835.46729212 15502.57008323 15218.63571377
+ 15302.13313737 14501.83494644 14419.89651712 13690.09821263
+ 13736.53048418 14203.27115274 14746.76002352 14909.21617424
+ 14950.85832902 15013.15104726 15061.36522134 15329.64720779
+ 14852.68998325 14454.58898225 14126.13446459 14169.47961413
+ 14184.18775849 13929.28814364 14255.14582607 14473.22594173
+ 14994.58803902 14887.01865092 14622.55825852 13943.52493983
+ 14386.61142009 14497.3910554  14459.62348537 14048.30395461
+ 14215.05877884 14548.76790326 14649.05232596 14369.10976287
+ 14402.08193348 14566.3757887  14679.55371263 14637.25178515
+ 14787.07610389 14879.67996445 14655.76977814 14378.14011295
+ 14514.25905664 14978.32104992 15019.48075036 14687.20993749
+ 14374.1472373  14742.13090956 15083.70814544 15466.63780497
+ 15063.23021744 14930.31140312 14444.87098957 14206.67297729
+ 13991.47401831 13839.1571768  14558.41328536 14712.04920657
+ 14804.49699833 14689.45684639 14572.0346241  14737.14276781
+ 14125.0690921  14720.68489501 14416.73934984 14755.84144373
+ 14490.30971554 14497.75072292 13936.11387616 14029.8531488
+ 14255.195171   14518.88950098 14196.59589289 14016.0312194
+ 14364.01782664 14378.56152216 14613.19111015 14269.74363911
+ 14228.68766343 14268.67781576 14283.44982074 14655.46616787
+ 14445.73644251 14764.1914451  14205.36508407 13879.0047777
+ 13555.594486   13943.79790917 14496.40956363 15145.8487519
+ 14704.02171912 14731.08494413 14516.36630965 14660.86706664
+ 14828.55599815 14869.54899713 14763.59567572 14748.72894736
+ 14188.49853281 14062.33483579 13601.02790733 14244.4908774
+ 14807.62033701 15456.86126196 15139.92743817 14965.83853208
+ 14627.98985371 14583.6944843  14706.53636414 15370.06898478
+ 15324.18533868 15059.83533961 14472.26304848 14902.73981264
+ 14963.2916588  15224.73111313 15148.46891322 14759.33088376
+ 14260.42113483 14025.13633076 14037.43433036 14110.66034914
+ 13935.41412928 14354.93482445 14378.8639393  14201.06770483
+ 13876.3725378  14062.34690631 14749.30802824 14843.32722401
+ 14897.42598085 14701.94714147 14758.79143772 14602.68662661
+ 14512.15048566 14249.04705052 14700.25942103 14629.49110556
+ 14782.20657905 14315.66851593 14362.3037637  14530.86873225
+ 14859.40380459 14782.67792277 14673.12335278 14348.64245286
+ 14710.42142076 15066.30048173 15155.08010436 15061.73365505
+ 14630.91512792 14669.42954189 13917.95752001 14173.6034541
+ 14393.45642682 14546.5064162  14230.96499422 13945.86047895
+ 14664.60897622 14993.48311086 15219.03019697 14991.03840957
+ 14829.98144342 14207.07795643 13877.66867487 14095.43888652
+ 14156.83349626 14564.98462163 14194.13788409 14499.05485127
+ 14162.47138374 14599.39197025 14260.44315095 14284.58118755
+ 14058.17928041 14401.32068577 14452.46983661 14585.4343436
+ 14698.2978979  14388.61095702 14235.9497422  14193.03173384
+ 14694.81967522 14852.25057428 14996.07383177 15001.40079986
+ 15119.2831724  14818.32519635 14452.21323673 14232.85313334
+ 14207.10456677 14351.18169279 14467.13773621 14244.23422577
+ 14050.95356768 13742.96324791 14008.06929296 14318.48379181
+ 14625.3835333  14724.02128212 14459.02228065 14345.5808906
+ 14019.11517863 14016.93708785 13943.23270623 13984.79317785
+ 13837.08764262 14240.29164778 14395.54938104 14880.86862355
+ 14353.84979976 14394.75863371 13703.79678625 14264.62871533
+ 14583.03099227 15081.81109235 14710.08933781 14463.9320048
+ 14518.98405573 14556.11787942 14812.49676162 14562.44311589
+ 14763.78967351 14515.52046661 14961.67454224 14851.16049538
+ 15042.54150799 14447.05456312 14184.02425911 13989.37337543
+ 14132.75756794 14370.27333966 14713.10843835 14804.88777045
+ 14890.51580018 14700.75114536 14406.38886716 14054.88234918
+ 13971.0540692  14195.7722975  14548.2479276  14099.33962386
+ 14145.56644245 14153.99855009 14930.22219983 14854.17114332
+ 14793.2380797  14347.79057461 14704.65271046 14644.9258826
+ 14852.22475245 14808.92334506 15390.64191726 14699.66362928
+ 14350.07441366 13730.44021683 13979.53734229 14572.80429856
+ 14976.00761934 15229.14912432 15014.14388732 15103.48335538
+ 15121.3050892  14573.29341893 14097.35318468 13801.1657952
+ 14231.2092012  14296.67260308 14222.26694147 13812.45226506
+ 13808.49358293]
+
+
+
+ +
+
+ +
+
+
+
+

Random Forest

+
+
+
+
+
+
In [65]:
+
+
+
rf_param = {'model__n_estimators': [10, 100, 300],
+            'model__max_depth': [10, 20, 30, 40],
+            'model__min_samples_split': [2, 5, 10],
+            'model__min_samples_leaf': [1, 2, 3],
+            'model__max_features': ["auto", 'sqrt']}
+rf = RandomForestRegressor()
+pipe = Pipeline([
+    ('scale', scaler),
+    ('model', rf)])
+gridsearch_rf = GridSearchCV(estimator=pipe,
+                          param_grid = rf_param,
+                          scoring = scorer,
+                          cv = tscv,
+                          n_jobs=4,
+                          verbose=3
+                         )
+gridsearch_rf.fit(X_train_n, y_train_n)
+
+ +
+
+
+ +
+
+ + +
+ +
+ + +
+
Fitting 10 folds for each of 216 candidates, totalling 2160 fits
+
+
+
+ +
+ +
+ + +
+
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
+[Parallel(n_jobs=4)]: Done  38 tasks      | elapsed:   10.0s
+[Parallel(n_jobs=4)]: Done 136 tasks      | elapsed:   49.4s
+[Parallel(n_jobs=4)]: Done 296 tasks      | elapsed:  1.7min
+[Parallel(n_jobs=4)]: Done 520 tasks      | elapsed:  2.2min
+[Parallel(n_jobs=4)]: Done 808 tasks      | elapsed:  3.8min
+[Parallel(n_jobs=4)]: Done 1160 tasks      | elapsed:  5.3min
+[Parallel(n_jobs=4)]: Done 1576 tasks      | elapsed:  7.0min
+[Parallel(n_jobs=4)]: Done 2056 tasks      | elapsed:  9.2min
+[Parallel(n_jobs=4)]: Done 2160 out of 2160 | elapsed:  9.5min finished
+
+
+
+ +
+ +
Out[65]:
+ + + + +
+
GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=10),
+             estimator=Pipeline(steps=[('scale', StandardScaler()),
+                                       ('model', RandomForestRegressor())]),
+             n_jobs=4,
+             param_grid={'model__max_depth': [10, 20, 30, 40],
+                         'model__max_features': ['auto', 'sqrt'],
+                         'model__min_samples_leaf': [1, 2, 3],
+                         'model__min_samples_split': [2, 5, 10],
+                         'model__n_estimators': [10, 100, 300]},
+             scoring=make_scorer(mean_squared_error), verbose=3)
+
+ +
+ +
+
+ +
+
+
+
In [66]:
+
+
+
rf_n = gridsearch_rf.best_estimator_
+
+#get cv results of the best model + confidence intervals
+cv_score = cross_val_score(rf_n, X_train_n, y_train_n, cv=tscv, scoring=scorer)
+nlp_perf = nlp_perf.append({'Model':'RF', 'MSE':np.mean(cv_score), 'SD':(np.std(cv_score))}, ignore_index=True)
+
+ +
+
+
+ +
+
+
+
+

XGBoost

+
+
+
+
+
+
In [67]:
+
+
+
xgb_param = {'model__lambda': list(np.arange(1,10, 1)), #L2 regularisation
+             'model__alpha': list(np.arange(1,10, 1)),  #L1 regularisation
+            }
+xgb = XGBRegressor(booster='gblinear', feature_selector='shuffle', objective='reg:squarederror')
+
+pipe = Pipeline([
+    ('scale', scaler),
+    ('model', xgb)])
+gridsearch_xgb = GridSearchCV(estimator=pipe,
+                          param_grid = xgb_param,
+                          scoring = scorer,
+                          cv = tscv,
+                          n_jobs=4,
+                          verbose=3
+                         )
+gridsearch_xgb.fit(X_train_n, y_train_n)
+
+ +
+
+
+ +
+
+ + +
+ +
+ + +
+
Fitting 10 folds for each of 81 candidates, totalling 810 fits
+
+
+
+ +
+ +
+ + +
+
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
+[Parallel(n_jobs=4)]: Done  56 tasks      | elapsed:    0.7s
+[Parallel(n_jobs=4)]: Done 440 tasks      | elapsed:    6.6s
+[Parallel(n_jobs=4)]: Done 810 out of 810 | elapsed:   11.1s finished
+
+
+
+ +
+ +
Out[67]:
+ + + + +
+
GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=10),
+             estimator=Pipeline(steps=[('scale', StandardScaler()),
+                                       ('model',
+                                        XGBRegressor(base_score=None,
+                                                     booster='gblinear',
+                                                     colsample_bylevel=None,
+                                                     colsample_bynode=None,
+                                                     colsample_bytree=None,
+                                                     feature_selector='shuffle',
+                                                     gamma=None, gpu_id=None,
+                                                     importance_type='gain',
+                                                     interaction_constraints=None,
+                                                     learni...
+                                                     monotone_constraints=None,
+                                                     n_estimators=100,
+                                                     n_jobs=None,
+                                                     num_parallel_tree=None,
+                                                     random_state=None,
+                                                     reg_alpha=None,
+                                                     reg_lambda=None,
+                                                     scale_pos_weight=None,
+                                                     subsample=None,
+                                                     tree_method=None,
+                                                     validate_parameters=None,
+                                                     verbosity=None))]),
+             n_jobs=4,
+             param_grid={'model__alpha': [1, 2, 3, 4, 5, 6, 7, 8, 9],
+                         'model__lambda': [1, 2, 3, 4, 5, 6, 7, 8, 9]},
+             scoring=make_scorer(mean_squared_error), verbose=3)
+
+ +
+ +
+
+ +
+
+
+
In [68]:
+
+
+
xgb_n = gridsearch_xgb.best_estimator_
+
+#get cv results of the best model + confidence intervals
+cv_score = cross_val_score(xgb_n, X_train_n, y_train_n, cv=tscv, scoring=scorer)
+nlp_perf = nlp_perf.append({'Model':'XGB', 'MSE':np.mean(cv_score), 'SD':(np.std(cv_score))}, ignore_index=True)
+xgb_n
+
+ +
+
+
+ +
+
+ + +
+ +
Out[68]:
+ + + + +
+
Pipeline(steps=[('scale', StandardScaler()),
+                ('model',
+                 XGBRegressor(alpha=9, base_score=0.5, booster='gblinear',
+                              colsample_bylevel=None, colsample_bynode=None,
+                              colsample_bytree=None, feature_selector='shuffle',
+                              gamma=None, gpu_id=-1, importance_type='gain',
+                              interaction_constraints=None, lambda=1,
+                              learning_rate=0.5, max_delta_step=None,
+                              max_depth=None, min_child_weight=None,
+                              missing=nan, monotone_constraints=None,
+                              n_estimators=100, n_jobs=0,
+                              num_parallel_tree=None, random_state=0,
+                              reg_alpha=0, reg_lambda=0, scale_pos_weight=1,
+                              subsample=None, tree_method=None,
+                              validate_parameters=1, verbosity=None))])
+
+ +
+ +
+
+ +
+
+
+
In [69]:
+
+
+
print(nlp_perf)
+
+ +
+
+
+ +
+
+ + +
+ +
+ + +
+
   Model           MSE            SD
+0  Ridge  8.062476e+06  6.728594e+06
+1     RF  8.257951e+06  7.146497e+06
+2    XGB  8.062751e+06  6.728328e+06
+
+
+
+ +
+
+ +
+
+
+
In [70]:
+
+
+
nlp_fig = px.scatter(nlp_perf, x="Model", y='MSE', color='Model', error_y="SD")
+#nlp_fig.update_layout(title_text="Performance of models trained on NLP features",
+nlp_fig.show()
+
+ +
+
+
+ +
+
+ + +
+ +
+ + + +
+
+
+ +
+ +
+
+ +
+
+
+
+

The 3 models are performing quite similary. They might be useful candidates for stacking.

+ +
+
+
+
+
+
+

Econ+NLP models

Let's use all features now

+ +
+
+
+
+
+
+

Ridge Regression

+
+
+
+
+
+
In [71]:
+
+
+
en_perf = pd.DataFrame(columns=['Model','MSE', 'SD'])
+en_perf
+
+ +
+
+
+ +
+
+ + +
+ +
Out[71]:
+ + + +
+
+ + + + + + + + + + + + +
ModelMSESD
+
+
+ +
+ +
+
+ +
+
+
+
In [72]:
+
+
+
ridge_param = {'model__alpha': list(np.arange(0.1,1,0.01))}
+ridge = Ridge(max_iter=5000)
+pipe = Pipeline([
+    ('scale', scaler),
+    ('model', ridge)])
+search_ridge = GridSearchCV(estimator=pipe,
+                          param_grid = ridge_param,
+                          scoring = scorer,
+                          cv = tscv,
+                          n_jobs=4,
+                          verbose=3
+                         )
+search_ridge.fit(X_train, y_train)
+
+ +
+
+
+ +
+
+ + +
+ +
+ + +
+
Fitting 10 folds for each of 90 candidates, totalling 900 fits
+
+
+
+ +
+ +
+ + +
+
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
+[Parallel(n_jobs=4)]: Done  56 tasks      | elapsed:    0.3s
+[Parallel(n_jobs=4)]: Done 824 tasks      | elapsed:    4.7s
+[Parallel(n_jobs=4)]: Done 900 out of 900 | elapsed:    5.1s finished
+
+
+
+ +
+ +
Out[72]:
+ + + + +
+
GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=10),
+             estimator=Pipeline(steps=[('scale', StandardScaler()),
+                                       ('model', Ridge(max_iter=5000))]),
+             n_jobs=4,
+             param_grid={'model__alpha': [0.1, 0.11, 0.12, 0.13,
+                                          0.13999999999999999,
+                                          0.14999999999999997,
+                                          0.15999999999999998,
+                                          0.16999999999999998,
+                                          0.17999999999999997,
+                                          0.18999999999999995,
+                                          0.19999999999999996,
+                                          0.2...
+                                          0.23999999999999994,
+                                          0.24999999999999992,
+                                          0.2599999999999999,
+                                          0.2699999999999999,
+                                          0.2799999999999999,
+                                          0.2899999999999999,
+                                          0.29999999999999993,
+                                          0.30999999999999994,
+                                          0.3199999999999999,
+                                          0.32999999999999985,
+                                          0.33999999999999986,
+                                          0.34999999999999987,
+                                          0.3599999999999999,
+                                          0.3699999999999999,
+                                          0.3799999999999999,
+                                          0.3899999999999999, ...]},
+             scoring=make_scorer(mean_squared_error), verbose=3)
+
+ +
+ +
+
+ +
+
+
+
In [73]:
+
+
+
ridge_en = search_ridge.best_estimator_
+
+#get cv results of the best model + confidence intervals
+cv_score = cross_val_score(ridge_en, X_train, y_train, cv=tscv, scoring=scorer)
+en_perf = en_perf.append({'Model':'Ridge', 'MSE':np.mean(cv_score), 'SD':(np.std(cv_score))}, ignore_index=True)
+ridge_en
+
+ +
+
+
+ +
+
+ + +
+ +
Out[73]:
+ + + + +
+
Pipeline(steps=[('scale', StandardScaler()),
+                ('model', Ridge(alpha=0.9899999999999995, max_iter=5000))])
+
+ +
+ +
+
+ +
+
+
+
In [74]:
+
+
+
coefs = ridge_en['model'].coef_
+ridge_coefs = pd.DataFrame({'Coef': coefs,
+                           'Name': list(X_train.columns)})
+ridge_coefs["abs"] = ridge_coefs.Coef.apply(np.abs)
+ridge_coefs = ridge_coefs.sort_values(by="abs", ascending=False).drop(["abs"], axis=1)
+ridge_coefs
+
+ +
+
+
+ +
+
+ + +
+ +
Out[74]:
+ + + +
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
CoefName
122042.016441Close_lag_1
13415.436237Close_lag_2
1559.349649Close_lag_4
1428.210120Close_lag_3
97.691580compound_min_lag_2
07.462194compound_mean
75.646883compound_max_lag_2
1-5.208500compound_max
10-2.649817subjectivity_mean_lag_1
5-2.106836compound_mean_lag_2
81.471173compound_min_lag_1
2-1.411217compound_min
3-1.145039subjectivity_mean
4-0.809486compound_mean_lag_1
6-0.236423compound_max_lag_1
110.020128subjectivity_mean_lag_2
+
+
+ +
+ +
+
+ +
+
+
+
In [75]:
+
+
+
plotCoef(ridge_en['model'], X_train)
+
+ +
+
+
+ +
+
+ + +
+ +
+ + + + +
+ +
+ +
+ +
+
+ +
+
+
+
+

Random Forest

+
+
+
+
+
+
In [76]:
+
+
+
rf_param = {'model__n_estimators': [10, 100, 300],
+            'model__max_depth': [10, 20, 30, 40],
+            'model__min_samples_split': [2, 5, 10],
+            'model__min_samples_leaf': [1, 2, 3],
+            'model__max_features': ["auto", 'sqrt']}
+rf = RandomForestRegressor()
+pipe = Pipeline([
+    ('scale', scaler),
+    ('model', rf)])
+gridsearch_rf = GridSearchCV(estimator=pipe,
+                          param_grid = rf_param,
+                          scoring = scorer,
+                          cv = tscv,
+                          n_jobs=4,
+                          verbose=3
+                         )
+gridsearch_rf.fit(X_train, y_train)
+
+ +
+
+
+ +
+
+ + +
+ +
+ + +
+
Fitting 10 folds for each of 216 candidates, totalling 2160 fits
+
+
+
+ +
+ +
+ + +
+
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
+[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:    9.8s
+[Parallel(n_jobs=4)]: Done 132 tasks      | elapsed:   48.9s
+[Parallel(n_jobs=4)]: Done 292 tasks      | elapsed:  1.7min
+[Parallel(n_jobs=4)]: Done 516 tasks      | elapsed:  2.4min
+[Parallel(n_jobs=4)]: Done 804 tasks      | elapsed:  4.2min
+[Parallel(n_jobs=4)]: Done 1156 tasks      | elapsed:  5.5min
+[Parallel(n_jobs=4)]: Done 1572 tasks      | elapsed:  7.4min
+[Parallel(n_jobs=4)]: Done 2052 tasks      | elapsed:  9.7min
+[Parallel(n_jobs=4)]: Done 2160 out of 2160 | elapsed: 10.1min finished
+
+
+
+ +
+ +
Out[76]:
+ + + + +
+
GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=10),
+             estimator=Pipeline(steps=[('scale', StandardScaler()),
+                                       ('model', RandomForestRegressor())]),
+             n_jobs=4,
+             param_grid={'model__max_depth': [10, 20, 30, 40],
+                         'model__max_features': ['auto', 'sqrt'],
+                         'model__min_samples_leaf': [1, 2, 3],
+                         'model__min_samples_split': [2, 5, 10],
+                         'model__n_estimators': [10, 100, 300]},
+             scoring=make_scorer(mean_squared_error), verbose=3)
+
+ +
+ +
+
+ +
+
+
+
In [77]:
+
+
+
rf_en = gridsearch_rf.best_estimator_
+
+#get cv results of the best model + confidence intervals
+cv_score = cross_val_score(rf_en, X_train, y_train, cv=tscv, scoring=scorer)
+en_perf = en_perf.append({'Model':'RF', 'MSE':np.mean(cv_score), 'SD':(np.std(cv_score))}, ignore_index=True)
+rf_en
+
+ +
+
+
+ +
+
+ + +
+ +
Out[77]:
+ + + + +
+
Pipeline(steps=[('scale', StandardScaler()),
+                ('model',
+                 RandomForestRegressor(max_depth=40, max_features='sqrt',
+                                       min_samples_leaf=3, min_samples_split=10,
+                                       n_estimators=10))])
+
+ +
+ +
+
+ +
+
+
+
+

XGBoost

+
+
+
+
+
+
In [78]:
+
+
+
xgb_param = {'model__lambda': list(np.arange(1,10, 1)), #L2 regularisation
+             'model__alpha': list(np.arange(1,10, 1)),  #L1 regularisation
+            }
+xgb = XGBRegressor(booster='gblinear', feature_selector='shuffle', objective='reg:squarederror')
+
+pipe = Pipeline([
+    ('scale', scaler),
+    ('model', xgb)])
+gridsearch_xgb = GridSearchCV(estimator=pipe,
+                          param_grid = xgb_param,
+                          scoring = scorer,
+                          cv = tscv,
+                          n_jobs=4,
+                          verbose=3
+                         )
+gridsearch_xgb.fit(X_train, y_train)
+
+ +
+
+
+ +
+
+ + +
+ +
+ + +
+
Fitting 10 folds for each of 81 candidates, totalling 810 fits
+
+
+
+ +
+ +
+ + +
+
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
+[Parallel(n_jobs=4)]: Done  56 tasks      | elapsed:    0.7s
+[Parallel(n_jobs=4)]: Done 440 tasks      | elapsed:    6.0s
+[Parallel(n_jobs=4)]: Done 810 out of 810 | elapsed:   10.8s finished
+
+
+
+ +
+ +
Out[78]:
+ + + + +
+
GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=10),
+             estimator=Pipeline(steps=[('scale', StandardScaler()),
+                                       ('model',
+                                        XGBRegressor(base_score=None,
+                                                     booster='gblinear',
+                                                     colsample_bylevel=None,
+                                                     colsample_bynode=None,
+                                                     colsample_bytree=None,
+                                                     feature_selector='shuffle',
+                                                     gamma=None, gpu_id=None,
+                                                     importance_type='gain',
+                                                     interaction_constraints=None,
+                                                     learni...
+                                                     monotone_constraints=None,
+                                                     n_estimators=100,
+                                                     n_jobs=None,
+                                                     num_parallel_tree=None,
+                                                     random_state=None,
+                                                     reg_alpha=None,
+                                                     reg_lambda=None,
+                                                     scale_pos_weight=None,
+                                                     subsample=None,
+                                                     tree_method=None,
+                                                     validate_parameters=None,
+                                                     verbosity=None))]),
+             n_jobs=4,
+             param_grid={'model__alpha': [1, 2, 3, 4, 5, 6, 7, 8, 9],
+                         'model__lambda': [1, 2, 3, 4, 5, 6, 7, 8, 9]},
+             scoring=make_scorer(mean_squared_error), verbose=3)
+
+ +
+ +
+
+ +
+
+
+
In [79]:
+
+
+
xgb_en = gridsearch_xgb.best_estimator_
+
+#get cv results of the best model + confidence intervals
+cv_score = cross_val_score(xgb_en, X_train, y_train, cv=tscv, scoring=scorer)
+en_perf = en_perf.append({'Model':'XGB', 'MSE':np.mean(cv_score), 'SD':(np.std(cv_score))}, ignore_index=True)
+xgb_en
+
+ +
+
+
+ +
+
+ + +
+ +
Out[79]:
+ + + + +
+
Pipeline(steps=[('scale', StandardScaler()),
+                ('model',
+                 XGBRegressor(alpha=9, base_score=0.5, booster='gblinear',
+                              colsample_bylevel=None, colsample_bynode=None,
+                              colsample_bytree=None, feature_selector='shuffle',
+                              gamma=None, gpu_id=-1, importance_type='gain',
+                              interaction_constraints=None, lambda=9,
+                              learning_rate=0.5, max_delta_step=None,
+                              max_depth=None, min_child_weight=None,
+                              missing=nan, monotone_constraints=None,
+                              n_estimators=100, n_jobs=0,
+                              num_parallel_tree=None, random_state=0,
+                              reg_alpha=0, reg_lambda=0, scale_pos_weight=1,
+                              subsample=None, tree_method=None,
+                              validate_parameters=1, verbosity=None))])
+
+ +
+ +
+
+ +
+
+
+
+

Trying to stack econometric and NLP models

+
+
+
+
+
+
In [80]:
+
+
+
from sklearn.model_selection import cross_val_predict
+
+X_train_stack = pd.DataFrame(pd.DataFrame(columns=['econ_r', 'nlp_r']))
+X_train_stack['econ_r'] = cross_val_predict(ridge_e, X_train_e, y_train, cv=10)
+X_train_stack['nlp_r'] = cross_val_predict(ridge_n, X_train_n, y_train, cv=10)
+
+X_test_stack = pd.DataFrame(pd.DataFrame(columns=['econ_r', 'nlp_r']))
+X_test_stack['econ_r'] = ridge_e.predict(X_test_e)
+X_test_stack['nlp_r'] = ridge_n.predict(X_test_n)
+
+X_train_stack.to_csv("Stack_train.csv")
+X_test_stack.to_csv("Stack_test.csv")
+
+from sklearn.linear_model import ElasticNetCV
+stack = ElasticNetCV(cv=tscv)
+stack.fit(X_train_stack, y_train)
+cv_score = cross_val_score(stack, X_train_stack, y_train, cv=tscv, scoring=scorer)
+stack_performance = {'Model':'XGB', 'MSE':np.mean(cv_score), 'SD':(np.std(cv_score))}
+stack_performance
+
+mape(y_test, stack.predict(X_test_stack))
+
+ +
+
+
+ +
+
+ + +
+ +
Out[80]:
+ + + + +
+
1.428419912765189
+
+ +
+ +
+
+ +
+
+
+
In [81]:
+
+
+
coefs = stack.coef_
+ridge_coefs = pd.DataFrame({'Coef': coefs,
+                           'Name': list(X_train_stack.columns)})
+ridge_coefs["abs"] = ridge_coefs.Coef.apply(np.abs)
+ridge_coefs = ridge_coefs.sort_values(by="abs", ascending=False).drop(["abs"], axis=1)
+print(ridge_coefs)
+plotCoef(stack, X_train_stack)
+
+ +
+
+
+ +
+
+ + +
+ +
+ + +
+
      Coef    Name
+0  0.99867  econ_r
+1 -0.00000   nlp_r
+
+
+
+ +
+ +
+ + + + +
+ +
+ +
+ +
+
+ +
+
+
+
+

Model Comparison

+
+
+
+
+
+
+

prediction_compare = pd.DataFrame(pd.DataFrame(columns=['y_true', 'econ_r', 'econ_rf', 'econ_x', 'nlp_r', 'nlp_rf', 'nlp_x', 'comb_r', 'comb_rf', 'comb_x', 'stack'])) +prediction_compare['y_true'] = y_test +prediction_compare['econ_r'] = ridge_e.predict(X_test_e) +prediction_compare['econ_rf'] = rf_e.predict(X_test_e) +prediction_compare['econ_x'] = xgb_e.predict(X_test_e) +prediction_compare['nlp_r'] = ridge_n.predict(X_test_n) +prediction_compare['nlp_rf'] = rf_n.predict(X_test_n) +prediction_compare['nlp_x'] = xgb_n.predict(X_test_n) +prediction_compare['comb_r'] = ridge_en.predict(X_test) +prediction_compare['comb_rf'] = rf_en.predict(X_test) +prediction_compare['comb_x'] = xgb_en.predict(X_test) +prediction_compare['stack'] = stack.predict(X_test_stack)

+

prediction_compare.sample(3)

+ +
+
+
+
+
+
+

Outcomes

->The market is arguably to be a random walk. Although there is some direction to its movements, there is still quite a bit of randomness to its movements.

+

->The news that we used might not be the most relevant. Perhaps it would have been better to use news relating to the 30 companies that make up the Dow.

+

->More information could have been included in the model, such as the previous day(s)'s change, the previous day(s)'s main headline(s).

+

->Many people have worked on this task for years and companies spend millions of dollars to try to predict the movements of the market, so we shouldn't expect anything too great considering the small amount of data that we are working with and the simplicity of our model.

+ +
+
+
+
+
+ + + + + + diff --git a/NYSE/Dow_Jones_Industrial_Average_Prediction/DJIA_prediction_using_news.py b/NYSE/Dow_Jones_Industrial_Average_Prediction/DJIA_prediction_using_news.py new file mode 100644 index 00000000..3c73d90a --- /dev/null +++ b/NYSE/Dow_Jones_Industrial_Average_Prediction/DJIA_prediction_using_news.py @@ -0,0 +1,1103 @@ +#!/usr/bin/env python +# coding: utf-8 + +# # Dow Jones Industrial Average (DJIA) prediction +# We will be predicting the DJIA closing value by using the top 25 headlines for the day. +# +# The Dow Jones Industrial Average, Dow Jones, or simply the Dow, is a stock market index that measures the stock performance of 30 large companies listed on stock exchanges in the United States. +# +# ![image.png](attachment:image.png) +# +# +# + +# # Step 1 - Importing all the libraries + +# In[6]: + + +import time +start = time.time() +import os + +import numpy as np # linear algebra +import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) + +from sklearn.model_selection import cross_val_score #sklearn features various ML algorithms +from sklearn.feature_selection import VarianceThreshold +from sklearn.preprocessing import StandardScaler +from nltk.sentiment.vader import SentimentIntensityAnalyzer #sentiment analysis +from textblob import TextBlob #processing textual data + +#plotting +import plotly.express as px +import plotly.graph_objects as go +import seaborn as sns +import matplotlib.pyplot as plt + +#statistics & econometrics +import statsmodels.tsa.api as smt +import statsmodels.api as sm + +#model fiiting and selection +from sklearn.metrics import mean_squared_error +from sklearn.metrics import make_scorer +from sklearn.preprocessing import StandardScaler +from sklearn.pipeline import Pipeline +from sklearn.model_selection import GridSearchCV +from sklearn.model_selection import TimeSeriesSplit +from sklearn.linear_model import Lasso, Ridge +from sklearn.ensemble import RandomForestRegressor +from xgboost.sklearn import XGBRegressor + + +# # Step 2 - Importing the dataset + +# In[7]: + + +df = pd.read_csv("Combined_News_DJIA.csv", sep=',',low_memory=False, + parse_dates=[0]) + +full_stock = pd.read_csv("upload_DJIA_table.csv",low_memory=False, + parse_dates=[0]) + +#add the closing stock value to the df - this will be the y variable +df["Close"]=full_stock.Close + +#show how the dataset looks like +df.head(5) + + +# In[8]: + + +#drop the label column +df2 = df.drop(['Label'], axis=1) + +#view the new dataframe +df2.head(10) + + +# # Step 3 - Data cleaning +# NA treatment : +# We'll simply fill the NAs in the numerical features (Date, Close). In the text features we'll fill the missing values with ''. + +# In[9]: + + +#check for NAN +df2.isnull().sum() + + +# There are a few headlines missing. Let's fill them in with a whitespace. + +# In[10]: + + +#replacing nan values with a whitespace +df2 = df2.replace(np.nan, ' ', regex=True) + +#sanity check +df2.isnull().sum().sum() + + +# Remove the HTML tags : +# There are several non-word tags in the headlines that would just bias the sentiment analysis so we need to remove them and replace with ''. This can be done with regex. + +# In[11]: + + +#Remove the HTML tags +df2 = df2.replace('b\"|b\'|\\\\|\\\"', '', regex=True) +df2.head(2) + + +# Sentiment and subjectivity score extraction : +# Now we run the sentiment analysis extracting the compound score that goes from -0.5 (most negative) to 0.5 (most positive). I'm going to use the "dirty" texts in this part because VADER can utilize the information such as ALL CAPS, punctuation, etc. We'll also calculate the subjectivity of each headline using the TextBlob package. +# +# Initialise the VADER analyzer. + +# In[12]: + + +#Sentiment and subjectivity score extraction +Anakin = SentimentIntensityAnalyzer() + +Anakin.polarity_scores(" ") + + +# Write a function to save the subjectivity score directly from TextBlob function's output. Subjectivity score might detect direct quotes in the headlines and positive stuff is rarely quoted in the headline. + +# In[13]: + + +def detect_subjectivity(text): + return TextBlob(text).sentiment.subjectivity + +detect_subjectivity(" ") #should return 0 + + +# In[14]: + + +start_vect=time.time() +print("ANAKIN: 'Intializing the process..'") + +#get the name of the headline columns +cols = [] +for i in range(1,26): + col = ("Top{}".format(i)) + cols.append(col) + + +for col in cols: + df2[col] = df2[col].astype(str) # Make sure data is treated as a string + df2[col+'_comp']= df2[col].apply(lambda x:Anakin.polarity_scores(x)['compound']) + df2[col+'_sub'] = df2[col].apply(detect_subjectivity) + print("{} Done".format(col)) + +print("VADER: Vaderization completed after %0.2f Minutes"%((time.time() - start_vect)/60)) + + +# In[15]: + + +#the text isn't required anymore +df2 = df2.drop(cols,axis=1) +df2.head(5) + + +# Summarise the compound and subjectivity scores weighted by rating of the headline (top1 has the most weight) + +# In[16]: + + +comp_cols = [] +for col in cols: + comp_col = col + "_comp" + comp_cols.append(comp_col) + +w = np.arange(1,26,1).tolist() +w.reverse() + +weighted_comp = [] +max_comp = [] +min_comp = [] +for i in range(0,len(df)): + a = df2.loc[i,comp_cols].tolist() + weighted_comp.append(np.average(a, weights=w)) + max_comp.append(max(a)) + min_comp.append(min(a)) + +df2['compound_mean'] = weighted_comp +df2['compound_max'] = max_comp +df2['compound_min'] = min_comp + + +sub_cols = [] +for col in cols: + sub_col = col + "_sub" + sub_cols.append(sub_col) + +weighted_sub = [] +max_sub = [] +min_sub = [] +for i in range(0,len(df2)): + a = df2.loc[i,sub_cols].tolist() + weighted_sub.append(np.average(a, weights=w)) + max_sub.append(max(a)) + min_sub.append(min(a)) + +df2['subjectivity_mean'] = weighted_sub +df2['subjectivity_max'] = max_sub +df2['subjectivity_min'] = min_sub + +to_drop = sub_cols+comp_cols +df2 = df2.drop(to_drop, axis=1) + +df2.head(5) + + +# # Step 4. Exploratory Data Analysis +# First, the timeseries of the y (to be predicted) variable will be explored. It's likely that the timeseries isn't stationary which however doesn't worry us in this case as the models won't be of the classical timeseries methods family. + +# In[17]: + + +#EDA-time series plot +fig1 = go.Figure() +fig1.add_trace(go.Scatter(x=df2.Date, y=df.Close, + mode='lines')) +title = [] +title.append(dict(xref='paper', yref='paper', x=0.0, y=1.05, + xanchor='left', yanchor='bottom', + text='Development of stock values from Aug, 2008 to Jun, 2016', + font=dict(family='Arial', + size=30, + color='rgb(37,37,37)'), + showarrow=False)) +fig1.update_layout(xaxis_title='Date', + yaxis_title='Closing stock value (in $)', + annotations=title) +fig1.show() + + +# It is quite obvious that the timeseries isn't stationary at all. It just seems to be a downwards trend over the time. +# So let's look at how much unstationary the timeseries actually is using Dickey-Fuller Test. +# +# In Dickey-Fuller test, we test the null hypothesis that a unit root is present in time series data. +# To make things a bit more clear, this test is checking for stationarity or non-stationary data. The test is trying to reject the null hypothesis that a unit root exists and the data is non-stationary. + +# In[18]: + + +#function for quick plotting and testing of stationarity +def stationary_plot(y, lags=None, figsize=(12, 7), style='bmh'): + """ + Plot time series, its ACF and PACF, calculate Dickey–Fuller test + + y - timeseries + lags - how many lags to include in ACF, PACF calculation + """ + if not isinstance(y, pd.Series): + y = pd.Series(y) + + with plt.style.context(style): + fig = plt.figure(figsize=figsize) + layout = (2, 2) + ts_ax = plt.subplot2grid(layout, (0, 0), colspan=2) + acf_ax = plt.subplot2grid(layout, (1, 0)) + pacf_ax = plt.subplot2grid(layout, (1, 1)) + + y.plot(ax=ts_ax) + p_value = sm.tsa.stattools.adfuller(y)[1] + ts_ax.set_title('Time Series Analysis Plots\n Dickey-Fuller: p={0:.5f}'.format(p_value)) + smt.graphics.plot_acf(y, lags=lags, ax=acf_ax) + smt.graphics.plot_pacf(y, lags=lags, ax=pacf_ax) + plt.tight_layout() + + +# In[19]: + + +stationary_plot(df2.Close) + + +# Next we look at the compound sentiment scores. + +# In[20]: + + +fig2 = go.Figure() +fig2.add_trace(go.Scatter(x=df2.Date, y=df2.compound_mean, + mode='lines', + name='Mean')) +fig2.add_trace(go.Scatter(x=df2.Date, y=df2.compound_max, + mode='lines', + name='Maximum')) +fig2.add_trace(go.Scatter(x=df2.Date, y=df2.compound_min, + mode='lines', + name='Minimum')) +title = [] +title.append(dict(xref='paper', yref='paper', x=0.0, y=1.05, + xanchor='left', yanchor='bottom', + text='Development of sentiment compound score', + font=dict(family='Arial', + size=30, + color='rgb(37,37,37)'), + showarrow=False)) +fig2.update_layout(xaxis_title='Date', + yaxis_title='Compound score', + annotations=title) +fig2.show() + + +# Let's also plot the distribution of the compound score. + +# In[21]: + + +compm_hist = px.histogram(df2, x="compound_mean") +compm_hist.show() + + +# Now, we take a look at the subjectivity scores. + +# In[22]: + + +fig3 = go.Figure() +fig3.add_trace(go.Scatter(x=df2.Date, y=df2.subjectivity_mean, + mode='lines', + name='Mean')) +fig3.add_trace(go.Scatter(x=df2.Date, y=df2.subjectivity_min, + mode='lines', + name='Min')) +fig3.add_trace(go.Scatter(x=df2.Date, y=df2.subjectivity_max, + mode='lines', + name='Max')) +title = [] +title.append(dict(xref='paper', yref='paper', x=0.0, y=1.05, + xanchor='left', yanchor='bottom', + text='Development of subjectivity score', + font=dict(family='Arial', + size=30, + color='rgb(37,37,37)'), + showarrow=False)) +fig3.update_layout(xaxis_title='Date', + yaxis_title='Subjectivity score', + annotations=title) +fig3.show() + + +# Now we plot distribution of the subjectivity scores as well. + +# In[23]: + + +subm_hist = px.histogram(df2, x="subjectivity_mean") +subm_hist.show() + + +# Next we'll look at some descriptive statistics about the data. + +# In[24]: + + +df2.describe() + + +# # Feature selection +# We are not going to use many FS methods since the features were mostly handcrafted. So we'll simply look at their variance and proportion of unique values. + +# In[25]: + + +def unique_ratio (col): + return len(np.unique(col))/len(col) + +cols = ['Close', 'compound_mean', 'compound_max', 'compound_min', 'subjectivity_mean', 'subjectivity_max', 'subjectivity_min'] + +ur = [] +var = [] +for col in cols: + ur.append(unique_ratio(df2[col])) + var.append(np.var(df2[col])) + +feature_sel = pd.DataFrame({'Column': cols, + 'Unique': ur, + 'Variance': var}) +feature_sel + + +# In[26]: + + +sel_fig = go.Figure(data=go.Scatter( + x=feature_sel.Column, + y=feature_sel.Unique, + mode='markers', + marker=dict(size=(feature_sel.Unique*100)), +)) +sel_fig.update_layout(title='Ratio of unique values', + yaxis_title='Unique ratio') +sel_fig.show() + + +# Compound maximum and minimum are potentially less interesting as only ~18% of their values are unique. Also maximum of subjectivity has very low values. Minimum subjectivity has contains almost only 0. We'll drop the subjectivity min and max. + +# In[27]: + + +drop = ['subjectivity_min', 'subjectivity_max'] +clean_df = df2.drop(drop,axis=1) + + +# # Step 5 - Lag the extracted features +# To allow the models to look into the past, we'll add features which are essentially just copies of rows from n-steps back. In order to not create too many new features we'll add only features from 1 week prior to the current datapoint. + +# In[28]: + + +lag_df = clean_df.copy() +lag_df.head(3) + + +# In[29]: + + +to_lag = list(lag_df.columns) +to_lag_4 = to_lag[1] +to_lag_1 = to_lag[2:len(to_lag)] + + +# In[30]: + + +#lagging text features two days back +for col in to_lag_1: + for i in range(1,3): + new_name = col + ('_lag_{}'.format(i)) + lag_df[new_name] = lag_df[col].shift(i) + +#lagging closing values 4 days back +for i in range(1, 5): + new_name = to_lag_4 + ('_lag_{}'.format(i)) + lag_df[new_name] = lag_df[to_lag_4].shift(i) + + +# In this process, rows with NAs were created. Unfortunately these rows will have to be removed since we simply don't have the data from the future. + +# In[31]: + + +#Show many rows need to be removed +lag_df.head(10) + + +# Above we can see that the first 4 rows now have missing values. Let's delete them and reset index. + +# In[32]: + + +lag_df = lag_df.drop(lag_df.index[[np.arange(0,4)]]) +lag_df = lag_df.reset_index(drop=True) + +#sanity check for NaNs +lag_df.isnull().sum().sum() + + +# In[33]: + + +lag_df.head(5) + + +# # Step 8 - Model training +# Let's train 3 ML models. We'll do this in 2 rounds. First, using the econometric features alone (7 lags of y). +# Second, including the information extracted from the headlines (compound, subjectivity and their lags) +# +# Models: +# +# 1)Ridge regression - punish model for using too many features but doesn't allow the coeficients drop to zero completely +# +# 2)Random forest +# +# 3)XGBoost +# +# We'll score all models by mean squared error as it gives higher penalty to larger mistakes. And before each model training we'll standardize the training data. +# +# The first step will be creating folds for cross-validation. We'll use the same folds for all models in order to allow for creating a meta-model. Since we're working with timeseries the folds cannot be randomly selected. Instead a fold will be a sequence of data so that we don't lose the time information. + +# In[34]: + + +# for time-series cross-validation set 10 folds +tscv = TimeSeriesSplit(n_splits=10) + + +# The cost function to minimize is mean squared error because this function assigns cost proportionally to the error size. The mean absolute percentage error will be used for plotting and easier interpretation. It's much easier to understand the errors of a model in terms of percentage. Each training set is scaled (normalized) independently to minimize data leakage. + +# In[35]: + + +def mape(y_true, y_pred): + return np.mean(np.abs((y_true - y_pred) / y_true)) * 100 +scorer = make_scorer(mean_squared_error) +scaler = StandardScaler() + + +# Next we split the dataset into training and testing. 20% of the data will be used for testing. + +# In[36]: + + +def ts_train_test_split(X, y, test_size): + """ + Perform train-test split with respect to time series structure + """ + + # get the index after which test set starts + test_index = int(len(X)*(1-test_size)) + + X_train = X.iloc[:test_index] + y_train = y.iloc[:test_index] + X_test = X.iloc[test_index:] + y_test = y.iloc[test_index:] + + return X_train, X_test, y_train, y_test + + +# In[37]: + + +X = lag_df.drop(['Close'],axis=1) +X.index = X["Date"] +X = X.drop(['Date'],axis=1) +y = lag_df.Close + +X_train, X_test, y_train, y_test = ts_train_test_split(X, y, test_size = 0.2) + +#sanity check +(len(X_train)+len(X_test))==len(X) + + +# In[38]: + + +#function for plotting coeficients of models (lasso and XGBoost) +def plotCoef(model,train_x): + """ + Plots sorted coefficient values of the model + """ + + coefs = pd.DataFrame(model.coef_, train_x.columns) + coefs.columns = ["coef"] + coefs["abs"] = coefs.coef.apply(np.abs) + coefs = coefs.sort_values(by="abs", ascending=False).drop(["abs"], axis=1) + + plt.figure(figsize=(15, 7)) + coefs.coef.plot(kind='bar') + plt.grid(True, axis='y') + plt.hlines(y=0, xmin=0, xmax=len(coefs), linestyles='dashed'); + + +# # Step 8.1 - Econometric models +# First let's train models using only the lags of the y variable (i.e. Close). + +# In[39]: + + +econ_cols = list(X_train.columns) +econ_cols = econ_cols[12:17] +X_train_e = X_train[econ_cols] +X_test_e = X_test[econ_cols] +y_train_e = y_train +y_test_e = y_test + + +# In[40]: + + +econ_perf = pd.DataFrame(columns=['Model','MSE', 'SD']) +econ_perf + + +# # Ridge regression + +# In[41]: + + +ridge_param = {'model__alpha': list(np.arange(0.001,1,0.001))} +ridge = Ridge(max_iter=5000) +pipe = Pipeline([ + ('scale', scaler), + ('model', ridge)]) +search_ridge = GridSearchCV(estimator=pipe, + param_grid = ridge_param, + scoring = scorer, + cv = tscv, + n_jobs=4 + ) +search_ridge.fit(X_train_e, y_train_e) + + +# In[42]: + + +ridge_e = search_ridge.best_estimator_ + +#get cv results of the best model + confidence intervals +from sklearn.model_selection import cross_val_score +cv_score = cross_val_score(ridge_e, X_train_e, y_train_e, cv=tscv, scoring=scorer) +econ_perf = econ_perf.append({'Model':'Ridge', 'MSE':np.mean(cv_score), 'SD':(np.std(cv_score))}, ignore_index=True) +ridge_e + + +# In[43]: + + +plotCoef(ridge_e['model'], X_train_e) + + +# In[44]: + + +coefs = ridge_e['model'].coef_ +ridge_coefs = pd.DataFrame({'Coef': coefs, + 'Name': list(X_train_e.columns)}) +ridge_coefs["abs"] = ridge_coefs.Coef.apply(np.abs) +ridge_coefs = ridge_coefs.sort_values(by="abs", ascending=False).drop(["abs"], axis=1) +ridge_coefs + + +# In[45]: + + +econ_perf + + +# # Random forest + +# In[46]: + + +rf_param = {'model__n_estimators': [10, 100, 300], + 'model__max_depth': [10, 20, 30, 40], + 'model__min_samples_split': [2, 5, 10], + 'model__min_samples_leaf': [1, 2, 3], + 'model__max_features': ["auto", 'sqrt']} + + +# In[47]: + + +rf = RandomForestRegressor() +pipe = Pipeline([ + ('scale', scaler), + ('model', rf)]) +gridsearch_rf = GridSearchCV(estimator=pipe, + param_grid = rf_param, + scoring = scorer, + cv = tscv, + n_jobs=4, + verbose=3 + ) + + +# In[48]: + + +gridsearch_rf.fit(X_train_e, y_train_e) + + +# In[49]: + + +rf_e = gridsearch_rf.best_estimator_ + +#get cv results of the best model + confidence intervals +cv_score = cross_val_score(rf_e, X_train_e, y_train_e, cv=tscv, scoring=scorer) +econ_perf = econ_perf.append({'Model':'RF', 'MSE':np.mean(cv_score), 'SD':(np.std(cv_score))}, ignore_index=True) + + +# In[50]: + + +econ_perf + + +# # XGBoost +# Using linear booster because tree methods don't work with timeseries very well + +# In[52]: + + +xgb_param = {'model__lambda': list(np.arange(0.1,3, 0.1)), #L2 regularisation + 'model__alpha': list(np.arange(0.1,3, 0.1)), #L1 regularisation + } + + +# In[53]: + + +xgb = XGBRegressor(booster='gblinear', feature_selector='shuffle', objective='reg:squarederror') + +pipe = Pipeline([ + ('scale', scaler), + ('model', xgb)]) +gridsearch_xgb = GridSearchCV(estimator=pipe, + param_grid = xgb_param, + scoring = scorer, + cv = tscv, + n_jobs=4, + verbose=3 + ) + + +# In[54]: + + +gridsearch_xgb.fit(X_train_e, y_train_e) + + +# In[55]: + + +xgb_e = gridsearch_xgb.best_estimator_ + +#get cv results of the best model + confidence intervals +cv_score = cross_val_score(xgb_e, X_train_e, y_train_e, cv=tscv, scoring=scorer) +econ_perf = econ_perf.append({'Model':'XGB', 'MSE':np.mean(cv_score), 'SD':(np.std(cv_score))}, ignore_index=True) +xgb_e + + +# In[56]: + + +print(econ_perf) + + +# In[57]: + + +econ_fig = px.scatter(econ_perf, x="Model", y='MSE', color='Model', error_y="SD") +econ_fig.update_layout(title_text="Performance of models trained on lags of y") +econ_fig.show() + + +# # Step 8.2 - NLP models +# Let's try now predict the stock value using only information from the news headlines. + +# In[58]: + + +X_train_n = X_train.drop(econ_cols, axis=1) +X_test_n = X_test.drop(econ_cols, axis=1) +y_train_n = y_train +y_test_n = y_test + + +# In[59]: + + +nlp_perf = pd.DataFrame(columns=['Model','MSE', 'SD']) +nlp_perf + + +# +# # Ridge regression + +# In[60]: + + +ridge_param = {'model__alpha': list(np.arange(1,10,0.1))} +ridge = Ridge(max_iter=5000) +pipe = Pipeline([ + ('scale', scaler), + ('model', ridge) +]) +search_ridge = GridSearchCV(estimator=pipe, + param_grid = ridge_param, + scoring = scorer, + cv = tscv, + n_jobs=4 + ) +search_ridge.fit(X_train_n, y_train_n) + + +# In[61]: + + +ridge_n = search_ridge.best_estimator_ + +#get cv results of the best model + confidence intervals +cv_score = cross_val_score(ridge_n, X_train_n, y_train_n, cv=tscv, scoring=scorer) +nlp_perf = nlp_perf.append({'Model':'Ridge', 'MSE':np.mean(cv_score), 'SD':(np.std(cv_score))}, ignore_index=True) +ridge_n + + +# In[62]: + + +plotCoef(ridge_n['model'], X_train_n) + +coefs = ridge_n['model'].coef_ +ridge_coefs = pd.DataFrame({'Coef': coefs, + 'Name': list(X_train_n.columns)}) +ridge_coefs["abs"] = ridge_coefs.Coef.apply(np.abs) +ridge_coefs = ridge_coefs.sort_values(by="abs", ascending=False).drop(["abs"], axis=1) +ridge_coefs + + +# In[63]: + + +mape(y_test, ridge_n.predict(X_test_n)) + + +# Predicted values for years 2014-2016 using ridge regression. + +# In[64]: + + +print(ridge_n.predict(X_test_n)) + + +# # Random Forest + +# In[65]: + + +rf_param = {'model__n_estimators': [10, 100, 300], + 'model__max_depth': [10, 20, 30, 40], + 'model__min_samples_split': [2, 5, 10], + 'model__min_samples_leaf': [1, 2, 3], + 'model__max_features': ["auto", 'sqrt']} +rf = RandomForestRegressor() +pipe = Pipeline([ + ('scale', scaler), + ('model', rf)]) +gridsearch_rf = GridSearchCV(estimator=pipe, + param_grid = rf_param, + scoring = scorer, + cv = tscv, + n_jobs=4, + verbose=3 + ) +gridsearch_rf.fit(X_train_n, y_train_n) + + +# In[66]: + + +rf_n = gridsearch_rf.best_estimator_ + +#get cv results of the best model + confidence intervals +cv_score = cross_val_score(rf_n, X_train_n, y_train_n, cv=tscv, scoring=scorer) +nlp_perf = nlp_perf.append({'Model':'RF', 'MSE':np.mean(cv_score), 'SD':(np.std(cv_score))}, ignore_index=True) + + +# # XGBoost + +# In[67]: + + +xgb_param = {'model__lambda': list(np.arange(1,10, 1)), #L2 regularisation + 'model__alpha': list(np.arange(1,10, 1)), #L1 regularisation + } +xgb = XGBRegressor(booster='gblinear', feature_selector='shuffle', objective='reg:squarederror') + +pipe = Pipeline([ + ('scale', scaler), + ('model', xgb)]) +gridsearch_xgb = GridSearchCV(estimator=pipe, + param_grid = xgb_param, + scoring = scorer, + cv = tscv, + n_jobs=4, + verbose=3 + ) +gridsearch_xgb.fit(X_train_n, y_train_n) + + +# In[68]: + + +xgb_n = gridsearch_xgb.best_estimator_ + +#get cv results of the best model + confidence intervals +cv_score = cross_val_score(xgb_n, X_train_n, y_train_n, cv=tscv, scoring=scorer) +nlp_perf = nlp_perf.append({'Model':'XGB', 'MSE':np.mean(cv_score), 'SD':(np.std(cv_score))}, ignore_index=True) +xgb_n + + +# In[69]: + + +print(nlp_perf) + + +# In[70]: + + +nlp_fig = px.scatter(nlp_perf, x="Model", y='MSE', color='Model', error_y="SD") +#nlp_fig.update_layout(title_text="Performance of models trained on NLP features", +nlp_fig.show() + + +# The 3 models are performing quite similary. They might be useful candidates for stacking. + +# # Econ+NLP models +# Let's use all features now + +# # Ridge Regression + +# In[71]: + + +en_perf = pd.DataFrame(columns=['Model','MSE', 'SD']) +en_perf + + +# In[72]: + + +ridge_param = {'model__alpha': list(np.arange(0.1,1,0.01))} +ridge = Ridge(max_iter=5000) +pipe = Pipeline([ + ('scale', scaler), + ('model', ridge)]) +search_ridge = GridSearchCV(estimator=pipe, + param_grid = ridge_param, + scoring = scorer, + cv = tscv, + n_jobs=4, + verbose=3 + ) +search_ridge.fit(X_train, y_train) + + +# In[73]: + + +ridge_en = search_ridge.best_estimator_ + +#get cv results of the best model + confidence intervals +cv_score = cross_val_score(ridge_en, X_train, y_train, cv=tscv, scoring=scorer) +en_perf = en_perf.append({'Model':'Ridge', 'MSE':np.mean(cv_score), 'SD':(np.std(cv_score))}, ignore_index=True) +ridge_en + + +# In[74]: + + +coefs = ridge_en['model'].coef_ +ridge_coefs = pd.DataFrame({'Coef': coefs, + 'Name': list(X_train.columns)}) +ridge_coefs["abs"] = ridge_coefs.Coef.apply(np.abs) +ridge_coefs = ridge_coefs.sort_values(by="abs", ascending=False).drop(["abs"], axis=1) +ridge_coefs + + +# In[75]: + + +plotCoef(ridge_en['model'], X_train) + + +# # Random Forest + +# In[76]: + + +rf_param = {'model__n_estimators': [10, 100, 300], + 'model__max_depth': [10, 20, 30, 40], + 'model__min_samples_split': [2, 5, 10], + 'model__min_samples_leaf': [1, 2, 3], + 'model__max_features': ["auto", 'sqrt']} +rf = RandomForestRegressor() +pipe = Pipeline([ + ('scale', scaler), + ('model', rf)]) +gridsearch_rf = GridSearchCV(estimator=pipe, + param_grid = rf_param, + scoring = scorer, + cv = tscv, + n_jobs=4, + verbose=3 + ) +gridsearch_rf.fit(X_train, y_train) + + +# In[77]: + + +rf_en = gridsearch_rf.best_estimator_ + +#get cv results of the best model + confidence intervals +cv_score = cross_val_score(rf_en, X_train, y_train, cv=tscv, scoring=scorer) +en_perf = en_perf.append({'Model':'RF', 'MSE':np.mean(cv_score), 'SD':(np.std(cv_score))}, ignore_index=True) +rf_en + + +# # XGBoost + +# In[78]: + + +xgb_param = {'model__lambda': list(np.arange(1,10, 1)), #L2 regularisation + 'model__alpha': list(np.arange(1,10, 1)), #L1 regularisation + } +xgb = XGBRegressor(booster='gblinear', feature_selector='shuffle', objective='reg:squarederror') + +pipe = Pipeline([ + ('scale', scaler), + ('model', xgb)]) +gridsearch_xgb = GridSearchCV(estimator=pipe, + param_grid = xgb_param, + scoring = scorer, + cv = tscv, + n_jobs=4, + verbose=3 + ) +gridsearch_xgb.fit(X_train, y_train) + + +# In[79]: + + +xgb_en = gridsearch_xgb.best_estimator_ + +#get cv results of the best model + confidence intervals +cv_score = cross_val_score(xgb_en, X_train, y_train, cv=tscv, scoring=scorer) +en_perf = en_perf.append({'Model':'XGB', 'MSE':np.mean(cv_score), 'SD':(np.std(cv_score))}, ignore_index=True) +xgb_en + + +# # Trying to stack econometric and NLP models + +# In[80]: + + +from sklearn.model_selection import cross_val_predict + +X_train_stack = pd.DataFrame(pd.DataFrame(columns=['econ_r', 'nlp_r'])) +X_train_stack['econ_r'] = cross_val_predict(ridge_e, X_train_e, y_train, cv=10) +X_train_stack['nlp_r'] = cross_val_predict(ridge_n, X_train_n, y_train, cv=10) + +X_test_stack = pd.DataFrame(pd.DataFrame(columns=['econ_r', 'nlp_r'])) +X_test_stack['econ_r'] = ridge_e.predict(X_test_e) +X_test_stack['nlp_r'] = ridge_n.predict(X_test_n) + +X_train_stack.to_csv("Stack_train.csv") +X_test_stack.to_csv("Stack_test.csv") + +from sklearn.linear_model import ElasticNetCV +stack = ElasticNetCV(cv=tscv) +stack.fit(X_train_stack, y_train) +cv_score = cross_val_score(stack, X_train_stack, y_train, cv=tscv, scoring=scorer) +stack_performance = {'Model':'XGB', 'MSE':np.mean(cv_score), 'SD':(np.std(cv_score))} +stack_performance + +mape(y_test, stack.predict(X_test_stack)) + + +# In[81]: + + +coefs = stack.coef_ +ridge_coefs = pd.DataFrame({'Coef': coefs, + 'Name': list(X_train_stack.columns)}) +ridge_coefs["abs"] = ridge_coefs.Coef.apply(np.abs) +ridge_coefs = ridge_coefs.sort_values(by="abs", ascending=False).drop(["abs"], axis=1) +print(ridge_coefs) +plotCoef(stack, X_train_stack) + + +# # Model Comparison + +# prediction_compare = pd.DataFrame(pd.DataFrame(columns=['y_true', 'econ_r', 'econ_rf', 'econ_x', 'nlp_r', 'nlp_rf', 'nlp_x', 'comb_r', 'comb_rf', 'comb_x', 'stack'])) +# prediction_compare['y_true'] = y_test +# prediction_compare['econ_r'] = ridge_e.predict(X_test_e) +# prediction_compare['econ_rf'] = rf_e.predict(X_test_e) +# prediction_compare['econ_x'] = xgb_e.predict(X_test_e) +# prediction_compare['nlp_r'] = ridge_n.predict(X_test_n) +# prediction_compare['nlp_rf'] = rf_n.predict(X_test_n) +# prediction_compare['nlp_x'] = xgb_n.predict(X_test_n) +# prediction_compare['comb_r'] = ridge_en.predict(X_test) +# prediction_compare['comb_rf'] = rf_en.predict(X_test) +# prediction_compare['comb_x'] = xgb_en.predict(X_test) +# prediction_compare['stack'] = stack.predict(X_test_stack) +# +# prediction_compare.sample(3) + +# # Outcomes +# ->The market is arguably to be a random walk. Although there is some direction to its movements, there is still quite a bit of randomness to its movements. +# +# ->The news that we used might not be the most relevant. Perhaps it would have been better to use news relating to the 30 companies that make up the Dow. +# +# ->More information could have been included in the model, such as the previous day(s)'s change, the previous day(s)'s main headline(s). +# +# ->Many people have worked on this task for years and companies spend millions of dollars to try to predict the movements of the market, so we shouldn't expect anything too great considering the small amount of data that we are working with and the simplicity of our model. diff --git a/NYSE/Dow_Jones_Industrial_Average_Prediction/README.md b/NYSE/Dow_Jones_Industrial_Average_Prediction/README.md new file mode 100644 index 00000000..032a3867 --- /dev/null +++ b/NYSE/Dow_Jones_Industrial_Average_Prediction/README.md @@ -0,0 +1,2 @@ +# Dow_Jones_Industrial_Average_Prediction +GROUP PROJECT Context: Dow Jones Industrial Average (DJIA) prediction We will be predicting the DJIA closing value by using the top 25 headlines for the day. The Dow Jones Industrial Average, Dow Jones, or simply the Dow, is a stock market index that measures the stock performance of 30 large companies listed on stock exchanges in the United States. Content: There are two channels of datasets used: News data which is crawled historical news headlines from Reddit WorldNews Channel (Links to an external site.) (/r/worldnews). They are ranked by reddit users' votes, and only the top 25 headlines are considered for a single date. (Range: 2008-06-08 to 2016-07-01) Stock data: Dow Jones Industrial Average (DJIA) is used to "prove the concept". (Range: 2008-08-08 to 2016-07-01) three data files in .csv format used: RedditNews.csv: two columns The first column is the "date", and the second column is the "news headlines". All news are ranked from top to bottom based on how hot they are. Hence, there are 25 lines for each date. DJIA_table.csv: Downloaded directly from Yahoo Finance (Links to an external site.): webpage. CombinedNewsDJIA.csv: The first column is "Date", the second is "Label", and the following ones are news headlines ranging from "Top1" to "Top25". Projected ML techniques: We are going to use different models such as Ridge regression, Xg boost, Random forest to predict DJIA value and then compare the results of each of the models. Group members: Akhil Teja Balabadruni, Amaan, Varun Iyer