An ETL script written in Python that extracts, transforms, and loads tech-themed headlines and their corresponding articles from CNN Business - Tech, summarizes the articles, and applies sentiment analysis to them, while logging everything into a dataframe
A pandas dataframe is the final output, containing the following columns:
- Company: Lists the companies mentioned in the corresponding article
- Headline: Raw headline string
- Article: Raw article string
- Article Sentiment: Sentiment score of unmodified article
- Article Sentiment Description: Sentiment grading of unmodified article
- Summary Sentiment: Sentiment score of summaried article
- Summary Sentiment Description: Sentiment grading of summaried article
beautifulsoup4==4.11.1
nltk==3.8.1
numpy==1.23.2
openai==1.3.4
pandas==1.4.3
Requests==2.31.0
selenium==4.15.2
webdriver_manager==3.8.6
Can also be found in requirements.txt
The companies.csv
file provided includes a list of companies that could be potentially mentioned in the headlines
companies_from_csv = pd.read_csv('companies.csv')
companies_list = companies_from_csv['company'].tolist()
print(len(companies_list))
print(companies_list)
Company names are then stored into a list companies_list
.
Data gets extracted from the CNN Tech page using BeautifulSoup Specific sections of the HTML code obtained are filtered and a search for headlines and links is performed Obtained headlines and links are then appended to 2 separate lists then filtered to remove unwanted characters or string Headlines get matched with company names list to find which headlines reference a name from the list, and which get discarded Company and Headline are tabbed into a dataframe
Selenium was mainly used to delay the extraction done by BeautifulSoup of the article page contents since the search results are not instantly loaded by default. CNN News Search is where each article was searched for as a GET query, sorted by reference. After a delay of 10 seconds post search, BeautifulSoup was used to scrape the article and be placed back into the dataframe
The function nteracts with the OpenAI API to generate language-based responses. It takes a prompt as input (in this case a summarization prompt), sends a request to the OpenAI API, and returns the generated text as the output to be stored in a summaries list.
client = OpenAI(
api_key = 'XXXXXXXXXXXXXXXXXXX',
)
def get_response(prompt):
completions = client.completions.create(
model = "text-davinci-003",
prompt = prompt,
stream = False,
max_tokens = 1024,
n = 1,
stop = None,
temperature = 0.5,
)
message = completions.choices[0].text
return message
Summaries are then ran through the function and stored into a list.
The model text-davinci-003
was used for this purpose.
Note: The API call limit is 3 requests per minute.
This block performs sentiment analysis on a list of articles using the VADER sentiment analysis tool from NLTK
For each article in the articles_list
, analyze its sentiment using the SentimentIntensityAnalyzer and store the compound sentiment score in article_sentiment_list
.
Sentiments are then categorized based on their compound score:
- If the score is greater than or equal to 0.05, label it as 'Good News' and sentiment as 'Positive'.
- If the score is less than or equal to -0.05, label it as 'Bad News' and sentiment as 'Negative'.
- Otherwise, label it as 'Neutral' and sentiment as 'Neutral'.
The obtained dataframe is then appended into a csv file: loaded_data.csv
df.to_csv('loaded_data.csv', mode='a', header=False, index=False)
Creates a log of runs along with their time and date, stores into project_log.txt
.
timestamp_format = '%Y-%h-%d-%H:%M:%S' # Year-Monthname-Day-Hour-Minute-Second
now = datetime.now() # get current timestamp
timestamp = now.strftime(timestamp_format)
with open("project_log.txt", "a") as f:
f.write(timestamp + ': ' + "Data appended " + '\n')
- Possibility of adding multiple news sources
- Possibility of eliminating the use of Selenium and merely extracting the headlines and their corresponding links into a dictionary in the extract phase
- Airflow or Cron modification to allow for automated running