- Problem Statement
- Data Sourcing
- Data Preparation
- Data Modeling
- Data Visualization
- Data Analysis
- Insights
- Shareable link
The purpose of this analysis is to gain insights into the number of engagements the #NG30DaysofLearning has on twitter.
For this study we examined a variety of categories: the number of tweets, number of users, the most active users, the most mentioned tools e.t.c
- The dataset used for this analysis was scrapped from twitter jupyter notebook
- The preview of the dataset and python code is shown below:
!pip install snscrape
Requirement already satisfied: snscrape in c:\users\user\anaconda3\lib\site-packages (0.4.3.20220106)
Requirement already satisfied: lxml in c:\users\user\anaconda3\lib\site-packages (from snscrape) (4.8.0)
Requirement already satisfied: beautifulsoup4 in c:\users\user\anaconda3\lib\site-packages (from snscrape) (4.11.1)
Requirement already satisfied: requests[socks] in c:\users\user\anaconda3\lib\site-packages (from snscrape) (2.27.1)
Requirement already satisfied: filelock in c:\users\user\anaconda3\lib\site-packages (from snscrape) (3.6.0)
Requirement already satisfied: soupsieve>1.2 in c:\users\user\anaconda3\lib\site-packages (from beautifulsoup4->snscrape) (2.3.1)
Requirement already satisfied: idna<4,>=2.5 in c:\users\user\anaconda3\lib\site-packages (from requests[socks]->snscrape) (3.3)
Requirement already satisfied: charset-normalizer~=2.0.0 in c:\users\user\anaconda3\lib\site-packages (from requests[socks]->snscrape) (2.0.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\user\anaconda3\lib\site-packages (from requests[socks]->snscrape) (1.26.9)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\user\anaconda3\lib\site-packages (from requests[socks]->snscrape) (2021.10.8)
Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in c:\users\user\anaconda3\lib\site-packages (from requests[socks]->snscrape) (1.7.1)
import pandas as pd
import snscrape.modules.twitter as sntwitter
query = "(#30DaysOfLearning OR #NG30DaysOfLearning) until:2022-06-26 since:2022-05-05"
tweets = []
limit = 30000
for tweet in sntwitter.TwitterHashtagScraper(query).get_items():
if len(tweets) == limit:
break
else:
tweets.append([tweet.date, tweet.url, tweet.user.username, tweet.sourceLabel, tweet.user.location, tweet.content, tweet.likeCount, tweet.retweetCount, tweet.quoteCount, tweet.replyCount])
df = pd.DataFrame(tweets, columns=['Date', 'TweetURL','User', 'Source', 'Location', 'Tweet', 'Likes_Count','Retweet_Count', 'Quote_Count', 'Reply_Count'])
df.to_csv('30DLTweets.csv')
df.head()
Date | TweetURL | User | Source | Location | Tweet | Likes_Count | Retweet_Count | Quote_Count | Reply_Count | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2022-06-25 22:51:18+00:00 | https://twitter.com/poetrineer/status/15408300... | poetrineer | Twitter for Android | Oyo, Nigeria | So as one of my commitment to document my lear... | 0 | 0 | 0 | 1 |
1 | 2022-06-25 22:44:10+00:00 | https://twitter.com/poetrineer/status/15408282... | poetrineer | Twitter for Android | Oyo, Nigeria | Finally, here is my updated COVID-19 Data Anal... | 1 | 0 | 0 | 0 |
2 | 2022-06-25 19:25:58+00:00 | https://twitter.com/MichealOjuri/status/154077... | MichealOjuri | Twitter Web App | Oyo, Nigeria | #30NGDaysOfLearning\n#30daysoflearning \n#micr... | 0 | 0 | 0 | 0 |
3 | 2022-06-25 16:44:36+00:00 | https://twitter.com/oye__aashu/status/15407377... | oye__aashu | Twitter for Android | Nainital, India | Day 4/ #30daysoflearning learned all about arr... | 3 | 0 | 0 | 1 |
4 | 2022-06-25 12:49:02+00:00 | https://twitter.com/hsb_data/status/1540678455... | hsb_data | Twitter Web App | New Jersey | Learning about sub queries on @DataCamp (SQL) ... | 1 | 0 | 0 | 0 |
df.describe()
Likes_Count | Retweet_Count | Quote_Count | Reply_Count | |
---|---|---|---|---|
count | 683.000000 | 683.000000 | 683.000000 | 683.000000 |
mean | 15.780381 | 3.812592 | 0.185944 | 1.166911 |
std | 41.164555 | 12.665561 | 0.737938 | 2.626554 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 2.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 8.000000 | 2.000000 | 0.000000 | 1.000000 |
max | 549.000000 | 248.000000 | 9.000000 | 29.000000 |
The dataset is also available at 30DLTweets
Data transformation was done in Power Query and the dataset was loaded into Microsoft Power BI Desktop for modeling.
The Twitter Sentiment dataset is given by a table named:
30DLTweets
which has10 columns and 680 rows
of observation
The tabulation below shows the 30DLTweets
table with its column names and their description:
Column Name | Description |
---|---|
Date | Represents the date and time of tweet |
TweetURL | Describes the tweet url |
User | Describes the username of the user |
Source | Descibes the device type of the user |
Location | Describes the location of the user |
Tweet | Describes the content of the tweet |
Likes_Count | Represents the count of likes of the tweet |
Retweet_Count | Represents the count of retweets of the tweet |
Quote_Count | Represents the count of quote tweets on the tweet |
Reply_Count | Represents the count of reply on the tweet |
Data Cleaning for the dataset was done in power query as follows:
- The
30DLTweets
table was split into a dimension and 2 fact tables respectively, hence called:
UserProfile
TweetStats
TweetProfile
- A calculated column
UserID
was created in each of the tables using the M-formulaUserID = [User] & "_" & [Source]
- Unnecessary columns were removed in each of the tables
- Each of the columns in the tables were validated to have the correct data type
To ensure the accuracy of the dates in the Date
column of TweetStats
and TweetProfile
tables, a date table was created for referencing using the M-formula:
{Number.From(List.Min(TweetProfile[Date]))..Number.From(List.Max(TweetProfile[Date]))}
Here is a breakdown of what the formula does:
For the dataset, we want the start date to reflect the earliest to latest date that we have in the data: May 9, 2022 - June 25, 2022.
Day Name
column was inserted into the date table and renamed to DayOfTheWeek
The date table was named Calender
.
After the dataset was cleaned and transformed, it was ready to be modeled.
- The
Calender
table was marked as the official date table in the dataset. - A
one-to-many (*:1) relationship
was created between theTweetStats
and theCalender
tables using thedate
column in each of the tables - A
one-to-many (*:1) relationship
was created between theTweetProfile
and theCalender
tables using thedate
column in each of the tables - A
one-to-many (*:1) relationship
was created between theUserProfile
and theTweetStats
tables using theUserID
column in each of the tables - A
one-to-many (*:1) relationship
was created between theUserProfile
and theTweetProfile
tables using theUserID
column in each of the tables - The realtioships formed in the data model is a
Star Schema
and is shown below:
Data visualization for the dataset was done in 3 folds using Microsoft Power BI Desktop:
- The
Content Analysis
: Shows the tools by mention, word cloud, top active users, etc. - The
Summary
: Shows the total number of tweets, total number of users, tweet by day of the week, etc. - The
Dashboard
: Shows visualization fromContent Analysis
andSummary
to provide answer to the Problem Statement.
Figure 1 shows visualizations from Content Analysis
page
Figure 1 |
---|
Figure 2 shows visualizations from Summary
page
Figure 2 |
---|
Figure 3 shows visualizations from Dashboard
page
Figure 3 |
---|
Measures used in visualization are:
- Total No. of Tweets =
COUNT(TweetProfile[Tweet])
- Total No. of Users =
COUNT(UserProfile[user])
- Total No. of Sources =
COUNT(UserProfile[Source])
- Average Engagement Rate =
DIVIDE(Total No. of Users, Total No. of Tweets)
As shown from Data Visualization, It can be deduced that:
- The were
680
tweets for the #NG30DaysOfLearning - The were about
234
users - The average engagement rate is
34%
As shown by Data Visualization, It can be deduced that:
- The most active user of the #NG30DaysOfLearning is theoyinbooke with 110 tweets
- The most mentioned tool is Github
You can interact with the report here: