Skip to content

Utilized Python for data cleaning and analysis of social media usage data, employing libraries like Pandas and Matplotlib. Improved insights for informed decision-making and business strategies.

License

Notifications You must be signed in to change notification settings

BoddepallyVenkatesh06/Clean-and-analyze-social-media-usage-data-with-Python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Clean-and-analyze-social-media-usage-data-with-Python

Twitter Sentiment Analysis


Table of Contents


Problem Statement

The purpose of this analysis is to gain insights into the number of engagements the #NG30DaysofLearning has on twitter.

For this study we examined a variety of categories: the number of tweets, number of users, the most active users, the most mentioned tools e.t.c


Data Sourcing

  • The dataset used for this analysis was scrapped from twitter jupyter notebook
  • The preview of the dataset and python code is shown below:
!pip install snscrape
Requirement already satisfied: snscrape in c:\users\user\anaconda3\lib\site-packages (0.4.3.20220106)
Requirement already satisfied: lxml in c:\users\user\anaconda3\lib\site-packages (from snscrape) (4.8.0)
Requirement already satisfied: beautifulsoup4 in c:\users\user\anaconda3\lib\site-packages (from snscrape) (4.11.1)
Requirement already satisfied: requests[socks] in c:\users\user\anaconda3\lib\site-packages (from snscrape) (2.27.1)
Requirement already satisfied: filelock in c:\users\user\anaconda3\lib\site-packages (from snscrape) (3.6.0)
Requirement already satisfied: soupsieve>1.2 in c:\users\user\anaconda3\lib\site-packages (from beautifulsoup4->snscrape) (2.3.1)
Requirement already satisfied: idna<4,>=2.5 in c:\users\user\anaconda3\lib\site-packages (from requests[socks]->snscrape) (3.3)
Requirement already satisfied: charset-normalizer~=2.0.0 in c:\users\user\anaconda3\lib\site-packages (from requests[socks]->snscrape) (2.0.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\user\anaconda3\lib\site-packages (from requests[socks]->snscrape) (1.26.9)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\user\anaconda3\lib\site-packages (from requests[socks]->snscrape) (2021.10.8)
Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in c:\users\user\anaconda3\lib\site-packages (from requests[socks]->snscrape) (1.7.1)
import pandas as pd
import snscrape.modules.twitter as sntwitter
query = "(#30DaysOfLearning OR #NG30DaysOfLearning) until:2022-06-26 since:2022-05-05"
tweets = []
limit = 30000


for tweet in sntwitter.TwitterHashtagScraper(query).get_items():
    
    if len(tweets) == limit:
        break
    else:
        tweets.append([tweet.date, tweet.url, tweet.user.username, tweet.sourceLabel, tweet.user.location, tweet.content, tweet.likeCount, tweet.retweetCount,  tweet.quoteCount, tweet.replyCount])
        
df = pd.DataFrame(tweets, columns=['Date', 'TweetURL','User', 'Source', 'Location', 'Tweet', 'Likes_Count','Retweet_Count', 'Quote_Count', 'Reply_Count'])

df.to_csv('30DLTweets.csv')
df.head()
Date TweetURL User Source Location Tweet Likes_Count Retweet_Count Quote_Count Reply_Count
0 2022-06-25 22:51:18+00:00 https://twitter.com/poetrineer/status/15408300... poetrineer Twitter for Android Oyo, Nigeria So as one of my commitment to document my lear... 0 0 0 1
1 2022-06-25 22:44:10+00:00 https://twitter.com/poetrineer/status/15408282... poetrineer Twitter for Android Oyo, Nigeria Finally, here is my updated COVID-19 Data Anal... 1 0 0 0
2 2022-06-25 19:25:58+00:00 https://twitter.com/MichealOjuri/status/154077... MichealOjuri Twitter Web App Oyo, Nigeria #30NGDaysOfLearning\n#30daysoflearning \n#micr... 0 0 0 0
3 2022-06-25 16:44:36+00:00 https://twitter.com/oye__aashu/status/15407377... oye__aashu Twitter for Android Nainital, India Day 4/ #30daysoflearning learned all about arr... 3 0 0 1
4 2022-06-25 12:49:02+00:00 https://twitter.com/hsb_data/status/1540678455... hsb_data Twitter Web App New Jersey Learning about sub queries on @DataCamp (SQL) ... 1 0 0 0
df.describe()
Likes_Count Retweet_Count Quote_Count Reply_Count
count 683.000000 683.000000 683.000000 683.000000
mean 15.780381 3.812592 0.185944 1.166911
std 41.164555 12.665561 0.737938 2.626554
min 0.000000 0.000000 0.000000 0.000000
25% 1.000000 0.000000 0.000000 0.000000
50% 2.000000 0.000000 0.000000 0.000000
75% 8.000000 2.000000 0.000000 1.000000
max 549.000000 248.000000 9.000000 29.000000

The dataset is also available at 30DLTweets


Data Preparation

Data transformation was done in Power Query and the dataset was loaded into Microsoft Power BI Desktop for modeling.

The Twitter Sentiment dataset is given by a table named:

  • 30DLTweets which has 10 columns and 680 rows of observation

The tabulation below shows the 30DLTweets table with its column names and their description:

Column Name Description
Date Represents the date and time of tweet
TweetURL Describes the tweet url
User Describes the username of the user
Source Descibes the device type of the user
Location Describes the location of the user
Tweet Describes the content of the tweet
Likes_Count Represents the count of likes of the tweet
Retweet_Count Represents the count of retweets of the tweet
Quote_Count Represents the count of quote tweets on the tweet
Reply_Count Represents the count of reply on the tweet

Data Cleaning for the dataset was done in power query as follows:

  • The 30DLTweets table was split into a dimension and 2 fact tables respectively, hence called:
  1. UserProfile
  2. TweetStats
  3. TweetProfile
  • A calculated column UserID was created in each of the tables using the M-formula UserID = [User] & "_" & [Source]
  • Unnecessary columns were removed in each of the tables
  • Each of the columns in the tables were validated to have the correct data type

To ensure the accuracy of the dates in the Date column of TweetStats and TweetProfile tables, a date table was created for referencing using the M-formula:

{Number.From(List.Min(TweetProfile[Date]))..Number.From(List.Max(TweetProfile[Date]))}

Here is a breakdown of what the formula does:

For the dataset, we want the start date to reflect the earliest to latest date that we have in the data: May 9, 2022 - June 25, 2022.

Day Name column was inserted into the date table and renamed to DayOfTheWeek

The date table was named Calender.


Data Modeling

After the dataset was cleaned and transformed, it was ready to be modeled.

  • The Calender table was marked as the official date table in the dataset.
  • A one-to-many (*:1) relationship was created between the TweetStats and the Calender tables using the date column in each of the tables
  • A one-to-many (*:1) relationship was created between the TweetProfile and the Calender tables using the date column in each of the tables
  • A one-to-many (*:1) relationship was created between the UserProfile and the TweetStats tables using the UserID column in each of the tables
  • A one-to-many (*:1) relationship was created between the UserProfile and the TweetProfile tables using the UserID column in each of the tables
  • The realtioships formed in the data model is a Star Schema and is shown below:

Data Model


Data Visualization

Data visualization for the dataset was done in 3 folds using Microsoft Power BI Desktop:

  • The Content Analysis: Shows the tools by mention, word cloud, top active users, etc.
  • The Summary: Shows the total number of tweets, total number of users, tweet by day of the week, etc.
  • The Dashboard: Shows visualization from Content Analysis and Summary to provide answer to the Problem Statement.

Figure 1 shows visualizations from Content Analysis page

Figure 1
image

Figure 2 shows visualizations from Summary page

Figure 2
image

Figure 3 shows visualizations from Dashboard page

Figure 3
image

Data Analysis

Measures used in visualization are:

  • Total No. of Tweets = COUNT(TweetProfile[Tweet])
  • Total No. of Users = COUNT(UserProfile[user])
  • Total No. of Sources = COUNT(UserProfile[Source])
  • Average Engagement Rate = DIVIDE(Total No. of Users, Total No. of Tweets)

As shown from Data Visualization, It can be deduced that:

  • The were 680 tweets for the #NG30DaysOfLearning
  • The were about 234 users
  • The average engagement rate is 34%

Insights

As shown by Data Visualization, It can be deduced that:

  • The most active user of the #NG30DaysOfLearning is theoyinbooke with 110 tweets
  • The most mentioned tool is Github

Shareable Link

You can interact with the report here:

View Report

About

Utilized Python for data cleaning and analysis of social media usage data, employing libraries like Pandas and Matplotlib. Improved insights for informed decision-making and business strategies.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published