-
Notifications
You must be signed in to change notification settings - Fork 2
Features
We engineer three classes of features; classic features, semantic features, and topic-based features.
The twitter user features are largely derived from botornot
with the exception of a few features which are not easily available through the Twitter API. We list and describe thoroughly the features below:
For each of the following text features we derive of numerical features (refer to the function getTextFeatures
in birdspotter/utils.py
):
- name (The user's name)
- location (The location of the user, as set by the user)
- description (the description of the user, as set by the user)
- status_text (the tweet text, where a tweet can be a tweet or retweet)[^1]
The numerical features derived are for each text feature {text}:
- _{text}n_chars (the number of characters in the text)
- _{text}n_commas (the number of commas in the text)
- _{text}n_digits (the number of numerical digits in the text)
- _{text}n_exclaims (the number of exclaimation points (!) in the text)
- _{text}n_extraspace (the number of extra space in the text)
- _{text}n_hashtags (the number hashtags in the text)
- _{text}n_lowers (the number of lower case characters in the text)
- _{text}n_mentions (the number of @mentions in the text)
- _{text}n_periods (the number of full stops in the text)
- _{text}n_urls (the number of urls in the text)[^2]
- _{text}n_words (the number of punctuation delimitted words in the text)
- _{text}n_caps (the number of upper case characters in the text)
- _{text}n_nonasciis (the number of non-asciis characters in the text)
- _{text}n_puncts (the number of punctuation characters in the text)
- _{text}n_charsperword (the average number of characters per word in the text)
- _{text}n_lowersp (the lower-case characters as a fraction of all characters in the text)
- _{text}n_capsp (the upper-case characters as a fraction of all characters in the text)
[^1]: The numerical features are averaged across all tweets for a user
[^2]: The name text doesn't include name_n_urls
These features are not based on text [^3]:
- statuses_count (the number of tweets authored by the user, which are currently available for all of time)
- listed_count (the number of public lists of which the user is a member)
- friends_count (the number of users this user is following)
- verified (a boolean denoting if this user has an account verified by Twitter)
- ff_ratio (the ratio of followees to followers of the user)
- years_on_twitter (the number of users the users account has been on twitter)
- statuses_rate (the average number of tweets per year for all of time)
- tweets_to_followers (the ratio of tweets authored for all of time, to the number of followers)
- retweet_count (the average number of retweets at the point of authorship/retweet)
- favorite_count (The average number of favourites a tweet/retweet has garnered before authorship/retweet)
- favourites_count (The number of tweets this user has liked, in the user's account lifetime)
- n_tweets (The total number of tweets within the dataset that the user has authored)
- n_retweets (The total number of retweets within the dataset for the user)
- n_quotes (The total number of quotations within the dataset for the user)
- n_timeofday (The average hour of the weekday that tweets/retweets were generated by this user)
These features are booleans realating to the source of the tweet/retweet:
['google', 'ifttt', 'facebook', 'ipad', 'lite', 'hootsuite', 'android', 'webclient', 'iphone']
[^3]: Many of these are accessed or derived directly from the user or tweet json
object, as described by the Twitter documentation.
We use the FastText 300d word embeddings trained of wikipedia news to generate an appropriate semantic representation of all of a user's tweets and the user's descriptions.
To generate a description's embedding we average the embeddings of the words within the description.
To generate a single tweet's embedding we average the embeddings of the words within the tweet.
To aggregate the embeddings of all tweets, we average the tweet embeddings across all tweets, resulting in a 300d vector representing a user's tweets.
This results in the sematic features con_w2v{i}_ and des_w2v{i}_ for each user, where i represents each of the 300 word2vec dimensions.
We calculate the TF-IDF of the hashtags where we treat the concatenation of all of a user's tweets as documents, and hashtags as terms. We only keep the tf-idf of the top 1,000 hashtags, and as such have a 1,000d topic representation of a user.