Influencer is the one who can make an impact towards a service/product by giving his opinion towards it. There by he can influence his friends, peers or his followers in particular platform such as twitter, facebook, instagram etc. He can also associate with popular brands across the globe and promote their products.
So what about predicting the influence of any user for a particular platform? Interesting.. right?
Yeah, In this blog post we will build such a model and predict the influencer score of those users.
Let’s do it for twitter platform.
- Deciding the features — Features that are to be extracted for building the model. For eg: followers_count, friends_count, listed_count etc.
- Collecting twitter ids — We need to fetch the twitter ids of users(all kind of users, including influencer and normal users), we will rank these users according to our criteria.
- Feature extraction — Extracting the features of collected users.
- Building the model — Building linear regression model using extracted features
- Prediction — Predicting the influencer score of twitter user using the model.
Deciding the features
How we can say that one user is influencer? He must have certain characteristics such as stimulating a discussion, proposing ideas, following up activities etc.
These characteristics can translate to its equivalent numeric representation using user’s meta information and the content published by the user.
So we’ll identify the parameters which affects the influencer score of users, and it may vary according to the platform since metadata availability depends on the platform.
For any given twitter user, we can collect and compute the following features associated with the user.
- total followers (followers_count)
- total following (friends_count)
- total listings (listed_count)
- total favorites : (favourites_count)
- total tweets by User : (statuses count)
- hasURL : if url is present in profile (true or false)
- mention_by_others : How many times user ‘A’ is mentioned in tweets from other authors
- retweet_ratio : Number of times a tweet by Author ‘A’ is retweeted / (Total original tweets in DB)
- liked_ratio : Number of times a tweet by Author ‘A’ is liked by other twitter users
- orig_content_ratio: (Total tweets in DB — retweets by author ‘A’ ) / Total tweets in DB
- hashtag_ratio: Total tweets from Author that contain one or more HASHTAGS / Total tweets in DB
- urls_ratio: Total tweets from Author that contain one or more URLs / Total tweets in DB
- symbols_ratio: Total tweets from Author that contain one or more SYMBOLS / Total tweets in DB
- mentions_ratio : Total tweets from Author that contain one or more @mentions / Total tweets in DB
- reputation = Social score that depends only on followers, following and number of tweets. We compute such social reputation as log (total followers \* total followers \* status count / total following)
- retweet_hindex (similar to citation [h-index][hindex link])
- like_hindex (similar to citation [h-index][hindex link])
Collecting twitter ids
Inorder to extract post and features, we need to have the twitter id of that user. For this article I have manually collected nearly 2000 twitter ids of normal users and 1000 influencers from here.
After collecting these ids, we need to manually assign ranks(by giving score) to these users.
Feature Extraction
Using the twitter ids collected, we need to extract the features of each tweet of user. We will use python library tweepy
(https://github.com/tweepy/tweepy) for this purpose. We can install tweepy
by issuing
pip install tweepy
Inorder to use tweepy we need to have tokens associated with twitter developer account. For creating twitter developer account, follow this link
Login to twitter account →Create an application →Fill out the form →Manage keys and access tokens →Copy access tokens and keys below.
import tweepy
import sys, os, csv##use tokens from dev account
CONSUMER_KEY = 'consumer_key'
CONSUMER_SECRET = 'consumer_secret'
OAUTH_TOKEN = 'oath_token'
OAUTH_TOKEN_SECRET = 'oath_token_secret'
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.secure = True
auth.set_access_token(OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
tweepyApi = tweepy.API(auth, parser=tweepy.parsers.JSONParser() \
, wait_on_rate_limit=True, wait_on_rate_limit_notify = True)
Next we need to extract the tweets, for that please use these scripts.
It will take several hours to extract all those features as twitter apis are rate limited and finally we will be getting a csv file containing all extracted features.
The csv file looks like,
Building the Model
Let’s take a look at the correlation of each feature with influence score. Most of our features have some correlation with influence score, which means we don’t need to drop any feature.
corr = features.corr()
print("Correlation of features with Influence score \n")
print corr['score']
To learn and evaluate a linear regression model for influence score, we create a training and test set by splitting our data set with a 3:2 ratio training and testing
def make_train_test_set(df, train_test_split_prct, clipping_qunatile):
msk = np.random.rand(len(df)) < train_test_split_prct
train_df = df[msk].copy()
test_df = df[~msk].copy()
thres = train_df.quantile(clipping_qunatile)
fet_list = [ x for x in list(df) if x not in ['id', 'screen_name'] ]
for col in fet_list:
if col :
train_df[col] = train_df[col]/thres[col]
test_df[col] = test_df[col]/thres[col]
train_df.clip(0, 1)
test_df.clip(0,1)
cols = [col for col in list(df) if col not in ["score", 'screen_name', 'id']] y_train = train_df['score'].values
y_test = test_df['score'].values
X_train = train_df[cols].values
X_test = test_df[cols].values return X_train, X_test, y_train, y_test, thres.transpose(), fet_list
Also, most of the features have no upper bound and can have outlier values. For example, a twitter bot may have huge following but a small number of followers, so it is better to normalize the data with an upper bound on individual feature values. Let’s use 95th-percentile value of each feature value to clip the feature and use min-max normalization to normalize the features to 0–1 range.
import json
def get_mean_clip_value():
thres_list = []
for run in range(1, 50):
X_train, X_test, y_train, y_test, thres, fet_list = make_train_test_set()
thres_list.append(thres)
thres_df = pd.DataFrame(thres_list)
return thres_df.mean()def LinearRegressionModel():
X_train, X_test, y_train, y_test, thres, fet_clip_value = make_train_test_set()
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
return regr, fet_list, fet_clip_valuedef SerializeLinearRegressionModel(model_path):
model, fet_list, clip_value = LinearRegressionModel()
clip_value = get_mean_clip_value()
jsonObj = {}
modelDict = {}
normalizationDict = {}
coeffDict = dict(zip(fet_list, list(model.coef_)))
modelDict['coeff'] = coeffDict
modelDict['intercept'] = model.intercept_
for fet in fet_list:
normalizationDict[fet] = clip_value[fet]
jsonObj['LRModel'] = modelDict
jsonObj['clippingValue'] = normalizationDict
jsonStr = json.dumps(jsonObj, indent=4, sort_keys=True)
with open(model_path, 'wb') as f:
f.write(jsonStr)
def DeserializeLinearRegressionModel(model_path):
with open(model_path) as f:
model = json.loads(f.read())
return modelmodel_path = 'LRModelTwitterInfluencer.json'
blob_service.get_blob_to_path(mycontainer, LRModelTwitterInfleuncerBlob, model_path)
TwitterInfluencerModel = DeserializeLinearRegressionModel(model_path)
Prediction
For any new twitter user, we follow a similar process to predict the influence score.
1. Extract several tweets of the user using tweepy
2. Compute raw features of the twitter account using twitter API
2. Normalize the feature
3. Predict the influence score using our model
def normalize_features(fet_vec, model):
normalize_fet = {}
clip_dict = model["clippingValue"]
for fet in fet_vec.keys():
if fet in clip_dict.keys():
normalize_fet[fet] = max(0, min(1, float(fet_vec[fet])/float(clip_dict[fet])))
return normalize_fetdef compute_influence_score(feature_vec, model):
score = 0
coeff_dict = model["LRModel"]["coeff"]
intercept = float(model["LRModel"]["intercept"])
for fet in feature_vec.keys():
if fet in coeff_dict.keys():
score = score + float(feature_vec[fet]) * float(coeff_dict[fet])
score += intercept
return max(min(1, score), 0)def predict_influence_score(twitter_usernames, predictionModel):
for screen_name in twitter_usernames:
tweets, mentions = get_all_tweets(screen_name)
features = compute_features(tweets, mentions)
normalize_feature = normalize_features(features, predictionModel)
influence_score = compute_influence_score(normalize_feature, predictionModel)
print ("\nInfluence Score of {0}: {1:.3f}\n".format(screen_name, influence_score))
Provide one or more usernames to compute their influence score
twitter_user_names = ['imVkohli', 'StephenCurry30']
predict_influence_score(twitter_user_names, TwitterInfluencerModel)
Conclusion
In this blog, we learnt one possible approach to associate influence score with any twitter account. We can use the same approach to other social platform as well.