Saturday 1 March 2014

Building a natural language classifier

The idea behind this post is to build a classifier that would work based any content. For the purpose of this experiment, we chose twitter data.
In any machine learning experiment, we need to have a training set and a test set. I collected a number of tweets during the period October 2013 to February 2014 using twitter stream API. Total 322,382 tweets were collected during this time period. Total 235,100 users participated in this information exchange. Based on specific events on days, I used the following four tokens to filter the tweets on stream API that are 1) India 2) FamilyGuy 3) FastAndFurious 4) Thanksgiving. The data that I collected contains all the original tweets initiated during this time and all the retweets. Since Twitter does not maintain the flow of retweets across the users, all the retweets point to the original tweet rather than intermediate tweets. Because of this even if a particular tweet is missed but retweets are seen, we can always find the original tweet from retweet since original tweet is always fully embedded in the retweet.
Twitter schema allows users to add metadata to their tweets. This metadata manifests itself in the form of hashtags in the tweets. Hashtags provide a primary means of categorization. Since tweets are limited in size of content that they can carry, users can attach URLs that point to a larger content. We want to evaluate the impact of presence or absence of hashtags and URLs in a tweet.
We parse the tweet data and create a table with following columns for each tweet.

  1. Retweets in less than 10 seconds 
  2. RTs in greater than 10 and less than 30 seconds 
  3. RTs in greater than 30 and less than 1 minute 
  4. RTs in great than 1 min and less than 5 mins 
  5. RTs in great than 5 mins and less than 10 mins 
  6. RTs in greater than 10 mins and less than 30 mins 
  7. RTs in greater than 30 mins and less than 1 hour 
  8. RTs in greater than 1 hour 
  9. Number of hashtags 
  10. Number of URLs 
  11. Total retweets 
  12. Category of popularity of retweets denoted by an interval variable ranging from 1 to 7. 1 being the least popular while 7 being most popular

Since we have already seen that there is a large percentage of Twitter users who primarily just retweet and don’t add any content of their own. Also, Twitter does not really preserve the path traveled by a tweet. For all the further analysis, we only take into account tweets that were created by a user. We ignored the instance of retweets by other users since it is not adding any information.
The next step is to find appropriate categories for the tweet. We use the following set of rules to find appropriate categories for the tweets being examined.
  1. Look at all the hashtags in a tweet. All of these are potential categories
  2. For tweets having more than one hashtag, we only categorize it under the hashtag with the highest frequency
  3. If a tweet does not have any hashtag, we classify it under the URL which is attached to it.
Once we have collected all the hashtags and URLs, we need to eliminate similarities. While deciding on the categories of the user, it is important to notice that many times people do mistakes in typing hashtags, use different spelling and also use related words to categorize a tweet. 
Considering all these issues, we propose the following algorithm to come up with relevant categories of tweets.
  1. We measure the distance between two hashtags to identify similar hashtags. We use Jaro Winkler distance as a measure of distance between two hashtags. One of the reasons for choosing Jaro Winkler distance is the fact that it is best suited for short strings which is the case with Hashtags. At the end of step 1), we have similar hashtags clubbed into a single bucket. For example, we can see that the above algorithm identifies "familyguyxxx familyg familyguyproblems familyguyfans" hashtags as a similar one.  
  2. At the second step, we look at groups of hashtags and then look at the hashtags that occur together very frequently and then we merge that group. For example, a group containing “familyguy” and a group containing “briangriffin” occurs together frequently and we merge them in a single group
  3. We eliminate all the hashtags that seem to occur together but the text contained in those tweets has very high distance. It is a common practice among the spammers to ride popular and trending hashtags and start tweeting by including that hashtag.
Thus we end up with the final list of categories that we can use in final analysis. These are our significant categories. The next step is to build a classifier. We have used a dynamic language model based classifier. The Dynamic Language Model based classifiers use multivariate estimators for the category distribution and dynamic language models for the per-category character sequence estimators. The whole data is segregated into training and test set by dividing it into 90% and 10%. The training set is 90% and the test set is 10%.
Total accuracy of the model thus built is 83.64%. This means that out of all the testing cases, the model is able to successfully classify 83.64% of times accurately. 


Important results
Total cases
57240
True Positives
1064
False Negatives
208
False Positives
208
True Negatives
55760

You can find the complete IEEE Paper by clicking on hyperlink.

No comments:

Post a Comment

How GenAI models like ChatGPT will end up polluting knowledge base of the world.

 Around nine years ago, around the PI day, I wrote a blog post about an ancient method of remembering digits of pi. Here is the  link  to bl...