As more and more social scientists employ algorithms to try and “code” or “annotate” large datasets, the question of which algorithm to use has become a bit of a debate. Some people use out of the box lexicons, and that’s fine. But if you’ve got the time to annotate some data yourself, and are looking to actually build a model, an algorithm must be chosen. So what one works best when you’re trying to work with Tweets?
Tweets, by the very nature of the character limitations, are sparse. That is to say that they are meager in the amount of good content they provide. They almost always have a link, stop words and Twitter jargon that are generally useless to me without more advanced language processing toolkits. They are also quite large. Because of 400+ million Tweets that are sent a day, they’re not hard to come by. So what we have is a lot of barely useful things.
After reading some of the documentation for the algorithms included in Weka, I found that Liblinear has received a lot of praise for its accuracy with large, sparse datasets. It’s also very quick when it comes to training. Sure enough, on my data, I saw a 5% accuracy boost through CV immediately. After playing with some of the parameters on of the algorithm. I gained another 1.5% by changing the cost parameter from 1 to 3. This affects performance, but not significantly. Just for due diligence, I went back and tested all of the non-nominal classifiers in LightSIDE to see if any were able to beat LibLinear. None came close. Interestingly enough, it appears the Bayesian models would be the second best option.