As more and more social scientists employ algorithms to try and “code” or “annotate” large datasets, the question of which algorithm to use has become a bit of a debate. Some people use out of the box lexicons, and that’s fine. But if you’ve got the time to annotate some data yourself, and are looking to actually build a model, an algorithm must be chosen. So what one works best when you’re trying to work with Tweets?

Tweets, by the very nature of the character limitations, are sparse. That is to say that they are meager in the amount of good content they provide. They almost always have a link, stop words and Twitter jargon that are generally useless to me without more advanced language processing toolkits. They are also quite large. Because of 400+ million Tweets that are sent a day, they’re not hard to come by. So what we have is a lot of barely useful things.

After reading some of the documentation for the algorithms included in Weka, I found that Liblinear has received a lot of praise for its accuracy with large, sparse datasets. It’s also very quick when it comes to training. Sure enough, on my data, I saw a 5% accuracy boost through CV immediately. After playing with some of the parameters on of the algorithm. I gained another 1.5% by changing the cost parameter from 1 to 3. This affects performance, but not significantly. Just for due diligence, I went back and tested all of the non-nominal classifiers in LightSIDE to see if any were able to beat LibLinear. None came close.  Interestingly enough, it appears the Bayesian models would be the second best option.

Here’s the initial paper on Liblinear, incase anyone finds this helpful.

Leave a Reply

Your email address will not be published. Required fields are marked *


October 11th, 2015

Socioeconomic Status, Social Capital, and Partisan Polarity as Predictors of Political Incivility on Twitter

This paper came about when my clever colleague Toby Hopp asked me about a dataset I had collected. It was on the […]

March 31st, 2014

Network Issue Agendas on Twitter during the 2012 U.S. Presidential Election

I’ve been working on agenda-setting research now for 5 years. Still, I am incredibly humbled to be on a journal […]

April 4th, 2013

A Social Network Analysis of “Social Media” Articles in Academic Journals

My Ph.D. advisor Dr. Joe Bob Hester came to me with a question: how do academic articles on the broad […]

March 3rd, 2013

How many followers do people and news media have on Twitter?

Part of the research I do here at UNC looks at how people and the news media react to each […]

November 27th, 2012

LibLinear Algorithm & Twitter

As more and more social scientists employ algorithms to try and “code” or “annotate” large datasets, the question of which […]

November 26th, 2012

The Top Congressmen on Twitter

A colleague here at UNC asked me the other day if I could scrape the follower counts of the 500+ congressmen who […]

November 12th, 2012

Agenda-Setting, Ideologies & Twitter: How “Moderate Mitt” was a huge mistake for Newt Gingrich

Continuing with my agenda-setting research stream, I decided to look at the GOP primaries this year, and more specifically the […]

November 12th, 2012

When is a website liable for User Generated Content?

In some research I did last year, I investigated the question, when is a website liable for content it hosts […]

November 11th, 2012

Does Agenda-Setting Theory Still Apply to Social Media & Social Networking?

In what was my first agenda-setting study, I took a look at social media/social networking site Twitter, and investigated the […]