Why your Twitter analysis sucks – and what you can do about it

I’ll start with my favourite word in the English language. CONTEXT CONTEXT CONTEXT!!!

My colleagues, friends, and students are at the point of rolling their eyes when I say it. But it is so important and the main reason your Twitter analysis sucks.

What do I mean by context? If you are going to datafy something – turn a tweet, a representation of a thought, emotion, idea into data –, then you need to think about the context of a) the user and why they tweeted, b) the dataset you are looking at, c) the problem you are trying to solve by datafying that tweet in the first place, and d) the tools you are using. Why?

BECAUSE NATURAL LANGUAGE PROCESSING NEEDS TO BE BESPOKE AND YOUR PRECONCEIVED ASSUMPTIONS WILL TRIP YOU UP.

NLP is like a game of chess, you need a strategy. This means knowing your next 15+ moves (this is my average number of function executions before until I have to rework my pipeline).

Time and time again I see this sort of standardised pipeline for preprocessing as a method to get to the juicy bits (modelling), but if you are using these sorts of pipelines then you are going to produce a subpar analysis. Let me give you an example.

Let’s say you are an analyst working on a political campaign and your boss, the candidate, wanted to know how people were feeling about a certain issue they are speaking about. You decide to do some sentiment analysis of a collection of tweets on #auspol and found that the sentiment towards the politician for was moderately positive towards that issue. Great! Now you can tell them they are doing the right thing and to ramp it up.

So they do, but they lose the election…and you get fired. WTF?

Did you think about the context of your sample? Let’s dissect what went wrong with this example, and explore why some critical and lateral thinking is a necessary ingredient in your analysis.

The English language is a nightmare

Tweets amplify the problematic issues around English, arguably the stupidest language on earth. Tweets are generally utter filth. They are noisy, short and use non-standard conventions. I have a love-hate relationship with them.

To demonstrate where this went wrong, I’ll use the Stanford Sentiment Analysis Treebank. Grey circles represent neutral sentiment, orange is negative and blue is positive. This is our original tweet (let’s say it was posted shortly after your candidate made a speech about the issue).

Original tweet

Now for the analysis. First, we make the string lowercase, then we have a few options that are ‘standard’ in the NLP preprocessing pipeline.

We can:

  1. Remove the punctuation and special characters, leaving the word that the hashtag is made of; or
  2. Remove the punctuation and special characters, getting rid of the hashtags completely.

Let’s see what happens when we do this.

Option 1. Remove the punctuation and special characters, leaving the word that the hashtag is made of.

Result = Negative sentiment

Great, the sentence is picked up as negative. No issues. What about option 2?

Option 2. Remove the hashtag completely.


Result = Positive sentiment

Ok, that’s not great. Clearly, this sentence carries negative sentiment, but it’s been picked up as positive. That’s because human emotion is incredibly difficult to datafy. People often use hashtags to convey sarcasm, which is inherently negative in sentiment.

You might be thinking, ‘ just leave the hashtags in there and that way I can capture the sarcasm’. Well yes maybe. But hashtags aren’t always singular words. Let’s expand that hashtag into #idiotpollies as in ‘idiot politicians’.

Typical hashtag

Result = Positive sentiment

And we are back to positive.

This is a hashtag issue. It’s almost impossible to assign sentiment to hashtags. You can split them but there are two main problems with this. Functions to split a hashtag into its root terms usually split the sting at the first capitalised letter they come accross e.g “#AustraliaDayWeekend” becomes “Australia day weekend”. You need to adjust your pre-processing pipeline to make sure you don’t decapitalise before this is done.

Splitting would be fine except:

  • It won’t work when the hashtag has no capitals e.g #auspol #melbourneweather
  • Some hashtags, once split, lose the original meaning that they had in their concatenated form;
  • If you are doing another analysis, like topic modelling, you would use the opportunity to treat them as a ‘word’ – hence different analysis needs different preprocessing (stop using the same data for a differnt model);
  • Non-standard or trending words like #ScoMo get caught in the crossfire. #ScoMo for Australian PM Scott Morrison gets turned to “sco mo” and all meaning is lost. Same for the long-running hashtag #auspol. Let’s look at this example.

#CaptianCook is a trending hashtag in Australia and relates to the debate about Australia Day. Not to get into it, but most tweets like this are sarcastic and negative, and also hilarious.

Now when we normalise the tweet by removing punctuation, splitting the hashtags on their capitals and then putting it into lowercase, we get:

“scott morrison has finally confirmed the dress code for australia day citizenship captain cook sco mo auspol”

And when we look for sentiment:

Result = Positive sentiment

Positive again. #ShockHorror.

The #CaptianCook is negative and sarcastic as a trending hashtag. But ‘Captian’ is positive. If you were working for the Australian PM, you may be in a bit of trouble.

Now what?

We need to think about what words may be tricking the model into thinking this is a positive sentence. Can you guess?

It’s the word “can’t”. Since we removed the punctuation, getting ready to tokenise, the root word “can” is kept. “Can” is labelled as a positive term, hence the positive sentiment.

At this point, if you are still trying to catch me out you might say we can concatenate “can” with the “t” and use “cant”. Ok, let’s try that.

Result = Neutral sentiment

Nope, that returns a neutral sentiment and doesn’t capture the sentiment of the user at all.

You might also say we can manually label hashtags and split according to a dictionary. Sure, you could. And you should, but remember you will never capture all trends – the Twitterverse is too dynamic, and they change too quickly. I’ll go into this in another post.

The answer

Contraction expansion.

Dismiss your flashbacks to year 6 English and stay with me here. By expanding the contraction to “cannot”, which will be separated to “can not”, we save the sentiment…and your job.

Result = Negative sentiment

Finally! “Not” supersedes the positive “can”.

The moral of the story is you need to know your problem, the context of your data, what it means for the pre-processing of your data, and what you may be tripped up on.

I hope this helps you think about new ways to improve your work, and hopefully not get fired!

As always, please share on Twitter and tag us as @data_little for happy coding binges.

Feature photo by JESHOOTS.COM on Unsplash

Thanks to Ed Farrel for his wonderful editing of yet another helpful post from The Little Data Scientist.