Wrangling your data means getting your hands dirty

This is a quick post about why you need to adopt a mindset of iterative and interactive data wrangling.

Iterative data wrangling is the idea that you use your analysis output as a source of information to ‘redo’ you wrangling. Interactive data wrangling is the idea that you manually intervene in the process.

Some schools have the idea that data mining and cleaning should be as automated as possible. A lot of software has been purpose built for this process in the last couple of years. Because the DS community is quite friendly there are a few free software among the proprietary software like the exxy Alteryx and Trifacta.

These include Googles Openrefine and Quadrient DataCleaner.

The problem with data wrangling software is that it misses things. Things you don’t even know were there until they mess up your output. If you aren’t aware of what your data looks then you will have a hard time fixing issues or even diagnosing them. Often the conclusion is ‘ oh I better add some gradient boosting, my F1-score is a little low’. Alternatively, just clean your data better.

The other problem with automatic wrangling software is that the schema is defined for you. This means that you may end up over cleaning your data and therefore overfitting your models.

Iterative data wrangling is a skill though. The more iterations you do, the more likely it is that you are cleaning up your own mess after you spot what you have done to your data on output. Kind of like how you only see the stains on your favourite top after you ironed it and put it on. Shoulda spot cleaned.

For those new to the field, it is a process of failing. The more times you do dumb things, the more you learn and don’t do them again.

I’ll give you an example.

I have a set of tweets that use ‘coz’, ’cause’ and ‘cuz’ instead of ‘because’.

I thought that these words needed to be normalised, I could have probably just thrown these away but I was trying to be overly thorough. So I wrote a script to do this.

I didn’t check my output to often and only really looked at the top 50 words for each of the topics in the output. Later when I was visualising things I saw this:

Jabecausezi bebecause becausey

I realised that because I had manually altered these terms before I had tokenized (and, therefore, created a list, not a string to alter) I had inserted these words into ‘because’ rather than replace those words with ‘because’. I never picked it up because these words were made rare-ish. Not infrequent enough to be removed, but not frequent enough to turn up in the top 50 words for each topic.

Rare-ish is not a real word btw.

A silly example sure, but just one of many dumb mistakes i have made and never repeated. It actually took me a few weeks to pick up on this and it is the perfect example why you need to iteratively and interactively clean your data. My coherence scores (I was topic modelling at the time) jumped right up and even though I don’t really pay much attention to them now (because I review my topics qualitatively), my metric-happy collaborators got a lot more relaxed about the results.

Even Mark Wahlberg recommends it.

Marky is a neat freak

As always retweet this post @data_little for happy cleaning vibes.

Photo by averie woodard.Thanks Averie for the use of your work at Unsplash.