Clever kids are all about Machine Learning

The LDS editor, Caitie, this week received a request from James, a high school student in WA asking a few questions about Machine Learning (ML). It was passed to them from their supervisor. The questions were incredibly well formed and our editor wanted to share the responses to the student….wistfully wishing their own students would ask such good questions.

BTW, this kid is 17!

Briefly, what area(s) of ML do you work with?

My work involves designing and optimising novel research methods which make use of computational text mining. I look at Natural Language Processing (NLP) pipelines and a technique called topic modelling to conduct research on social media data and discover insights about different events and conversations happening across the world. I blend these insights with known facts and other datasets to answer particular research questions that traditional methods can’t explain.

Are you currently working on a project involving ML? If so, could you explain what it is, your goals, its purpose, etc.?

In my current project, I am taking real-world social media data sets (Twitter) and using a topic model called MetaLDA build here by our Machine Learning Group, to determine what sorts of discourses are present across 10 years of chatter about the ‘right to be forgotten’.

The right to be forgotten is new data protection right, that’s now built into the General Data Protection Regulation (GDPR) and applies to any company that processes data from EU citizens – That’s pretty much most countries in Australia. 

The challenge of Twitter data sets is that they are huge (1 million+ tweets) and what we call sparse and noisy, which I will explain next. Topic models are sensitive to these problems, and so we need to use a sophisticated set of Machine Learning techniques to overcome these issues. The problem is that our ML algorithms don’t fix these problems by themselves. So, we have to look at other aspects of the ML pipeline including:

  • What data we collect
  • How we clean or wrangle it
  • How we process it, and
  • How we optimise the algorithm.

We need to get this right in order to:

  • Find out the topics people are talking about
  • Who those people are
  • How those conversations have changed over time
  • What events affected them.

Additionally, we want to know about how the interactions between tweeters have influenced these changes. So, we need to know how to integrate traditional social science methodologies into a partly automated process of knowledge discovery.

A Topic model is a generative probabilistic model which represents documents as a mixture of topics where the allocation of a document to a topic is dependent on the words within that document. The Latent Dirichlet Allocation (LDA) is the go-to topic model for most studies (Blei, Ng, Jordan & Lafferty, 2003). However, LDA is challenging to optimise when modelling tweets which are very short documents. The shortness of tweets reduces the efficiency of the LDA inference algorithm which reduces the accuracy of their topics. MetaLDA which was developed by our group, is designed specifically for modelling short documents like tweets. 

If applicable, what kind of problems do you encounter in relation to datasets? How you mitigate against these issues?

The challenge of Twitter data sets is that they are huge (1 million+ tweets) and are what we call sparse and noisy. When you look at a tweet, we see it has lots of characters and links in it that we need to remove, because they confuse the algorithm. #hashtags and @usernames are examples. We also need to remove ‘stop words’ such as ‘at’, ‘and’, and ‘the’. But when we remove these, the algorithm becomes even shorter. You can see the difference between an unprocessed tweet and a pre-processed tweet below.

This is a big problem for us, and we have to think carefully about how much noise to remove without overly sanitising our data, which may give inaccurate results.

We also have to handle lots of spam and, tweets that are unrelated to what we are interested in. For example, we were working with a group of Immunologists at UWA and they wanted to know what Rheumatoid arthritis patients thought about the medications they were using and how they talked about their illness on Twitter.

We had figure out how to remove the tweets from doctors, marketers and the media. To do this, we looked at three classification algorithms, Naive Bayes, SVM and a simple neural network. The neural networked performed the best at this task. But we need to do more work.

Other issues we face with Twitter data are malicious spam bots that are becoming harder and harder to detect. I have been working analysis of the hashtag political group #auspol in the lead up to the 2019 election and encountered this problem. Malicious bots try to sway the opinions of Australian voters with ‘Fake news’. This happened in the 2016 US Election too and is was thought to contribute to President Trumps win. It is an ongoing problem in social media analytics (The bots, not Trump).

All data is biased. But unlike what you might read in the news, it’s not always because of the researchers own personal biases. Sometimes it’s because we have no choice but to add or omit certain features to the data set. In the GDPR data set, I’ve had to remove all non-English tweets as the algorithm will get confused. Additionally, none of us speak al 160 languages detected. But since the GDPR was enacted in European countries, a lot of people tweeting about it are tweeting in a language other than English. By removing the non-English tweets, we lose the opinions of those people. The data set then becomes bias towards English speaking users primarily from the UK, US, Canada and Australia.

Do you believe ML systems are appropriate to use in surveillance or law enforcement contexts? How can the negative impacts of ML systems in such contexts be mitigated against?

Hmmm, this is a bit of a hot potato topic, and your question is a little leading, so I’m going to answer, but with a caveat. They should be used, but not before proper ethical, legal and data governance procedures are put in place, with oversight from independent bodies.

Predictive policing has a long history of going wrong. This is an instance of the runaway feedback loop. When a particular neighbourhood, let’s say Armadale has higher drug-related crime statistics then say, Swanbourne, then the police will be sent to Armadale* to look for drug-related activity more than they would Swanbourne. With the higher crime rate and increased police presence, there will be more arrests in Armadale than ever before. These crime statistics are fed back into the model for new protections, but of course, Armadale will come back as even more of a likely target for drug-related crime and they cycle continues regardless of the relative crime rate compared to the benchmark suburb of Swanbourne.  This isn’t great, obviously, but luckily, most police departments have realised this already.

*I’m allowed to pick on Armadale since I lived there for a while.

Alternatively, there are some truly amazing examples of when ML systems have been used in Law enforcement. At Monash University, Janis Dalins, who is a PhD student and Australian Federal Police (AFP) Officer, is working with Prof Campbell Willson on AI systems that help identify criminal activity on the Darkweb. The idea is that AI crawlers could make it faster to track down images of crimes against children which currently has to be done manually at significant cost to the AFP’s time, resources and mental health of the officers who have to view these images. 

COMPAS predicts the likelihood of recidivism for criminals based on various factors. A criticism of the tool is that it oversimplifies the judicial process, is weighed too heavily by judges and can lead to faulty convictions. Do you believe the fault lies with the nature of the tool itself as opposed to the methodology of judges ?

I wrote an opinion piece on COMPAS from my perspective as a Data Scientist. I believe that the COMPAS developers tried their best to create a holistic instrument for the benefit of the individual and the courts efficiency. However, it is being misused. This is outside of the developer’s hands. The fault never lies primarily with the tool, but developers must be diligent in updating the software. An example of where this has gone horribly wrong is the ingrained racial or gender-based biases in HR analytics.

Also have a look at this paper for reference.

Caliskan-Islam, A., Bryson, J. J., & Narayanan, A. (2016). Semantics derived automatically from language corpora necessarily contain human biases. arXiv preprint arXiv:1608.07187, 1-14.

Presumably you believe that ML has the potential to be beneficial towards society, could you illuminate some specific contexts or applications where this is the case, perhaps within your own body of work?

Well, I’ve given you a few examples but here is another. Turning point is Australia’s leading addiction treatment and research centre and together with Monash University has been successful in receiving a $1.2 million grant through the Google AI Impact Challenge. Prof Wray Buntine and Principle investigator Prof. Dan Lubman from Turning Point aim to adapt AI methodologies to streamline coding of national ambulance suicide-related attendance data. The resulting data would play a central role in informing public health prevention, policy and intervention, as well as identifying emerging trends, hidden populations and geographical hotspots for targeted responses relating to suicide. The ultimate aim would be to develop a national suicide monitoring system with the potential to set international standards for informing suicide prevention efforts.

As you can imagine, there are some serious ethical and legal implications for this work, but it ultimately aims to reduce the number of lives lost each year in these tragic circumstances.

Teenagers….scary smart.