Overfitting is a widely discussed pitfall in ML. In ML 101 class, we are taught how to identify and mitigate it in a Supervised ML setting. But wait, how about the Unsupervised setting? Can we overfit there as well? In my talk, I will show that we sure can, and we will discuss what to do about it.
Throughout my career as a data scientist, I had the opportunity to work on many diverse Machine Learning (ML) projects. Some of my favourite projects deal with an unsupervised learning setting. At first glance, an unsupervised model may seem simpler to execute, and with less potential for pitfalls along the way. For example, it does not require to split the data into training, test and validation datasets, and on the face of it, there is no risk of overfitting: we just run our ML model on the dataset, and voila!
Subsequently however, from my experience, when the model is deployed to production, we often measure poor performance. How can this be? In this talk, I will present how, if we are not careful enough, unsupervised ML models can be overfitted. I will demonstrate overfitting on a clustering problem, by simulating a clustering model selection. We will run several clustering models on the learning set and select the best model. Then, we will see that it may be overfitted. But don’t worry! We will also discuss some ways to mitigate the overfitting peril in an unsupervised setting. I will provide a toolset to help you identify and avoid this situation, and will show how each mitigation impacts our simulation results.