Thoughts on Data Science, ML and Startups

Machine Learning Design Patterns: Problem Representation Part 2

In the first part of this Problem Representation series, we saw that representing seemingly regression problem as a classification problem can bring increased performance. We also saw that constructing a label in a specific way can additionality increase the performance, but results weren't great - we achieved about 30% correlation only. To increase performance, we will split our problem - separate low popularity and medium-high popularity songs and treat them separately. Therefore this post is about separating unpopular tracks as best as we can. For this will take a look at two ML representation design patterns from the book - Rebalancing and Ensembles. As before, the code to reproduce these results is on github.

Task: Classifying songs unpopular vs popular

Looking at the distribution of popularity for our dataset (tracks produced from 2011), we see that a small portion of the songs is very unpopular (have a popularity score below 20): popularity-distribution-with-low-scores-highlighted

Low scores are outliers - only around 4% of all popularity scores in our dataset fall below the popularity score of 20. However, our track popularity models developed in part 1 of this series still tried to accommodate these low scores. Our models might perform better if we remove these "outlier" tracks before predicting the song's popularity. To do this, we need to classify songs - unpopular, the ones having Popularity of 20 or less, and popular, all the others. Since only 4% of our dataset tracks have such a low popularity score, this is a highly imbalanced problem. To solve such a task, we will try out a couple of ML design patterns - Rebalancing and Ensembles.

Rebalancing and Ensemble Design Patterns

Rebalancing Design Pattern addresses the problem of the imbalanced dataset in three ways: * undersampling the majority class - randomly or selectively discarding data from majority class; * oversampling the minority class - producing additional data points from the minority class based on some heuristic; * weighted classes - giving weight to errors from minority class to stair model in producing better predictions.

Ensemble Design Pattern is a combination of multiple machine learning algorithms trained on subsamples of data to reduce bias, variance, or both. There are three main approaches for ensembles: * Bagging (bootstrap aggregating) - great for reducing variance in ML models. We train a few models on random samples of our dataset and average (or take the majority vote) of their output. * Boosting - used for bias reduction, as it constructs a more powerful ensemble model than any of its models. * Stacking - yet another way to combine models. Training a model on top of other model outputs to find the input model outputs' best weighting.

We will use bagging in solving our problem since we will train those models on small data, and our goal is to reduce the variance of the combined score.

Solving the task

We will go through the experiments to see which produces the best result for our use-case. First, let's see a bit more about the data we are working with and define evaluation metrics.


You will find a bit more EDA in the notebook referenced above.

We have 19,788 tracks collected for the years 2011-2020. We split this dataset randomly to train and test sets - 15,830 and 3,958. We have 11 audio features to predict popularity. Based on the plot provided above, let us mark tracks with the popularity of 20 and less as unpopular and other songs as popular. Having this definition, summary statistics for the label are:

Statistic Popularity /train/ Popularity /test/
Count 15,830 3,958
Mean 0.042 0.041
Std 0.201 0.200

The distributions of response are close in training and test sets.


Our problem is a standard classification - the only thing we care about is to classify as accurately as possible examples in the test set. Since our dataset is highly imbalanced, accuracy is not an appropriate metric. We will use precision, recall, and f1-score for evaluation instead as these metrics better capture the performance of the model in our setting.

Let us start our experiments with the naive approach.

Naive classification

The naive approach is not to modify our dataset or classifier's hyper-parameters (weights for the classes). Since we evaluate our models based on precision, recall, and f1-score, we need to select a score threshold. We will do this by looking at the recall-precision-f1 graph like the one below.


The best point we can find from the above graph is 0.402 recall, 0.564 precision, and 0.470 f1-score. The metrics mentioned above correspond to the threshold of 0.205.

We have our baseline. Next, let us explore various ways to account for imbalance in our dataset and see the impact on model performance.

Weighting classes

One way to achieve rebalancing is through the algorithm parameter rather than through changes to the training data. We can penalize errors made on the minority class examples more, and in this way, bias our algorithm to learn a better representation of the minority class.

After running a few experiments, we achieved the best f1-score with the minority-to-majority weight ratio of 7-to-1. The performance of the model trained with this weighing is as follows.


The above graph's optimal point is an f1-score of 0.489 (a 4% improvement on the naive approach), the precision of 0.524, recall of 0.457. We achieve these metrics with the threshold of 0.545.

Now let us see what impact does the rebalancing of the data brings.


Dealing with the imbalanced dataset can be a series of blog posts in itself. Here we will look into a couple of techniques for undersampling and later for oversampling to get a sense of how these techniques perform. Overall - throwing out data is not wise, so we expect oversampling to come out on top. Let's see what our experiments show.

We will use a great package to make this process easy for us - imbalanced-learn.


Random undersampling is the most straightforward approach we can take to alter our dataset - randomly discarding some of the majority class records to achieve a specified "balanced" level. Running a few experiments for different values of this balance, we have found that the best performing ratio is 0.2 - undersampling until the balance is 20% minority examples and 80% majority examples (the original dataset was 4% minority and 96% majority).


The above graph's optimal point is an f1-score of 0.484, the precision of 0.469, recall of 0.500. We achieve these metrics with the threshold of 0.423. We see that this approach gives us the model with a slightly different precision-recall tradeoff - we have better recall with this model and a somewhat lower precision score. With the f1-score being almost the same, we can pick a model that fits our use-case better in terms of this tradeoff later. Next, let us investigate a bit cleverer undersampling technique.


NearMiss method is based on a paper by Zhang J. and Mani I. (2003). There are three variations of the algorithm: 1. NearMiss1 - selects majority class examples with the smallest average distance to three closets minority class examples. 2. NearMiss2 - selects majority class examples according to their distance to three farthest minority class examples. 3. NearMiss3 - selects a given number of the closest majority class examples to the minority class example.

After running nine experiments for each of the above variations, we achieve the best result for version 3 with the equal sampling strategy - we downsampled the majority class to a 50/50 ratio.


Given how much data we discard, it is no surprise that results are worse than those of other methods. We achieve the best performance with the threshold of 0.692 - the precision of 0.483, recall of 0.341, and f1-score of 0.400. Let us see if oversampling can improve results.


It is no surprise that we lose model performance as we discard records from our dataset. Our dataset is not very big, to begin with, so next, we will try a couple of oversampling methods - SMOTE and ADASYN.


SMOTE was proposed in the paper by Chawla et al. (2002) as a means of improving model performance on imbalanced datasets and as an alternative to random over-sampling with replacement. Instead of oversampling actual records, this approach creates synthetic examples from the minority class. Let's see how it performs on our dataset.

Running several experiments with different oversampling ratios, we find that oversampling the dataset to the 20/80 level gives us the best performing model.


The best performance is with the threshold of 0.567 - the precision of 0.500, recall of 0.396, and f1-score of 0.442. Oversampling with SMOTE produced a better model than NearMiss undersampling. However, it still does not outperform any of the naive, weighted, or models produced by randomly undersampling records. Let us investigate another oversampling technique.


ADASYN is the algorithm proposed in Haibo et al. (2008). The main difference from SMOTE is that ADASYN generates more synthetic examples that are harder to learn instead of treating every minority group example equally.

As with SMOTE, ADASYN oversampling produces the best model with the dataset balance of 20/80.


As is evident from the graph, oversampling with ADASYN gives us a model that outperforms just one other - NearMiss model - the precision of 0.453, recall of 0.378, and overall f1-score of 0.412.


The Ensemble design pattern is useful for many problems, not just for imbalanced datasets. Model performance will be more stable by training and aggregating multiple algorithms that have seen different parts of the training dataset. There are different ensemble techniques to target bias or variance issues, and there are algorithms that already incorporate this technique - Random forest (bagging) and Boosted trees (boosting). However, there is a twist with an imbalanced dataset.

We want to be smart about selecting a subsample of dataset to train one of our ensemble models. One way to be smart about downsampling, is to downsample only majority class. We will get every model trained on slightly different dataset, and will keep all the minority class information in each of them. We repeat this process as many times as needed for a stable performance. In our case, song popularity prediction, aggregating 45 models produced the best result.

By running several experiments, we found that the best ensemble has 45 models. That's a lot, but models are tree-based algorithms, so the inference is not snail-slow.


Ensemble approach gives us the third-best model with the precision of 0.481, recall of 0.470, and f1-score of 0.475 at a threshold of 0.733.


Putting all our results together we get the following table:

Approach Precision Recall F1-Score Notes
Weighting training examples 0.524 0.457 0.489 Best performing weighting at 7-to-1; threshold - 0.545
Random undersampling 0.469 0.500 0.484 Best performing undersampling to 20/80; threshold - 0.423
Ensemble 0.481 0.470 0.475 Best performing ensemble was with 45 models; threshold - 0.733
Naive 0.564 0.402 0.470 Threshold - 0.205
SMOTE oversampling 0.500 0.396 0.442 Best performing oversampling at 20/80; threshold - 0.567
ADASYN oversampling 0.453 0.378 0.412 Best performing oversampling at 20/80; threshold - 0.578
NearMiss undersampling 0.483 0.341 0.400 Best performing undersampling was done with version 3 to 50/50; threshold - 0.692

Evaluating models on the highest f1-score achieved, we have two close contenders with slightly different performance profiles. Model build by weighing training examples has higher precision but lower recall, while model build by randomly undersampling majority class in training data has a higher recall and lower precision. The next two models (the 3rd and 4th by f1 score) are close in terms of f1 score but are slightly different in the performance profile. The ensemble model is relatively balanced, with recall and precision being near, while the Naive model has the highest precision among selected models, with recall being barely above 0.4.

We set out on this imbalanced classification journey to "clean" our dataset from unpopular tracks before predicting that popularity for "normal" tracks. By applying our top-performing model, we get the following popularity distribution:


We remove around 60% of the unpopular songs with our model. We want to remove more, but we don't know if this will be enough to impact the overall popularity prediction performance. We will investigate this in Part 3 of this Problem Representation Design Patterns series. We will bring together the last two posts and combine them with the Cascade design pattern.

Other Posts in Machine Learning Design Patterns Series