Thoughts on Data Science, ML and Startups

Machine Learning Design Patterns: Problem Representation Part 1

In my previous post, I have discussed data representation patterns presented in Machine Learning Design Patterns by V. Lakshmanan, S. Robinson & M. Munn. In this post, I would like to talk about the next topic in the book mentioned above - problem representation. After taking care of our data representation, this is the next logical step (the next chapter in the book). Problem representation is probably the most critical decision to make for an ML problem - the decision to model a given problem will define how well our solution will perform. The good news is that we do not need to make the right decision from the start - as, with everything in ML, it is an iterative process. When you find yourself struggling to solve your problem with regression, try classification (always try classification if you can).

I will do it differently this time - instead of just discussing patterns, I will define a task which we will solve using different design patterns. This way, we will be able to compare results and see the influence of problem representation.

In this post, I will concentrate on Reframing and Neutral Class design patterns. Next time I will cover Rebalancing and Ensemble design patterns. But first, let's define our task.

Task: Predict song popularity

To illustrate the design patterns mentioned above, I will try to predict track popularity on Spotify only using track (mostly audio) features such as danceability, liveness, and tempo. You can download data from Kaggle with the full list of features used. Also, you can find a notebook to reproduce results here.

Popularity is a Spotify metric calculated for each track based on the number of plays and those plays' recency. Given the above definition, we expect that new songs will be more popular on average. So we will limit ourselves to the tracks released in the past decade. Again, for more details, see the notebook.

Reframing and Neutral Class Design Patterns

Like I have already mentioned, in this post, I would like to discuss and illustrate two problem representation design patterns - Reframing and Neutral Class. The first should be bread and butter for any data scientist, but the second I haven't seen anywhere else so far.

Solving the task

The task is simple - predict which song is popular. Popularity rating is an integer from 1 to 100. It's not a real-valued target, as a house price would be, but regression is a reasonable approach here. However, if the only thing we want to predict is popularity (not the rating itself), we can make this task a classification problem by thresholding the popularity index.

Data

You can find more on data in the notebook referenced above.

We have 19,788 tracks collected for the years 2011-2020 and 11 audio features that we will use to predict the song's popularity. We split this dataset randomly to train, and test sets - 15,830 and 3,958 tracks fall in each, respectively.

The popularity has the following distributions for train and test splits popularity-distribution

Not ideal, but close. Statistics are close too:

Statistic Popularity /train/ Popularity /test/
Count 15,830 3,958
Mean 58.89 58.58
Std 15.30 15.14
Min 0 0
25th 54 53
50th 61 60
75th 67 67
Max 99 100

Evaluation

For these types of problems, I use correlation (rank correlation) as an evaluation metric. Since I want to know which tracks are/will be popular, I am interested to know does my predicted score indicates higher actual popularity - and that's a correlation. I will use Spearman's Rho and Kendall's Tau with corresponding p-values for evaluation.

Regression

As mentioned, regression is a reasonable approach to model popularity. Using sklearn's GradientBoostingRegressor to model popularity gives us the following results.

Predicted vs. Actual Scatter Plot

predicted-vs-actual-popularity-regression

Correlation coefficients

Metric Coefficient p-value
Speaerman's Rho 0.270 5.52e-67
Kendall's Tau 0.184 6.47e-66

There is a statistically significant correlation between predicted and actual popularity. But let's see if we can do better.

Classification

We can reframe our original regression problem to classification by thresholding the popularity index to create classes for our model to predict. Classification usually works better than regression because we simplify our task a bit. If the only thing we need is a relative ordering of the songs - scores from the classification model are enough.

Split at the median

The most straightforward approach for binary class creation - split the scores at the median value. With this approach, we will have nicely distributed training data. It might not work if your dataset values are very skewed towards one or the other end of the scale. In that case, you will have to experiment if having the imbalanced dataset works out OK, or you should employ some rebalancing design patterns (the subject of my next post!).

Predicted vs. Actual Scatter Plot

predicted-vs-actual-popularity-classification-simple

Correlation coefficients

Metric Coefficient p-value
Speaerman's Rho 0.307 5.62e-87
Kendall's Tau 0.210 2.68e-85

It is an 11% and 14% improvement in Spearman's and Kendall's correlations, respectively. Not bad for simple thresholding. Let's see if we can improve upon this result.

Middle values removed

Another trick that I always experiment with is simplifying a problem for the algorithm by removing intermediate values. Depending on your question, this might help the model distinguish between good and bad examples. However, this means that we are discarding some of our data. And therefore, this tradeoff will improve results when an amount of discarded data is less costly than the removed ambiguity.

Predicted vs. Actual Scatter Plot

predicted-vs-actual-popularity-classification-middle-removed

Correlation coefficients

Metric Coefficient p-value
Speaerman's Rho 0.303 7.06e-85
Kendall's Tau 0.207 1.22e-82

In our example, I took top and bottom 40 percent of the dataset, and as we can see, this ended up hurting performance. I also varied the amount of data I remove, and this approach always has a detrimental effect on model performance in this case.

Classification with Neutral Class

The above example tried to remove ambiguity introduced by splitting continuous response at a threshold but hurt model performance because we removed part of the data. The Neutral Class design pattern takes care of that - instead of eliminating some data, we give it a class and use it for prediction. Then, at inference time, we only look at the high class probability.

The neutral class design pattern is also useful when we train models on human labelers output, like medical imaging applications. When human labelers do not agree, we can represent that uncertainty to our model via neutral class.

Predicted vs. Actual Scatter Plot

predicted-vs-actual-popularity-classification-neutral-class

Correlation coefficients

Metric Coefficient p-value
Speaerman's Rho 0.308 7.61e-88
Kendall's Tau 0.211 6.66e-86

As you can see, an additional 0.5% increase in Spearman's and Kendall's metrics compared to our simple classification approach. It won't make-or-break your ML project, but the size of the effect will depend on the dataset. And with this simple change, any positive impact is welcome.

Discussion

Looking at the scatterplots (or even looking at the label distribution), we see that we could employ more tricks to increase model performance. We could train a model to distinguish 0 (or very low) scores from the higher ones and then introduce another model to predict higher scores on the first's output. It is called the Cascade design pattern, and I will write about it in the future.

Summary

This post explored two ML design patterns - Reframing and Neutral Class. We have shown that these can work in tandem by reframing a problem at hand from regression to classification and then adding a neutral class to help our model distinguish better high and low values. These two steps add a nice performance boost if we can live with having an estimate of a probability of the high class rather than an estimate of the metric in question.

Next, we will look at Rebalancing and Ensembles design patterns. I think they are useful daily since most of the exciting problems are imbalanced by nature (predicting high-value customers, churners, etc.).

Other Posts in Machine Learning Design Patterns Series