Machine Learning Design Patterns: Data Representation
Design patterns are a set of best practices and solutions to common problems. Machine learning engineers as engineers in other disciplines can benefit immensely by following such idioms. In this and following posts, I will discuss ML patterns outlined in Machine Learning Design Patterns by V. Lakshmanan, S. Robinson & M. Munn
Data Representation Design Patterns
Let us start with Data Representation Patterns. These patterns focus on the feature engineering part of the ML workflow. It would be a stretch to call some simple and commonly used techniques a
1. Linear transformations: min-max scaling, clipping, and z-score normalization;
2. Non-linear transformations: logarithms, taking a root of a value, histogram equalization, and box-cox transform;
3. Categorical feature handling: one-hot-encoding (this might be a pattern, but it's just too common these days, and there are quite a few better approaches);
4. Handling array of categorical features: array statistics;
Patterns that authors describe in the book, and we will discuss here are: Embeddings, Feature Cross, Multimodal Input, and Hashed Feature,
An embedding is a learnable representation of a high cardinality feature into a lower-dimensional space while preserving information. I would even say that embeddings can enhance categorical features by encoding them to make ML tasks easier for a learning algorithm. These days (DL boom), it seems that everybody knows to use an image embedding in some ML task if an image is a contributing factor. However, this pattern is not only about this type of embedding. Let us see what problems this pattern address.
The problem with categorical features is that simple conversion techniques do not capture relationships between classes and rely on the algorithm to distill those relationships. Whatever categorical feature you have: A day of the week - if day number starts from Sunday, Saturday and Sunday will be far apart in terms of numerical value, while they are most likely close to each other in meaning; A book or a food category - we can one-hot encode them, but we cannot encode relationships among the classes this way; Very high cardinality features, such as a fine-grained catalog of an e-shop, can be impractical one-hot encode. All categorical features would benefit from embedding. The cardinality may also play a role here - if we have a vast vocabulary for categorical feature (fine-grained catalog in an e-shop can be prohibitively large to one-hot encode)
There is also a problem with unstructured data incorporation into our algorithm. Text, image, and audio are rich sources of information but are not easy to incorporate in algorithms that are not specifically built for those types of input.
Embeddings are a great way to encode categorical or unstructured data so that other algorithms can use a better representation of the feature.
Embedding categorical features is probably the best and easiest way to improve model performance. By learning an embedding for any categorical feature, we extract information on how categories relate to each other for the task we are trying to learn. Embedding boosts the algorithm's performance and has a nice side effect - we can use learned embeddings in different learning tasks.
Embedding unstructured data is almost the only way to include that data in learning models. Suppose we have a problem that would benefit from including unstructured information. In that case, we could use one of the pre-trained models for images (one of many trained on ImageNet) or text (e.g., Glove/Word2Vec word vectors) and include the provided information to any other model together with structured data.
The embedding pattern is unique - there is no real alternative to it. We can encode categories by integers or one-hot encode them, but it is not even close in terms of model performance. If possible, we should always use Embedding pattern; the only consideration is about what type of embedding to include.
Combining feature values and making every combination a different feature helps our algorithm learn relationships between characteristics faster.
When our features relate in non-linear ways, we can improve our model by providing those non-linear features by "crossing" them.
One typical example is a time of day and the day of the week for some event. For instance, we want to predict a demand for bikes in a specific city bike-sharing spot. Having just time of day and day of the week might not be enough, and an explicit
AND relationship could improve our model.
Feature cross is a simple combination of two or more categorical features (or bucketed numerical ones). For example, if we have the day of a week (Monday, Tuesday, etc.) and the time of day (1 PM, 2 PM, etc.) features, we could get the time of the day of a week feature (Monday 1 PM, Monday 2 PM, Tuesday 1 PM, etc.). Crossing features increase cardinality considerably; therefore, using this pattern with the embedding pattern could yield even better results.
Feature crosses are a simple way to introduce non-linearity in our models and help to learn relationships faster. This pattern is even better when using it with the embeddings pattern discussed above. However, we should consider features that we want to cross carefully. Since this pattern increases the model's complexity quite a bit, we should not go and cross all our features.
When we have multiple representations of the phenomenon that we want to model, we should include all those representations in our algorithm.
Many algorithms that are available online are designed for a specific type of input - image models (ResNets), text models (all the transformers out there), audio models (I haven't used any myself so far). However, there is a significant class of problems where we would like to use several types of inputs — a combination of structured and unstructured data. For example, modeling social media campaign results include all the above input types.
By employing the embedding pattern, we can join different types of input into a single model. We can either use any of the pre-trained models for image/text input and extract the last layer features to concatenate them with numerical or categorical features. For example, if we classify a complicated situation from a picture, having metadata about that situation (date and time, weather, geographical location, etc.) can significantly increase our model's accuracy. Additionally, this pattern is useful with the same data but different representations - bucketing is the simplest example. For example, if we have a distance feature in a dataset as a continuous feature, bucketing could help us learn non-linear relationships. Maybe very short and very long distances correlate with our outcome.
As data scientists/machine learning engineers, we should always seek new features to add to our models. By seeking out features from different modalities of the phenomenon, we can build better models. Similarly, we can increase our model's performance by presenting the same information from a different angle.
The hashed feature design pattern is an interesting approach meant to address cold start, incomplete vocabulary, and model size problems.
Having high cardinality categorical features poses three main challenges when building an ML model: * Not all categories might exist in the training dataset; * The number of categories might be prohibitively big; * Cold-start problem;
The proposed solution is a deterministic hash function (authors of the book proposes
FarmHash). We would hash a given categorical feature in a pre-defined number of buckets and would use that as a feature instead of the original value. With this approach, we tick all the above boxes:
* All categories would get a bucket, even those that were not present in the training dataset;
* We control how many buckets we hash our values into;
* Even new values of the feature that wasn't available during training time would get handled (our model would not error out.)
This pattern is the least appealing of all the data representation design patterns. There might be situations where this is necessary, but it should be a necessity. By randomly assigning buckets to our feature values, we might group very different values, and therefore our model would suffer. So I would go with anything other than hashing, if possible, describing high cardinality features with values metadata or descriptive statistics. Maybe grouping those values based on those statistics and then using that as an input for future values to handle the cold start problem.
On the other hand, this is the only new design pattern, so I am delighted that the authors decided to include this.
This post is the first in the series about ML design patterns. The book contains several more chapters I intend to write about, including Problem Representation Design Pattern, Model Training Patterns, Model Serving Design Patterns, Reproducibility Design Patterns, and Responsible AI Design Patterns.