Fraud Detection with Minimum Labels: Semi-Supervised Learning

Antons Tocilins-Ruberts
Ravelin Tech Blog
Published in
7 min readJul 11, 2022

--

Image by Marija Zaric from Unsplash

This blog continues exploring deep learning methods for tabular data. Previously, we’ve looked at TabNet which was one of the first deep learning models designed specifically for tabular data. This blog is going to focus on VIME (Value Imputation and Mask Estimation) — a self and semi-supervised learning framework for tabular data.

The purpose of this blog is to give you a deep dive into the architecture and to show you when and how it can be applied to fraud detection. You can find all the notebooks and code in this github repo. It contains a full working example, so we highly recommend checking it out. Fraud detection data is taken from Kaggle and the original VIME repo can be found here.

Why Self/Semi Supervised Learning?

Imagine a situation where only 1% of the entire dataset is labelled. This can happen in domains where labelling the data is prohibitively expensive and/or very time consuming. For example in fraud detection only a small sample of transactions is ever reviewed.

It might be tempting to train a fully-supervised model on this 1% of labels but this can give you a false sense of security. Classical models for tabular datasets (like DNNs or XGBoost) are very proficient at learning a strict boundary around labelled examples, but this can miss the mark on unlabelled datapoints. Instead, this kind of problem might be better suited for semi-supervised learning because it is capable of using both labelled and unlabelled data.

Intuition behind supervised overfitting

The Main Idea

Introduced by Yoon et al. (2020), VIME is a systematic approach to self-supervised and semi-supervised learning for tabular data. The main idea is that the final model is trained in 2 steps:

  1. The data encoder is trained on the unlabelled dataset using a reconstruction task (Self-Supervised Learning)
  2. The trained encoder is then used to train a predictor which is fine-tuned on both labelled and unlabelled datasets (Semi-Supervised Learning)
VIME overview taken from https://vanderschaar-lab.com/papers/NeurIPS2020_VIME.pdf

Self-Supervised Learning

Theory

The idea of self-supervised learning is very simple — we want to use the unlabelled data to learn useful feature representations. This is usually done by training an encoder model to perform some artificial but challenging task. For example, in image domains we might want to reconstruct samples augmented by rotation, noise addition, etc. The same logic applies to VIME where the input samples are corrupted and subsequently recovered. The augmentation process for a feature matrix X is performed using the following steps:

Self-supervised data generation adapted from https://vanderschaar-lab.com/papers/NeurIPS2020_VIME.pdf

Essentially, we just swap some feature values from one observation for the corresponding values from another observation. This type of augmentation is known as CutMix. According to the authors, reconstructing corrupted tabular samples can be quite challenging for the model, so the task is actually split into 2 separate sub-tasks:

  1. Reconstruction of the corruption mask
  2. Reconstruction of the original features.

Each of these subtasks has its own model which takes the shared encoder embeddings as input. Intuitively, using these two tasks the encoder is pushed to learn correlations among the features and to output embeddings that can recover original data.

The loss that gets optimised is calculated using this formula:

where Mask Reconstruction Loss is binary cross-entropy, Feature Reconstruction Loss is mean squared error and alpha is a hyperparameter to tune.

Note that from this whole model, the only part we actually care about is the encoder. After the training, it should know useful feature representations which are going to be used in the semi-supervised setting.

Code

Note that the encoder and decoder architectures can be adjusted depending on how complicated your dataset is. You can use more layers, different activations or even a TabNet architecture.

The model returns 2 arrays — the reconstructed mask and the reconstructed features. After preparing the data, the self-supervised model can be trained as any other TensorFlow model. The hyper-parameter alpha controls how much weight to give to the feature reconstruction loss.

Evaluation

Mask reconstruction is basically a binary classification task, so it can be evaluated using ROC AUC. For the feature reconstruction we can use RMSE or correlation coefficient. Different columns have different ease of reconstruction, so it’s worth looking at the distributions of these metrics across features

Self-supervised VIME evaluation plot

Overall, the model we trained seems to be performing relatively well. The reconstructed features are positively correlated and masked values tend to get larger mask scores.

Another way to evaluate the self-supervised model is to look at the embeddings. When we corrupt the dataset, we teach the model to learn to generate robust embeddings. If a sample was corrupted 5 times, all 5 embeddings should be relatively close to each other in the vector space. Let’s check this hypothesis by corrupting 10 different samples 5 times and projecting their embeddings to 2-dimensional space using UMAP.

VIME vs Not-trained Embeddings

From the plot above, you can see that indeed the embeddings for the same samples are close to each other. Also note that when the model was not trained with corrupted samples, the embeddings are not well clustered in the vector space. Based on these plots, we can be certain that the encoder has learned useful information and can be used in the semi-supervised learning part.

Semi-Supervised Learning

Theory

The semi-supervised learning part uses both labelled and unlabelled datasets. With the labelled part it’s business as usual — supervised training on labels using either cross-entropy or MSE losses. The unlabelled training is a bit trickier and consists out of the following steps:

Semi-supervised data generation adapted from https://vanderschaar-lab.com/papers/NeurIPS2020_VIME.pdf

The main idea is again to balance 2 objectives — to accurately predict the labelled data and to output consistent predictions for corrupted unlabelled samples. The unsupervised task acts as a regulariser preventing the model from forgetting everything it has learned in the self-supervised part.

Supervised and Unsupervised VIME training

The overall semi-supervised loss is calculated using the following formula:

where Supervised Loss is Categorical Cross-entropy for classification tasks and Mean Squared Errors for regression tasks, Unsupervised Loss is the variance of per sample predictions, and beta is a hyperparameter.

Code

As you can see, the architecture doing both tasks is the same. Hence, we just need to add a predictor model on top of the pre-trained encoder to get the final model.

The trickiest part with semi-supervised VIME is the training. We have 2 tasks which have completely different datasets and losses (labelled and unlabelled). To solve this challenge, we’ve decided to construct a custom training loop using generators as inputs.

Semi-supervised VIME Training Loop

Evaluation

Let’s see how it performs on the test set which has all the fraud labels for the following 2 months. Remember, the model was trained on only 1% of labels, so we expect VIME to outperform fully-supervised models in this setting. The 2 models used for comparison are Multi-layered Perceptron(MLP) and Random Forest (RF). We’re comparing with an MLP because we want to see how large the benefit of pre-training using VIME is. We’ve chosen to compare with an RF because it is an industry standard in fraud detection. All the models were trained 10 times to estimate the average performance. The distribution of PR AUC can be seen below.

PR AUC Comparison Plot

We can see that VIME significantly outperforms fully-supervised models. MLP on average has 0.05 lower PR AUC than VIME which means that pre-training tasks indeed add value to the neural network. In addition, VIME outperforms RFs by 0.015 on average which can be significant with large transaction volumes and values.

PR AUC Comparison Table

Conclusion

In the situations where only a small fraction of a dataset is labelled, semi-supervised learning can lead to better results when compared to fully-supervised models.

This blog has shown how semi-supervised learning can be performed on tabular data using VIME. In addition, we did a deep dive into the architecture and training procedure so that you have a better understanding of what is happening under the hood. Finally, you saw how to train VIME using fraud-detection dataset in Tensorflow and how to evaluate it. Once again, we encourage you to check out the full notebook in this github repo and let us know if you have any comments, suggestions, or requests for further deep-dive.

--

--