Personalised fraud detection
Insurance fraud is expensive, affects insurance prices for all customers, and is therefore important to detect and prevent. Soft fraud, the exaggeration of legitimate claims, is quite diffuse and difficult to spot. A sustainable welfare system requires implementation of effective measures to limit fraud, such as tax avoidance and tax evasion. Furthermore, money laundering is a serious threat to the global economy.
Fraud detection can be seen as a regression/forecasting problem, where fraud (true/false) is the response, possibly with a potential economic loss. In such situations, there are a great number of covariates/features connected to each case, especially if one considers interactions. Further, the data are class imbalanced because the number of investigated fraud cases is generally low compared to the total number of cases. Another challenge is that the data are gathered over time, and that the quality may vary. In addition, only a small subset of the total number of cases is controlled. The objective is then to produce a trustworthy and interpretable probability of fraud for each new case, that can handle structured and unstructured data, including transactions, relational networks, and other available digital records in a privacy responsible setting. Since a small number of cases are controlled, fraud detection can also be seen as anomaly detection.
Network analysis for fraud detection
Fraud often involves more than a single individual. There could be groups of criminals all acting together, or one or more criminals that utilize businesses, financial services, or other (innocent) individuals to perform the fraud. In such settings, network relations play a fundamental role. This is particularly the case for money laundering where both financial transactions, professional roles, and social relations form networks relevant for modeling. Graph neural networks (GNN) is a model class that allows neural network models to be built on top of such graph data structures. Through BigInsight, a part-time master student in the Data Science program at University of Oslo has been working on this topic for a few years. The project used data from DNB to model and detect money laundering with GNNs working on a heterogenous graph consisting of both transactions and professional role networks. In 2022, the thesis was completed with promising results. We also started to write a scientific paper based on and expanding upon this work.
Surveying the field of embeddings
Our network analysis work over the past few years has spurred us to survey the field of embeddings within statistics and machine learning. This has resulted in two survey papers. In the first paper, we take the reader all the way from statistical embeddings, from principal components, via non-linear embeddings, topological embeddings and topological data analysis, to embeddings on networks. The paper has been accepted for publication in the renowned review journal Statistical Science. The second paper builds on the first by surveying extensions to embeddings of time series and dynamic networks. That paper has been accepted for publication in Journal of Time Series Analysis. We believe that this comprehensive knowledge will inspire further work.
Detecting structuring or smurfing
Structuring is the act of parcelling what would otherwise be a large financial transaction into a series of smaller transactions to avoid scrutiny by regulators and law enforcement. Criminal enterprises may employ several agents (”smurfs”) to make the transaction. Structuring appears in money laundering and other financial crimes. Even though this is a known money laundering technique, methods for detecting smurfing are pretty scarce in the scientific literature. We are devising methods to search for and detect smurfing patterns, which can be used as rules directly or as complex features in a machine learning model. In 2022, we finetuned the methodology based on feedback from DNB. Small-scale tests on real DNB customer data showed promising results.
Sentiment analysis for fraud detection
Sentiment analysis is the use of natural language processing (NLP) or text analysis to systematically identify, extract, quantify, and study affective states and subjective information. In the case of fraud, certain sentiments, like “impatient” or “unsatisfied”, or the transitions between them could be a signal of fraudulent behavior. We have developed a method to predict sentiments of Gjensidige insurance chats. Chats are instant messages that Gjensidige customers can use to ask questions to customer service. Detecting sentiments is a difficult problem since even humans can disagree on which sentiment(s) that can be found in a specific text. A variant of the method is already being used by Gjensidige and there is interest from other BigInsight partners as well. During the summer of 2022, we also tested adapting NLP models to provide complete textual answers to chat questions from Gjensidige customers, with interesting results.
Copula regression
Traditional regression methods model the conditional probability of fraud given the covariates directly. In copula regression, which is an upcoming field, this conditional model is instead inferred from the joint distribution of the response and the covariates that are constructed with a copula. This allows for a lot of flexibility, especially for the modelling of interactions. However, the existing inference methods for copula regression handle rather low dimensions only. In BigInsight, we have over the past few years been working on new inference methods that are suitable for the dimensions we encounter in fraud detection.
Fraud loss
Statistical fraud detection typically aims at extracting a subset of the most suspicious cases (insurance claims, financial transactions, etc.) for further investigation, since investigators are typically limited to controlling a restricted number k of cases, due to limited resources. The most efficient manner of allocating these resources is then to try selecting the k cases with the highest probability of being fraudulent. Optimizing such a system is not necessarily the same as optimizing to obtain the most accurate probability estimate for all cases and then ranking them. We propose a loss function, denoted the fraud loss, for selecting the model complexity via a tuning parameter. In 2022, we published a paper based on this method in the Journal of Applied Statistics. The paper contains a thorough simulation study showing comparable or better performance when using fraud loss, compared to the traditional approach.