Categorizing Transactions with Machine Learning and rules

TL;DR

In this post, I’ll demonstrate how combining rules-based systems with machine learning — specifically Random Forest — can significantly improve transaction categorization, particularly for incidental and non-recurring cases. This hybrid approach not only reduces manual efforts but also improves accuracy, helping me make better financial decisions with minimal intervention.

The problem

Categorizing financial transactions is an essential task for gaining financial insights, whether for personal budgeting or business analytics. Over the years, I’ve explored various approaches to categorize transactions effectively, using rules-based systems. In earlier blog posts, I discussed how rules act as part of the domain and how aligning them with domain-driven practices enhances financial insight.

The foundation of gaining these insights is accurate transaction categorization. Accurate categorization is crucial for understanding spending habits and planning budgets. Without accurate categorization, financial insights can remain incomplete or misleading. I may misinterpret my financial behaviors, overlook critical spending patterns, or miss opportunities to optimize budgets.

As the year ends, I like to update my budgets for the upcoming year. This time, I found I hadn’t categorized my expenses for several months, leaving over 400 transactions uncategorized. Many were one-off transactions, like rest-stop purchases during travel, which lacked discernible patterns. This underscored the limitations of my rules-based system.

The most persistent issue is handling incidental and non-recurring transactions. These transactions often slip through the cracks because they don’t adhere to patterns that rules-based systems can capture.

While rules-based systems provide a solid foundation, their limitations—especially with non-recurring transactions—make machine learning an essential complement. In the past, I experimented with a naïve Bayes classifier to tackle this problem. Now, I’m using a more sophisticated approach: a Random Forest model implemented with WEKA.

This post explores how combining rules-based systems with machine learning, particularly Random Forest, can address these challenges effectively.

Rules-Based Categorization: The Starting Point

In my earlier explorations (e.g. here), I detailed how rules can form the backbone of transaction categorization. These rules might look something like:

If description contains ‘supermarket’, categorize as Groceries.
If amount equals recurring monthly subscription, categorize as Subscriptions.

The rules work well for predictable, recurring transactions but falter when faced with outliers. For example:

A one-off dinner at an upscale restaurant.
A gift purchase categorized as “Miscellaneous.”
So-called PIN transactions that only have a description like “Betaalautomaat 15:33 pasnr. ~~xxxxxx~~“

For instance, a transaction labeled ‘Betaalautomaat 15:33 pasnr. xxxxxx’ provides no useful context, making categorization difficult without additional metadata or sophisticated preprocessing.

The limitations of rules-based systems:

Scalability: Writing comprehensive rules for every edge case is nearly impossible.
Dynamic Changes: Transaction descriptions and patterns change over time.
Non-Recurring Transactions: Incidental transactions often lack historical data to match against.

Exploring Machine Learning for Transaction Categorization

Machine learning provides an opportunity to address the gaps left by rules. Instead of relying on hardcoded patterns, a model can learn from labeled transaction data and generalize to unseen data. To get started, I chose WEKA, a popular tool for machine learning experimentation, due to its ease of use and variety of algorithms.

1: Data preparation

What was done:

Exported a dataset of previously categorized transactions.
Features included textual (descriptions, counterparties), numerical (amount), and contextual (date).
Labels represented transaction categories (e.g., Groceries, Miscellaneous).

Why it matters:

These features help the model learn patterns and improve its ability to generalize to unseen data.

2. Training the Random Forest

Random Forest was chosen for its ability to handle noisy, incidental data effectively. Unlike simpler models, it can manage both numerical and categorical features while being robust to overfitting—a common challenge in imbalanced datasets like financial transactions.

3. Challenges Encountered

While the model achieved an 86% accuracy rate, certain categories, like one-off clothing purchases or PIN transactions, posed significant challenges. These transactions suffered from sparse data, inconsistent descriptions, and a lack of useful contextual information. Additionally, categories like ‘Miscellaneous’ often lacked sufficient training examples, while generic descriptions like ‘Betaalautomaat’ were prone to misclassification.

More generally there were notable challenges:

Sparse and Imbalanced Data: Categories with fewer transactions, such as one-off gift purchases, often provide insufficient examples for the model to learn from, leading to misclassifications.
Text Processing: Transaction descriptions often lacked standardization (e.g., inconsistent vendor names).
Default descriptions without useful information—such as the previously mentioned PIN transactions—often lead to misclassifications.”

These challenges illustrate the need for further refinements, such as better text tokenization and seamless integration into the codebase.

Next steps

Currently, descriptions are treated as raw strings, which limits their effectiveness as features. Tokenizing descriptions — breaking them into meaningful components such as words, numbers, or patterns — would enable the model to better identify recurring patterns in transaction text, including vendor names or keywords.

In addition, I need to deploy the machine learning model in my application to automate transaction categorization using a domain-driven approach.

Bridging Rules and machine learning

Instead of abandoning rules entirely, combining them with machine learning could offer the best of both worlds:

Use rules for predictable patterns and recurring transactions.
Leverage machine learning for edge cases and incidental transactions.

The synergy between rules and machine learning lies in their complementary strengths. Rules excel at handling predictable, recurring patterns, while machine learning offers adaptability for unpredictable and edge cases. Together, they create a robust system that reduces manual intervention and improves accuracy.

For instance, a rule might flag all transactions containing “Electric Company” as “Utilities”, while the machine learning model handles transactions like “Cozy Bistro” by analyzing contextual clues to classify it as ‘Dining.’

Conclusion

As financial transactions grow more complex, combining rules with machine learning offers a scalable, adaptable solution for categorization. Refining preprocessing techniques and iterating on the model can further minimize errors and maximize insights.

Have you faced similar challenges with transaction categorization or other domain-specific problems? Share your experiences and the strategies or tools that worked for you—I’d love to exchange ideas!