Mark van Venrooij

Mark van Venrooij's blog

Category: featured (page 1 of 2)

quick update

A long overdue update from my side. As mentioned in previous post I got married last year. Next to that this year we welcomed a newborn daughter. So my life changed a lot the last 12 months.

Update on my path to financial independence

We saved a bit less than the goal (25%) of last year, this is basically due to buying a nursery and all other things for my daughter. However having a young daughter compensates for that a easily. The Savings as percentage of FI goal basically hit the goal as the stock market was pretty good. Ignoring any market changes and interest it would take approximately 36 years to be financial independent giving the total savings of last year. The goals for this year are a bit more modest compared to last year. It should be possible to reach the 20% savings rate and that would result in reaching 18% of my set FI goal.

Savings rate 2016

Savings rate year 18.5%
Savings as percentage of FI goal 14
years till financial independence using total amount of savings of last 12 months 36

Financial goals 2017

Savings rate 20%
Savings as percentage of FI goal 18%

Naive Bayesian Classifier on transactions

For a long time I have the idea to enhance my BudgetApp application with machine learning techniques. I would like to use these techniques to categorize my transactions. Let me first explain how my current solution works. I have a set of rules that are used to assign a category to a transaction. In principle there are two kind of rules: Rules based on the description of a transaction and rules based on the contra account number of the transaction. The first rule that matches the transaction is used to categorize the transaction. The rules engine works quite well. About 90% of all transactions are recurring so these can be categorized automatically. But on the moment you get a new kind of transaction that will be recurring you have to create a new rule so in the near future this will be categorized automatically. My total ruleset now contains about 100 rules. In a few years I collected about 1400 transactions that are categorized with the above mechanism. After trying a implementation of the K-nearest-neighbor algorithm. I finally use a bayesian classifier implementation which works quite well.

So 2 weeks ago I finally decided to try using machine learning techniques. With all the categorized transactions I have a great set of data that I can use for supervised learning, also this can be used to validate my solution. There are a few basic assumptions for my solution. I don’t mind that the chosen solution is not able to classify a transaction but it should have very little incorrect classifications. The tool is used to create a budget and if too many transactions are incorrectly classified, the budget might be incorrect.

One of the easiest classification algoritms to implement is K nearest neigbour. The algoritm might be slow if you have many data points. But my total data set seems to be small (only 1400 entries). The main part of this algorithm is to construct a distance function that determines how similar the new transaction to categorize is compared to all other transactions. Looking at the previous solution I recon that there are two major features of a transaction that are used to categorize a transaction: Description and Contra account. So probably the distance function should use these two things to categorize a transaction. I think that the contra account is a good place to start. Basically each different contra account should indicate another category. While running this on the real data I notice that I got a lot of incorrect categorizations in my set, and more importantly that it takes a few minutes to calculate the category of 10 transactions based on the ‘learned’ input of about 1000 transactions. So the solution is actually pretty slow even on my small data set. Another problem I have is that I can’t think of a way to construct a description distance function. I don’t find it acceptable that a categorization of a dozen transactions takes minutes as the old way only takes seconds. After checking if I could improve my calculations I found I didn’t make any major mistakes. Now I have an extra requirement: the solution should be about as fast as the rules solution.

Back to the drawing board it is. I need to select a different classification algoritm. After spending some time on the wikipedia pages I think a naive bayesian classifier might help me. It is pretty straightforward to implement. Furthermore it seems to be quite effective and efficient for similar problems (like spam filtering). A naive bayesian classifier assumes that an absence or presence of a feature is independent of the presence or absence of another feature. This assumption is often wrong, but in practice it seems to work pretty well. Let’s implement this thing.

Starting with the contra account number. What is the probability that a new transaction has contra account 123 and belongs to category x? I think it should be the number of transactions with contra account 123 and category x in the training set divided by the number of all transactions with contra account 123 I can’t think of a better way to find the probabilities on this feature. After implementing this thing I use about 10% of all known transactions as training set and try to categorize the remaining transactions to validate my solution. The results are really promising: 533 transactions are categorized correctly, 697 couldn’t be categorized based on the contra account number (basically the contra account number for these transactions is zero or not in my training set), and only 11 transactions are classified incorrectly. After some investigation on these incorrect transactions I notice that 10 of them fall in the category “salary” while it should be “expense claim”. As my probability only takes the account number as feature this makes sense. I have far more salary payments than expense claim payments. This problem should be resolved if I add more features to my probability calculations. Probably if I add description the distinction can be made. I might include the transaction amount as well as these amounts are quite far apart.

Implementing the probability of the descriptions is similar to the contra account number: the number of transactions with word y in the description and category x in the training set divided by the number of all transactions with word y in the description. This works only for a single word in the description. To combine multiple words these probabilities should be multiplied. Ok implemented, time to validate again: 912 transaction classified successfully, 208 can’t be classified and 121 classified incorrectly. The incorrect classifications worry me. If I look into the details I see that most of them are classified as groceries but the actual category can be many things e.g. gifts. The problem here is that most incorrect transactions are “pin” transactions as we call that in the Netherlands, transactions paid with my debit card. The transactions are categorized as groceries due to similarities in the description field. They all contain the same words.

At this moment the solution cannot replace the ruleset mainly due to the high number of incorrect classifications. However I want to include other features as well. The amount should fix some problems.

P.S. As a proof that the old system is not ideal as well I found 1 transaction that was incorrectly classified in the old system!

Leaky abstractions in JPA eclipselink

Leaky abstractions in JPA/Eclipselink

Leaky abstractions in JPA/EclipseLink

In our team we use JPA, EclipseLink to be precise. Now we start stress testing we notice something weird. The child relations of a entity are not updated if we only save the child. Probably we can prevent it by some configuration/annotation or other way, but we use an ORM to make sure we don’t have to think about these things.

Let me give you some code samples. There are two entities involved: ParentEntity and ChildEntity:

Furthermore we have a ParentEntityRepository object manages access to the entities.

Some times we need to add a new child to an existing parent. The easiest way to do that is to create a child set the parent and save the child.

The problem now is that the new child is not added to the parent in our front-end, but in the database the child is added and has the correct parent attached to it. After some time (EclipseLink cache expiration?) the results are correct.

I think this is really disturbing. You use a ORM to have an abstraction from your data layer, but it seams it’s a leaky one. I can’t remember that Hibernate has the same behavior, my memory might be misleading me though ;-).

P.S. The solution is to add the child to the parent and then save the parent, but I don’t think that should be the case.

Coen Damen pointed me to the EclipseLink F.A.Q. with another solution by calling refresh() on the parent. Still I think it should be managed by the framework as I don’t want to think about database relations and when to refresh an entity.

Devnology codefest: Space Invaders quine

You saved the world - Invaders defeated

You saved the world

As a participant in Devnology Code Fest Space Invaders I knew immediately what the solution should be.

Some time ago I visited the Hot or Not Software Craftsmanship session by @KevlinHenney. In that session Kevlin introduced me to the principle of a quine and showed some examples. One of the examples was the Qlobe – a quine that shows a globe in the source code that rotates 45 degrees when executed.

So my contribution should be a space invaders quine. Furthermore my ruby skills need some practice so it will be a ruby space invaders quine. This is also a good excuse to finally finish reading Programming Ruby: The Pragmatic Programmers’ Guide. Below you a few possible game states are shown, the full solution can be found at GitHub.

Review: The clean coder

Cover of the clean coder

The clean coder

Last week I finished reading The clean coder written by @unclebobmartin. The book’s subtitle: “A Code of Conduct for Professional Programmers”, summarizes the book quite good.

Throughout the book @unclebobmartin tells about his own experiences during his career. Also he illustrates the points he wants to make with excellent example discussions between (project) managers and developers. These example stories are quite entertaining and make the book easy to read.

The first few chapters discuss how a professional developer behaves as a person. It’s all about managing expectations & making sure that you’re in the right state to do your job. After that some practices that improve the quality of your work are introduced. Examples practices are TDD, testing in general and estimation. The last chapters discuss how to behave in a team. A embarrassing paragraph in these chapters is called “Degrees of failure”. In that part @unclebobmartin basically explains how the education system fails to educate software developers.

It’s a nice book to read and has nice suggestions how software developers can improve their work and become craftsmen. However many of these tips are not new to me. They can be found in the enormous blob of blogs about agile / lean / etc. development. Only the estimation chapter gave me some real new information. Because the book is written in @unclebobmartin’s distinctive style I still recommend the book. The pragmatic programmer is a different league though.

Older posts

© 2017 Mark van Venrooij

Theme by Anders NorenUp ↑