Mark van Venrooij

Mark van Venrooij's blog

Author: Mark van Venrooij (page 2 of 11)

Review 2015, plans 2016

Has another year really flown by? My Calendar definitely says yes. So it is time to look back at 2015 and make plans for 2016.


Good things I want to continue

As you probably can see from my previous posts I somewhat obsessed by my personal finance situation. At least that is what other people in my environment tell me. For me I just like numbers statistics, in other words I’m a nerd and I’m perfectly happy with that. In the beginning of 2015 I made plans to be financially independent in approximately 25 years. I was able to stick to this plan since I made it. I won’t go in the details here (sounds like worth a post of his own). Some people in my environment tell me this is impossible, maybe this is true but planning for it makes me at least in the less dependent on my job in the future.

In the beginning of the year I had a job quite far from home. I was lucky to find a job with more responsibilities closer to home. Details about this you can find at my LinkedIn profile. I love this job and hope I’m able to keep it the coming year.

Things I want to improve

One thing I totally failed at in 2015 is blogging. If I look at my post list for 2015 it tells me there are exactly 0 posts so far. So the plan for 2016 is to have a monthly post. Not sure what the content will be, but in this post at least I found 1 subject to explore.

In 2015 I was very busy with my job and lots of other stuff that was really important. I totally forgot to sharpen the knife. So in 2016 I need to spend more time and effort on learning new things. This includes professional skills and personal skils.

Yes I was busy and I totally forgot about my photography hobby. I made some nice pictures, but not nearly enough if you look at the fun it brings me.

Plans 2016

So there are some things to improve on as I mentioned in my review of 2015: Blogging at least once a month, learn new things like exploring the topic of machine learning, spend at least 1 day a month making very nice photos

The biggest plan however is to marry my girlfriend. This will not only affect 2016 but probably the rest of my life. After being together for more than 10 years I feel this is the best way forward. This will affect my plans for being financially independent in a negative way and will cost a lot of time planning that could harm my other plans but I think this is totally worth it.

Naive Bayesian Classifier on transactions

For a long time I have the idea to enhance my BudgetApp application with machine learning techniques. I would like to use these techniques to categorize my transactions. Let me first explain how my current solution works. I have a set of rules that are used to assign a category to a transaction. In principle there are two kind of rules: Rules based on the description of a transaction and rules based on the contra account number of the transaction. The first rule that matches the transaction is used to categorize the transaction. The rules engine works quite well. About 90% of all transactions are recurring so these can be categorized automatically. But on the moment you get a new kind of transaction that will be recurring you have to create a new rule so in the near future this will be categorized automatically. My total ruleset now contains about 100 rules. In a few years I collected about 1400 transactions that are categorized with the above mechanism. After trying a implementation of the K-nearest-neighbor algorithm. I finally use a bayesian classifier implementation which works quite well.

So 2 weeks ago I finally decided to try using machine learning techniques. With all the categorized transactions I have a great set of data that I can use for supervised learning, also this can be used to validate my solution. There are a few basic assumptions for my solution. I don’t mind that the chosen solution is not able to classify a transaction but it should have very little incorrect classifications. The tool is used to create a budget and if too many transactions are incorrectly classified, the budget might be incorrect.

One of the easiest classification algoritms to implement is K nearest neigbour. The algoritm might be slow if you have many data points. But my total data set seems to be small (only 1400 entries). The main part of this algorithm is to construct a distance function that determines how similar the new transaction to categorize is compared to all other transactions. Looking at the previous solution I recon that there are two major features of a transaction that are used to categorize a transaction: Description and Contra account. So probably the distance function should use these two things to categorize a transaction. I think that the contra account is a good place to start. Basically each different contra account should indicate another category. While running this on the real data I notice that I got a lot of incorrect categorizations in my set, and more importantly that it takes a few minutes to calculate the category of 10 transactions based on the ‘learned’ input of about 1000 transactions. So the solution is actually pretty slow even on my small data set. Another problem I have is that I can’t think of a way to construct a description distance function. I don’t find it acceptable that a categorization of a dozen transactions takes minutes as the old way only takes seconds. After checking if I could improve my calculations I found I didn’t make any major mistakes. Now I have an extra requirement: the solution should be about as fast as the rules solution.

Back to the drawing board it is. I need to select a different classification algoritm. After spending some time on the wikipedia pages I think a naive bayesian classifier might help me. It is pretty straightforward to implement. Furthermore it seems to be quite effective and efficient for similar problems (like spam filtering). A naive bayesian classifier assumes that an absence or presence of a feature is independent of the presence or absence of another feature. This assumption is often wrong, but in practice it seems to work pretty well. Let’s implement this thing.

Starting with the contra account number. What is the probability that a new transaction has contra account 123 and belongs to category x? I think it should be the number of transactions with contra account 123 and category x in the training set divided by the number of all transactions with contra account 123 I can’t think of a better way to find the probabilities on this feature. After implementing this thing I use about 10% of all known transactions as training set and try to categorize the remaining transactions to validate my solution. The results are really promising: 533 transactions are categorized correctly, 697 couldn’t be categorized based on the contra account number (basically the contra account number for these transactions is zero or not in my training set), and only 11 transactions are classified incorrectly. After some investigation on these incorrect transactions I notice that 10 of them fall in the category “salary” while it should be “expense claim”. As my probability only takes the account number as feature this makes sense. I have far more salary payments than expense claim payments. This problem should be resolved if I add more features to my probability calculations. Probably if I add description the distinction can be made. I might include the transaction amount as well as these amounts are quite far apart.

Implementing the probability of the descriptions is similar to the contra account number: the number of transactions with word y in the description and category x in the training set divided by the number of all transactions with word y in the description. This works only for a single word in the description. To combine multiple words these probabilities should be multiplied. Ok implemented, time to validate again: 912 transaction classified successfully, 208 can’t be classified and 121 classified incorrectly. The incorrect classifications worry me. If I look into the details I see that most of them are classified as groceries but the actual category can be many things e.g. gifts. The problem here is that most incorrect transactions are “pin” transactions as we call that in the Netherlands, transactions paid with my debit card. The transactions are categorized as groceries due to similarities in the description field. They all contain the same words.

At this moment the solution cannot replace the ruleset mainly due to the high number of incorrect classifications. However I want to include other features as well. The amount should fix some problems.

P.S. As a proof that the old system is not ideal as well I found 1 transaction that was incorrectly classified in the old system!

Is income equality in companies a key to success?

In the talk by Richard Wilkinson he speaks about the correlation that was found between (in)equality in countries and their performance in healthcare etc. Bottom line: in the developed world countries with more equality have a longer life expectancy compared the more unequal countries. The inequality metric is a good indicator for many more “good” things. Correlation is of course not causation, but he also mentions some possible reasons why it could be causation.

Now I wonder if the same thing holds true for companies.  How would you handle such research? First you need to find a metric to calculate the (in)equality in a company. One possibility is to calculate the best paid employees salary compared to the lowest paid employee. E.g. the best paid employee earns 40 times the amount the worst paid employee gets. Maybe this metric is too limited as it only compares the extremes. Furthermore you need to determine the success of the company ideally with some metrics. Profit is off-course one but I think it shouldn’t be the only one.

Searching for some answers to the question I found an article saying that the CEO of a company should earn about 20 times the salary of the lowest paid worker. I think CEO is probably too limited as well. Does somebody know research that hints towards correlation between equality and company successes or maybe the opposite?

Leaky abstractions in JPA eclipselink

Leaky abstractions in JPA/Eclipselink

Leaky abstractions in JPA/EclipseLink

In our team we use JPA, EclipseLink to be precise. Now we start stress testing we notice something weird. The child relations of a entity are not updated if we only save the child. Probably we can prevent it by some configuration/annotation or other way, but we use an ORM to make sure we don’t have to think about these things.

Let me give you some code samples. There are two entities involved: ParentEntity and ChildEntity:

Furthermore we have a ParentEntityRepository object manages access to the entities.

Some times we need to add a new child to an existing parent. The easiest way to do that is to create a child set the parent and save the child.

The problem now is that the new child is not added to the parent in our front-end, but in the database the child is added and has the correct parent attached to it. After some time (EclipseLink cache expiration?) the results are correct.

I think this is really disturbing. You use a ORM to have an abstraction from your data layer, but it seams it’s a leaky one. I can’t remember that Hibernate has the same behavior, my memory might be misleading me though ;-).

P.S. The solution is to add the child to the parent and then save the parent, but I don’t think that should be the case.

Coen Damen pointed me to the EclipseLink F.A.Q. with another solution by calling refresh() on the parent. Still I think it should be managed by the framework as I don’t want to think about database relations and when to refresh an entity.

Coderetreat in December solves quine problem in March

You saved the world - space invaders quine

You saved the world - space invaders defeated

Yesterday I spoke to @MMz_ about my Space invaders quine. As is was a general meeting and a lot of people attended we didn’t have time to look into the solution properly. So I promised him to explain the details on a later moment. I think more people would like to know how the technical solution works so I decided to write this blog.

The first step in the program is to read the “field”. The last statements in the file are:

Basically the space invaders field is initialized and next_move is called.

Below you can find an simplified version of the next_move code.

So the invaders are moved, a new state of the field is calculated, the darts (bullets/lasers) are moved upwards, a new field is calculated and finally the player puts a new dart in the field as the shot is calculated.

While initializing the SpaceInvaders class the field is pre-processed. In order to move the invaders and the darts it is easier if the input field is split into two separate fields. One containing only the darts and another only containing the invaders. This makes moving each type of objects a lot easier.

The darts field is a String that contains line-ends, spaces and darts (i). To move all darts up basically the only thing that has to be done is remove the first line of the String and add a new line at the bottom. In code this is like:

The next step is calculating the resulting field. The top line of the field doesn’t change neither do the bottom lines starting with the line #XX ……. So these lines are just copied. For the middle part both the dart field and the invaders field are overlay-ed. For each position in the field if a dart and a invader are found on the same position they both are removed from the resulting field. The actual overlay code looks like this:

Next step is to move the invaders. Initially the invaders move right. But if the wall is “touched” they move one line down. After the move down the invaders move left till the left wall is “touched” resulting in an “s”-like pattern.

The current direction of the invaders should be “remembered” by the source code. If you look at the last character of the first line in the field you see the “r”. This is the current direction.

Moving the invaders right is actually quite easy. The most right character of the field is certainly an empty place, otherwise the wall was “touched” and the invaders move down. If each line in the invaders field is put in an array of characters and the resulting array is rotated the invaders are moved right. The only thing left is to make a new string of that array. In code this looks like:

Moving left works similar only rotating 1 position to the other side. Moving down is similar to moving darts up. One extra thing for moving invaders down is to detect if any invader reaches the “player line”. If this happens the user failed to defend the earth.

Last step is to let the player shoot. When running the program is is possible to supply a number as argument. This argument is used to determine where the player shoots. Shooting is just replacing the player line with the new version.

Last step is to make the code a quine. I simplified the next_move method shown above. There are some extra steps. These steps are to see if the player has won/lost and also to calculate & write the next state of the source code. At this moment the new field is calculated. So the “only” thing left is some String manipulation in the source code to replace the old field, with the new field. With a regexp that is quite easy. The code looks like this:

Now you’ve seen my solution some closing thoughts. The first steps into my solution was moving the invaders right. Initially my solution was to read the field and use a regexp to match a line with “*” in it and remove the last space character and adding a new space on the left. Getting this to work is not really difficult. After implementing moving down and moving left I decided to introduce darts and to move these darts up. I couldn’t find out how to get that working. After thinking about it for half an hour or so I decided I couldn’t fix that. I stopped coding at that moment to come back at a later moment. As someone put it before:

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

The morning after I suddenly had an idea how to solve it. I remembered that during the global day of code retreat in December some people solved Conway’s Game of Life by overlaying multiple copies of the initial state. I’m convinced that attending the global day of code retreat helped me create this quine. Waiting a day to get the solution is for me just like the “Thoughtful walking” technique described in Pragmatic thinking and learning.

P.S. If someone else is trying to make a quine and write it to a file, please make sure you don’t overwrite your source file until you’re ready to release your solution. As you are replacing things in the original source file and you make a mistake (I know you don’t make mistakes), you loose the original version. It happened to me several times. Thank got I had Git & committed a lot. The full code can be found on GitHub.

« Older posts Newer posts »

© 2019 Mark van Venrooij

Theme by Anders NorenUp ↑