Mark van Venrooij

Mark van Venrooij's blog

Category: Software Development (page 1 of 6)

Continuing Domain Driven Design

tl;dr

In this post I continue my progress towards an DDD version of my favorite pet project. In this iteration I add a way to check the budget against the real expenses. The code is on GitHub and the picture contains the new model.

Continuing Domain-Driven Design

In my previous blog post I started to apply Domain-Driven Design to my favorite learning project. The quick summary of that post is that I started to create a Ubiquitous language and designed two aggregates.

Checking the budget

The main goal for the application is to check my expenditures against a budget for a certain category. To quote (a part of) the ubiquitous language:

To achieve financial insight the application needs to import and categorize financial transactions. I want to set budgets for a year and compare them to the actual amounts.

In order to check if my expenditures are still on budget I like to execute a method on Budget like budget.remaining(). To easiest way to calclulate the budget.remaining() is to substract the amount used for a budget from the amount planned.

To implement this method I need information that is stored in two aggregates. The actual expenditures are stored in the Category aggregate and the budget amount in Budget. I found several ways to implement the method:

  • Add the category repository to the budget aggregate.
  • Rethink the aggregate boundaries.
  • Make an application service to coordinate.
  • Make a domain service that calls the repository from the budget.
  • Using eventual consistency in combination with domain events.

Add the category repository to the budget aggregate

I can add the category repository to Budget, however this seems mixing responsibilities. Of course it will work. If I look at the internet this seems bad practice.

Rethink the aggregate boundaries

It is also possible to rethink my original design of aggregates. I could make budget the aggregate root that includes category and thus the transactions. This would make it easy to calculate the budget remaining. However Budget and Category have a different life cycle. A Budget is bound to a specific year while a Category typically will live for many years. Therefore this option is not the right one.

Application service to coordinate everything and do the calculation

I can create an application service that will coordinate the change and let it update both the category and the budget. In my opinion that sounds too much like procedural coding and is not a viable option.

Make a domain service that calls the repository from the budget

Another option is using a domain service that indirectly calls the repository. Linking the domain service in the entity is allowed. Although this seems actually not really different from including the repository, just another indirection. Therefore not an option for me.

Using eventual consistency in combination with domain events

The final option is raising an event every time a transaction is added. A separate event listener can then update the budget. This way I realize separation of concerns. The downside: some “infrastructure” is needed before it actually works. This is for me the cleanest option and thus the way forward.

However in this option there are some extra things to think about. A category spans multiple years and a budget is specific for a single year. So the question is which budget needs updating. There are several options for solving this.

  1. Loop through all yearly budgets for a specific category and update all budgets based on the new amount.
  2. Make the event specific for a year.

For now I have chosen for the option to loop through budgets for a specific category. Mainly because this is easiest to implement. Otherwise I need to compose a complex event that is targeted to update budgets including at least the transaction year or worse the whole transaction. That feels like mixing the aggregates a bit. Next to that by making the event more generic it is easier to reuse later if that is needed. By choosing this approach the consequence is that performance will take a hit.

Work to be done

Currently only the field amount is stored in budget. So now I need to change amount into amountPlanned and introduce amountUsed. Furthermore I need to create an AmountUsedUpdater that acts as an event listener to update the amountUsed. I didn’t use amountSpent because I have also some budgets for my income.

Conclusion

In this blog post I show the way I implement the comparison between the budget and the actual expenditures. The code is on GitHub. The picture below shows the domain model.

Domain-Driven Design model

The new model

Applying Domain-Driven Design

tl;dr

In this post I apply Domain-Driven Design to my favorite pet project. The picture above contains the resulting model. The resulting code can be found at GitHub.

My attempt on Domain-Driven design

As mentioned in my previous post I need to learn Domain-Driven Design for my new job. In this post I apply the things I learned till now to an application that gives me insight in my financial situation. The domain is very familiar to me as I used it before to learn new things. The application mainly tells me where I spent my money. Next to that I set yearly budgets on each category in the beginning of a year. During that year I will compare the budget against the actual situation to assess if I’m still on track. Ideally the application also incorporates some functionality to calculate my savings rate and some other metrics related to reaching financial independence.

Ubiquitous language

Eric Evans’ book emphasizes the importance of the ubiquitous language. So let me try to define the ubiquitous language for this application.

To achieve financial insight the application needs to import and categorize financial transactions. I want to set budgets for a year and compare them to the actual amounts. Furthermore based on the actual numbers the savings rate and years till financial independence must be calculated. In order to calculate the years till financial independence there is something to assess my savings and investments.

The bold words seem fundamental concepts to me. These words should play a major role in my domain. Probably I miss a couple of concepts right now, but that is fine. I will refine the ubiquitous language later. The most interesting part of the application to me is the last part: calculating the metrics on FI. However these are a result. In order to achieve FI earlier there are two possibilities:

  • Increase your income
  • Reduce your spendings

The application will not help in getting more income. Getting insight in your expenses can help to see when budgets are exceeded and where expenses are high. So for now I first focus on categorizing financial transactions and creating budgets.

Please note that for my personal finances I don’t use double bookkeeping as it seems overkill. Next to that I ignore any cash transactions. Not categorizing cash transactions will not influence my insights much as I don’t use a lot of cash. Next to that keeping all receipts is too much of a burden to me.

Entities versus value objects

After defining the ubiquitous language the next step in Domain-Driven Design is to try to make the model. Important parts of the model are the entities and value objects. As the focus is on categorizing and budgeting I find at least 3 objects of interest: Category, Transaction and Budget. To me a Category is an entity. Mainly because in time the amount of money spent in a Category will change via adding new transactions. Furthermore there must be a Budget more specifically a Budget per Category per year. Budgets are entities as pretty much anything can change in time.

If I look at Transaction I’m a bit in doubt. During the life cycle of my financial administration a transaction will not change. That is because they are originating in an external source: my bank. Next to that I’m only concerned about the values of a Transaction. If I need to make a guess I think I need an account number, a contra account, the date, the amount and the description of the Transaction. These fields are also the important ones in my naive Bayesian classifier to categorize my transactions. For now I make a choice to make a Transaction a value object. This might be revisited later.

Identities & associations

For Category the name is a good identity. There is a unidirectional association from category to zero or more Transactions. Next to that there is a unidirectional association between a Budget and a Category. Because setting a Budget without being able to categorize real Transactions doesn’t make sense, it is not possible to have a Budget without a corresponding Category. Over the years there are many Budgets per Category, however per year there can be only 1. This means that for the identity of the Budget it is probably wise to include the year. To me combining the Category and a year seems a good choice as identity for the Budget.

Aggregates

Time to define some aggregates in the model. Typically a Budget is created at the beginning of each year and Categories are created once (and maybe adjusted once in a while). However during the year I import and categorize my Transactions. For me this leads to a logical split. Making Budget an aggregate with only one entity. Category and Transactions are an aggregate as well. Category is the aggregate root. As Budget is a separate aggregate root I don’t want to include the Category as an object. I just want to have a reference to a Category through its identity. This has implications for the Budget and its identity. It identity is now made of a Category name and a year.

Repositories

Time to choose the repositories. My first attempt is a repository per aggregate. Given the requirements this makes sense to me now. In the future I probably want to persist my data somehow. Maybe with a relational database, maybe an object database or maybe just plain old CSV files. For exploring and building the domain this is not relevant. Furthermore maybe some future feature(s) will give me more information what storage technology will suit my needs. For now I will just use an plain old java implementation of my repository, using the collection classes.

Creation of aggregates

Creating a category is straight forward as there are no special requirements here. So a constructor will suffice. However for creating a Budget we need to make sure that there actually is a Category available. To check if a Category exists I need to find the Category by name using the CategoryRepository. I could make an association in Budget to that repository. However that seems bad practice. I think creating a BudgetFactory makes sense in this case.

Conclusion

Given the small number of objects in the domain now I don’t find it necessary to split these classes among different modules yet. The model created seems to cover my basic needs of storing the budgets and categorize transactions. The complete design is shown below.

The resulting DDD model

The resulting DDD model

Next to that the resulting code, including unit tests, can be found on GitHub. However I didn’t implement comparing budgets to actual expenditures yet. Next to that I only implemented the model so at this moment there is no application to run. These are topics for another blog post.

Learning domain-driven design

On my new job the department is applying Domain-Driven Design. I did read Domain-Driven Design – Tackling Complexity in the Heart of Software by Eric Evans a couple of years back. However as I never used it I need to learn how to apply it. So in order to really learn DDD I started to reread Eric Evans’ book and plan on reading Implementing Domain-Driven Design by Vaughn Vernon. As theory and practice are only equal in theory, I want to apply DDD to my favorite pet project. To recap an earlier blog post, many years ago I started categorizing all my financial transactions. At first I used to build a tool myself to give me insight in my spendings, later I bought something to do this for me. However this is now the domain I know pretty well and use to learn new things.

Current progress

Currently I read till chapter 5 of Eric Evans’ book. The key things I learned in the first chapters:

  • Make domain concepts explicit by capturing the knowledge in the design
  • Define a ubiquitous language using domain terms
  • The feedback loop from coding back to the design is often missing
  • The domain must be isolated from the user interface, db glue and other supporting code in order to achieve separation of concerns
  • The smart UI cannot be combined with DDD

Moving forward

My plan is to finish reading both books before the end of the year. (I hope I can find the time, given that I have a young daugther.) In the same time I plan to work on my pet project. As I learned long ago a good way to learn things is to teach others. So I want to share my progress in implementing DDD on my pet project through this blog. So the coming weeks expect some new blog posts on this topic. Basically me struggling to apply the DDD concepts.

Naive Bayesian Classifier on transactions

For a long time I have the idea to enhance my BudgetApp application with machine learning techniques. I would like to use these techniques to categorize my transactions. Let me first explain how my current solution works. I have a set of rules that are used to assign a category to a transaction. In principle there are two kind of rules: Rules based on the description of a transaction and rules based on the contra account number of the transaction. The first rule that matches the transaction is used to categorize the transaction. The rules engine works quite well. About 90% of all transactions are recurring so these can be categorized automatically. But on the moment you get a new kind of transaction that will be recurring you have to create a new rule so in the near future this will be categorized automatically. My total ruleset now contains about 100 rules. In a few years I collected about 1400 transactions that are categorized with the above mechanism. After trying a implementation of the K-nearest-neighbor algorithm. I finally use a bayesian classifier implementation which works quite well.

So 2 weeks ago I finally decided to try using machine learning techniques. With all the categorized transactions I have a great set of data that I can use for supervised learning, also this can be used to validate my solution. There are a few basic assumptions for my solution. I don’t mind that the chosen solution is not able to classify a transaction but it should have very little incorrect classifications. The tool is used to create a budget and if too many transactions are incorrectly classified, the budget might be incorrect.

One of the easiest classification algoritms to implement is K nearest neigbour. The algoritm might be slow if you have many data points. But my total data set seems to be small (only 1400 entries). The main part of this algorithm is to construct a distance function that determines how similar the new transaction to categorize is compared to all other transactions. Looking at the previous solution I recon that there are two major features of a transaction that are used to categorize a transaction: Description and Contra account. So probably the distance function should use these two things to categorize a transaction. I think that the contra account is a good place to start. Basically each different contra account should indicate another category. While running this on the real data I notice that I got a lot of incorrect categorizations in my set, and more importantly that it takes a few minutes to calculate the category of 10 transactions based on the ‘learned’ input of about 1000 transactions. So the solution is actually pretty slow even on my small data set. Another problem I have is that I can’t think of a way to construct a description distance function. I don’t find it acceptable that a categorization of a dozen transactions takes minutes as the old way only takes seconds. After checking if I could improve my calculations I found I didn’t make any major mistakes. Now I have an extra requirement: the solution should be about as fast as the rules solution.

Back to the drawing board it is. I need to select a different classification algoritm. After spending some time on the wikipedia pages I think a naive bayesian classifier might help me. It is pretty straightforward to implement. Furthermore it seems to be quite effective and efficient for similar problems (like spam filtering). A naive bayesian classifier assumes that an absence or presence of a feature is independent of the presence or absence of another feature. This assumption is often wrong, but in practice it seems to work pretty well. Let’s implement this thing.

Starting with the contra account number. What is the probability that a new transaction has contra account 123 and belongs to category x? I think it should be the number of transactions with contra account 123 and category x in the training set divided by the number of all transactions with contra account 123 I can’t think of a better way to find the probabilities on this feature. After implementing this thing I use about 10% of all known transactions as training set and try to categorize the remaining transactions to validate my solution. The results are really promising: 533 transactions are categorized correctly, 697 couldn’t be categorized based on the contra account number (basically the contra account number for these transactions is zero or not in my training set), and only 11 transactions are classified incorrectly. After some investigation on these incorrect transactions I notice that 10 of them fall in the category “salary” while it should be “expense claim”. As my probability only takes the account number as feature this makes sense. I have far more salary payments than expense claim payments. This problem should be resolved if I add more features to my probability calculations. Probably if I add description the distinction can be made. I might include the transaction amount as well as these amounts are quite far apart.

Implementing the probability of the descriptions is similar to the contra account number: the number of transactions with word y in the description and category x in the training set divided by the number of all transactions with word y in the description. This works only for a single word in the description. To combine multiple words these probabilities should be multiplied. Ok implemented, time to validate again: 912 transaction classified successfully, 208 can’t be classified and 121 classified incorrectly. The incorrect classifications worry me. If I look into the details I see that most of them are classified as groceries but the actual category can be many things e.g. gifts. The problem here is that most incorrect transactions are “pin” transactions as we call that in the Netherlands, transactions paid with my debit card. The transactions are categorized as groceries due to similarities in the description field. They all contain the same words.

At this moment the solution cannot replace the ruleset mainly due to the high number of incorrect classifications. However I want to include other features as well. The amount should fix some problems.

P.S. As a proof that the old system is not ideal as well I found 1 transaction that was incorrectly classified in the old system!

Leaky abstractions in JPA eclipselink

Leaky abstractions in JPA/Eclipselink

Leaky abstractions in JPA/EclipseLink

In our team we use JPA, EclipseLink to be precise. Now we start stress testing we notice something weird. The child relations of a entity are not updated if we only save the child. Probably we can prevent it by some configuration/annotation or other way, but we use an ORM to make sure we don’t have to think about these things.

Let me give you some code samples. There are two entities involved: ParentEntity and ChildEntity:

Furthermore we have a ParentEntityRepository object manages access to the entities.

Some times we need to add a new child to an existing parent. The easiest way to do that is to create a child set the parent and save the child.

The problem now is that the new child is not added to the parent in our front-end, but in the database the child is added and has the correct parent attached to it. After some time (EclipseLink cache expiration?) the results are correct.

I think this is really disturbing. You use a ORM to have an abstraction from your data layer, but it seams it’s a leaky one. I can’t remember that Hibernate has the same behavior, my memory might be misleading me though ;-).

P.S. The solution is to add the child to the parent and then save the parent, but I don’t think that should be the case.

Edit:
Coen Damen pointed me to the EclipseLink F.A.Q. with another solution by calling refresh() on the parent. Still I think it should be managed by the framework as I don’t want to think about database relations and when to refresh an entity.

Older posts

© 2018 Mark van Venrooij

Theme by Anders NorenUp ↑