30.6.15

Performance optimizations in SOA - part 1

It's been a while since my last post here, but during my absence I was mainly working on performance optimizations in SOA (Service Oriented Architecture) and I would like to share with you some of my conclusions after those few months.

Whenever you have to deal with Foundation Library ( a kind of library that is widely used by many back-ends) and you want to be sure that pull requests that you are integrating are not decreasing an overall performance of the library , than apart from strict Code Reviews -> MEASURE, measure and one more time measure your library performance. I spent a week or two to build our performance dashboard (Python+JS+CSS/HTML5+C++ for library code) for library that is one of our core libraries, and a key library for data encoding/decoding. Final result looks (more or less) like:


Example of Performance Dashboard for Foundation Library
What to measure ? And how to measure it ?

Well here things starts to be a bit more complicated, depends on the language that you are using, you can or you can not measure memory usage (let's exclude the case when you have your own memory map allocator that pre-allocates memory per object, and you are manually freeing memory from this memory allocator) . So firstly try to measure most common use cases that your library provides. Usually API from Library Manager. Secondly: Do you have a cache system in your library ? If yes than measure first read and second read (from cache). As integrator you have to have a tool that allows you to say that library is heading into wrong/good direction -> Try to build a trend graphs taking into account all previous tests, that after few months you can take a look and say to your manager that the work that you put in place for performance optimizations is going into right/wrong direction ;).

 
Trend graph for library

If you successfully build your Performance dashboard than you have to ensure that tests which you are running are always run in same conditions. Always use same machine with same configuration, for performance measurements, don't do it in core time when other people probably using same machine that you are using for test. Best option here is to have a separated test environment, otherwise use cron and schedule a tests somewhere in nightly hours. To reduce random factors, remove outliers from graph. Never run once if you are building trend graphs, run tests multiple times to build an average that should much better visualize a real library trends. 

Let's say that you succeed and you are able now to monitor you library performance, now its time for graph analysis, which believe me is the most interesting part of work when dealing with huge foundation libraries, as if gives you an idea what is really going on in your library, and how you library behave. Take a look and try to guess what is going on:


Did you see this "fragmentation" jumps ? :)

Any guess ? Well i will answer to this "issue" in the last image,

Two different paths ? What is going on ? :) 


There could be many reasons for such behavior, but it usually means that some elements are dependent on another elements (when you deal with Lazy Layers its very important to remove such dependencies as much as possible, -> dependency injection could be a good choice here).

Well this one is easy O(n^2)

Frankly speaking I think we all see what is going on here, bad algorithm in BOM model. But maybe we cannot do it better ? If you can.. refactor immediately!

Fragmentation and Cache system looks broken ?


Fragmentation that is observed here is really due to internal STL memory allocation (quite often in this case its connected to boost::multi_index structures, here you can observe how those structures behave) First and second read(from cache) is in this case OK, as cache has been already refilled, this is why its also an important factor to take a look at the numbers on the scale, otherwise we may spend some time searching for an issue where there is no issue at all. Look at the numbers at the scale -> Always.

A so called mess :))))



If you get a graphs like the last one, than you are definitely not the happiest man on the earth. Cause analysis of such graphs is usually highly complicated. We can say that something is happening with library when we reach a specified number of Elements ~2000, but to find out what is really going on and from where those outliers comes from, you have to spent a bit more time with tools like valgrind, and dig deep inside the code. But hey! at least you know where to look for, without measurements you won't be able reveal most of those performance issues.

Always Measure!