Showing posts with label memory. Show all posts
Showing posts with label memory. Show all posts

30.6.15

Performance optimizations in SOA - part 1

It's been a while since my last post here, but during my absence I was mainly working on performance optimizations in SOA (Service Oriented Architecture) and I would like to share with you some of my conclusions after those few months.

Whenever you have to deal with Foundation Library ( a kind of library that is widely used by many back-ends) and you want to be sure that pull requests that you are integrating are not decreasing an overall performance of the library , than apart from strict Code Reviews -> MEASURE, measure and one more time measure your library performance. I spent a week or two to build our performance dashboard (Python+JS+CSS/HTML5+C++ for library code) for library that is one of our core libraries, and a key library for data encoding/decoding. Final result looks (more or less) like:


Example of Performance Dashboard for Foundation Library
What to measure ? And how to measure it ?

Well here things starts to be a bit more complicated, depends on the language that you are using, you can or you can not measure memory usage (let's exclude the case when you have your own memory map allocator that pre-allocates memory per object, and you are manually freeing memory from this memory allocator) . So firstly try to measure most common use cases that your library provides. Usually API from Library Manager. Secondly: Do you have a cache system in your library ? If yes than measure first read and second read (from cache). As integrator you have to have a tool that allows you to say that library is heading into wrong/good direction -> Try to build a trend graphs taking into account all previous tests, that after few months you can take a look and say to your manager that the work that you put in place for performance optimizations is going into right/wrong direction ;).

 
Trend graph for library

If you successfully build your Performance dashboard than you have to ensure that tests which you are running are always run in same conditions. Always use same machine with same configuration, for performance measurements, don't do it in core time when other people probably using same machine that you are using for test. Best option here is to have a separated test environment, otherwise use cron and schedule a tests somewhere in nightly hours. To reduce random factors, remove outliers from graph. Never run once if you are building trend graphs, run tests multiple times to build an average that should much better visualize a real library trends. 

Let's say that you succeed and you are able now to monitor you library performance, now its time for graph analysis, which believe me is the most interesting part of work when dealing with huge foundation libraries, as if gives you an idea what is really going on in your library, and how you library behave. Take a look and try to guess what is going on:


Did you see this "fragmentation" jumps ? :)

Any guess ? Well i will answer to this "issue" in the last image,

Two different paths ? What is going on ? :) 


There could be many reasons for such behavior, but it usually means that some elements are dependent on another elements (when you deal with Lazy Layers its very important to remove such dependencies as much as possible, -> dependency injection could be a good choice here).

Well this one is easy O(n^2)

Frankly speaking I think we all see what is going on here, bad algorithm in BOM model. But maybe we cannot do it better ? If you can.. refactor immediately!

Fragmentation and Cache system looks broken ?


Fragmentation that is observed here is really due to internal STL memory allocation (quite often in this case its connected to boost::multi_index structures, here you can observe how those structures behave) First and second read(from cache) is in this case OK, as cache has been already refilled, this is why its also an important factor to take a look at the numbers on the scale, otherwise we may spend some time searching for an issue where there is no issue at all. Look at the numbers at the scale -> Always.

A so called mess :))))



If you get a graphs like the last one, than you are definitely not the happiest man on the earth. Cause analysis of such graphs is usually highly complicated. We can say that something is happening with library when we reach a specified number of Elements ~2000, but to find out what is really going on and from where those outliers comes from, you have to spent a bit more time with tools like valgrind, and dig deep inside the code. But hey! at least you know where to look for, without measurements you won't be able reveal most of those performance issues.

Always Measure!



11.3.15

PLT (Procedure Linkage Table) and tcmalloc part 3

Here we go again!
This time let's take a look what PLT and GOT has to do with tcmalloc at all ? Should we worry when going to production with our new tcmalloc allocator ?

In short terms YES we should.. Why ? Well lets find out.

At the beginning we need some basic knowledge about dynamic linking under Linux system with glibc, so lets start with that. To keep it simple I will only consider DT_RUNPATH, not the old and currently deprecated DT_RPATH (but please pay attention that in production environment its quite common to have old deprecated variables..). So when ld.so is trying to load dynamic shared library, there are few paths that ld.so will consider taking into consideration the following order, and indeed this very important to understand it when you deals with shared libraries:

  • firstly path from the environment $LD_LIBRARY_PATH
  • than paths from the caller's (the ELF object that require the library load through ELF dependency or dlopen) DT_RUNPATH. 
  • DT_RUNPATH is a set of paths, hardcoded in the binary at link time, that are here to help in the path resolution of dynamic library at runtime. At link time, it's controlled by the -rpath ld flag, or then environment variable LD_RUN_PATH 
  • from ld.so cache. Libraries present in the cache are found from paths given in /etc/ld.so.conf and files from /etc/ld.so.conf.d/
  • and /lib and /usr/lib as a last resort

OK we have some basic knowledge, lets briefly explain what PLT is? as we will need this knowledge latter. One more time briefly it's procedure linkage table. This mechanism is used to speed up process startup. It allows position independent code (PIC) object (dynamically linked shared library) to lazily rely on foreign symbols defined by other ELF object. Exactly LAZY, that means PLT symbol is only resolved the very first time it is used at the cost of one more indirection before accessing the GOT (global offset table). If you symbol relies only on GOT its faster in runtime but slower in startup phase, as usually we are using only small subset of symbols, not all of them.
A bit more detailed info here:
http://eli.thegreenplace.net/2011/11/03/position-independent-code-pic-in-shared-libraries/

Some symbols in ELF can be declared as weak objects. It means that its possible to provide that symbol redefinition in another ELF object. This is exactly what static linkage is doing, it will keep only strong symbols after linkage thus there will be no conflict in linkage time. For dynamic libraries everything looks different mainly because dynamic linkage knows nothing about weak and strong symbols. When resolving a symbol dynamicly, the first encountered definition wins, it could be weak or strong it does not matter FIRST win always. Its clear here that order matters when you redefine symbols! So its clear now that if you want to link your tcmalloc dynamically you have to link it before glibc, otherwise you are one more time doomed :-)

Now lets dive deeper..

Do you ever think why you are able to redefine malloc/free/realloc/calloc? If no than maybe you read the the text above ? If yes than you should already know it. Indeed all allocation symbols are marked as weak in glibc, and this is why we are able to use tcmalloc instead of standard one! But this is not the end! Remember that ld.so is still here and it has to call malloc/free also! Which version it will call when resolving dynamic linkage ? Dynamic linker need malloc, but in order to call malloc there must by one already loaded, sounds weird but, well to link something dynamically you at least need one malloc call before you can redefine it. To fix this issue, the glibc always uses calls to the PLT version of malloc/calloc/realloc/free so that the address of the actual implementation can be easily rewritten at runtime with a single write. In a first time, malloc@plt points to the glibc malloc implementation. Later on, when the new malloc implementation is loaded and initialized (http://www.delorie.com/gnu/docs/glibc/libc_34.html), malloc@plt will point to the new implementation. Same for free@plt.

Now its probably a bit more clear comparing to our knowledge from first post about tcmalloc.
Can we now go even deeper ? We can and lets try to find out why you should be very careful when you changing default malloc.

Very early during the process startup, ld.so call _dl_init_paths which initializes the search paths for the current executable we are loading. It allocates through malloc@plt (pointing to glibc malloc) structures to store data from \$LD_LIBRARY_PATH and DT_RUNPATH. When looking for a dynamic object, ld.so sequentially calls open_path with paths from \$LD_LIBRARY_PATH, then DT_RUNPATH. If for some reason we could not load the library using these paths, then the structure if freed using free@plt and assigned to NULL (why its freed?). What is important ?  dlopen! open_path can be called at any time. So if we call this after tcmalloc has been initialized, free@plt points to tc_free, but the data was allocated with glibc's malloc. BAM! core.

What I told you is now fixed in https://www.sourceware.org/ml/libc-alpha/2013-04/msg00308.html 2.18 glibc. But did you checked your glibc version ? :-) If its newer than that forget about it and go to prod without any worries (really? ;-) )

This fix is preventing free@plt calls for DT_RUNPATH (because free@plt could be now tc_free).

Sum up:
  • Check your glibc version
  • Check your  LD_LIBRARY_PATH and all other paths 
  • Take your time to understand how tcmalloc works and why it works at all. Otherwise don't use it, cause you wont be able to say why it fail if it starts to.
What about LD_PRELOAD ? Well just read this (http://lca2009.linux.org.au/slides/172.pdf) presentation to master it.

Credits go to Amadeus MDW team that did a good job investigating some tricky parts that could be presented here now.

Rgds

$(TA)$

Comments $\TeX$ mode $ON$

10.3.15

tcmalloc part 2

We are back again,

In this second 'chapter' about tcmalloc I'm going to show you some results from real production environment code. I will focus on simple comparison of malloc and tcmalloc performance in NO multithreaded software. Is it worth switching from your defaut malloc to custom one ? I told you in first part of malloc post that its not necessarily obvious.. Lets find out.

Firstly lets check how simple glibc malloc behave in our test library:

Lets take a look at some numbers on how long it takes to evaluate some basic library functions:

Performance test create :   0:6000
Performance test
pop :      0:18000
Performance test
create :   0:5000
Performance test pop :      0:18000
Performance test :          0:93668000
Performance test Cache :    0:142341000
Performance test :          0:150872000 (not hitting the cache)
Performance test for Test : 0:8803000 (hitting cache)
Performance test for Test : 0:8770000 (hitting cache)
Performance test for Test : 0:8797000
(hitting cache)

Take a look at callgrind outputs:





We clearly see that ld lib responsible for dynamic libraries is taking most of the time in processing, but what is also very important malloc itself seems to work very heavily. Can we gain anything in terms of speed when we just simply replace our default malloc by custom tcmalloc ? Well of course there are few possibilities to configure tcmalloc with bigger page size (which in fact will use a bit more memory than we need but should be also a bit faster as less tcmalloc calls are needed in this case). But firstly lets try to look how default tcmalloc with default configuration behave. We will not use dynamic linking here we will try to exploit our test scenario using special LD_PRELOAD variable to be sure that We won't encounter any bad memory deallocation which leads usually to system core.

 For more detailed info about LD_PRELOAD you can go here: https://blog.cryptomilk.org/2014/07/21/what-is-preloading/

In next post I will try to explain how glibc works and why LD_PRELOAD do the job. Also it's not exactly true that tcmalloc replace malloc completely, well its partially true.. but for now lets come back to our tcmalloc scenario.

TCmalloc results:

Performance test create        : 0:5000
Performance test pop           : 0:18000
Performance test create        : 0:5000
Performance test pop           : 0:18000
Performance test               : 0:91996000
Performance test Cache         : 0:138960000
Performance test               : 0:147004000
Performance test for Test      : 0:8265000
Performance test for Test      : 0:8231000
Performance test for Test      : 0:8260000


Sounds promising! 150872000-147004000= 3868000 nanoseconds faster ~ 2.56% faster 

Lets also take a look at our callgrind outputs:






libc has a bit less to do, and you wont find malloc/free/calloc on its list. Those are now handled by tcmalloc. You can clearly see here how tcmalloc behave comparing to simple malloc. And its now quite clear (if it wasn't yet?) that pthread is a must here to use tcmalloc implementation. You can try yourself tcmalloc configuration build with TLS flag off to compare the results. And please share it if you have some, it will be nice to see how it behave. I'm going to test it also in some spare time with various flags.

We've seen that we can gain something even if we are in single threaded library. Some internal tests shows us that we can gain much more when switching to custom malloc in multithreaded environment. The choice is yours! But maybe its good to consider and compare also jemalloc and lockless ? For sure I will post some results from those two in comparison with tcmalloc and malloc. Stay tuned!

In next post I'm going to present you how ld, libc, LD_PRELOAD and free/malloc works when dealing with dynamic and static libraries, and why LD_PRELOAD works at all ?
One more time stay tuned, and follow me on twitter if you don't want to miss it :-)




Rgds
$(TA)$

Comments $\TeX$ mode $ON$.

tcmalloc

During last week I have got a pleasure to work with custom memory allocator from google perf tools: http://goog-perftools.sourceforge.net/. Its called tcmalloc, and let me write few words about it.

Tcmalloc is a replacement for glibc malloc, which today is widely used in most of standard new operator implementations. It's faster mainly because its lock free, and it's lock free because its multithreaded. Here we come to some troubles mainly because not all systems support this kind of MT that tcmalloc implementation is using. There is something called TLS (Thread Local Storage) and it could be that you system does not support TLS. If this is the case you should in my opinion try to find another solution, why ?
Indeed its possible to switch off TLS in config.h file before you run ./configure on tcmalloc that will definitely switch off TLS for you, but than you will not see any speed ups from tcmalloc, so there is no need to change it. Secondly its definitely designed for multithreaded environment, if you running on one simple thread.. well I will show you in next post some real results that maybe you can decide is it worth your time..

So firstly when to use it ? 
Only when you really have to. Indeed it sound nice to replace some basic function by another one and get 5% average speed up. But only in theory. In practice it may lead you to many memory disasters, especially when you are dealing with dynamically linked libraries. 
Why dynamically linked ? 
Well lets consider that you are working on a huge project with many libraries, its obvious that compiling your library with tcmalloc won't work correctly with other libraries. You will quickly see problems related to : Trying to free not exactly your memory (allocated by tcmalloc , but deleted by simple free or opposite).
There is a way to get rid of this and its called LD_PRELOAD, it's a kind of workaround to ensure that all libraries are loaded with tcmalloc even if they were not compiled with it. Sounds good ? Yes indeed you just need to do some basic export LD_PRELOAD=\path_to_lib.so and that is it. Rebuild. Have fun. This is currently the only possible way to make it work if you are working on a complex project with many libraries and your backend is linking to many stuff that is not on your side. LD_PRELOAD ensures that dlopen will take care of any dynamically linked lib and will replace malloc by tcmalloc. 

To sum up most important parts:
  • Its designed to excel multithreaded environments, but it may give also some good results in a simple one threaded application, mainly because it gives you possibility to setup page size manually, so you can configure it to your needs.
  • gperf tools provide more than only tcmalloc, but tcmalloc is a base component for the rest.
  • google folder in gperftools package is deprecated, please read .h comments carefully when you configure it, its quite common that people are linking to old deprecated google folder in the pkg.
  • if you are going to use tcmalloc with other tools from gperf , please pay attention there are a lot of things that can go wrong in this case especially in 64 bit environment (stack unwinding, dlopen bugs and so on..).
  •  the easiest way to use it is not to link it but use it with LD_PRELOAD, and for now the safest way also. It depends on your needs but in big projects it will usually be the safest way.
  • switch off TLS in config.h if your system is not supporting it. But if its not than why would you like to use tcmalloc ? 
  • DON'T use dlopen to load tcmalloc, it will not work properly.. firstly because you will be forced to switch off TLS support and secondly because you will quickly encounter free memory problems.
  • if you get an error like:
    ___tls_get_addr: symbol not found
    it means you are doomed, and your system is kinda old ;-) You can still fix it by turning off TLS support but well.. didnt I tell you that its not worth ? ;-)
 In next tcmalloc episode I'm going to present some real examples on performance improvements or not :-)

Rgds
$(TA)$

Comments $\TeX$ mode $ON$.