minus4 blog: March 2015

11.3.15

PLT (Procedure Linkage Table) and tcmalloc part 3

Here we go again!
This time let's take a look what PLT and GOT has to do with tcmalloc at all ? Should we worry when going to production with our new tcmalloc allocator ?

In short terms YES we should.. Why ? Well lets find out.

At the beginning we need some basic knowledge about dynamic linking under Linux system with glibc, so lets start with that. To keep it simple I will only consider DT_RUNPATH, not the old and currently deprecated DT_RPATH (but please pay attention that in production environment its quite common to have old deprecated variables..). So when ld.so is trying to load dynamic shared library, there are few paths that ld.so will consider taking into consideration the following order, and indeed this very important to understand it when you deals with shared libraries:

firstly path from the environment $LD_LIBRARY_PATH
than paths from the caller's (the ELF object that require the library load through ELF dependency or dlopen) DT_RUNPATH.

DT_RUNPATH is a set of paths, hardcoded in the binary at link time, that are here to help in the path resolution of dynamic library at runtime. At link time, it's controlled by the -rpath ld flag, or then environment variable LD_RUN_PATH

from ld.so cache. Libraries present in the cache are found from paths given in /etc/ld.so.conf and files from /etc/ld.so.conf.d/
and /lib and /usr/lib as a last resort

OK we have some basic knowledge, lets briefly explain what PLT is? as we will need this knowledge latter. One more time briefly it's procedure linkage table. This mechanism is used to speed up process startup. It allows position independent code (PIC) object (dynamically linked shared library) to lazily rely on foreign symbols defined by other ELF object. Exactly LAZY, that means PLT symbol is only resolved the very first time it is used at the cost of one more indirection before accessing the GOT (global offset table). If you symbol relies only on GOT its faster in runtime but slower in startup phase, as usually we are using only small subset of symbols, not all of them.
A bit more detailed info here:
http://eli.thegreenplace.net/2011/11/03/position-independent-code-pic-in-shared-libraries/

Some symbols in ELF can be declared as weak objects. It means that its possible to provide that symbol redefinition in another ELF object. This is exactly what static linkage is doing, it will keep only strong symbols after linkage thus there will be no conflict in linkage time. For dynamic libraries everything looks different mainly because dynamic linkage knows nothing about weak and strong symbols. When resolving a symbol dynamicly, the first encountered definition wins, it could be weak or strong it does not matter FIRST win always. Its clear here that order matters when you redefine symbols! So its clear now that if you want to link your tcmalloc dynamically you have to link it before glibc, otherwise you are one more time doomed :-)

Now lets dive deeper..

Do you ever think why you are able to redefine malloc/free/realloc/calloc? If no than maybe you read the the text above ? If yes than you should already know it. Indeed all allocation symbols are marked as weak in glibc, and this is why we are able to use tcmalloc instead of standard one! But this is not the end! Remember that ld.so is still here and it has to call malloc/free also! Which version it will call when resolving dynamic linkage ? Dynamic linker need malloc, but in order to call malloc there must by one already loaded, sounds weird but, well to link something dynamically you at least need one malloc call before you can redefine it. To fix this issue, the glibc always uses calls to the PLT version of malloc/calloc/realloc/free so that the address of the actual implementation can be easily rewritten at runtime with a single write. In a first time, malloc@plt points to the glibc malloc implementation. Later on, when the new malloc implementation is loaded and initialized (http://www.delorie.com/gnu/docs/glibc/libc_34.html), malloc@plt will point to the new implementation. Same for free@plt.

Now its probably a bit more clear comparing to our knowledge from first post about tcmalloc.
Can we now go even deeper ? We can and lets try to find out why you should be very careful when you changing default malloc.

Very early during the process startup, ld.so call _dl_init_paths which initializes the search paths for the current executable we are loading. It allocates through malloc@plt (pointing to glibc malloc) structures to store data from \$LD_LIBRARY_PATH and DT_RUNPATH. When looking for a dynamic object, ld.so sequentially calls open_path with paths from \$LD_LIBRARY_PATH, then DT_RUNPATH. If for some reason we could not load the library using these paths, then the structure if freed using free@plt and assigned to NULL (why its freed?). What is important ? dlopen! open_path can be called at any time. So if we call this after tcmalloc has been initialized, free@plt points to tc_free, but the data was allocated with glibc's malloc. BAM! core.

What I told you is now fixed in https://www.sourceware.org/ml/libc-alpha/2013-04/msg00308.html 2.18 glibc. But did you checked your glibc version ? :-) If its newer than that forget about it and go to prod without any worries (really? ;-) )

This fix is preventing free@plt calls for DT_RUNPATH (because free@plt could be now tc_free).

Sum up:

Check your glibc version
Check your LD_LIBRARY_PATH and all other paths
Take your time to understand how tcmalloc works and why it works at all. Otherwise don't use it, cause you wont be able to say why it fail if it starts to.

What about LD_PRELOAD ? Well just read this (http://lca2009.linux.org.au/slides/172.pdf) presentation to master it.

Credits go to Amadeus MDW team that did a good job investigating some tricky parts that could be presented here now.

Rgds

$(TA)$

Comments $\TeX$ mode $ON$

10.3.15

tcmalloc part 2

We are back again,

In this second 'chapter' about tcmalloc I'm going to show you some results from real production environment code. I will focus on simple comparison of malloc and tcmalloc performance in NO multithreaded software. Is it worth switching from your defaut malloc to custom one ? I told you in first part of malloc post that its not necessarily obvious.. Lets find out.

Firstly lets check how simple glibc malloc behave in our test library:

Lets take a look at some numbers on how long it takes to evaluate some basic library functions:

Performance test create :   0:6000
Performance test pop :    0:18000
Performance test create :   0:5000
Performance test pop :      0:18000
Performance test :          0:93668000
Performance test Cache : 0:142341000
Performance test :          0:150872000 (not hitting the cache)
Performance test for Test : 0:8803000 (hitting cache)
Performance test for Test : 0:8770000 (hitting cache)
Performance test for Test : 0:8797000 (hitting cache)

Take a look at callgrind outputs:

We clearly see that ld lib responsible for dynamic libraries is taking most of the time in processing, but what is also very important malloc itself seems to work very heavily. Can we gain anything in terms of speed when we just simply replace our default malloc by custom tcmalloc ? Well of course there are few possibilities to configure tcmalloc with bigger page size (which in fact will use a bit more memory than we need but should be also a bit faster as less tcmalloc calls are needed in this case). But firstly lets try to look how default tcmalloc with default configuration behave. We will not use dynamic linking here we will try to exploit our test scenario using special LD_PRELOAD variable to be sure that We won't encounter any bad memory deallocation which leads usually to system core.

For more detailed info about LD_PRELOAD you can go here: https://blog.cryptomilk.org/2014/07/21/what-is-preloading/

In next post I will try to explain how glibc works and why LD_PRELOAD do the job. Also it's not exactly true that tcmalloc replace malloc completely, well its partially true.. but for now lets come back to our tcmalloc scenario.

TCmalloc results:

Performance test create        : 0:5000
Performance test pop           : 0:18000
Performance test create        : 0:5000
Performance test pop           : 0:18000
Performance test               : 0:91996000
Performance test Cache        : 0:138960000
Performance test               : 0:147004000
Performance test for Test    : 0:8265000
Performance test for Test : 0:8231000
Performance test for Test      : 0:8260000

Sounds promising! 150872000-147004000= 3868000 nanoseconds faster ~ 2.56% faster

Lets also take a look at our callgrind outputs:

libc has a bit less to do, and you wont find malloc/free/calloc on its list. Those are now handled by tcmalloc. You can clearly see here how tcmalloc behave comparing to simple malloc. And its now quite clear (if it wasn't yet?) that pthread is a must here to use tcmalloc implementation. You can try yourself tcmalloc configuration build with TLS flag off to compare the results. And please share it if you have some, it will be nice to see how it behave. I'm going to test it also in some spare time with various flags.

We've seen that we can gain something even if we are in single threaded library. Some internal tests shows us that we can gain much more when switching to custom malloc in multithreaded environment. The choice is yours! But maybe its good to consider and compare also jemalloc and lockless ? For sure I will post some results from those two in comparison with tcmalloc and malloc. Stay tuned!

In next post I'm going to present you how ld, libc, LD_PRELOAD and free/malloc works when dealing with dynamic and static libraries, and why LD_PRELOAD works at all ?
One more time stay tuned, and follow me on twitter if you don't want to miss it :-)

Rgds
$(TA)$

Comments $\TeX$ mode $ON$.

tcmalloc

During last week I have got a pleasure to work with custom memory allocator from google perf tools: http://goog-perftools.sourceforge.net/. Its called tcmalloc, and let me write few words about it.

Tcmalloc is a replacement for glibc malloc, which today is widely used in most of standard new operator implementations. It's faster mainly because its lock free, and it's lock free because its multithreaded. Here we come to some troubles mainly because not all systems support this kind of MT that tcmalloc implementation is using. There is something called TLS (Thread Local Storage) and it could be that you system does not support TLS. If this is the case you should in my opinion try to find another solution, why ?
Indeed its possible to switch off TLS in config.h file before you run ./configure on tcmalloc that will definitely switch off TLS for you, but than you will not see any speed ups from tcmalloc, so there is no need to change it. Secondly its definitely designed for multithreaded environment, if you running on one simple thread.. well I will show you in next post some real results that maybe you can decide is it worth your time..

So firstly when to use it ?
Only when you really have to. Indeed it sound nice to replace some basic function by another one and get 5% average speed up. But only in theory. In practice it may lead you to many memory disasters, especially when you are dealing with dynamically linked libraries.
Why dynamically linked ?
Well lets consider that you are working on a huge project with many libraries, its obvious that compiling your library with tcmalloc won't work correctly with other libraries. You will quickly see problems related to : Trying to free not exactly your memory (allocated by tcmalloc , but deleted by simple free or opposite).
There is a way to get rid of this and its called LD_PRELOAD, it's a kind of workaround to ensure that all libraries are loaded with tcmalloc even if they were not compiled with it. Sounds good ? Yes indeed you just need to do some basic export LD_PRELOAD=\path_to_lib.so and that is it. Rebuild. Have fun. This is currently the only possible way to make it work if you are working on a complex project with many libraries and your backend is linking to many stuff that is not on your side. LD_PRELOAD ensures that dlopen will take care of any dynamically linked lib and will replace malloc by tcmalloc.

To sum up most important parts:

Its designed to excel multithreaded environments, but it may give also some good results in a simple one threaded application, mainly because it gives you possibility to setup page size manually, so you can configure it to your needs.
gperf tools provide more than only tcmalloc, but tcmalloc is a base component for the rest.
google folder in gperftools package is deprecated, please read .h comments carefully when you configure it, its quite common that people are linking to old deprecated google folder in the pkg.
if you are going to use tcmalloc with other tools from gperf , please pay attention there are a lot of things that can go wrong in this case especially in 64 bit environment (stack unwinding, dlopen bugs and so on..).
the easiest way to use it is not to link it but use it with LD_PRELOAD, and for now the safest way also. It depends on your needs but in big projects it will usually be the safest way.
switch off TLS in config.h if your system is not supporting it. But if its not than why would you like to use tcmalloc ?
DON'T use dlopen to load tcmalloc, it will not work properly.. firstly because you will be forced to switch off TLS support and secondly because you will quickly encounter free memory problems.
if you get an error like:
```
___tls_get_addr: symbol not found
```
it means you are doomed, and your system is kinda old ;-) You can still fix it by turning off TLS support but well.. didnt I tell you that its not worth ? ;-)

In next tcmalloc episode I'm going to present some real examples on performance ~~improvements~~ or not :-)

Rgds
$(TA)$

Comments $\TeX$ mode $ON$.