minus4 blog: tcmalloc part 2

We are back again,

In this second 'chapter' about tcmalloc I'm going to show you some results from real production environment code. I will focus on simple comparison of malloc and tcmalloc performance in NO multithreaded software. Is it worth switching from your defaut malloc to custom one ? I told you in first part of malloc post that its not necessarily obvious.. Lets find out.

Firstly lets check how simple glibc malloc behave in our test library:

Lets take a look at some numbers on how long it takes to evaluate some basic library functions:

Performance test create :   0:6000
Performance test pop :    0:18000
Performance test create :   0:5000
Performance test pop :      0:18000
Performance test :          0:93668000
Performance test Cache : 0:142341000
Performance test :          0:150872000 (not hitting the cache)
Performance test for Test : 0:8803000 (hitting cache)
Performance test for Test : 0:8770000 (hitting cache)
Performance test for Test : 0:8797000 (hitting cache)

Take a look at callgrind outputs:

We clearly see that ld lib responsible for dynamic libraries is taking most of the time in processing, but what is also very important malloc itself seems to work very heavily. Can we gain anything in terms of speed when we just simply replace our default malloc by custom tcmalloc ? Well of course there are few possibilities to configure tcmalloc with bigger page size (which in fact will use a bit more memory than we need but should be also a bit faster as less tcmalloc calls are needed in this case). But firstly lets try to look how default tcmalloc with default configuration behave. We will not use dynamic linking here we will try to exploit our test scenario using special LD_PRELOAD variable to be sure that We won't encounter any bad memory deallocation which leads usually to system core.

For more detailed info about LD_PRELOAD you can go here: https://blog.cryptomilk.org/2014/07/21/what-is-preloading/

In next post I will try to explain how glibc works and why LD_PRELOAD do the job. Also it's not exactly true that tcmalloc replace malloc completely, well its partially true.. but for now lets come back to our tcmalloc scenario.

TCmalloc results:

Performance test create        : 0:5000
Performance test pop           : 0:18000
Performance test create        : 0:5000
Performance test pop           : 0:18000
Performance test               : 0:91996000
Performance test Cache        : 0:138960000
Performance test               : 0:147004000
Performance test for Test    : 0:8265000
Performance test for Test : 0:8231000
Performance test for Test      : 0:8260000

Sounds promising! 150872000-147004000= 3868000 nanoseconds faster ~ 2.56% faster

Lets also take a look at our callgrind outputs:

libc has a bit less to do, and you wont find malloc/free/calloc on its list. Those are now handled by tcmalloc. You can clearly see here how tcmalloc behave comparing to simple malloc. And its now quite clear (if it wasn't yet?) that pthread is a must here to use tcmalloc implementation. You can try yourself tcmalloc configuration build with TLS flag off to compare the results. And please share it if you have some, it will be nice to see how it behave. I'm going to test it also in some spare time with various flags.

We've seen that we can gain something even if we are in single threaded library. Some internal tests shows us that we can gain much more when switching to custom malloc in multithreaded environment. The choice is yours! But maybe its good to consider and compare also jemalloc and lockless ? For sure I will post some results from those two in comparison with tcmalloc and malloc. Stay tuned!

In next post I'm going to present you how ld, libc, LD_PRELOAD and free/malloc works when dealing with dynamic and static libraries, and why LD_PRELOAD works at all ?
One more time stay tuned, and follow me on twitter if you don't want to miss it :-)

Rgds
$(TA)$

Comments $\TeX$ mode $ON$.

minus4 blog

10.3.15

tcmalloc part 2

No comments:

Post a Comment