31.7.15

Patricia/Radix tree - in-memory data manipulation

I was planning to write continuation post about SOA optimizations, but 'accidentally' I had a chance in last few weeks to code something a bit more interesting from my developer perspective. Not well known and not frequently and widely used in most companies (apart from those which are dealing with in-memory operations) : Radix trees!

Storing and retrieving strings in memory is a fundamental problem in computer science. The efficiency of string data structures used for this task is of paramount importance for applications such as in-memory databases, text-based search engines and dictionaries. The burst tree( http://www.cs.uvm.edu/~xwu/wie/CourseSlides/Schips-BurstTries.pdf ) is a leading choice for such tasks, as it can provide fast sorted access to strings. The burst trie, however, uses linked lists as substructures which can result in poor use of CPU cache and main memory. Thus, engineering a fast, compact, and scalable tree for strings remains an open problem. We will take a look here at Radix tree, which is one of possible way how to handle this fundamental problem and you can treat it as introduction to in-memory data handling.

To make a long story short, here is the radix tree structure

Radix tree
As we can see, we have here a kind of structural space optimization, and also quite efficient search structure. Apart from what you see on the picture, it is not binary tree. And here in fact magic happens!

Building a simple radix tree is not much different from non binary tree setup, but if we use this structure wisely we can on top of it create a powerfull searching mechanism that can align our tree to some specific group size, meaning that each tree level will try to be as close to this group size. Is it not fantastic ? Imagine that you are working on a node with limited memory, or when using simple radix tree you want to speed up you searches, for both tasks radix tree level grouping algorithm presented below could be a good choice (But there are better choices also! like HAT-Trie for example)

Comparing this structure with any other balanced tree or hashed structures we are gaining in two places. First is space, and second is number of comparision needed to find desired key. Hashmaps are special here cause you can think that they are faster, which is not necesairlly true if you consider hash algorithm then worst-case time is much higher then the one in radix tree.

#include <iostream>
#include <string.h>
#include <stdio.h>
using namespace std;
struct node
{
    string key;
    int len;
    node* link;
    node* next;
    int before;
    node(const string &x) : link(0), next(0)
    {
        len = x.size() + 1; // Used for split purposes,
        key = x;
    }
    node(const string &x, int n) : link(0), next(0)
    {
        key = x.substr(0,n);
        len = n;
    }
    ~node() {}
};
class stringTree
{
    public:
        node* root;
        int target_size;
    public:
        stringTree(int targetSize):root(0), target_size(targetSize){}
        stringTree(const string& x, int targetSize): root(0), target_size(targetSize){root = internal_insert(root,x);}
    public:
        void insert(const string& x)
        {
            if (root)
                internal_insert(root,x);
            else
                root = internal_insert(root,x);
        }
        void display()
        {
            if (root)
            {
                redesign_postorder(root,root,0,0);
                internal_display(root,0);
            }
        }
        bool find(const string &x)
        {
            if (root)
                return internal_find(root, x, 0)==NULL?false:true;
            return false;
        }
    private:
        int countChilds(node* t)
        {
            int size = 1;
            if (t->len == 1 && t->key==" ") // Hack for ' ', same keys are stored as len=1 empty strings
                size--;
            node* iterator = t;
            while (iterator->next)
            {
                if (iterator->next->before < size)
                    iterator->next->before = size;
                iterator = iterator->next;
                size++;
            }
            return size;
        }
        void redesign_postorder(node* oldt, node* t, int levelSize, int groupSize)
        {
            if (t==NULL) return;
            redesign_postorder(t,t->link,levelSize+1,0);
            redesign_postorder(t,t->next,levelSize,groupSize+1);
            if (oldt->link && groupSize==0)
            {
                int numberOf = countChilds(oldt);
                int numberOfInternal = countChilds(t);
                numberOf+=groupSize;
                numberOf+=oldt->before;
                if (numberOf<target_size && ((numberOf+numberOfInternal)<=target_size))
                    join(oldt);
            }
        }
        void internal_display(node* t,int level)
        {
            if (t == NULL) return;
            for (int i = 0; i < level;i++)
                printf("\t");
            printf("node...: %s \n", t->key.c_str());
            internal_display(t->link,level+1);
            internal_display(t->next,level);
        }
        node* internal_insert(node* t, const string &x, int n=0)
        {
            if( !n ) n = x.size()+1;
            if( !t ) return new node(x);
            int k = prefix(x,n,t->key,t->len);
            if( k==0 )t->next = internal_insert(t->next,x,n);
            else if( k<n )
            {
                if( k<t->len )
                    split(t,k);
                t->link = internal_insert(t->link,x.substr(k),n-k);
            }
            return t;
        }
        int prefix(const string &x, int n, const string &key, int m)
        {
            for( int k=0; k<n; k++ )
                if( k==m || x[k]!=key[k] )
                    return k;
            return n;
        }
        void split(node* t, int k)
        {
            node* p = new node(t->key.substr(k),t->len-k);
            p->link = t->link;
            t->link = p;
            string temp = t->key.substr(0,k);
            t->key = temp;
            t->len = k;
        }
        void join(node* t) // Goign to use it during tree merge to met group size conditions
        {
            node* p = t->link;
            string temp = t->key.substr(0,t->len);
            string temp2 = p->key.substr(0,p->len);
            t->key = temp + temp2;
            t->len += p->len;
            t->link = p->link;
            string tempKey = temp;
            int lenghtTemp = t->len;
            while(t->next)
                t=t->next;
            t->next = p->next;
            while(p->next)
            {
                string tempKey2 = p->next->key.substr(0,p->next->len);
                string tempKey3 = tempKey + tempKey2;
                p->next->key = tempKey3;
                p->next->len += lenghtTemp;
                p=p->next;
            }
        }
        node* internal_find(node* t, const string &x, int n=0)
        {
            if( !n ) n = x.size()+1;
            if( !t ) return 0;
            int k = prefix(x,n,t->key,t->len);
            if( k==0 ) return internal_find(t->next,x,n);
            if( k==n ) return t;
            if( k==t->len ) return internal_find(t->link,x.substr(k),n-k);
            return 0;
        }
};


int main() {

    string testowy = "amazon";
    string testowy2 = "amazonadsystem";
    string testowy3 = "amazonwebapps";
    string testowy4 = "amazon";
    string testowy5 = "amazon-adsystem";
    string testowy6 = "amazonwebapps";
    string testowy7 = "aol";

    stringTree test(testowy,3);
    test.insert(testowy2);
    test.insert(testowy3);
    test.insert(testowy4);
    test.insert(testowy5);
    test.insert(testowy6);
    test.insert(testowy7);

    printf("found: %d\n", test.find(testowy2));
    printf("found: %d\n", test.find("bsabsha"));
    printf("============================\n");
    printf("TREE\n");
    printf("============================\n");
   
    test.display();
    
    return 0;

}

Please pay attention that this code is not even close to ideal, it works but it could be done much better in some places. The intention of the code was just to show you how to build such tree. Moreover if you take a closer look at redesign_postorder  function you can clearly see that postorder evaluation is a bit tricky here, cause easier and more intuitive way will be to start from bottom and go level by level up. Can you see some pros and cons of this solution with postorder ? Also did you manage to see that redesign_postorder is indeed doing level grouping here ? Another interesting question could be: Can instead of tree redesign for specific level we can do it during insertion ? More interesting info and interesting structures based on radix trees could be found for example here:

http://lwn.net/Articles/175432/
http://code.dogmap.org/kart/
http://www-db.in.tum.de/~leis/papers/ART.pdf


Comments $\TeX$ mode $ON$

No comments:

Post a Comment