What’s New ?

The Top 10 favtutor Features You Might Have Overlooked

Read More

Ukkonen's Suffix Tree Algorithm (Python, C++ & Java code)

  • Sep 26, 2022
  • 9 Minutes Read
  • Why Trust Us
    We uphold a strict editorial policy that emphasizes factual accuracy, relevance, and impartiality. Our content is crafted by top technical writers with deep knowledge in the fields of computer science and data science, ensuring each piece is meticulously reviewed by a team of seasoned editors to guarantee compliance with the highest standards in educational content creation and publishing.
  • By Manvi Saxena
Ukkonen's Suffix Tree Algorithm (Python, C++ & Java code)

Data Structures and Algorithms are an integral part of today's software and programming world. Every student wants to excel in this field to get a good placement in big companies. Data Structures include not just arrays, graphs, queues, and linked lists but tries as well. In today's blog, we'll be talking about Suffix Trees, their generalization, and creating them using Ukkonen's algorithm with Python, C++ & Java codes. So, let's get started. 

What are Suffix Trees?

You must be aware of strings. Well, Suffix Trees are a popular tool for analyzing these text strings. A Suffix Tree contains all the suffixes of the given text. It can be known as a digital tree that provides the string's structure and uses certain algorithmic methods to find its subsets. These trees are used to solve a wide number of questions such as longest palindrome, pattern matching, and finding distinct substrings in a given string. 

What are Generalized Suffix Trees?

When we work with suffix trees due to their limitation, we're able to work only with the suffixes of a single string. Moving towards reality, we have to deal with several strings and suffixes at a particular time. How do we sort out this dilemma? Here's when the Generalized Suffix Trees come into the picture! As its name, these trees refer to a more generalized form of already existing suffix trees. 

If there are certain strings like S1, S2, S3,...Sk, then the generalized suffix tree of all these strings will have its leaf node in a format different from others. The node labels are in the form of i:j, where, i mean the position of the string and j refers to the string whose suffix the given string is. You can also take a look at the below illustration to gain a better understanding of the concept: 

Generalized Suffix Trees

Are Suffix Trees the same as Tries? 

Many people confuse Suffix Trees and Tries to be the same, however, that is not true. We have given a very good illustration below to understand how these two are different, take a look!

suffix and tries

After viewing the above image, you must know that these two are not the same and should not be treated likewise. A suffix tree is essentially a compressed version of the trie and a much more efficient one too!

Understanding Ukkonen's Algorithm

In 1995, Esko Ukkonen proposed a linear time algorithm for creating suffix trees i.e., Ukkonen's Algorithm. Initially, an implicit suffix tree is created that contains the first character of the string. Then, as you move forward, it keeps on adding the other characters till the tree is complete. The orderly addition of characters in the tree maintains the on-line property of this algorithm. 

*Implicit Trees are those intermediate trees that we encounter while creating the suffix tree of a given string using Ukkonen's Algorithm. 

We understand that the working example of such a complex algorithm could prove to be a very important step for your learning. But, first read the below 2 statements to understand the basics. 

  1. If you are aware of tries, it's going to be really for you. We are essentially building a similar thing so, eventually, you're going to come across root edges, edges coming out to lead to newer edges and, so on. 
  2. But, as we have mentioned in the above section, the edge labels are not single characters. So, while working you have to keep this in mind. 

Creating Suffix Tree using Ukkonen's Algorithm

We'll begin by learning the easy string suffix trees using this algorithm. Here, we'll use a very simple string that has no special characters: abc. Follow the steps below to create its suffix tree:

  1. The algorithm works for the strings in left to right order. For each character of the string, we move ahead by one step. 
  2. So, beginning from left to right, we first encounter  a. So, we create an edge from the root node to a leaf by labeling it as [0,#]. Here, means the current end. Now, this edge that we just created represents the substring starting at 0 and ending at the current end, 1. 

creating suffix tree using ukkonen's algorithm

     3. Moving over to the second position, we expand the current edge of a to ab labeling it as [0,#]. But, since b is also a suffix, we add another edge for it, [1,#]. We did this modification because we are now at position 1 of the string(indexing begins with 0) and ends till the current end. 

creating suffix tree using ukonnen algorithm 2

    4. We again move forward and encounter c. Now, we again have to update and create a new edge as c is also a suffix of the given string. Look down below for what the tree will look like, now.

moving in suffix tree

Pseudocode of Ukkonen's Algorithm

function insert(T, string, start_index, length){
    i = start_index
    while(i < length)
        if T.arr[string[i]]) is NULL
            T.arr[string[i]] = new node()
            T = T.arr[string[i]]
            while(i < length)
                T.label = T.label+string[i]
                i = i+1
            endwhile
            return 
        endif

        j=0, x=i
        temp = T.arr[string[i]]
        while j < temp.label.length and i < length && temp.label[j] = string[i]
            i = i+1
            j = j+1
        endwhile

        if i = length 
            return
        endif

        if j = temp.label.length
            cnt = 0
            for k = 'a' to 'z'
                if temp.arr[k] = NULL
                    cnt = cnt+1
                endif
            endfor
            if cnt = 27
                while i < length
                    temp.label = temp.label + string[i]
                    i = i+1 
                endwhile
            else
                T = temp
            endifelse       
        else
            newInternalNode = new node()    
            k=0
            while k < j 
                newInternalNode.label = newInternalNode.label + temp.label[k]
                k = k+1
            endwhile
            newInternalNode.arr[string[j]] = new node()
            while k < temp.label.length 
                newInternalNode.arr[string[j]].label+=temp.label[k]
                k = k+1
            endwhile
            for k = 'a' to 'z'
                newInternalNode.arr[string[j]].arr[k] = temp.arr[k]
            endfor
            T.arr[string[x]] = newInternalNode  
            T = T.arr[string[x]]
        endifelse
    endwhile
}

function makeSuffixTree(string, length){
    Trie T
        string = concatenate(string, "{")
    for i = 0 to length
        insertInTrie(T, string, i, length)
}

 

Implementation of Suffix Tree

While implementing suffix trees certain things need to be kept in mind. In this portion, we'll be covering every minute detail you will require to implement a suffix tree. The full suffix tree can be expressed in Theta(n) space if all its edges and nodes can be expressed in the same space.

The parent-child connections among the nodes in a tree are both essential and sensitive, so you need to consider that in great detail. Quite often, programmers use sibling lists in which each node has a pointer pointing to the next child. Other solutions include hash maps, balanced search trees, and sorted or unsorted arrays. 

Below, we have given a table of certain options for implementing suffix trees and their time complexity:

Implementation Extraction Insertion Traversal
Hash Map O(1) O(1) O(n)
Sorted Arrays O(logn) O(n) O(1)
Bitwise Sibling Tree O(logn) O(1) O(1)
Balanced Search Tree O(logn) O(logn) O(1)
Hash Map + Sibling List O(1) O(1) O(1)
Sibling Lists O(n) O(1) O(1)

 

Implementation of Suffix Tree with C++ Code

#include 
#include 
#include 

struct Node {
    std::string sub = "";   // a substring of the input string
    std::vector<int> ch;    // vector of child nodes

    Node() {
        // empty
    }

    Node(const std::string& sub, std::initializer_list<int> children) : sub(sub) {
        ch.insert(ch.end(), children);
    }
};

struct SuffixTree {
    std::vector<Node> nodes;

    SuffixTree(const std::string& str) {
        nodes.push_back(Node{});
        for (size_t i = 0; i < str.length(); i++) {
            addSuffix(str.substr(i));
        }
    }

    void visualize() {
        if (nodes.size() == 0) {
            std::cout << "\n";
            return;
        }

        std::function<void(int, const std::string&)> f;
        f = [&](int n, const std::string & pre) {
            auto children = nodes[n].ch;
            if (children.size() == 0) {
                std::cout << "- " << nodes[n].sub << '\n';
                return;
            }
            std::cout << "+ " << nodes[n].sub << '\n';

            auto it = std::begin(children);
            if (it != std::end(children)) do {
                if (std::next(it) == std::end(children)) break;
                std::cout << pre << "+-";
                f(*it, pre + "| ");
                it = std::next(it);
            } while (true);

            std::cout << pre << "+-";
            f(children[children.size() - 1], pre + "  ");
        };

        f(0, "");
    }

private:
    void addSuffix(const std::string & suf) {
        int n = 0;
        size_t i = 0;
        while (i < suf.length()) {
            char b = suf[i];
            int x2 = 0;
            int n2;
            while (true) {
                auto children = nodes[n].ch;
                if (x2 == children.size()) {
                    // no matching child, remainder of suf becomes new node
                    n2 = nodes.size();
                    nodes.push_back(Node(suf.substr(i), {}));
                    nodes[n].ch.push_back(n2);
                    return;
                }
                n2 = children[x2];
                if (nodes[n2].sub[0] == b) {
                    break;
                }
                x2++;
            }
            // find prefix of remaining suffix in common with child
            auto sub2 = nodes[n2].sub;
            size_t j = 0;
            while (j < sub2.size()) {
                if (suf[i + j] != sub2[j]) {
                    // split n2
                    auto n3 = n2;
                    // new node for the part in common
                    n2 = nodes.size();
                    nodes.push_back(Node(sub2.substr(0, j), { n3 }));
                    nodes[n3].sub = sub2.substr(j); // old node loses the part in common
                    nodes[n].ch[x2] = n2;
                    break; // continue down the tree
                }
                j++;
            }
            i += j; // advance past part in common
            n = n2; // continue down the tree
        }
    }
};

int main() {
    SuffixTree("banana$").visualize();
}

 

Output:

+
+-- banana$
+-+ a
| +-+ na
| | +-- na$
| | +-- $
| +-- $
+-+ na
| +-- na$
| +-- $
+-- $

 

Implementation with Java Code

import java.util.ArrayList;
import java.util.List;

public class SuffixTreeProblem {
    private static class Node {
        String sub = "";                       // a substring of the input string
        List<Integer> ch = new ArrayList<>();  // list of child nodes
    }

    private static class SuffixTree {
        private List<Node> nodes = new ArrayList<>();

        public SuffixTree(String str) {
            nodes.add(new Node());
            for (int i = 0; i < str.length(); ++i) {
                addSuffix(str.substring(i));
            }
        }

        private void addSuffix(String suf) {
            int n = 0;
            int i = 0;
            while (i < suf.length()) {
                char b = suf.charAt(i);
                List<Integer> children = nodes.get(n).ch;
                int x2 = 0;
                int n2;
                while (true) {
                    if (x2 == children.size()) {
                        // no matching child, remainder of suf becomes new node.
                        n2 = nodes.size();
                        Node temp = new Node();
                        temp.sub = suf.substring(i);
                        nodes.add(temp);
                        children.add(n2);
                        return;
                    }
                    n2 = children.get(x2);
                    if (nodes.get(n2).sub.charAt(0) == b) break;
                    x2++;
                }
                // find prefix of remaining suffix in common with child
                String sub2 = nodes.get(n2).sub;
                int j = 0;
                while (j < sub2.length()) {
                    if (suf.charAt(i + j) != sub2.charAt(j)) {
                        // split n2
                        int n3 = n2;
                        // new node for the part in common
                        n2 = nodes.size();
                        Node temp = new Node();
                        temp.sub = sub2.substring(0, j);
                        temp.ch.add(n3);
                        nodes.add(temp);
                        nodes.get(n3).sub = sub2.substring(j);  // old node loses the part in common
                        nodes.get(n).ch.set(x2, n2);
                        break;  // continue down the tree
                    }
                    j++;
                }
                i += j;  // advance past part in common
                n = n2;  // continue down the tree
            }
        }

        public void visualize() {
            if (nodes.isEmpty()) {
                System.out.println("");
                return;
            }
            visualize_f(0, "");
        }

        private void visualize_f(int n, String pre) {
            List<Integer> children = nodes.get(n).ch;
            if (children.isEmpty()) {
                System.out.println("- " + nodes.get(n).sub);
                return;
            }
            System.out.println("? " + nodes.get(n).sub);
            for (int i = 0; i < children.size() - 1; i++) {
                Integer c = children.get(i);
                System.out.print(pre + "??");
                visualize_f(c, pre + "? ");
            }
            System.out.print(pre + "??");
            visualize_f(children.get(children.size() - 1), pre + "  ");
        }
    }

    public static void main(String[] args) {
        new SuffixTree("banana$").visualize();
    }
}

 

Output:

? 
??- banana$
??? a
? ??? na
? ? ??- na$
? ? ??- $
? ??- $
??? na
? ??- na$
? ??- $
??- $

 

Implementation with Python Code

class Node:
    def __init__(self, sub="", children=None):
        self.sub = sub
        self.ch = children or []

class SuffixTree:
    def __init__(self, str):
        self.nodes = [Node()]
        for i in range(len(str)):
            self.addSuffix(str[i:])

    def addSuffix(self, suf):
        n = 0
        i = 0
        while i < len(suf):
            b = suf[i]
            x2 = 0
            while True:
                children = self.nodes[n].ch
                if x2 == len(children):
                    # no matching child, remainder of suf becomes new node
                    n2 = len(self.nodes)
                    self.nodes.append(Node(suf[i:], []))
                    self.nodes[n].ch.append(n2)
                    return
                n2 = children[x2]
                if self.nodes[n2].sub[0] == b:
                    break
                x2 = x2 + 1

            # find prefix of remaining suffix in common with child
            sub2 = self.nodes[n2].sub
            j = 0
            while j < len(sub2):
                if suf[i + j] != sub2[j]:
                    # split n2
                    n3 = n2
                    # new node for the part in common
                    n2 = len(self.nodes)
                    self.nodes.append(Node(sub2[:j], [n3]))
                    self.nodes[n3].sub = sub2[j:] # old node loses the part in common
                    self.nodes[n].ch[x2] = n2
                    break # continue down the tree
                j = j + 1
            i = i + j   # advance past part in common
            n = n2      # continue down the tree

    def visualize(self):
        if len(self.nodes) == 0:
            print ""
            return

        def f(n, pre):
            children = self.nodes[n].ch
            if len(children) == 0:
                print "--", self.nodes[n].sub
                return
            print "+-", self.nodes[n].sub
            for c in children[:-1]:
                print pre, "+-",
                f(c, pre + " | ")
            print pre, "+-",
            f(children[-1], pre + "  ")

        f(0, "")

SuffixTree("banana$").visualize()

 

Output:

+-
 +- -- banana$
 +- +- a
 |  +- +- na
 |  |  +- -- na$
 |  |  +- -- $
 |  +- -- $
 +- +- na
 |  +- -- na$
 |  +- -- $
 +- -- $

 

Time Complexity  

Generally, the implementation of Ukkonen's algorithm for creating a suffix tree of a given string takes O(n^2) or O(n^3) time complexity where n refers to the length of the string. But, if we use a few algorithmic techniques, we can reduce its time complexity to O(n)

Conclusion

Suffix trees are used to solve many problems that involve strings in fields like text editing, free text search, pattern matching, etc. Understanding this concept and working with Ukkonen's algorithm can take you one step closer to your goal of getting placed in a good organization. Therefore, keep practicing these types of concepts that we bring to you at FavTutor. 

FavTutor - 24x7 Live Coding Help from Expert Tutors!

About The Author
Manvi Saxena
I'm Manvi Saxena, a technophile, and technical content writer. Having a curious mind helps me gain exposure and explore the latest technologies. Being a coder myself, it feels fantastic to share my knowledge via tech blogs to help others.