random thoughts

cat /dev/random > ideas

Implement a Simple Trie in C++

| Comments

The following introduction is copied from Wikipedia:

In computer science, a trie, also called digital tree or sometimes radix tree or prefix tree (as they can be searched by prefixes), is an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings.

Trie is a good choice for bioinformatic applications (for DNA sequence alighment). Also because trie supports ordered traversal, outputing all keys in the trie is actually a pre-order traversal. So obviously trie can be used for sorting, for example Burstsort, it’s even faster than quicksort for large data sets ( Worst case: ).

A simple trie tree looks like:

digraph "Graphviz" { root [label=" "] inn [color=red] in [color=red] to [color=red] tea [color=red] ten [color=red] root -> i [label=i] root -> t [label=t] i -> in [label=n] in -> inn [label=n] t -> to [label=o] t -> te [label=e] te -> tea [label=a] te -> ten [label=n] } Graphviz root i i root->i i t t root->t t inn inn in in in->inn n to to tea tea ten ten i->in n t->to o te te t->te e te->tea a te->ten n
naive trie implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
#include <iostream>
#include <string>
using namespace std;
class node {
public:
	node() : is_word(false) {
		memset(children, 0, sizeof(children));
	}
	node *children[26];
	bool is_word;
};
class trie {
public:
	trie() : root(0) {}
	void insert(const string &str) {
		if (!root)
			root = new node;
		node *loc = root;
		for (int i=0; i < str.size(); i++) {
			int idx = str[i] - 'a';
			if (!loc->children[idx]) {
				loc->children[idx] = new node;
			}
			loc = loc->children[idx];
		}
		loc->is_word = true;
	}
	void remove (const string &str) {
		node *loc = find_node(str);
		if (loc) loc->is_word = false;
	}
	bool prefix(const string &str) {
		return !str.empty() && find_node(str) != 0;
	}
	bool search(const string &str) {
		node *loc = find_node(str);
		return loc ? loc->is_word : false;
	}
private:
	node *root;
	node *find_node(const string &str) {
		node *loc = root;
		for (int i=0; i < str.size(); i++) {
			int idx = str[i] - 'a';
			if (!loc->children[idx]) return 0;
			loc = loc->children[idx];
		}
		return loc;
	}
};

Problems

The above implemetation does not use space efficiently. The space complexity is , m = 26 for just English characters, and things get even worse when deal with unicode characters.

Follow-ups and references

  1. Tripple array trie and double array trie
  2. http://en.wikipedia.org/wiki/Ternary_search_tree
  3. BK-tree is an interesting data structure for fuzzy string search, popular for spell checking, see also Damn Cool Algorithms, Part 1: BK-Trees
  4. http://en.wikipedia.org/wiki/Patricia_tree
  5. 双数组Trie树原理
  6. A Succinct Trie, this is a very impressive way to encode trie with bit strings, uses the Succinct data structure.
  7. Space-efficient Static Trees and Graphs Guy Jacobson

Comments