后缀树与后缀数组

现在的位置: 首页 > 综合 > 正文

后缀树与后缀数组

2017年12月17日 ⁄ 综合 ⁄ 共 11669字 ⁄ 字号小中大 ⁄ 评论关闭

我很懒的 @ 2008-03-10 16:47

后缀树和后缀数组简直就是 ACM 选手必备的知识啊，我已经在两次比赛中碰到过相关的问题了。我甚至还写过一篇应用的文章，可是我真是井底之蛙啊，那时我还不知道这个叫后缀数组，还有更好的构造算法，还有很多的应用。最近终于好好在这方面扫了个盲，在此小小地总结一下。

假设有一个长度为 n 的字符串 T[0 ... n)；S(i) 表示 T 的从下标 i 开始的后缀，即 T[i ... n)。那么 T 的后缀数组就是把 S(i) ~ S(n - 1) 这 n 个后缀按字典序排好序的一个数组。它对于查找 T 的子串之类的问题是很有用的。问题就在于怎样快速地把这些后缀排好序。

最简单的方法就是把所有 S(i) 快速排序。快速排序本身的时间是 O(n log n)，但是由于排序的对象是一个个字符串，所以每次比较的时间在最差情况下都会变成线性的（也就是 O(n) 的），因此总的时间在最差情况下可能会升到 O(n²) 左右，这就很慢了。对此，我学到了三个更快的算法。

1. Ukkonen 算法

Ukkonen 算法先用 O(n) 的时间构造一棵后缀树，然后再用 O(n) 的时间从后缀树得到后缀数组。在这个网址，介绍了作者 Esko Ukkonen，并列出了他的一些论文；其中的一篇《On-line
construction of suffix-trees》是可以下载的，里面就讲解了什么是后缀树，怎样在 O(n) 的时间内构造它，以及怎样从它得到后缀数组。

不过我一开始还没发现这篇论文，我是从 Dan Gusfield 的《Algorithms on Strings, Trees and Sequences - COMPUTER SCIENCE AND COMPUTATIONAL BIOLOGY》这本书里学到这个算法的。这本书在中国没的卖，想买的话，可以找代购网站去 Amazon 买。我是在 eMule 上搜到并下载的。这本书中的这节内容讲得还可以，虽然我觉得它示例比较少，但是花了点功夫还是看懂了。学会了之后，原作者的论文我就没有仔细看过了，所以没法评论。

Ukkonen 算法还是比较复杂的，代码比较长；而且后缀树这个结构本身也比较费空间。总而言之，虽然该算法在理论上是最快的，后缀树也是一个很优美的结构，但是在许多实际应用中不是很实惠。

然而，一开始我还不知道别的算法时，还是把它实现了出来（代码 1、代码
2）。（我写了两个版本，它们的不同点在于每个节点的子节点的存放方式。代码 1 是用数组，代码 2 是用链表。用数组的话，查找指定的子节点很快，只要 O(1)；但是比较费空间。用链表的话，省空间，但是查找子节点比较慢，只能线性地查找，不过一般情况下问题不大。实际上，我在 PKU
3415 这道题中，用数组反而比用链表慢，可能前者分配空间所花的时间比较多吧。）

2. DC3 算法

我在 Google 上搜到了这篇论文，《Linear Work Suffix Array Construction》，其中介绍了一个可以在 O(3n) 的时间内构造出后缀数组的算法，叫作
DC3 (Difference Cover mod 3) 算法。

该算法的基本原理大致是这样的。针对所有后缀的前 3 个字符，做基数排序；对于不满 3 个字符的后缀，排序时在后面补 0（这里的 0 是结束符，在 T 中不能出现；0 的字典序最优先）；排序时还要包括进从结束符（即 T[n]）开始的后缀 S(n): “000”。如果所有后缀的前 3 个字符都不完全相同，那么这一次就排好了，最后去掉多余的 “000” 后缀（它一定排在第一个），就得到答案了，时间是 O(3n)。如果存在前 3 个字符相同的，则需要生成一个名次数组 R， R(i) 表示 S(i) 在排好序后位于第几名（名次从
1 开始计），接着再用上述方法递归地求 R[0 ... n] 的后缀数组，其结果和 T 的后缀数组是完全对应的，也就是说 S_R(i) 排在第几位，则 S(i) 也应该排在第几位。但问题是如果这样递归层数多了，时间也就大大增加了。

接下来，在上述算法的基础上，需要一个优化。首先，只对满足 i mod 3 = 1 或 i mod 3 = 2 的那些 S(i) 按照前 3 个字符进行基数排序；如果这其中有前 3 个字符相同的，同样也需要递归地求它们的名次数组的后缀数组。排好了 i mod 3 = 1、2 的后缀之后，就可以得到一个总的名次数组 R，其中那些 i mod 3 = 0 的后缀的名次还是未知的。接着对于所有 i mod 3 = 0 的 S(i)，靠 T[i] 和 R(i + 1) 这两个关键字就可以对它们排序了。最后把排好序的
mod 3 = 1、2 和 mod 3 = 0 的后缀归并起来就是答案了。归并的时候，比较两个后缀 S(i) 和 S(j) 的方法也是看它们的前 3 个字符，如果都相同，那么比较 R(i + 1) 和 R(j + 1)，若不可比（其中有一个是未知的）则再比较 R(i + 2) 和 R(j + 2)。

有了以上的优化，即使当中出现了需要递归的情况，每次递归求解的字符串长度也只有原来的 2 / 3，那么即使递归的层数再多，总的时间之和也是会收敛的。

以上我只是潦草地介绍一下，具体的还是自己看论文吧。论文写得还是蛮清楚的。尤其是最后有一个用 C++ 实现的代码，其中有很多细节实现地很巧妙，很值得学习。

3. 倍增算法

我是从 IOI 2004 国家集训队论文集中的一篇名为《后缀数组》的文章中学到这个算法的。该文章在 Google 上搜得到，讲得还是蛮清楚的。我在此就不多介绍了，请自己看文章。

倍增算法最大的优点是实现简单，速度也还可以，O(n log n)。如果程序的时间要求不是很紧的话，应该作为首选的算法。这里是我对倍增算法的实现。

4. 多个字符串的后缀数组

在很多问题中，都需要求多个字符串的后缀数组，也就是把多个字符串的所有后缀都放在一起排序。这个结构对于查找公共子串之类的问题是很有用的。后缀树是可以表示多个字符串的，但是 DC3 算法和倍增算法都只能求单个字符串的后缀数组。

其实多个字符串的后缀数组可以转化成单个字符串的后缀数组。比如要求 “abc” 和 “def” 这两个字符串的后缀数组，可以转化成求 “abc1def” 的后缀数组。其中 1 是字典顺仅次于结束符 0 的字符，它也不出现在任何字符串中。这样求出来的后缀数组和 “abc” 与 “def” 的后缀数组是等价的；只是多了一个以 1 开头的后缀，但它一定排序在最前面，很容易去掉。在倍增算法中，用 0 替代 1 好像也可以；在 DC3 算法中好像不能用 0 替代 1，但是我忘记怎么重现那个错误了，所以现在也不好说。但是用
1 肯定是没错的，这样符合 “结束符在字符串中不出现” 的原则。

这篇文章我写得比较潦草，因为我引用的几篇文章本身都写得很清楚了，我确实没有什么新发现。所以到此为止吧。

关键词/Tags: acm 后缀数组后缀树 pku 倍增算法 ukkonen dc3

曾经的这一天...

[代码] 用 Ukkonen 算法构造后缀树/后缀数组，O(n)，用数组存放子节点

我很懒的 @ 1985-08-05 15:30

本文是这篇文章的附件。

/////////////////////////////////////////////////////////////////
//Suffix Tree and Suffix Array with Ukkonen Algorithm.
//Store child nodes in array.
/////////////////////////////////////////////////////////////////
#include <cstring>//strlen, memset.
#include <list>//used by suffix tree node.
#include <vector>//used by suffix tree.

using namespace std;

struct Suffix { const char* str; int from; };

const int ALPHABET_SIZE = 26 * 2 + 1;

struct InNode;

struct SfxNode //Suffix tree node
{   const char *l, *r;//[l, r): edge label from parent to this node.
    InNode* prnt;//parent
    virtual bool isLeaf() = 0; };

struct InNode: public SfxNode {//Internal node (non-leaf node)
    //for character X and string A, if this node's label is XA,
    //then the suffix link is the node with label A, if any.
    InNode* sfxLink;//suffix link
    SfxNode* ch[ALPHABET_SIZE];//children
    SfxNode*& child(char c)
        { if ('{post.content}' == c) { return ch[0]; }
          return c < 'a'? ch[c - 'A' + 1]: ch[c - 'a' + 27]; }
    bool isLeaf() { return false; }
};

struct Leaf: public SfxNode//Suffix tree leaf
{ list<int> from;//which string does this suffix belong to.
bool isLeaf() { return true; } };

InNode g_internal[200000 + 100]; //Ask for memory once and allocate
Leaf g_leaf[200000 + 100]; //them my self, to make the
int g_inCnt = 0, g_leafCnt = 0; //tree destruction fast.

InNode* newInNode(const char* l = NULL, const char* r = NULL)
{   InNode* p = &g_internal[g_inCnt++];
    p->l = l; p->r = r; p->sfxLink = p->prnt = NULL;
    memset( p->ch, 0, sizeof(p->ch) );
    return p; }

Leaf* newLeaf(const char* l = NULL, const char* r = NULL)
{   Leaf* p = &g_leaf[g_leafCnt++];
    p->l = l; p->r = r; p->from.clear();
    return p; }

list<int> g_stack;//A stack for the DFS of the tree.

class SuffixTree {
public:
SuffixTree(): m_root( newInNode() ), m_texts(), m_lens() {}
~SuffixTree() { clear(); }

    //Don't free the space of the added string
    //before the last string is added.
    void addText(const char* text) {
        m_text = m_i = text; m_leafCnt = 0; m_p = m_root;
        m_root->l = m_root->r = m_text;
        m_len = strlen(text);
        for (int i = 0; i <= m_len; i++) {
            m_newIn = NULL;
            for (int j = m_leafCnt; j <= i; j++)
                { if ( !extend(m_text + j, m_text + i) ) { break; } }
        }
        m_texts.push_back(m_text); m_lens.push_back(m_len);
    }

void clear() { g_inCnt = g_leafCnt = 0; m_root = newInNode();
m_texts.clear(); m_lens.clear(); }

    //Write the two arrays to construct a suffix array:
    //sfx: the suffixes in lexigraphical order.
    //lcp[i]: longest common prefix of sfx[i - 1] and sfx[i].
    void toSuffixArray(Suffix* sfx, int* lcp) const {
        Node* p = m_root;
        int i = 0, depth = 0, sfxI = 0, cp = 0;
        g_stack.clear(); g_stack.push_back(0);
        while ( !g_stack.empty() ) {
            if ( p->isLeaf() ) {
                Leaf* leaf = (Leaf*)p;
                if (depth > 1) {
                    for (list<int>::iterator it = leaf->from.begin();
                         leaf->from.end() != it; it++)
                    {   sfx[sfxI].from = *it;
                        sfx[sfxI].str = m_texts[*it]+m_lens[*it]-depth+1;
                        lcp[sfxI++] = cp; cp = depth - 1; }
                }
                i = g_stack.back(); i++; g_stack.pop_back();
                depth -= p->r - p->l; p = p->prnt; cp = depth;
            }
            else {
                InNode* in = (InNode*)p;
                while ( i < ALPHABET_SIZE && !in->ch[i] ) { i++; }
                if (ALPHABET_SIZE == i)//All children are visited.
                {   i = g_stack.back(); i++; g_stack.pop_back();
                    depth -= p->r - p->l; p = p->prnt; cp = depth; }
                else { p = in->ch[i]; depth += p->r - p->l;
                       g_stack.push_back(i); i = 0; }
            }
        }
    }

private:
    typedef SfxNode Node;

    //Go along string m_text[l, r) starting from node p.
    void goStr(const char* l, const char* r) {
        m_i = m_p->r;
        while (l < r)
        {   m_p = ( (InNode*)m_p )->child(*l);//There must be a child.
            if (r-l <= m_p->r - m_p->l) { m_i = m_p->l + (r-l); l=r; }
            else { m_i = m_p->r; l += m_p->r - m_p->l; } }
    }

    //Return true if new leaf is added.
    bool extend(const char* i, const char* r) {
        if (m_i < m_p->r) {
            const char* l;
            if (*m_i == *r) {//implicit extension, no new leaf added.
                if (*r) { m_i++; return false; }
                ( (Leaf*)m_p )->from.push_back( m_texts.size() );
                l = r - (m_p->r - m_p->l - 1);
            }
            else {
                //Insert a new internal node and add a new leaf.
                InNode* in = newInNode(m_p->l, m_i);
                m_p->prnt->child(*m_p->l) = in; in->prnt = m_p->prnt;
                in->child(*m_i) = m_p; m_p->prnt = in; m_p->l = m_i;
                Leaf* leaf = newLeaf(r, m_text + m_len + 1);
                in->child(*r) = leaf; leaf->prnt = in;
                leaf->from.push_back( m_texts.size() ); m_leafCnt++;
                //This new internal node may be suffix link of others.
                if (m_newIn) { m_newIn->sfxLink = in; }
                m_p = m_newIn = in;
                l = r - (m_p->r - m_p->l);
            }
            //Find the position of next extension.
            InNode* p = m_p->prnt; m_p = p;
            if (p->sfxLink) { m_p = p->sfxLink; } else { l++; }
            goStr(l, r);
        }
        else {//in condition that m_i == m_p->r
            InNode* p = (InNode*)m_p;//now m_p must be internal.
            if (m_newIn) { m_newIn->sfxLink = p; m_newIn = NULL; }
            Node* ch = p->child(*r);
            if (ch)
            {   if (*r) { m_p = ch; m_i = m_p->l + 1; return false; }
                ( (Leaf*)ch )->from.push_back( m_texts.size() ); }
            else { Leaf* leaf = newLeaf(r, m_text + m_len + 1);
                   p->child(*r) = leaf; leaf->prnt = p; m_leafCnt++;
                   leaf->from.push_back( m_texts.size() ); }
            if (i < r) { m_p = p->sfxLink; goStr(NULL, NULL); }
        }
        return true;
    }

    InNode* m_root;
    vector<const char*> m_texts; vector<int> m_lens;
    //the following members are to help the extensions.
    Node* m_p; InNode* m_newIn;
    const char *m_i, *m_text;
    int m_leafCnt, m_len;
};

//Test suite and usage example
#include <iostream>
int main() {
    Suffix sa[100]; int lcp[100];
    char a[] = "xabxa", b[] = "babxba";
    SuffixTree t; t.addText(a); t.addText(b);
    t.toSuffixArray(sa, lcp);
    int cnt = strlen(a) + strlen(b);
    for (int i = 0; i < cnt; i++)
    {   cout << sa[i].from <<" "<< sa[i].str <<" "<<lcp[i] << endl; }
    return 0; //output: 0 a 0
              //        1 a 1
              //        0 abxa 1
              //        1 abxba 3
              //        1 ba 0
              //        1 babxba 2
              //        0 bxa 1
              //        1 bxba 2
              //        0 xa 0
              //        0 xabxa 2
              //        1 xba 1
}

关键词/Tags: acm 后缀数组后缀树

代码] 用倍增算法构造后缀数组，O(n log n)

我很懒的 @ 1985-08-05 17:04

本文是这篇文章的附件。

/////////////////////////////////////////////////////////////////
//Constructing Suffix Array with Doubling Algorithm, O(n log n).
/////////////////////////////////////////////////////////////////
#include <algorithm>//sort
#include <cstring>//memset

using namespace std;

const int MAX_SFX = 210000;

struct Sfx {
    int i; int key[2];
    bool operator < (const Sfx& s) const
    {   return key[0] < s.key[0]
               || key[0] == s.key[0] && key[1] < s.key[1]; }
};

int g_buf[MAX_SFX + 1];
Sfx g_tempSfx[2][MAX_SFX], *g_sa = g_tempSfx[0];

void cSort(Sfx* in, int n, int key, Sfx* out) {
    int* cnt = g_buf; memset( cnt, 0, sizeof(int) * (n + 1) );
    for (int i = 0; i < n; i++) { cnt[ in[i].key[key] ]++; }
    for (int i = 1; i <= n; i++) { cnt[i] += cnt[i - 1]; }
    for (int i = n - 1; i >= 0; i--)
        { out[ --cnt[ in[i].key[key] ] ] = in[i]; }
}

//Build a suffix array from string 'text' whose length is 'len'.
//write the result into global array 'g_sa'.
void buildSA(char* text, int len) {
    Sfx *temp = g_tempSfx[1];
    int* rank = g_buf;
    for (int i = 0; i < len; i++)
        { g_sa[i].i = g_sa[i].key[1] = i; g_sa[i].key[0] = text[i]; }
    sort(g_sa, g_sa + len);
    for (int i = 0; i < len; i++) { g_sa[i].key[1] = 0; }
    int wid = 1;
    while (wid < len) {
        rank[ g_sa[0].i ] = 1;
        for (int i = 1; i < len; i++)
        {   rank[ g_sa[i].i ] = rank[ g_sa[i - 1].i ];
            if ( g_sa[i-1] < g_sa[i] ) { rank[ g_sa[i].i ]++; } }
        for (int i = 0; i < len; i++)
            { g_sa[i].i = i; g_sa[i].key[0] = rank[i];
              g_sa[i].key[1] = i + wid < len? rank[i + wid]: 0; }
        cSort(g_sa, len, 1, temp); cSort(temp, len, 0, g_sa);
        wid *= 2;
    }
}

int getLCP(char* a, char* b)
{ int l=0; while(*a && *b && *a==*b) { l++; a++; b++; } return l; }

void getLCP(char* text, Sfx* sfx, int len, int* lcp) {
    int* rank = g_buf;
    for (int i=0, r=0; i < len; i++, r++) { rank[ sfx[i].i ] = r; }
    lcp[0] = 0;
    if (rank[0])
        { lcp[ rank[0] ] = getLCP( text, text + sfx[ rank[0]-1 ].i ); }
    for (int i = 1; i < len; i++) {
        if ( !rank[i] ) { continue; }
        if (lcp[ rank[i - 1] ] <= 1)
        { lcp[ rank[i] ] = getLCP( text+i, text+sfx[ rank[i]-1 ].i ); }
        else
        { int L = lcp[ rank[i - 1] ] - 1;
          lcp[rank[i]] = L+getLCP(text+i+L, text+sfx[rank[i]-1].i+L); }
    }
}

//Test suite and usage example
#include <iostream>
using namespace std;
int main() {
    char str[] = "aabbaa{post.content}ababab";
    int from[] = {0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1};
    int lcp[13];
    buildSA(str, 13); getLCP(str, g_sa, 13, lcp);
    for (int i=1; i<13; i++)//The first suffix is useless (empty).
    { cout<<from[g_sa[i].i]<<' '<<str+g_sa[i].i<<' '<<lcp[i]<<endl; }
    return 0;//output: 0 a 0
             //        0 aa 1
             //        0 aabbaa 2
             //        1 ab 1
             //        1 abab 2
             //        1 ababab 4
             //        0 abbaa 2
             //        1 b 0
             //        0 baa 1
             //        1 bab 2
             //        1 babab 3
             //        0 bbaa 1
}

【上篇】Range Minimum Query and Lowest Common Ancestor
【下篇】欧拉回路

作者: demented

该日志由 demented 于6年前发表在综合分类下，最后更新于 2017年12月17日.
转载请注明: 后缀树与后缀数组 | 学步园 +复制链接

抱歉!评论已关闭.

学步园

后缀树与后缀数组

作者: demented

书签

最新文章New

本站推荐

返回首页