串匹配

串匹配

朴素字符串匹配算法

  • 检测每一个位置开始的字符串是否符合要求.
  • 最坏$\Theta((n-m+1)m)$

Rabin-Karp算法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
RABIN-KARP-MATCHER(T, P, d, q)
n = T.length
m = P.length
h = d^{m-1} mod q
p = 0
t_0 = 0
for i = 1 to m // preprocessing
p = (dp + P[i]) mod q
t_0 = (dt_0 + T[i]) mod q
for s = 0 to n - m
if p == t_s
if P[1..m] == T[s+1..s+m]
print "Pattern occurs with shift" s
if s < n - m
t_{s+1} = (d(t_s-T[s+1]h) + T[s+m+1]) mod q
  • 预处理$O(m)$, 最坏运行时间$\Theta((n-m+1)m)$, 平均情况下较好.
  • 将P和T都转化为数字$p$和$t_s$, 左边是高位. t下标可去掉.
  • $p= P[m] + d(P[m-1] + d(P[m-2] + … + d(P[2] + dP[1])…)) \mod q$
  • $t_{s+1} = (d(t_s-T[s+1]h) + T[s+m+1]) \mod q$
  • mod q是为了防止p和$t_s$的值太大, 导致每次算术运算需要的时间不是常数. 因为有mod q, 所以遇到伪命中点还需要进一步判断.
  • 期望运行时间为$O(n) + O(m(v+n/q))$, v为有效偏移量.

利用有限自动机进行字符串匹配

有限自动机

  • 一个有限自动机M是一个5元组$(Q, q_0, A, \Sigma, \delta)$.
    • Q是状态的有限集合
    • $q_0$是初始状态
    • $A\subseteq Q$是一个特殊的接收状态集合
    • $\Sigma$是有限字符输入表
    • $\delta$是一个从$Q\times \Sigma$到Q的函数, 称为M的转移函数
  • 有限自动机从$q_0$开始每次读入输入字符串的一个字符. 于状态q读入a则状态变为$\sigma(q, a)$. 当前状态$q\in A$则M接受了迄今为止读入的字符串. 未被接受的输入称为被拒绝的字符串.
  • 终态函数$\phi$: $\Sigma^* \to Q$, 满足$\phi(w)$是M在扫描字符串w后终止时的状态. 当且仅当$\phi(w)\in A$, M接受字符串w.
    • $\phi(\epsilon) = q_0$
    • $\phi(wa) = \delta(\phi(w), a)$

      字符串匹配自动机

  • P的后缀函数$\sigma: \Sigma^*\to \{0,1,…,m\}$: $\sigma(x) = max\{k: P_k \sqsupset x\}$
  • 字符串匹配自动机:
    • 状态集合为$\{0,1,…,m\}$. $q_0$是0状态, m是唯一被接受状态.
    • 对任意状态q和字符a
      $\delta(q, a) = \sigma(P_qa)$
  • $\phi(T_i) = \sigma(T_i)$

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    FINITE-AUTOMATON-MATCHER(T, delta, m)
    n = T.length
    q = 0
    for i = 1 to n
    q = delta(q, T[i])
    if q == m
    print "Pattern occurs with shift " i - m

    COMPUTE-TRANSITION-FUNCTION(P, Sigma)
    m = P.length
    for q = 0 to m
    for each character a in Sigma
    k = min(m + 1, q + 2)
    repeat
    k = k - 1
    until P_k is suffix of P_q a
    delta(q, a) = k
    return delta
  • 预处理时间可以改进到$O(m|\Sigma|)$.

  • 匹配时间$\Theta(n)$

Knuth-Morris-Pratt 算法(KMP)

  • 预处理$\Theta(m)$, 匹配$\Theta(n)$.
  • P的前缀函数$\pi: \{1, 2, …, m\}\to \{0, 1, …, m - 1\}$满足
    $\pi[q] = max\{k: k < q且P_k\sqsupset P_q\}$
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    KMP-MATCHER(T, P)
    n = T.length
    m = P.length
    pi = COMPUTE-PREFIX-FUNCTION(P)
    q = 0 // number of characters matched
    for i = 1 to n // scan the text from left to right
    while q > 0 and P[q+1] != T[i] // next character does not match
    q = pi[q]
    if P[q+1] == T[i] //next character matches
    q = q + 1
    if q == m
    print "Pattern occurs with shift " i - m
    q = pi[q] // look for the next match

    COMPUTE-PREFIX-FUNCTION(P)
    m = P.length
    pi[1] = 0
    k = 0
    for q = 2 to m
    while k > 0 and P[k+1] != P[q]
    k = pi[k]
    if P[k+1] == P[q]
    k = k + 1
    pi[q] = k
    return pi