串匹配
朴素字符串匹配算法
- 检测每一个位置开始的字符串是否符合要求.
- 最坏$\Theta((n-m+1)m)$
Rabin-Karp算法
1 | RABIN-KARP-MATCHER(T, P, d, q) |
- 预处理$O(m)$, 最坏运行时间$\Theta((n-m+1)m)$, 平均情况下较好.
- 将P和T都转化为数字$p$和$t_s$, 左边是高位. t下标可去掉.
- $p= P[m] + d(P[m-1] + d(P[m-2] + … + d(P[2] + dP[1])…)) \mod q$
- $t_{s+1} = (d(t_s-T[s+1]h) + T[s+m+1]) \mod q$
- mod q是为了防止p和$t_s$的值太大, 导致每次算术运算需要的时间不是常数. 因为有mod q, 所以遇到伪命中点还需要进一步判断.
- 期望运行时间为$O(n) + O(m(v+n/q))$, v为有效偏移量.
利用有限自动机进行字符串匹配
有限自动机
- 一个有限自动机M是一个5元组$(Q, q_0, A, \Sigma, \delta)$.
- Q是状态的有限集合
- $q_0$是初始状态
- $A\subseteq Q$是一个特殊的接收状态集合
- $\Sigma$是有限字符输入表
- $\delta$是一个从$Q\times \Sigma$到Q的函数, 称为M的转移函数
- 有限自动机从$q_0$开始每次读入输入字符串的一个字符. 于状态q读入a则状态变为$\sigma(q, a)$. 当前状态$q\in A$则M接受了迄今为止读入的字符串. 未被接受的输入称为被拒绝的字符串.
- 终态函数$\phi$: $\Sigma^* \to Q$, 满足$\phi(w)$是M在扫描字符串w后终止时的状态. 当且仅当$\phi(w)\in A$, M接受字符串w.
- P的后缀函数$\sigma: \Sigma^*\to \{0,1,…,m\}$: $\sigma(x) = max\{k: P_k \sqsupset x\}$
- 字符串匹配自动机:
- 状态集合为$\{0,1,…,m\}$. $q_0$是0状态, m是唯一被接受状态.
- 对任意状态q和字符a
$\delta(q, a) = \sigma(P_qa)$
$\phi(T_i) = \sigma(T_i)$
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18FINITE-AUTOMATON-MATCHER(T, delta, m)
n = T.length
q = 0
for i = 1 to n
q = delta(q, T[i])
if q == m
print "Pattern occurs with shift " i - m
COMPUTE-TRANSITION-FUNCTION(P, Sigma)
m = P.length
for q = 0 to m
for each character a in Sigma
k = min(m + 1, q + 2)
repeat
k = k - 1
until P_k is suffix of P_q a
delta(q, a) = k
return delta预处理时间可以改进到$O(m|\Sigma|)$.
- 匹配时间$\Theta(n)$
Knuth-Morris-Pratt 算法(KMP)
- 预处理$\Theta(m)$, 匹配$\Theta(n)$.
- P的前缀函数$\pi: \{1, 2, …, m\}\to \{0, 1, …, m - 1\}$满足
$\pi[q] = max\{k: k < q且P_k\sqsupset P_q\}$1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25KMP-MATCHER(T, P)
n = T.length
m = P.length
pi = COMPUTE-PREFIX-FUNCTION(P)
q = 0 // number of characters matched
for i = 1 to n // scan the text from left to right
while q > 0 and P[q+1] != T[i] // next character does not match
q = pi[q]
if P[q+1] == T[i] //next character matches
q = q + 1
if q == m
print "Pattern occurs with shift " i - m
q = pi[q] // look for the next match
COMPUTE-PREFIX-FUNCTION(P)
m = P.length
pi[1] = 0
k = 0
for q = 2 to m
while k > 0 and P[k+1] != P[q]
k = pi[k]
if P[k+1] == P[q]
k = k + 1
pi[q] = k
return pi