$ ls ~yifei/notes/

大规模字符串的匹配

Posted on:2018-07-19 18:35

Last modified:2023-01-19 17:28

问题：假设我们有一组比较长的文本，每一个文本都有几十 k 左右，还有一些敏感关键词需要删除，大概有几千，然后需要在这些文本中把关键词找出来。

暴力搜索

import re

compiled_words = [re.compile(r"\b" + word + r"\b") for word in my20000words]

for sentence in sentences:
  for word in compiled_words:
    print(re.sub(word, "***", sentence))

改进 1：把正则组合起来

pattern = "\b(word1|word2|word3)\b"

for sentence in sentences:
  print(re.sub(pattern, "***", sentence))

改进 2：使用 Trie 优化正则

对于数组：['foobar', 'foobah', 'fooxar', 'foozap', 'fooza']，使用上面的方法，我们可能会写出正则：

"\b(foobar|foobah|fooxar|foozap|fooza)\b"

但是这并不是最优的正则，应该使用前缀树的思想来合并单词，形成下面的正则：

r"\bfoo(?:ba[hr]|xar|zap?)\b"

具体的方法可以看这里：https://stackoverflow.com/a/42789508/1061155

改进 3：基于集合的搜索

import re

def delete_banned_words(matchobj):
    word = matchobj.group(0)
    if word.lower() in banned_words:
        return ""
    else:
        return word

word_pattern = re.compile("\w+")

for sentence in sentences:
    sentence = word_pattern.sub(delete_banned_words, sentence)

参考

All contents are under the CC-BY-NC-SA license, if not otherwise specified.

Opinions expressed here are solely my own and do not express the views or opinions of my employer.

友情链接: MySQL 教程站