Posted on:
Last modified:
问题:假设我们有一组比较长的文本,每一个文本都有几十 k 左右,还有一些敏感关键词需要删除,大概有几千,然后需要在这些文本中把关键词找出来。
import re
compiled_words = [re.compile(r"\b" + word + r"\b") for word in my20000words]
for sentence in sentences:
for word in compiled_words:
print(re.sub(word, "***", sentence))
pattern = "\b(word1|word2|word3)\b"
for sentence in sentences:
print(re.sub(pattern, "***", sentence))
对于数组:['foobar', 'foobah', 'fooxar', 'foozap', 'fooza'],使用上面的方法,我们可能会写出正则:
"\b(foobar|foobah|fooxar|foozap|fooza)\b"
但是这并不是最优的正则,应该使用前缀树的思想来合并单词,形成下面的正则:
r"\bfoo(?:ba[hr]|xar|zap?)\b"
具体的方法可以看这里:https://stackoverflow.com/a/42789508/1061155
import re
def delete_banned_words(matchobj):
word = matchobj.group(0)
if word.lower() in banned_words:
return ""
else:
return word
word_pattern = re.compile("\w+")
for sentence in sentences:
sentence = word_pattern.sub(delete_banned_words, sentence)
© 2016-2022 Yifei Kong. Powered by ynotes
All contents are under the CC-BY-NC-SA license, if not otherwise specified.
Opinions expressed here are solely my own and do not express the views or opinions of my employer.
友情链接: MySQL 教程站