10月27, 2020

正则表达式(REGEX)应用于信息抽取的案例

regex

This regex implementation is backwards-compatible with the standard ‘re’ module, but offers additional functionality.

案例一:从术语表中抽取文档中能够匹配的最长字符

regex包扩展了re中的基本语法,支持POSIX matching (leftmost longest) 的语法匹配最长的词组

>>> # Normal matching.
>>> regex.search(r'Mr|Mrs', 'Mrs')
<regex.Match object; span=(0, 2), match='Mr'>
>>> regex.search(r'one(self)?(selfsufficient)?', 'oneselfsufficient')
<regex.Match object; span=(0, 7), match='oneself'>
>>> # POSIX matching.
>>> regex.search(r'(?p)Mr|Mrs', 'Mrs')
<regex.Match object; span=(0, 3), match='Mrs'>
>>> regex.search(r'(?p)one(self)?(selfsufficient)?', 'oneselfsufficient')
<regex.Match object; span=(0, 17), match='oneselfsufficient'>

中文举例,术语表中有精华和精华露两个词,在下面的句子中期望返回精华露:

import regex as re
mat=re.compile('(?p)(精华|精华露)')
mat.findall("倩碧焕妍活力精华露")
>>> 精华露

案例二:从术语表中抽取文档中能够匹配的全部词组,包括同一位置上的嵌套结构

同上,期望返回精华和精华露

import regex as re
mat=re.compile('(?<=(精华|精华露))')
mat.findall("倩碧焕妍活力精华露")
>>> 精华露

案例三:抽取目前模式前后含某些触发词的字符

方法一:

m=re.search(r'电话号码:(\d+)', '电话号码:14412234111')
m.group(1)

>>> 14412234111

方法二:如果触发词在后面

m=re.search(r'(\d+)(?=\s86+)', '14412234111 86+')
m.group(1)

>>> 14412234111

本文链接:http://57km.cc/post/how to apply regex with information extraction.html

-- EOF --

Comments