正则表达式

fengchen 收录于 Computer

2023-07-15 约 2400 字预计阅读 5 分钟

正则表达式

正则表达式语法

元字符

元字符	描述
.	句号匹配任意单个字符除了换行符。 "`.ar`" => The car parked in the garage.
[ ]	匹配方括号内的任意字符 "`[Tt]he`" => The car parked in the garage.
[^ ]	否定的字符种类。匹配除了方括号里的任意字符 "`[^c]ar`=> The car parked in the garage.
*	匹配>=0个重复的在号之前的字符。 "`[a-z]`" => The car parked in the garage #21.
+	匹配>=1个重复的+号前的字符。 "`c.+t`" => The fat cat sat on the mat.
?	标记?之前的字符为可选. "`[T]?he`" => The car is parked in the garage.
{n,m}	匹配num个大括号之前的字符或字符集 (n <= num <= m). 0~9之间匹配最少2位，最多3位的数字："`[0-9]{2,3}`" => The number was 9.9997 but we rounded it off to 10.0. 0~9之间匹配只是2位的数字："`[0-9]{2,}`" => The number was 9.9997 but we rounded it off to 10.0. 0~9之间匹配3位数字："`[0-9]{3}`" => The number was 9.9997 but we rounded it off to 10.0.
(xyz)	字符集，匹配与 xyz 完全相等的字符串. "`(c
\|	或运算符，匹配符号前或后的字符. “`(T
\	转义字符,用于匹配一些保留的字符 `[ ] ( ) { } . * + ? ^ $ \ \|` "`(f
^	从开始行开始匹配 `[T
$	从末端开始匹配 "`(at\.)`” =>The fat cat. sat. on the mat. "`(at\.$)`"=>The fat cat. sat. on the mat.

简写字符集

简写	描述
.	除换行符外的所有字符
\w	匹配所有字母数字，等同于 `[a-zA-Z0-9_]`
\W	匹配所有非字母数字，即符号，等同于： `[^\w]`
\d	匹配数字： `[0-9]`
\D	匹配非数字： `[^\d]`
\s	匹配所有空格字符，等同于： `[\t\n\f\r\p{Z}]`
\S	匹配所有非空格字符： `[^\s]`
\f	匹配一个换页符
\n	匹配一个换行符
\r	匹配一个回车符
\t	匹配一个制表符
\v	匹配一个垂直制表符
\p	匹配 CR/LF（等同于 `\r\n`），用来匹配 DOS 行终止符

零宽度断言

符号	描述
?=	正先行断言-存在 “`(T
?!	负先行断言-排除 “`(T
?<=	正后发断言-存在 “`(?<=(T
?<!	负后发断言-排除 “`(?<!(T

标志(可选项)

标志	描述
i	忽略大小写。 "`The/gi`" => The fat cat sat on the mat.
g	全局搜索。 "`.(at)/gi`" => The fat cat sat on the mat.
m	多行修饰符：锚点元字符 `^` `$` 工作范围在每行的起始。

贪婪与惰性匹配

默认贪婪匹配，意味着会匹配尽可能长的子串

?转为惰性匹配,则遇到就停

“(.*at)"=>The fat cat sat on the mat.
“(.*?at)"=>The fat cat sat on the mat.

正则表达式操作

匹配

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
#include <regex>
bool regex_match (const basic_string<charT,ST,SA>& s,
          const basic_regex<charT,traits>& rgx,
          regex_constants::match_flag_type flags = regex_constants::match_default);
/**
	第一个参数s为：需要用正则表达式去匹配的字符串，简言之就是要处理的字符串。
	第二个参数rgx为：为一个basic_regex的一个对象，进行匹配的模式，用正则字符串表示，其声明为:
	(1)typedef basic_regex<char>    regex;//正常字符处理（常用）
	(2)typedef basic_regex<wchar_t> wregex;//宽字符处理
	第三个参数flags是控制第二个参数如何去匹配，第三个参数处可以设置一个或多个常量去控制，一般设置有默认值
	返回值为：如果匹配成功，返回True,否则返回False
*/

搜索

1
2
3
4
5
6
7
8
9
bool regex_search (const basic_string<charT,ST,SA>& s,
          const basic_regex<charT,traits>& rgx,
          regex_constants::match_flag_type flags = regex_constants::match_default);
      //参数含义与regex_match一致，此方法不返回匹配成功的字符串，只是确定里面是否有满足正则式的字句
bool regex_search (const basic_string<charT,ST,SA>& s,
          match_results<typename basic_string<charT,ST,SA>::const_iterator,Alloc>& m,
          const basic_regex<charT,traits>& rgx,
          regex_constants::match_flag_type flags = regex_constants::match_default);
      //其他参数含义一样，多了一个m参数，其含义为此处为一个match_results的类型，其作用是存储匹配的结果或者满足子表达式匹配的结果，返回结果为一个迭代器

替换

1
2
3
4
5
6
7
8
9
basic_string<charT,ST,SA> regex_replace (const basic_string<charT,ST,SA>& s,
                                         const basic_regex<charT,traits>& rgx,
                                         const charT* fmt,
                                         regex_constants::match_flag_type flags = regex_constants::match_default);
//第一个参数s表示要被操作的字符串对象
//第二个参数rgx为匹配正则表达式
//第三个参数fmt为以何种方式进行替换
//第四个参数flags为一种方式，代表怎样去替换
//返回值为：如果匹配成功返回已经替换成功的字符串，否则匹配失败，返回原字符串

py-正则表达式操作

re 模块的一般使用步骤

使用compile函数将正则表达式的字符串形式编译为一个 Pattern 对象通过 Pattern 对象提供的一系列方法对文本进行匹配查找，获得匹配结果（一个 Match 对象）最后使用 Match 对象提供的属性和方法获得信息，根据需要进行其他的操作

compile函数

用于编译正则表达式，生成一个 Pattern 对象

1
2
import re
re.compile(pattern[, flag])

pattern：匹配的正则表达式
flag ：一个可选参数，表示匹配模式，比如忽略大小写，多行模式等

match

从字符串的起始位置匹配一个模式，如果不是起始位置匹配成功的话，match()就返回none。必须从字符串开头匹配

1
re.match(pattern,string,flags=0)

pattern：匹配的正则表达式
string：要匹配的字符串
flags：标志位，用于控制正则表达式的匹配方式，如：是否区分大小写，多行匹配等等

返回一个匹配的对象，而不是匹配的内容。从起始位置开始没有匹配成功，即便其他部分包含需要匹配的内容，re.match()也会返回None。

一般一个小括号括起来就是一个捕获组。使用group()来提取每组匹配到的字符串。group()会返回一个包含所有小组字符串的元组，从 0 到所含的小组号。

0：表示正则表达式中符合条件的字符串。
1：表示正则表达式中符合条件的字符串中的第一个() 中的字符串。
2：表示正则表达式中符合条件的字符串中的第二个() 中的字符串。
…

1
2
3
4
5
6
7
8
9
import re
msg = 'name:Alice,age:6,score:80'

obj = re.match('name:(\w+),age:(\d+)', msg)
print(obj.group(0))    # name:Alice,age:6  符合条件的字符串
print(obj.group(1))    # Alice   第一匹配
print(obj.group(2))    # 6       第二匹配
print(obj.groups())    # ('Alice', '6')
print(obj.span())      # (0, 16)  返回结果的范围

search

findall

finditer

split

sub

用于替换字符串中的匹配项

1
2
def sub(pattern, repl, string, count=0, flags=0):
    return _compile(pattern, flags).sub(repl, string, count)

pattern：该参数表示正则中的模式字符串；
repl：该参数表示要替换的字符串（即匹配到pattern后替换为repl），也可以是个函数；
string：该参数表示要被处理（查找替换）的原始字符串；
count：可选参数，表示是要替换的最大次数，而且必须是非负整数，该参数默认为0，即所有的匹配都会被替换；
flags：可选参数，表示编译时用的匹配模式（如忽略大小写、多行模式等），数字形式，默认为0。