Python 正则表达式（匹配分组）（python能做什么）

匹配分组

字符	功能
`\|`	匹配左右任意一个表达式
`(ab)`	将括号中字符作为一个分组
`\num`	引用分组num匹配到的字符串
`(?P<name>)`	分组起别名
`(?P=name)`	引用别名为name分组匹配到的字符串

匹配左右任意一个表达式，类似或条件: |

我们在查询东西的时候不一定就是查一样，可能还会想要同时查询另一样东西。那么前面的只是讲述了匹配查询一样的情况。

需求：匹配出0-100之间的数字

#coding=utf-8import reIn [3]: re.match('[1-9]?\d','8').group() Out[3]: '8'In [5]: re.match('[1-9]?\d','78').group() Out[5]: '78'# 不正确的情况，因为[1-9]无法匹配0，那么直接就是使用\d匹配到0就结束了，所以只会打印一个0，不会打印8出来。In [6]: re.match('[1-9]?\d','08').group() Out[6]: '0'# 修正之后的，由于[1-9]无法匹配0，那么报错的时候ret为空，直接打印不在0-100之间In [14]: ret = re.match('[1-9]?\d$|100','08') In [15]: if ret: ...: print(ret.group()) ...: else: ...: print("不在0-100之间") ...: 不在0-100之间In [16]: # 改匹配第一个字符为[0-9]，当然就可以匹配出08来了In [9]: re.match('[0-9]?\d$','08').group() Out[9]: '08'# 匹配100肯定报错，因为这里只是匹配两位字符，那么就需要使用 | 增加一个匹配的类型了。In [17]: re.match('[1-9]?\d$','100').group() ---------------------------------------------------------------------------AttributeError Traceback (most recent call last) in ----> 1 re.match('[1-9]?\d$','100').group()AttributeError: 'NoneType' object has no attribute 'group'In [18]: # 添加 | 用来后续可以多个一个判断 100 的情况In [18]: re.match('[1-9]?\d$|100','100').group() Out[18]: '100'

将括号中字符作为一个分组：(ab)

上面写到可以通过 | 来进行或条件匹配，但是却是没有限定范围。那是上面意思呢？看看下面的这个例子来理解一下。

需求：匹配出163、126、qq邮箱

#coding=utf-8import re# 首先来简单匹配一个163的邮箱地址In [19]: re.match('\w{4,20}@163\.com','test@163.com').group() Out[19]: 'test@163.com'# 那么这个要判断163、126、qq的邮箱，我是不是直接加上|就好了呢？从结果来看，并不是的。In [20]: re.match('\w{4,20}@163|qq|126\.com','test@163.com').group() Out[20]: 'test@163'In [21]: re.match('\w{4,20}@163|qq|126\.com','qq').group() Out[21]: 'qq'In [22]: re.match('\w{4,20}@163|qq|126\.com','126.com').group() Out[22]: '126.com'In [23]: # 从上面的三个结果来看，貌似 | 把整体拆分三个规则来匹配。# 很明显这不是我们想要的结果。明显就是 | 的或范围没有做好限制。# 下面可以使用分组（）来限定或的范围来解决问题# 我在 (163|qq|126) 增加了括号，说明 | 这个或判断只在这个括号中有效果In [23]: re.match('\w{4,20}@(163|qq|126)\.com','126.com').group() ---------------------------------------------------------------------------AttributeError Traceback (most recent call last) in ----> 1 re.match('\w{4,20}@(163|qq|126)\.com','126.com').group()AttributeError: 'NoneType' object has no attribute 'group'# 来看看，这个直接qq的当然就会匹配报错了。In [24]: re.match('\w{4,20}@(163|qq|126)\.com','qq').group() ---------------------------------------------------------------------------AttributeError Traceback (most recent call last) in ----> 1 re.match('\w{4,20}@(163|qq|126)\.com','qq').group()AttributeError: 'NoneType' object has no attribute 'group'# 那么输入正确的邮箱地址，再来匹配看看是否正确。In [25]: re.match('\w{4,20}@(163|qq|126)\.com','test@163.com').group() Out[25]: 'test@163.com'In [26]: re.match('\w{4,20}@(163|qq|126)\.com','test@qq.com').group() Out[26]: 'test@qq.com'In [27]: re.match('\w{4,20}@(163|qq|126)\.com','test@126.com').group() Out[27]: 'test@126.com'In [28]: # 从上面的三个结果来看，都很正确匹配出来163、126、qq三种邮箱了。# 最后输入另一种未定义的 hostmail 邮箱，当然就是报错的结果了。In [28]: re.match('\w{4,20}@(163|qq|126)\.com','test@hostmail.com').group() ---------------------------------------------------------------------------AttributeError Traceback (most recent call last) in ----> 1 re.match('\w{4,20}@(163|qq|126)\.com','test@hostmail.com').group()AttributeError: 'NoneType' object has no attribute 'group'In [29]:

需求：不是以4、7结尾的手机号码(11位)

In [29]: tels = ["13800001234", "18916844321", "10086", "18800007777"] In [35]: for tel in tels: ...: ret = re.match('^1\d{9}[0-35-68-9]$',tel) ...: if ret: ...: print("%s 手机号码结果不以4或者7结尾" % ret.group()) ...: else: ...: print("%s 手机号码不是想要找的" % tel) ...: 13800001234 手机号码不是想要找的18916844321 手机号码结果不以4或者7结尾10086 手机号码不是想要找的18800007777 手机号码不是想要找的In [36]:

提取区号和电话号码

In [36]: re.match('\d{3,4}-?\d+','0755-12345678').group() Out[36]: '0755-12345678'In [37]: re.match('(\d{3,4})-?(\d+)','0755-12345678').group() Out[37]: '0755-12345678'In [38]: re.match('(\d{3,4})-?(\d+)','0755-12345678').group(1) Out[38]: '0755'In [39]: re.match('(\d{3,4})-?(\d+)','0755-12345678').group(2) Out[39]: '12345678'# 还有另外一种方式匹配，使用开头匹配符号 ^ 然后写上最后需要匹配的符号 - In [50]: re.match('[^-]*','0755-12345678').group() Out[50]: '0755'In [51]: re.match('[-]*','0755-12345678').group() Out[51]: ''In [52]: re.match('[^-]*','0755-12345678').group() Out[52]: '0755'In [53]: re.match('[^j]*','0755j12345678').group() Out[53]: '0755'In [54]: re.match('[^-]*','0755j12345678').group() Out[54]: '0755j12345678'In [55]: re.match('[^j]*','0755j12345678').group() Out[55]: '0755'In [56]: In [60]: re.match('[^-]*-?\d+','0755-12345678').group() Out[60]: '0755-12345678'In [61]: re.match('([^-]*)-?(\d+)','0755-12345678').group() Out[61]: '0755-12345678'In [62]: re.match('([^-]*)-?(\d+)','0755-12345678').group(1) Out[62]: '0755'In [63]: re.match('([^-]*)-?(\d+)','0755-12345678').group(2) Out[63]: '12345678'In [64]: # 这种写法的好处就是匹配所有 - 号前面的字符串In [64]: re.match('([^-]*)-?(\d+)','Abcdasdasd-12345678').group(2) Out[64]: '12345678'In [65]: re.match('([^-]*)-?(\d+)','Abcdasdasd-12345678').group(1) Out[65]: 'Abcdasdasd'In [66]:

引用分组num匹配到的字符串： \num

这个功能在做爬虫匹配网页HTML元素的时候经常会用到。下面来看看示例：

需求：匹配出hello beauty

#coding=utf-8import re# 首先匹配第一个看看In [66]: re.match('<[a-zA-Z]*>','hello beauty').group() Out[66]: ''# 后面如果是用 .* 的话的确可以匹配所有字符，但是不利用后面想要再次() 分组In [67]: re.match('<[a-zA-Z]*>.*','hello beauty').group() Out[67]: 'hello beauty'# 使用 \w 可以匹配字母、数字、下划线，但是不能匹配空格、tab等，所以只到hello这里。In [68]: re.match('<[a-zA-Z]*>\w*','hello beauty').group() Out[68]: 'hello'# 加上\s再两个\w之间进行匹配，那么就可以解决这个空格的问题了。剩下就是匹配最后的In [77]: re.match('<[a-zA-Z]*>\w*\s\w*','hello beauty').group() Out[77]: 'hello beauty'# 在最后写上匹配规则就可以了。In [78]: re.match('<[a-zA-Z]*>\w*\s\w*','hello beauty').group() Out[78]: 'hello beauty'In [79]# 但是，可以看到匹配的规则是大小写字母，如果不一样的html标签呢？In [80]: re.match('<[a-zA-Z]*>\w*\s\w*','hello beauty').group() Out[80]: 'hello beauty'# 这样虽然也匹配出了结果，但是并不是想要的呀。最好的结果是结尾也应该是html# 那么问题来了，很多时候匹配这是可能会变化的，不一定都是，可能是都有可能。In [81]: # 正确的理解思路：如果在第一对<>中是什么，按理说在后面的那对<>中就应该是什么# 通过引用分组中匹配到的数据即可，但是要注意是元字符串，即类似 r""这种格式In [89]: re.match(r'<([a-zA-Z]*)>\w*\s\w*','hello beauty').group() Out[89]: 'hello beauty'In [90]: re.match(r'<([a-zA-Z]*)>\w*\s\w*','hello beauty').group(1) Out[90]: 'html'# 上面将匹配的内容结尾写成了 \1 ，那么就是直接使用第一个括号分组的内容。In [91]:

从上面可以看出，括号() 的分组在正则匹配是可以引用的，那么如果这种() 非常多，都写 \1 \2 \3 肯定不是很方便，那么下面有一种命名的编写方式。

分组别名引用：(?P) (?P=name)

字符	功能
`(?P<name>)`	分组起别名
`(?P=name)`	引用别名为name分组匹配到的字符串

需求：匹配出

baidu.com

#coding=utf-8import reIn [92]: re.match(r"<(?P\w*)><(?P\w*)>.*", "

baidu.com

").group() Out[92]: '

baidu.com

'# 将第二个h标签改为 h2，使得不匹配h1。确认是否会报错。In [94]: re.match(r"<(?P\w*)><(?P\w*)>.*", "

baidu.com

").group() ---------------------------------------------------------------------------AttributeError Traceback (most recent call last) in ----> 1 re.match(r"<(?P\w*)><(?P\w*)>.*", "

baidu.com

").group()AttributeError: 'NoneType' object has no attribute 'group'In [95]:

不过这种方式知道就好，大部分也是使用 \1 \2 就可以完成匹配的了。

Flask接口签名sign原理与实例代码浅析

456 2022-08-26

Python 正则表达式（匹配分组）（python能做什么）

baidu.com

baidu.com

baidu.com

baidu.com

baidu.com

Flask接口签名sign原理与实例代码浅析

zookeeper python接口实例详解

分析EBS常用接口表

推荐文章

接口调用是什么意思？几种常用接口调用方式

接口设计原则

8款在线 API 接口文档管理工具

api管理系统是什么？

什么是接口调试？接口调试的步骤有哪些？

api 接口管理系统有哪些？

接口测试有几种测试方法

API文档生成工具有哪些？

微服务和api网关区别

交换机配置步骤

最近发表

热评文章

在线接口文档管理工具推荐，支持在线测试，HTTP接口

开源的在线接口文档wiki工具Mindoc的介绍与使

如何优雅的进行接口设计？接口设计的六大原则是什么？

什么是API测试,api检测公司

遇到百度网址安全中心提醒您该页面可能存在钓鱼欺诈信息

软件接口设计怎么做？前后端分离软件接口设计思路

Python 正则表达式（匹配分组）（python能做什么）

​​baidu.com​​

baidu.com

baidu.com

baidu.com

baidu.com

微信扫一扫：分享

推荐文章

最近发表

热评文章

baidu.com