-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathpython3_crash9.html
More file actions
493 lines (439 loc) · 45.7 KB
/
python3_crash9.html
File metadata and controls
493 lines (439 loc) · 45.7 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
<!DOCTYPE html>
<html lang="cn">
<head>
<meta charset="utf-8" />
<title>[雪峰磁针石博客]python3快速入门教程3文本处理1正则表达式</title>
<link rel="stylesheet" href="/theme/css/main.css" />
</head>
<body id="index" class="home">
<header id="banner" class="body">
<h1><a href="/">python自动化测试人工智能 </a></h1>
<nav><ul>
<li><a href="/category/ba-zi.html">八字</a></li>
<li><a href="/category/ce-shi.html">测试</a></li>
<li><a href="/category/ce-shi-kuang-jia.html">测试框架</a></li>
<li><a href="/category/common.html">common</a></li>
<li><a href="/category/da-shu-ju.html">大数据</a></li>
<li><a href="/category/feng-shui.html">风水</a></li>
<li><a href="/category/ji-qi-xue-xi.html">机器学习</a></li>
<li><a href="/category/jie-meng.html">解梦</a></li>
<li><a href="/category/linux.html">linux</a></li>
<li class="active"><a href="/category/python.html">python</a></li>
<li><a href="/category/shu-ji.html">书籍</a></li>
<li><a href="/category/shu-ju-fen-xi.html">数据分析</a></li>
<li><a href="/category/zhong-cao-yao.html">中草药</a></li>
<li><a href="/category/zhong-yi.html">中医</a></li>
</ul></nav>
</header><!-- /#banner -->
<section id="content" class="body">
<article>
<header>
<h1 class="entry-title">
<a href="/python3_crash9.html" rel="bookmark"
title="Permalink to [雪峰磁针石博客]python3快速入门教程3文本处理1正则表达式">[雪峰磁针石博客]python3快速入门教程3文本处理1正则表达式</a></h1>
</header>
<div class="entry-content">
<footer class="post-info">
<abbr class="published" title="2018-09-02T10:45:00+08:00">
Published: 日 02 九月 2018
</abbr>
<address class="vcard author">
By <a class="url fn" href="/author/andrew.html">andrew</a>
</address>
<p>In <a href="/category/python.html">python</a>.</p>
</footer><!-- /.post-info --> <h2 id="_1">简介</h2>
<p>-- 后面有很多习题,可以先做题目再来看文章</p>
<p>参考资料:https://docs.python.org/3/howto/index.html</p>
<p>正则表达式(Regular expressions REs或regexes或regex patterns)本质是小的且高度专业化的编程语言。它嵌入到 Python 中,调用使用re模块。需要指定一些规则来描述那些你希望匹配的字符串集合。这些字符串集合可能包含英语句子、e-mail地址、TeX 命令,或任何你想要的东东。然后可以提出问题,例如“字符串是否匹配该模式?”或“模式是否匹配字符串?”。 您还可以使用RE修改字符串或以各种方式拆分它。</p>
<p>正则表达式模式被编译成字节码,然后由 C 语言写的匹配引擎执行。对于高级的使用,你可能需要关注匹配引擎是如何执行给定RE,并通过一定的方式来编写RE,以便产生运行得更快的字节码。</p>
<p>正则表达式语言小而严格,不是所有的字符处理都可以使用正则表达式。还有一些任务,可以使用正则表达式来完成,但是表达式非常复杂。在这种情况下编写 Python 代码来处理会更好些;尽管 Python 代码比精巧的正则表达式执行起来会慢一些,但可能会更容易理解。</p>
<h2 id="_2">简单的模式</h2>
<h3 id="_3">字符匹配</h3>
<p>大多数字母和字符会匹配它们自身。举个例子,正则表达式test将完全匹配字符串test(你可以启用不区分大小写模式,该正则表达式可以匹配Test或TEST)。</p>
<p>这条规则有例外; 有些字符是特殊的元字符,不匹配自身。 相反,它们匹配一些与众不同的东西,或者通过重复它们或改变它们的含义来影响RE的其他部分。</p>
<p>这是元字符的完整列表</p>
<div class="highlight"><pre><span></span><span class="o">.</span> <span class="o">^</span> <span class="err">$</span> <span class="o">*</span> <span class="o">+</span> <span class="err">?</span> <span class="p">{</span> <span class="p">}</span> <span class="p">[</span> <span class="p">]</span> \ <span class="o">|</span> <span class="p">(</span> <span class="p">)</span>
</pre></div>
<p>方括号 [ ]指定用于存放你需要匹配的字符集合。例如 [abc] 会匹配字符 a,b 或 c;[a-c] 可以实现相同功能。后者使用范围来表示与前者相同的字符集合。如果你想只匹配小写字母,你的 RE 可能写成 [a-z]。</p>
<p>注意方括号元字符在中不会触发特殊功能例如 [akm$] 会匹配字符 'a','k','m' 或 '$'。</p>
<p>你还可以匹配方括号中未列出的所有其它字符,开头添加 ^即可,例如 [^5] 会匹配除了 '5' 之外的任何字符。</p>
<p>元字符转义:前面加上一个反斜杠,以消除它们的特殊功能:[,\ 。</p>
<p>反斜杠后边跟一些字符还可以表示特殊的意义,例如表示十进制数字,表示所有的字母或者表示非空白的字符集合。</p>
<p>让我们来举个例子:\w 匹配任何单词字符。如果正则表达式以字节的形式表示,这相当字符类 [a-zA-Z0-9_];如果正则表达式是一个字符串,\w 会匹配所有 Unicode 数据库(unicodedata 模块提供)中标记为字母的字符。你可以在编译正则表达式的时候,通过提供 re.ASCII 表示进一步限制 \w 的定义。</p>
<p>批注:re.ASCII 标志使得 \w 只能匹配 ASCII 字符,不要忘了,Python3 是 Unicode 的。</p>
<p>下边列举一些反斜杠加字符构成的特殊含义:</p>
<p>特殊字符 | 含义
| --- | --- |
\number |匹配相同编号的组。组从1开始编号。例如,(.+) \1匹配'the the'或'55 55',但不匹配'thethe'(注意组后面的空格)。此特殊序列只能用于匹配前99个组。如果数字的第一个数字是0,或者数字是3个八进制数字长,则不会将其解释为组匹配,而是解释为具有八进制值编号的字符。在方括号内,所有数字转义都被视为字符。默认支持Unicode,中文标点也会作为边界。可以启用re.ASCII。
\d |匹配任何十进制数字,相当于类[0-9]
\D |与 \d 相反,匹配任何非十进制数字的字符,相当于 [^0-9]
\s | 匹配任何空白字符(包含空格、换行符、制表符等),相当于类 [\t\n\r\f\v]
\S |与 \s 相反,匹配任何非空白字符,当相于类 [^\t\n\r\f\v]
\w |匹配任何字符,设置了re.ASCII的情况下等同于[a-zA-Z0-9_]
\W |与 \w 相反,设置了re.ASCII的情况下等同于[^a-zA-Z0-9_]
\b |匹配单词的开始或结束
\B |与 \b 相反
\A |仅匹配字符串的开始。
\Z |仅匹配字符串末尾。</p>
<p>例如 [\s,.] 匹配任何空白字符(/s 的特殊含义),',' 或 ‘.’ 。</p>
<p>转义字符:</p>
<div class="highlight"><pre><span></span>\<span class="n">a</span> \<span class="n">b</span> \<span class="n">f</span> \<span class="n">n</span>
\<span class="n">r</span> \<span class="n">t</span> \<span class="n">u</span> \<span class="n">U</span>
\<span class="n">v</span> \<span class="n">x</span> \\
</pre></div>
<p>'.'匹配除了换行符以外的任何字符。如果设置re.DOTALL,它将匹配包括换行在内的任何字符。</p>
<h3 id="_4">重复</h3>
<p>'*' : 指定前一个字符匹配零次或者多次。</p>
<p>例如ca*t将匹配 ct(0 个 a),cat(1 个字符 a),caaat(3 个字符 a),等等。需要注意的是,由于受到 C 语言的 int 类型大小的内部限制,正则表达式引擎会限制字符 'a' 的重复个数不超过 20 亿个;不过,通常我们工作也用不到那么大的数据,你未必有这么大的内存。</p>
<p>默认的是贪婪的,当你重复匹配一个 RE 时,匹配引擎会尝试尽可能多的去匹配。直到RE不匹配或者到了结尾,匹配引擎就会回退一个字符,然后再继续尝试。</p>
<p>表达式 a[bcd]*b,首先需要匹配字符 'a',然后零个到多个 [bcd],最后以 'b' 结尾。那现在想象一下,这个 RE 匹配字符串 abcbd 会怎样?</p>
<p>步骤 |匹配 |说明
| --- | --- | --- |
1| a| 匹配 RE 的第一个字符 'a'
2| abcbd| 引擎在符合规则的情况下尽可能地匹配 [bcd]<em>,直到该字符串的结尾
3| 失败 引擎尝试匹配 RE 最后一个字符 ‘b’,但当前位置已经是字符串的结尾,所以失败告终
4| abcb| 回退,所以 [bcd]</em> 匹配少一个字符
5| 失败| 再一次尝试匹配 RE 最后一个字符 'b',但字符串最后一个字符是 'd',所以失败告终
6| abc| 再次回退,所以 [bcd]* 这次只匹配 'bc'
7| abcb| 再一次尝试匹配这符 'b',这一次字符串当前位置指向的字符正好是 'b',匹配成功。最终,RE 匹配的结果是 abcb 。</p>
<p>另一个实现重复的元字符是 +,用于指定前一个字符匹配一次或者多次。</p>
<p>要特别注意 * 和 + 的区别:* 匹配的是零次或者多次,所以被重复的内容可能压根儿不会出现;+ 至少需要出现一次。例如 ca+t 会匹配 cat 和 caaat,但不会匹配 ct 。</p>
<p>还有两个表示重复的元字符,一个是问号?: 字符匹配零次或者一次。</p>
<p>元字符 {m,n}(m 和 n 都是十进制数),必须匹配 m 次到 n 次之间。例如 a/{1,3}b 会匹配 a/b,a//b 和 a///b 。但不会匹配 ab(没有斜杠);也不会匹配 a////b(斜杠超过三个)。</p>
<p>你可以省略 m 或者 n,这样的话,引擎会用合理的值代替。 m默认为0;n 默认为 20 亿。</p>
<p>{0,}跟’*‘ 是一样的;{1,}跟‘+’是一样的;{0,1} 跟?是一样的。不过还是鼓励大家记住并使用 * 、+ 和 ?,因为这些字符更短并且更容易阅读。</p>
<h3 id="_5">习题</h3>
<p>1, 下面那些不是python3正则表达式的元字符:</p>
<p>A $ B - C * D ? E /</p>
<p>参考答案:B E</p>
<p>2,python3正则表达式r'\bfoo\b'匹配下面哪些字符串</p>
<p>A 'foo' B 'foo.' C '(foo)' D 'bar foo baz' E 'foobar' F 'foo3'</p>
<p>参考答案:A B C D</p>
<p>3,python3正则表达式r'\bfoo\b'匹配下面哪些字符串</p>
<p>A 'foo,' B 'foo。' C '(foo!' D 'bar foo baz' E 'foobar' F 'foo3'</p>
<p>参考答案:A B C D</p>
<p>4,下面python3正则表达式元字符的描述哪些是错误的。
A. 默认\w不能匹配汉字
B. 默认\w能匹配汉字
C. 默认.能匹配换行符
D. 默认.不能匹配换行符</p>
<p>参考答案:A C </p>
<h2 id="_6">使用正则表达式</h2>
<p>Python通过re模块提供正则表达式支持,并可将正则表达式编译成对象,用它们来进行匹配。</p>
<h3 id="_7">编译正则表达式</h3>
<p>正则表达式编译为模式对象,该对象拥有各种方法,如查找模式匹配或者执行字符串替换。</p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="kn">import</span> <span class="nn">re</span>
<span class="o">>>></span> <span class="n">p</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s1">'ab*'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">p</span>
<span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s1">'ab*'</span><span class="p">)</span>
</pre></div>
<p>re.compile()也接受可选的flags 参数,用于开启各种特殊功能和语法变化。</p>
<p>简单的例子:</p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="n">p</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s1">'ab*'</span><span class="p">,</span> <span class="n">re</span><span class="o">.</span><span class="n">IGNORECASE</span><span class="p">)</span>
</pre></div>
<p>正则表达式作为字符串参数传给 re.compile() 。由于正则表达式并不是 Python语言的核心部分,因此没有为它提供特殊的语法支持,所以正则表达式只能以字符串的形式表示。</p>
<h3 id="_8">原始字符串</h3>
<p>'\section' 要用'\\section'表示。不巧,Python 字符串也使用 '\' 表示字符 '\'。原始字符串则方便得多:r"\section"即可。</p>
<h3 id="_9">匹配</h3>
<p>模式对象的常用方法:</p>
<p>方法 |功能
| --- | --- |
match() |判断正则表达式是否从开始处匹配一个字符串
search() |遍历字符串,找到正则表达式匹配的第一个位置
findall() |遍历字符串,找到正则表达式匹配的所有位置,并以列表的形式返回
finditer() |遍历字符串,找到正则表达式匹配的所有位置,并以迭代器的形式返回</p>
<p>如果没有找到任何匹配的话,match() 和 search() 会返回 None;如果匹配成功,则会返回匹配对象(match object),包含所有匹配的信息:例如从哪儿开始,到哪儿结束,匹配的子字符串等等。</p>
<p>python自带了正则表达式调试工具redemo.py:</p>
<p><img alt="图片.png" src="https://upload-images.jianshu.io/upload_images/10819934-725670e086543b42.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240"></p>
<p>实例:</p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="kn">import</span> <span class="nn">re</span>
<span class="o">>>></span> <span class="n">p</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s1">'[a-z]+'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">p</span>
<span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s1">'[a-z]+'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">p</span><span class="o">.</span><span class="n">match</span><span class="p">(</span><span class="s2">""</span><span class="p">)</span>
<span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">p</span><span class="o">.</span><span class="n">match</span><span class="p">(</span><span class="s2">""</span><span class="p">))</span>
<span class="bp">None</span>
<span class="o">>>></span> <span class="n">m</span> <span class="o">=</span> <span class="n">p</span><span class="o">.</span><span class="n">match</span><span class="p">(</span><span class="s1">'tempo'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">m</span>
<span class="o"><</span><span class="n">re</span><span class="o">.</span><span class="n">Match</span> <span class="nb">object</span><span class="p">;</span> <span class="n">span</span><span class="o">=</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span> <span class="n">match</span><span class="o">=</span><span class="s1">'tempo'</span><span class="o">></span>
<span class="o">>>></span> <span class="n">m</span><span class="o">.</span><span class="n">group</span><span class="p">()</span>
<span class="s1">'tempo'</span>
<span class="o">>>></span> <span class="n">m</span><span class="o">.</span><span class="n">start</span><span class="p">(),</span> <span class="n">m</span><span class="o">.</span><span class="n">end</span><span class="p">()</span>
<span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">m</span><span class="o">.</span><span class="n">span</span><span class="p">()</span>
<span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
</pre></div>
<p>匹配对象包含了很多方法和属性,以下几个是最重要的:</p>
<p>方法 |功能
| --- | --- |
group() |返回匹配的字符串
start() |返回匹配的开始位置
end() |返回匹配的结束位置
span() |返回元组表示匹配位置(start, end) </p>
<p>match的start总是0,search() 方法则不一定了:</p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">p</span><span class="o">.</span><span class="n">match</span><span class="p">(</span><span class="s1">'::: message'</span><span class="p">))</span>
<span class="bp">None</span>
<span class="o">>>></span> <span class="n">m</span> <span class="o">=</span> <span class="n">p</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="s1">'::: message'</span><span class="p">);</span> <span class="k">print</span><span class="p">(</span><span class="n">m</span><span class="p">)</span>
<span class="o"><</span><span class="n">re</span><span class="o">.</span><span class="n">Match</span> <span class="nb">object</span><span class="p">;</span> <span class="n">span</span><span class="o">=</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">11</span><span class="p">),</span> <span class="n">match</span><span class="o">=</span><span class="s1">'message'</span><span class="o">></span>
<span class="o">>>></span> <span class="n">m</span><span class="o">.</span><span class="n">group</span><span class="p">()</span>
<span class="s1">'message'</span>
<span class="o">>>></span> <span class="n">m</span><span class="o">.</span><span class="n">span</span><span class="p">()</span>
<span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">11</span><span class="p">)</span>
</pre></div>
<p>在实际应用中,最常用的方法是将匹配对象存放在局部变量中,并检查其返回值是否为 None 。</p>
<div class="highlight"><pre><span></span><span class="n">p</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span> <span class="o">...</span> <span class="p">)</span>
<span class="n">m</span> <span class="o">=</span> <span class="n">p</span><span class="o">.</span><span class="n">match</span><span class="p">(</span> <span class="s1">'string goes here'</span> <span class="p">)</span>
<span class="k">if</span> <span class="n">m</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="s1">'Match found: '</span><span class="p">,</span> <span class="n">m</span><span class="o">.</span><span class="n">group</span><span class="p">())</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="s1">'No match'</span><span class="p">)</span>
</pre></div>
<p>有两个方法可以返回所有的匹配结果,一个是 findall(),一个是 finditer() 。</p>
<p>findall() 返回的是一个列表:</p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="n">p</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="sa">r</span><span class="s1">'\d+'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">p</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="s1">'12 drummers drumming, 11 pipers piping, 10 lords a-leaping'</span><span class="p">)</span>
<span class="p">[</span><span class="s1">'12'</span><span class="p">,</span> <span class="s1">'11'</span><span class="p">,</span> <span class="s1">'10'</span><span class="p">]</span>
</pre></div>
<p>而 finditer() 则是将匹配对象作为迭代器返回:</p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="n">iterator</span> <span class="o">=</span> <span class="n">p</span><span class="o">.</span><span class="n">finditer</span><span class="p">(</span><span class="s1">'12 drummers drumming, 11 ... 10 ...'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">iterator</span>
<span class="o"><</span><span class="n">callable_iterator</span> <span class="nb">object</span> <span class="n">at</span> <span class="mi">0</span><span class="n">x</span><span class="o">...></span>
<span class="o">>>></span> <span class="k">for</span> <span class="n">match</span> <span class="ow">in</span> <span class="n">iterator</span><span class="p">:</span>
<span class="o">...</span> <span class="k">print</span><span class="p">(</span><span class="n">match</span><span class="o">.</span><span class="n">span</span><span class="p">())</span>
<span class="o">...</span>
<span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="p">(</span><span class="mi">22</span><span class="p">,</span> <span class="mi">24</span><span class="p">)</span>
<span class="p">(</span><span class="mi">29</span><span class="p">,</span> <span class="mi">31</span><span class="p">)</span>
</pre></div>
<h3 id="_10">模块级函数</h3>
<p>re 模块同时还提供了一些全局函数,例如 match(),search(),findall(),sub() 等。这些函数跟模式对象同名方法一样,只是增加了为正则表达式字符串的第一个参数。</p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">re</span><span class="o">.</span><span class="n">match</span><span class="p">(</span><span class="sa">r</span><span class="s1">'From\s+'</span><span class="p">,</span> <span class="s1">'Fromage amk'</span><span class="p">))</span>
<span class="bp">None</span>
<span class="o">>>></span> <span class="n">re</span><span class="o">.</span><span class="n">match</span><span class="p">(</span><span class="sa">r</span><span class="s1">'From\s+'</span><span class="p">,</span> <span class="s1">'From amk Thu May 14 19:12:10 1998'</span><span class="p">)</span>
<span class="o"><</span><span class="n">re</span><span class="o">.</span><span class="n">Match</span> <span class="nb">object</span><span class="p">;</span> <span class="n">span</span><span class="o">=</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span> <span class="n">match</span><span class="o">=</span><span class="s1">'From '</span><span class="o">></span>
</pre></div>
<p>这些函数帮你自动创建模式对象,并调用相关的函数。它们还将编译好的模式对象存放在缓存中,以便将来可以快速地直接调用。</p>
<p>在循环中大量的使用正则表达式,预编译的话可以节省一些函数调用,用模式的方法更好。但如果是循环外部,因为有内部缓存机制,两者效率相差无几。</p>
<h3 id="_11">编译标识</h3>
<p>编译标识可修改正则表达式的工作方式。在 re 模块下,编译标识均有完整名和简写,例如 IGNORECASE 简写是 I。</p>
<p>标志 |含义
| --- | --- |
ASCII,A |使得转义符号 \w,\b,\s 和 \d 只能匹配 ASCII 字符
DOTALL,S |使得 . 匹配任何符号,包括换行符
IGNORECASE,I |不区分大小写
LOCALE,L |支持当前语言(区域)设置
MULTILNE,M |多行匹配,影响 ^ 和 $
VERBOSE,X(for'extended') |启用详细的正则表达式, 更简洁且更容易理解。 </p>
<p>下面我们来详细讲解一下它们的含义:</p>
<ul>
<li>I</li>
</ul>
<p>IGNORECASE</p>
<p>不区分大小写。举个例子,正则表达式[a-z] 或[A-Z]会匹配对应的小写字母会匹配52个ASCII字母。还有4个非ASCII字母:'İ'(U + 0130,拉丁大写字母I上面带点),'ı'(U + 0131,拉丁文小写字母无点i),'s'(U + 017F,拉丁文小 字母长s)和'K'(U + 212A,开尔文标志)。 Spam 将匹配'Spam','spam','spAM'或'ſpam'(后者仅在Unicode模式下匹配)。 此小写不考虑当前区域设置; 如果您还设置了LOCALE标志,它也会。</p>
<ul>
<li>L</li>
</ul>
<p>LOCALE</p>
<p>使得 \w,\W,\b 和 \B和大小写依赖当前的语言(区域)环境,而不是 Unicode 数据库。</p>
<p>区域设置是 C 语言的功能,主要作用是消除不同语言之间的差异。例如你正在处理的是法文文本,你想使用 \w+ 来匹配单词,但是 \w 只是匹配 [A-Za-z] 中的单词,并不会匹配那些法文特殊符号。python3已经推荐使用Unicode,不建议使用LOCALE。</p>
<ul>
<li>M</li>
</ul>
<p>MULTILNE</p>
<p>通常 ^ 只匹配字符串的开头,而 \$ 匹配字符串的结尾。该标识设置时,^ 不仅匹配字符串的开头,还匹配行首;$ 不仅匹配字符的结尾,还匹配行尾。</p>
<ul>
<li>S</li>
</ul>
<p>DOTALL</p>
<p>. 可以匹配任何字符,包括换行符。如果不使用这个标志,. 将匹配除换行的所有字符。</p>
<ul>
<li>A</li>
</ul>
<p>ASCII
使得 \w,\W,\b,\B,\s 和 \S 只匹配 ASCII 字符,而不是全Unicode匹配。这个标识仅对Unicode模式有意义,并忽略字节模式。</p>
<ul>
<li>X</li>
</ul>
<p>VERBOSE</p>
<p>这个标识使你的正则表达式可以写得更可读灵活,空格会被忽略(除了出现在字符类中和使用反斜杠转义的空格);同时允许你在正则表达式字符串使用注释,# 符号后边的内容是注释(除了出现在字符类中和使用反斜杠转义的 #)。</p>
<p>下边是使用 re.VERBOSE 的例子,:</p>
<div class="highlight"><pre><span></span><span class="n">charref</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="sa">r</span><span class="s2">"""</span>
<span class="s2"> &[#] # Start of a numeric entity reference</span>
<span class="s2"> (</span>
<span class="s2"> 0[0-7]+ # Octal form</span>
<span class="s2"> | [0-9]+ # Decimal form</span>
<span class="s2"> | x[0-9a-fA-F]+ # Hexadecimal form</span>
<span class="s2"> )</span>
<span class="s2"> ; # Trailing semicolon</span>
<span class="s2">"""</span><span class="p">,</span> <span class="n">re</span><span class="o">.</span><span class="n">VERBOSE</span><span class="p">)</span>
<span class="c1"># 没有注释</span>
<span class="n">charref</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s2">"&#(0[0-7]+"</span>
<span class="s2">"|[0-9]+"</span>
<span class="s2">"|x[0-9a-fA-F]+);"</span><span class="p">)</span>
</pre></div>
<h3 id="_12">习题</h3>
<p>1,下面关于python3 正则模式的说哪些是错误的?</p>
<p>A. match() 可以返回多个匹配
B. search() 必须从开头开始匹配
C. findall() 以迭代器的形式返回
D. finditer() 以列表的形式返回</p>
<p>答案:ABCD</p>
<p>2,python3 正则表达式re.compile('spam', re.IGNORECASE)匹配哪些字符串?
A.'Spam' B. 'spam' C. 'spAM' D. 'ſpam' E.'pam'</p>
<p>答案:ABCD</p>
<h2 id="_13">更多关于模式</h2>
<h3 id="_14">更多元字符</h3>
<ul>
<li>I</li>
</ul>
<p>或操作符,对两个正则表达式进行操作。如果 A 和 B 是正则表达式,A | B 会匹配 A 或 B 中出现的任何字符。为了能够更加合理的工作,| 的优先级非常低。例如 Crow|Servo匹配'Crow'或'Servo',,而不是匹配'Cro'、 'w' 、'S'、'ervo'.。</p>
<p>同样,我们使用 | 来匹配 '|' 字符本身;或者包含在一个字符类中,像这样 [|] 。</p>
<ul>
<li>^</li>
</ul>
<p>匹配字符串开始。在 MULTILNE 中,每当遇到换行符就会立刻进行匹配。如果你只希望匹配位于字符串开头的单词 From,那么你的正则表达式可以写为 ^From</p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="s1">'^From'</span><span class="p">,</span> <span class="s1">'From Here to Eternity'</span><span class="p">))</span>
<span class="o"><</span><span class="n">re</span><span class="o">.</span><span class="n">Match</span> <span class="nb">object</span><span class="p">;</span> <span class="n">span</span><span class="o">=</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">4</span><span class="p">),</span> <span class="n">match</span><span class="o">=</span><span class="s1">'From'</span><span class="o">></span>
<span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="s1">'^From'</span><span class="p">,</span> <span class="s1">'Reciting From Memory'</span><span class="p">))</span>
<span class="bp">None</span>
</pre></div>
<ul>
<li>$</li>
</ul>
<p>匹配字符串的结束和行尾。</p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="s1">'}$'</span><span class="p">,</span> <span class="s1">'{block}'</span><span class="p">))</span>
<span class="o"><</span><span class="n">re</span><span class="o">.</span><span class="n">Match</span> <span class="nb">object</span><span class="p">;</span> <span class="n">span</span><span class="o">=</span><span class="p">(</span><span class="mi">6</span><span class="p">,</span> <span class="mi">7</span><span class="p">),</span> <span class="n">match</span><span class="o">=</span><span class="s1">'}'</span><span class="o">></span>
<span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="s1">'}$'</span><span class="p">,</span> <span class="s1">'{block} '</span><span class="p">))</span>
<span class="bp">None</span>
<span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="s1">'}$'</span><span class="p">,</span> <span class="s1">'{block}</span><span class="se">\n</span><span class="s1">'</span><span class="p">))</span>
<span class="o"><</span><span class="n">re</span><span class="o">.</span><span class="n">Match</span> <span class="nb">object</span><span class="p">;</span> <span class="n">span</span><span class="o">=</span><span class="p">(</span><span class="mi">6</span><span class="p">,</span> <span class="mi">7</span><span class="p">),</span> <span class="n">match</span><span class="o">=</span><span class="s1">'}'</span><span class="o">></span>
</pre></div>
<ul>
<li>A</li>
</ul>
<p>匹配字符串开始,如果没有设置 MULTILINE 标志的时候,\A 和 ^ 的功能是一样的;但如果设置了 MULTILNE 标志, ^ 会对字符串的每一行都进行匹配。</p>
<ul>
<li>Z</li>
</ul>
<p>匹配字符串的结束。</p>
<ul>
<li>b</li>
</ul>
<p>单词边界,这是只匹配单词的开始和结尾的零宽断言。单词定义为字母数字的序列,所以单词的结束指的是空格或者非字母数字的字符。</p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="n">p</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="sa">r</span><span class="s1">'\bclass\b'</span><span class="p">)</span>
<span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">p</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="s1">'no class at all'</span><span class="p">))</span>
<span class="o"><</span><span class="n">re</span><span class="o">.</span><span class="n">Match</span> <span class="nb">object</span><span class="p">;</span> <span class="n">span</span><span class="o">=</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">8</span><span class="p">),</span> <span class="n">match</span><span class="o">=</span><span class="s1">'class'</span><span class="o">></span>
<span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">p</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="s1">'the declassified algorithm'</span><span class="p">))</span>
<span class="bp">None</span>
<span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">p</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="s1">'one subclass is'</span><span class="p">))</span>
<span class="bp">None</span>
</pre></div>
<p>在使用这些特殊的序列的时候,有两点是需要注意的:第一是Python 的字符串跟正则表达式在有些字符上是有冲突的。比如说在 Python 中,\b 表示的是退格符,ASCII 码值是 8。</p>
<p>下边例子中,我们故意不写表示原始字符串的 'r',结果确实大相庭径:</p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="n">p</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s1">'</span><span class="se">\b</span><span class="s1">class</span><span class="se">\b</span><span class="s1">'</span><span class="p">)</span>
<span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">p</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="s1">'no class at all'</span><span class="p">))</span>
<span class="bp">None</span>
<span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">p</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="s1">'</span><span class="se">\b</span><span class="s1">'</span> <span class="o">+</span> <span class="s1">'class'</span> <span class="o">+</span> <span class="s1">'</span><span class="se">\b</span><span class="s1">'</span><span class="p">))</span>
<span class="o"><</span><span class="n">re</span><span class="o">.</span><span class="n">Match</span> <span class="nb">object</span><span class="p">;</span> <span class="n">span</span><span class="o">=</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">7</span><span class="p">),</span> <span class="n">match</span><span class="o">=</span><span class="s1">'</span><span class="se">\x08</span><span class="s1">class</span><span class="se">\x08</span><span class="s1">'</span><span class="o">></span>
</pre></div>
<p>第二点是在字符类中不能使用这个断言。在字符类中,\b 只是用来表示退格符。</p>
<ul>
<li>\B</li>
</ul>
<p>与 \b 的含义相反,\B 表示非单词边界的位置。</p>
<h3 id="_15">分组</h3>
<p>仅仅知道正则表达式是否匹配是不够的,正则表达式通常使用分组的方式分别对不同内容进行匹配。</p>
<p>下例子将 RFC-822 头用 “:” 号分成名字和值:</p>
<div class="highlight"><pre><span></span><span class="n">From</span><span class="p">:</span> <span class="n">author</span><span class="nd">@example.com</span>
<span class="n">User</span><span class="o">-</span><span class="n">Agent</span><span class="p">:</span> <span class="n">Thunderbird</span> <span class="mf">1.5</span><span class="o">.</span><span class="mf">0.9</span> <span class="p">(</span><span class="n">X11</span><span class="o">/</span><span class="mi">20061227</span><span class="p">)</span>
<span class="n">MIME</span><span class="o">-</span><span class="n">Version</span><span class="p">:</span> <span class="mf">1.0</span>
<span class="n">To</span><span class="p">:</span> <span class="n">editor</span><span class="nd">@example.com</span>
</pre></div>
<p>先用正则表达式匹配整个RFC-822头,使用一个组来匹配头的名字,另一个组匹配名字对应的值。</p>
<p>正则表达式使用元字符'(', ')' 来划分组。'(', ')' 元字符跟数学表达式中的小括号含义差不多;它们将包含在内部的表达式组合在一起,所以你可以对组的内容使用重复操作的元字符,例如 *,+,? 或者 {m,n} 。</p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="n">p</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s1">'(ab)*'</span><span class="p">)</span>
<span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">p</span><span class="o">.</span><span class="n">match</span><span class="p">(</span><span class="s1">'ababababab'</span><span class="p">)</span><span class="o">.</span><span class="n">span</span><span class="p">())</span>
<span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>
</pre></div>
<p>用'(',')'表示的组也捕获它们匹配的文本的起始和结束索引; 这可以通过将参数传递给ggroup(),start(),end() 和 span() 来检索。 组从0开始编号。组0始终存在; 它是整个RE,因此匹配对象方法都将组0作为其默认参数。</p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="n">p</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s1">'(a)b'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">m</span> <span class="o">=</span> <span class="n">p</span><span class="o">.</span><span class="n">match</span><span class="p">(</span><span class="s1">'ab'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">m</span><span class="o">.</span><span class="n">group</span><span class="p">()</span>
<span class="s1">'ab'</span>
<span class="o">>>></span> <span class="n">m</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="s1">'ab'</span>
</pre></div>
<p>子组从左到右进行编号,子组也允许嵌套,我们可以通过从左往右来统计左括号 ( 来确定子组的序号。</p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="n">p</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s1">'(a(b)c)d'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">m</span> <span class="o">=</span> <span class="n">p</span><span class="o">.</span><span class="n">match</span><span class="p">(</span><span class="s1">'abcd'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">m</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="s1">'abcd'</span>
<span class="o">>>></span> <span class="n">m</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="s1">'abc'</span>
<span class="o">>>></span> <span class="n">m</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="s1">'b'</span>
</pre></div>
<p>group() 方法可传入多个子组的序号:</p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="n">m</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">)</span>
<span class="p">(</span><span class="s1">'b'</span><span class="p">,</span> <span class="s1">'abc'</span><span class="p">,</span> <span class="s1">'b'</span><span class="p">)</span>
</pre></div>
<p>groups() 方法一次性返回所有的子组匹配的字符串:</p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="n">m</span><span class="o">.</span><span class="n">groups</span><span class="p">()</span>
<span class="p">(</span><span class="s1">'abc'</span><span class="p">,</span> <span class="s1">'b'</span><span class="p">)</span>
</pre></div>
<p>反引用可以在后面的位置使用先前匹配过的内容,用法是反斜杠加上数字。例如 \1 表示引用前边成功匹配的序号为 1 的子组。</p>
<div class="highlight"><pre><span></span><span class="o">>>></span> <span class="n">p</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="sa">r</span><span class="s1">'\b(\w+)\s+\1\b'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">p</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="s1">'Paris in the the spring'</span><span class="p">)</span><span class="o">.</span><span class="n">group</span><span class="p">()</span>
<span class="s1">'the the'</span>
</pre></div>
<h3 id="_16">习题</h3>
<p>1, python3正则表达式compile('Crow|Servo')匹配哪些内容:</p>
<p>A 'Crow' B 'foo。' C Servw' D 'Cro' E 'Serv' F 'Servo'</p>
<h3 id="_17">重复</h3>
<p>参考答案:A F</p>
<p>2, python3正则表达式要匹配'|',可以采用如下哪些方法:</p>
<p>A '||' B '|' C '[|]' D '$|' </p>
<p>参考答案:B C </p>
<p>3,关于python3正则表达式,下面哪些说法是正确的</p>
<p>A re.findall(r'^From', 'From Here to Eternity\nFrom', re.MULTILINE) 返回长度为2的列表
B re.findall(r'^From', 'From Here to Eternity\nFrom') 返回长度为2的列表
C re.findall(r'^From', ' From Here to Eternity\nFrom', re.MULTILINE) 返回长度为2的列表
D re.findall(r'\bFrom', ' From Here to Eternity\nFrom', re.MULTILINE) 返回长度为2的列表</p>
<p>参考答案:A D</p>
<p>4,关于python3正则表达式,下面哪些说法是正确的</p>
<p>A re.search(r'\bclass\b', 'no class at all')能成功匹配
B re.search('\bclass\b', 'no class at all')能成功匹配
C re.search(r'\bclass\b', 'no class at all')不能能成功匹配
D re.search('\bclass\b', 'no class at all')不能成功匹配</p>
<p>参考答案:A D</p>
<p>待整理: notepad++ 正则 http://docs.notepad-plus-plus.org/index.php/Regular_Expressions</p>
<h2 id="_18">参考资料</h2>
<ul>
<li>python测试等IT技术支持qq群: 144081101(后期会录制视频存在该群群文件) 591302926 567351477 钉钉免费群:21745728</li>
<li>道家技术-手相手诊看相中医等钉钉群21734177 qq群:391441566 184175668 338228106 看手相、面相、舌相、抽签、体质识别。服务费50元每人次起。请联系钉钉或者微信pythontesting</li>
<li><a href="https://china-testing.github.io/python3_crash8.html">本文最新版本地址</a></li>
<li><a href="https://github.com/china-testing/python-api-tesing">本文涉及的python测试开发库</a> 谢谢点赞!</li>
<li><a href="https://github.com/china-testing/python-api-tesing/blob/master/books.md">本文相关海量书籍下载</a></li>
<li><a href="https://china-testing.github.io/testing_training.html">接口自动化性能测试线上培训大纲</a></li>
</ul>
</div><!-- /.entry-content -->
</article>
</section>
<section id="extras" class="body">
<div class="blogroll">
<h2>links</h2>
<ul>
<li><a href="https://china-testing.github.io/testing_training.html">自动化性能接口测试线上及深圳培训与项目实战 qq群:144081101 591302926</a></li>
<li><a href="http://blog.sciencenet.cn/blog-2604609-1112306.html">pandas数据分析scrapy爬虫 521070358 Py人工智能pandas-opencv 6089740</a></li>
<li><a href="http://blog.sciencenet.cn/blog-2604609-1112306.html">中医解梦看相八字算命qq群 391441566 csdn书籍下载-python爬虫 437355848</a></li>
</ul>
</div><!-- /.blogroll -->
</section><!-- /#extras -->
<footer id="contentinfo" class="body">
<address id="about" class="vcard body">
Proudly powered by <a href="http://getpelican.com/">Pelican</a>, which takes great advantage of <a href="http://python.org">Python</a>.
</address><!-- /#about -->
<p>The theme is by <a href="http://coding.smashingmagazine.com/2009/08/04/designing-a-html-5-layout-from-scratch/">Smashing Magazine</a>, thanks!</p>
</footer><!-- /#contentinfo -->
</body>
</html>