当前位置：首页 > news >正文

正则表达式 - 简单模式匹配

news 2025/8/29 5:42:19

一、测试数据

二、简单模式匹配

1. 匹配字面值

2. 匹配数字和非数字字符

3. 匹配单词与非单词字符

4. 匹配空白字符

5. 匹配任意字符

6. 匹配单词边界

7. 匹配零个或多个字符

8. 单行模式与多行模式

一、测试数据

这里所用文本是《学习正则表达式》这本书带的，是《抒情歌谣集》（Lyrical Ballads, London, J.&A. Arch, 1798）中收录的塞缪尔·泰勒·柯勒律治的一首诗“The Rime of the Ancient” 的前几行。为了演示正则表达式的单行模式与多行模式，特意生成了带有换行符（ascii 10）的单个行，和不带换行符的多个行。

drop table if exists t_regexp;
create table t_regexp(a text);
insert into t_regexp values (
'THE RIME OF THE ANCYENT MARINERE, IN SEVEN PARTS.
ARGUMENT.
How a Ship having passed the Line was driven by Storms to the cold Country
towards the South Pole; and how from thence she made her course to the tropical
Latitude of the Great Pacific Ocean; and of the strange things that befell;
and in what manner the Ancyent Marinere came back to his own Country.
I.
1       It is an ancyent Marinere,
2          And he stoppeth one of three:
3       "By thy long grey beard and thy glittering eye
4          "Now wherefore stoppest me?');insert into t_regexp values ('THE RIME OF THE ANCYENT MARINERE, IN SEVEN PARTS.'),
('ARGUMENT.'),
('How a Ship having passed the Line was driven by Storms to the cold Country'),
('towards the South Pole; and how from thence she made her course to the tropical'),
('Latitude of the Great Pacific Ocean; and of the strange things that befell;'),
('and in what manner the Ancyent Marinere came back to his own Country.'),
('I.'),
('1       It is an ancyent Marinere,'),
('2          And he stoppeth one of three:'),
('3       "By thy long grey beard and thy glittering eye'),
('4          "Now wherefore stoppest me?');

二、简单模式匹配

1. 匹配字面值

匹配字符串字面值的方法就是使用普通的字符。例如 regexp_like(a,'Ship') 函数的意思是匹配字段 a 中带有 Ship 文本的行，缺省不区分大小写。执行结果如下：

mysql> select a from t_regexp where regexp_like(a,'Ship')\G
*************************** 1. row ***************************
a: THE RIME OF THE ANCYENT MARINERE, IN SEVEN PARTS.
ARGUMENT.
How a Ship having passed the Line was driven by Storms to the cold Country
towards the South Pole; and how from thence she made her course to the tropical
Latitude of the Great Pacific Ocean; and of the strange things that befell;
and in what manner the Ancyent Marinere came back to his own Country.
I.
1       It is an ancyent Marinere,
2          And he stoppeth one of three:
3       "By thy long grey beard and thy glittering eye
4          "Now wherefore stoppest me?
*************************** 2. row ***************************
a: How a Ship having passed the Line was driven by Storms to the cold Country
2 rows in set (0.00 sec)

2. 匹配数字和非数字字符

以下三个查询等价，都是匹配字段 a 中带有数字的行。

select a from t_regexp where regexp_like(a,'[0123456789]');
select a from t_regexp where regexp_like(a,'[0-9]');
select a from t_regexp where regexp_like(a,'\\d');

匹配以数字开头的行：

select a from t_regexp where regexp_like(a,'^\\d');

匹配纯数字行：

select a from t_regexp where regexp_like(a,'^\\d+$');

使用字符组可精确匹配字符。数字的字符组简写式 \d 更为简短，但却没有字符组强大、灵活。在无法使用 \d 时（不是所有情况下都支持这种方式），或者想匹配特定数字时，就需要使用字符组；合适的时候可以使用 \d，因为它更简短。

以下四个查询等价，都是匹配字段 a 中带有非数字的行。

select a from t_regexp where regexp_like(a,'[^0123456789]');
select a from t_regexp where regexp_like(a,'[^0-9]');
select a from t_regexp where regexp_like(a,'[^\\d]');
select a from t_regexp where regexp_like(a,'\\D');

匹配纯字母行：

select * from t_regexp where regexp_like(a,'^\\D+$');

要匹配非数字字符，可使用包含以下大写字母D的简写式 \D。注意字符组（中括号内）中的 ^ 符号不再代表行头而是表示取反，意思其实就是“不匹配这些”或“匹配除这些以外的内容”。

3. 匹配单词与非单词字符

\w 简写式将匹配所有的单词字符，\D 与 \w 的区别是 \D 会匹配空格、标点符号（引号、连字符、反斜杠、方括号）等字符，而 \w 只匹配字母、数字和下划线。在英语环境中，与 \w 匹配相同内容的字符组为：[_a-zA-Z0-9]

\W 匹配非单词字符，匹配空格、标点以及其他非字母、非数字字符。使用以下字符组也可以匹配相同的内容：[^_a-zA-Z0-9]

下表提供了更多的字符简写式。

字符简写式	描述
\a	报警符
[\b]	退格字符
\c x	控制字符
\d	数字字符
\D	非数字字符
\w	单词字符
\W	非单词字符
\0	空字符
\x xx	字符的十六进制值
\o xxx	字符的八进制值
\u xxx	字符的Unicode值

匹配所有emoji表情：

select userid,nickname from space_user where regexp_like(nickname,'(\\ud83c[\\udf00-\\udfff])|(\\ud83d[\\udc00-\\ude4f\\ude80-\\udeff])|[\\u2600-\\u2B55]') limit 10;

\w 不匹配符号：

select regexp_like('()','\\w'),regexp_like('()','\\W'),regexp_like('()','\\D');

匹配电子邮箱：

select regexp_like('wxy0327@sohu.com','\\w[-\\w.+]*@([A-Za-z0-9][-A-Za-z0-9]+\.)+[A-Za-z]{2,14}');

4. 匹配空白字符

\s 与 [ \t\n\r] 字符组匹配的内容相同，它会匹配空格、制表符（\t）、换行符（\n）、回车符（\r）。\s也有对应的大写形式，如要匹配非空白字符，使用 \S 或 [^ \t\n\r] 或 [^\s]。

下表列出了匹配常见和不太常见的空白字符的简写式。

字符简写式	描述
\f	换页符
\h	水平空白符
\H	非水平空白符
\n	换行符
\r	回车符
\s	空白符
\S	非空白符
\t	水平制表符
\v	垂直制表符
\V	非垂直制表符

5. 匹配任意字符

用正则表达式匹配任意字符的一种方法就是使用点号（U+002E）。点号可以匹配除行结束符之外的所有字符，个别情况除外。要匹配THE RIME整个短语，则可使用八个点号，但推荐用量词 .{8}

这个表达式就能匹配前两个单词以及它们之间的空格，但只是粗略地匹配。从 https://www.dute.org/regex 看看这个表达式有什么作用，就知道这里所说的粗略是什么意思了。它匹配了连续多组的八个字符，头尾相连，只有目标文本的最后几个字符除外。

6. 匹配单词边界

下面我们再试试匹配单词的边界和字母的开始和结束位置：

\bA.{5}T\b

可以看到细微的差异：

这个表达式有更强的特指性（请记住特指性，specificity，这个概念很重要），它匹配单词ANCYENT。简写式 \b 匹配单词边界，不消耗任何字符；字符 A 和 T 限定了字符序列的首尾字母；.{5} 匹配任意五个字符；简写式 \b 匹配单词的另一个边界。

现在再试一下这个简写式：

\b\w{7}\b

结果如下图所示。

7. 匹配零个或多个字符

最后再试试匹配零个或多个字符：

.*

它就相当于 [^\n] 或 [^\n\r]。类似地，点号也可以与表示“一个或多个”的量词（+）连用：

.+

8. 单行模式与多行模式

单行模式（single line mode）使得通配符点 . 匹配所有字符，包括换行符。多行模式（multi-line mode）使得 ^ 和 $ 匹配到每行字符串的开头和结尾处。从这个定义可以看到，所谓单行模式与多行模式，分别用来限定 . 与 ^ $ 的行为，而不像字面看起来表示单行或多行。因此这两种模式也不是互斥的，可以共用、使用其中一个或都不使用，MySQL中的正则缺省就是都不使用。实际上用 dotall 表示单行模式更为恰当，下一篇演示 dotall 的例子，这里用测试数据说明多行模式。需求是给 T 或 t 开头的行首尾分别加 HTML 标记 <h1> 与 <\h1>。

select regexp_replace(a,'(^T.*$)','<h1>$1<\h1>',1,0,'im') from t_regexp limit 1\G

结果如下，第一行和第四行加了标签，符合预期。

<h1>THE RIME OF THE ANCYENT MARINERE, IN SEVEN PARTS.<h1>
ARGUMENT.
How a Ship having passed the Line was driven by Storms to the cold Country
<h1>towards the South Pole; and how from thence she made her course to the tropical<h1>
Latitude of the Great Pacific Ocean; and of the strange things that befell;
and in what manner the Ancyent Marinere came back to his own Country.
I.
1       It is an ancyent Marinere,
2          And he stoppeth one of three:
3       "By thy long grey beard and thy glittering eye
4          "Now wherefore stoppest me?

regexp_replace 函数的参数说明：

a：需要被替换的原字符串字段。
(^T.*$)'：正则表达式，匹配 T 开头的行，然后使用括号将文本捕获到一个分组中。
<h1>$1<\h1>：替换表达式，将 $1 捕获的内容嵌套在了 h1 标签中。
1：开始搜索位置，缺省为1。
0：替换第几次匹配，缺省为0，表示替换所有匹配。
im：匹配类型，i 表示不区分大小写，m 表示多行匹配模式。如果不加 m，会将整个字符串当做单一字符串，则只能匹配出第一行。

现在修改需求为给每行首尾分别加 HTML 标记 <h1> 与 <\h1>。

select regexp_replace(a,'(^.*$)','<h1>$1<\h1>',1,0,'im') from t_regexp limit 1\G

结果如下：

<h1>THE RIME OF THE ANCYENT MARINERE, IN SEVEN PARTS.<h1>
<h1>ARGUMENT.<h1>
<h1>How a Ship having passed the Line was driven by Storms to the cold Country<h1>
<h1>towards the South Pole; and how from thence she made her course to the tropical<h1>
<h1>Latitude of the Great Pacific Ocean; and of the strange things that befell;<h1>
<h1>and in what manner the Ancyent Marinere came back to his own Country.<h1>
<h1>I.<h1>
<h1>1       It is an ancyent Marinere,<h1>
<h1>2          And he stoppeth one of three:<h1>
<h1>3       "By thy long grey beard and thy glittering eye<h1>
<h1>4          "Now wherefore stoppest me?<h1>

捕获分组中的 ^.*$ 说明：

^ 匹配字符串的第一个字符之前的位置。
$ 匹配字符串的最后一个字符后面的位置。
. 匹配单个字符。除了换行符之外，它的性质无关紧要。
* 匹配前一个匹配零次或多次。

因此，^.*$ 表示从头到尾匹配任何出现零次或多次的字符。基本上，这意味着匹配从字符串的开头到结尾的所有内容。注意这里的 . 一定要有，否则只会在每行最后添加一对标签：

THE RIME OF THE ANCYENT MARINERE, IN SEVEN PARTS.<h1><h1>
ARGUMENT.<h1><h1>
How a Ship having passed the Line was driven by Storms to the cold Country<h1><h1>
towards the South Pole; and how from thence she made her course to the tropical<h1><h1>
Latitude of the Great Pacific Ocean; and of the strange things that befell;<h1><h1>
and in what manner the Ancyent Marinere came back to his own Country.<h1><h1>
I.<h1><h1>
1       It is an ancyent Marinere,<h1><h1>
2          And he stoppeth one of three:<h1><h1>
3       "By thy long grey beard and thy glittering eye<h1><h1>
4          "Now wherefore stoppest me?<h1><h1>

查看全文

http://www.lryc.cn/news/62066.html