one , Import re library

python Use regular expressions to import re library .
import re
stay re In the library . Regular expressions are often used to retrieve lookups , Replace those that match a pattern ( rule ) Text for .

two , To use regular expressions

1, Looking for rules ;

2, Regular symbols are used to represent rules ;

3, Extract information , If every character can match , Then the matching is successful ; Once there are characters that fail to match, the matching fails .

three , Common basic symbols in regular expressions

1. Point number “.”

    A dot can be used instead of a newline character (\n) Any character other than , Including but not limited to English letters , number , chinese characters , English punctuation and Chinese punctuation .

2. asterisk “*”

    An asterisk can represent a subexpression before it ( Ordinary character , Another regular expression symbol or symbols )0 Times to infinity .

3. question mark “?”

    The question mark indicates the subexpression before it 0 Times or 1 second . be careful , The question mark here is in English .

4. Backslash “\”

   
Backslashes cannot be used alone in regular expressions , Even throughout Python You can't use it alone . Backslashes need to be used with other characters to turn special symbols into ordinary symbols , Change ordinary symbols into special symbols . as :“\n”.

5. number “\d”

    Used in regular expressions “\d” To represent one digit . Again ,“\d” Although it is composed of backslashes and letters d Constitutive , But you have to “\d” As a regular expression symbol as a whole .

6. parentheses “()”

Parentheses can extract the contents of parentheses .

four , Examples of common regular expressions

1.  .*?( Match all )

for example :'<title>(.*?)</title>'   Climb down the title of the page .

2,\w Word character [A-Za-z0-9_], "+" Match previous character 1 Times or infinite times
for example : A person's mailbox is like this [email protected], So how do we extract it from a lot of strings ? 
pattern: \w+@\w+\.com

reflection : If mailbox is [email protected] <mailto:[email protected]>, How to match ?
pattern:\w+@(\w+\.)?\w+\.com

? Represents a match 0 Times or 1 Matches within the sub bracket group ,"()" Indicates that the included content is a group , Group sequence number from pattern The string starts and ends in sequence . Because it's a match 0 Times or 1 second , Then it means that the part in parentheses is dispensable , So this pattern It is possible to match the above two mailbox formats .

extend : \w+@(\w+\.)*\w+\.com  The mode is even more powerful ," * "  Can match 0 Times or infinite times .

five ,re Core functions of Library

1,compile() function ( not essential )

•     Function definition : compile(pattern, flag=0)

•     Function description : Compiling regular expressions pattern, Then return a regular expression object .

Why pattern To compile ?《Python Core programming 》 That's how it's explained :

Using precompiled code objects is faster than using strings directly , Because the interpreter must compile the string into a code object before executing the code in the form of string .

2,match() function

•     Function definition : match(pattern, string, flag=0)

•     Function description : Only from the beginning of the string and pattern Match , Matching object returned after successful matching ( There is only one result ), Otherwise return None.

Here comes the problem , Why? result1 There are so many things ? It seems that the last one is the object to match . How do you extract this ?    
take it easy , What we get now is the matching object , It needs to be extracted by certain methods , It'll be in the back 《 Method of matching objects 》 Chapter to solve this problem , Keep looking down .
3,search() function

* Function definition : search(pattern, string, flag=0)
* letter
Number Description : And match() Work the same way , however search() Not from the beginning , Instead, find the first match from anywhere . If all strings fail to match , return None, Otherwise, the matching object is returned .

4,findall() function

* Function definition : findall(pattern, string [,flags])
* Function description : Finds all occurrences of regular expression patterns in a string , And returns a list of matches

It also lists match,search,findall Three function usage .findall And match and search The difference is that it returns a list of all non duplicate matches . If no match is found , An empty list is returned .
six , Method of matching objects ( extract )

above re The return contents of module functions can be divided into two types :

*      Return matching object : It's like above  <_sre.SRE_Match object; span=(0, 5), match='12345'>
Such objects , The functions that can return matching objects are match,search,finditer.
*      Returns a list of matches : What returns the list is  findall.
Therefore, the method of matching objects is only applicable match,search,finditer, Not applicable and findall.

There are two common methods to match objects :group,groups, There are also several questions about location, such as  start,end,span It's described in the code .

1,group method

* Method definition :group(num=0)
* Method description : Returns the entire matching object , Or a specially numbered word group

Look at the following example :

Here we need to use the grouping concept we mentioned earlier .

The significance of grouping is : We don't just want to get the whole string that matches , We also want to get a specific substring in the whole string . In the example above , The entire string is “ I 12345+abcde”, But I want to  
“abcde”, We can use it () Enclose . therefore , You can be right pattern Make any grouping , Extract what you want .

2,groups method

* Method definition :groups(default =None)
* Method description : Returns a tuple containing all matching subgroups , If matching fails, an empty tuple is returned

seven ,re Properties of the module (flag)

re Common attributes of modules are as follows :

* re.I:  Matching is not case sensitive ;( Commonly used )
* re.L:  According to the local locale used \w, \W, \b, \B, \s, \S Realize matching ;
* re.M: ^ and $ Match the beginning and end of the line in the target string, respectively , Instead of strictly matching the beginning and end of the entire string itself ;
* re.S: “.”( Point number ) Usually match except \n( Newline character ) All single characters except , This flag indicates “.”( Point number ) Can match all characters ;( Commonly used )
* re.X:  Escape by backslash , Otherwise, add all spaces #( And all subsequent text in that line ) Are ignored , Unless in a character class or allow comments and improve readability ;

be careful :

* If we define compile compile , You need to flag Fill in compile In function , Otherwise, an error will be reported in the matching function ;
If not defined compile, Can be directly in the matching function findall Fill in flag.
appendix :

Syntax list in regular expressions

 

 

Technology