one , Import re library
python Use regular expressions to import re library .
import re
stay re In the library . Regular expressions are often used to retrieve lookups , Replace those that match a pattern ( rule ) Text for .
two , To use regular expressions
1, Looking for rules ;
2, Regular symbols are used to represent rules ;
3, Extract information , If every character can match , Then the matching is successful ; Once there are characters that fail to match, the matching fails .
three , Common basic symbols in regular expressions
1. Point number “.”
A dot can be used instead of a newline character (\n) Any character other than , Including but not limited to English letters , number , chinese characters , English punctuation and Chinese punctuation .
2. asterisk “*”
An asterisk can represent a subexpression before it ( Ordinary character , Another regular expression symbol or symbols )0 Times to infinity .
3. question mark “?”
The question mark indicates the subexpression before it 0 Times or 1 second . be careful , The question mark here is in English .
4. Backslash “\”
Backslashes cannot be used alone in regular expressions , Even throughout Python You can't use it alone . Backslashes need to be used with other characters to turn special symbols into ordinary symbols , Change ordinary symbols into special symbols . as :“\n”.
5. number “\d”
Used in regular expressions “\d” To represent one digit . Again ,“\d” Although it is composed of backslashes and letters d Constitutive , But you have to “\d” As a regular expression symbol as a whole .
6. parentheses “()”
Parentheses can extract the contents of parentheses .
four , Examples of common regular expressions
1. .*?( Match all )
for example :'<title>(.*?)</title>' Climb down the title of the page .
2,\w Word character [A-Za-z0-9_], "+" Match previous character 1 Times or infinite times
for example : A person's mailbox is like this [email protected], So how do we extract it from a lot of strings ?
pattern: \w+@\w+\.com
reflection : If mailbox is [email protected] <mailto:[email protected]>, How to match ?
pattern:\w+@(\w+\.)?\w+\.com
? Represents a match 0 Times or 1 Matches within the sub bracket group ,"()" Indicates that the included content is a group , Group sequence number from pattern The string starts and ends in sequence . Because it's a match 0 Times or 1 second , Then it means that the part in parentheses is dispensable , So this pattern It is possible to match the above two mailbox formats .
extend : \w+@(\w+\.)*\w+\.com The mode is even more powerful ," * " Can match 0 Times or infinite times .
five ,re Core functions of Library
1,compile() function ( not essential )
• Function definition : compile(pattern, flag=0)
• Function description : Compiling regular expressions pattern, Then return a regular expression object .
Why pattern To compile ?《Python Core programming 》 That's how it's explained :
Using precompiled code objects is faster than using strings directly , Because the interpreter must compile the string into a code object before executing the code in the form of string .
2,match() function
• Function definition : match(pattern, string, flag=0)
• Function description : Only from the beginning of the string and pattern Match , Matching object returned after successful matching ( There is only one result ), Otherwise return None.
Here comes the problem , Why? result1 There are so many things ? It seems that the last one is the object to match . How do you extract this ?
take it easy , What we get now is the matching object , It needs to be extracted by certain methods , It'll be in the back 《 Method of matching objects 》 Chapter to solve this problem , Keep looking down .
3,search() function
* Function definition : search(pattern, string, flag=0)
* letter
Number Description : And match() Work the same way , however search() Not from the beginning , Instead, find the first match from anywhere . If all strings fail to match , return None, Otherwise, the matching object is returned .
4,findall() function
* Function definition : findall(pattern, string [,flags])
* Function description : Finds all occurrences of regular expression patterns in a string , And returns a list of matches
It also lists match,search,findall Three function usage .findall And match and search The difference is that it returns a list of all non duplicate matches . If no match is found , An empty list is returned .
six , Method of matching objects ( extract )
above re The return contents of module functions can be divided into two types :
* Return matching object : It's like above <_sre.SRE_Match object; span=(0, 5), match='12345'>
Such objects , The functions that can return matching objects are match,search,finditer.
* Returns a list of matches : What returns the list is findall.
Therefore, the method of matching objects is only applicable match,search,finditer, Not applicable and findall.
There are two common methods to match objects :group,groups, There are also several questions about location, such as start,end,span It's described in the code .
1,group method
* Method definition :group(num=0)
* Method description : Returns the entire matching object , Or a specially numbered word group
Look at the following example :
Here we need to use the grouping concept we mentioned earlier .
The significance of grouping is : We don't just want to get the whole string that matches , We also want to get a specific substring in the whole string . In the example above , The entire string is “ I 12345+abcde”, But I want to
“abcde”, We can use it () Enclose . therefore , You can be right pattern Make any grouping , Extract what you want .
2,groups method
* Method definition :groups(default =None)
* Method description : Returns a tuple containing all matching subgroups , If matching fails, an empty tuple is returned
seven ,re Properties of the module (flag)
re Common attributes of modules are as follows :
* re.I: Matching is not case sensitive ;( Commonly used )
* re.L: According to the local locale used \w, \W, \b, \B, \s, \S Realize matching ;
* re.M: ^ and $ Match the beginning and end of the line in the target string, respectively , Instead of strictly matching the beginning and end of the entire string itself ;
* re.S: “.”( Point number ) Usually match except \n( Newline character ) All single characters except , This flag indicates “.”( Point number ) Can match all characters ;( Commonly used )
* re.X: Escape by backslash , Otherwise, add all spaces #( And all subsequent text in that line ) Are ignored , Unless in a character class or allow comments and improve readability ;
be careful :
* If we define compile compile , You need to flag Fill in compile In function , Otherwise, an error will be reported in the matching function ;
If not defined compile, Can be directly in the matching function findall Fill in flag.
appendix :
Syntax list in regular expressions
Technology