Automate the Boring Stuff with Python

Home > Other > Automate the Boring Stuff with Python > Page 15
Automate the Boring Stuff with Python Page 15

by Al Sweigart

>> mo.group() '415-555-9999'

  On the other hand, findall() will not return a Match object but a list of strings—as long as there are no groups in the regular expression. Each string in the list is a piece of the searched text that matched the regular expression. Enter the following into the interactive shell:

  >>> phoneNumRegex = re.compile(r'ddd-ddd-dddd') # has no groups >>> phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000') ['415-555-9999', '212-555-0000']

  If there are groups in the regular expression, then findall() will return a list of tuples. Each tuple represents a found match, and its items are the matched strings for each group in the regex. To see findall() in action, enter the following into the interactive shell (notice that the regular expression being compiled now has groups in parentheses):

  >>> phoneNumRegex = re.compile(r'(ddd)-(ddd)-(dddd)') # has groups >>> phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000') [('415', '555', '1122'), ('212', '555', '0000')]

  To summarize what the findall() method returns, remember the following:

  When called on a regex with no groups, such as ddd-ddd-dddd, the method findall() returns a list of string matches, such as ['415-555-9999', '212-555-0000'].

  When called on a regex that has groups, such as (ddd)-(ddd)-(d ddd), the method findall() returns a list of tuples of strings (one string for each group), such as [('415', '555', '1122'), ('212', '555', '0000')].

  Character Classes

  In the earlier phone number regex example, you learned that d could stand for any numeric digit. That is, d is shorthand for the regular expression (0|1|2|3|4|5|6|7|8|9). There are many such shorthand character classes, as shown in Table 7-1.

  Table 7-1. Shorthand Codes for Common Character Classes

  Shorthand character class

  Represents

  d

  Any numeric digit from 0 to 9.

  D

  Any character that is not a numeric digit from 0 to 9.

  w

  Any letter, numeric digit, or the underscore character. (Think of this as matching “word” characters.)

  W

  Any character that is not a letter, numeric digit, or the underscore character.

  s

  Any space, tab, or newline character. (Think of this as matching “space” characters.)

  S

  Any character that is not a space, tab, or newline.

  Character classes are nice for shortening regular expressions. The character class [0-5] will match only the numbers 0 to 5; this is much shorter than typing (0|1|2|3|4|5).

  For example, enter the following into the interactive shell:

  >>> xmasRegex = re.compile(r'd+sw+') >>> xmasRegex.findall('12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7 swans, 6 geese, 5 rings, 4 birds, 3 hens, 2 doves, 1 partridge') ['12 drummers', '11 pipers', '10 lords', '9 ladies', '8 maids', '7 swans', '6 geese', '5 rings', '4 birds', '3 hens', '2 doves', '1 partridge']

  The regular expression d+sw+ will match text that has one or more numeric digits (d+), followed by a whitespace character (s), followed by one or more letter/digit/underscore characters (w+). The findall() method returns all matching strings of the regex pattern in a list.

  Making Your Own Character Classes

  There are times when you want to match a set of characters but the shorthand character classes (d, w, s, and so on) are too broad. You can define your own character class using square brackets. For example, the character class [aeiouAEIOU] will match any vowel, both lowercase and uppercase. Enter the following into the interactive shell:

  >>> vowelRegex = re.compile(r'[aeiouAEIOU]') >>> vowelRegex.findall('RoboCop eats baby food. BABY FOOD.') ['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o', 'A', 'O', 'O']

  You can also include ranges of letters or numbers by using a hyphen. For example, the character class [a-zA-Z0-9] will match all lowercase letters, uppercase letters, and numbers.

  Note that inside the square brackets, the normal regular expression symbols are not interpreted as such. This means you do not need to escape the ., *, ?, or () characters with a preceding backslash. For example, the character class [0-5.] will match digits 0 to 5 and a period. You do not need to write it as [0-5.].

  By placing a caret character (^) just after the character class’s opening bracket, you can make a negative character class. A negative character class will match all the characters that are not in the character class. For example, enter the following into the interactive shell:

  >>> consonantRegex = re.compile(r'[^aeiouAEIOU]') >>> consonantRegex.findall('RoboCop eats baby food. BABY FOOD.') ['R', 'b', 'c', 'p', ' ', 't', 's', ' ', 'b', 'b', 'y', ' ', 'f', 'd', '.', ' ', 'B', 'B', 'Y', ' ', 'F', 'D', '.']

  Now, instead of matching every vowel, we’re matching every character that isn’t a vowel.

  The Caret and Dollar Sign Characters

  You can also use the caret symbol (^) at the start of a regex to indicate that a match must occur at the beginning of the searched text. Likewise, you can put a dollar sign ($) at the end of the regex to indicate the string must end with this regex pattern. And you can use the ^ and $ together to indicate that the entire string must match the regex—that is, it’s not enough for a match to be made on some subset of the string.

  For example, the r'^Hello' regular expression string matches strings that begin with 'Hello'. Enter the following into the interactive shell:

  >>> beginsWithHello = re.compile(r'^Hello') >>> beginsWithHello.search('Hello world!') <_sre.SRE_Match object; span=(0, 5), match='Hello'> >>> beginsWithHello.search('He said hello.') == None True

  The r'd$' regular expression string matches strings that end with a numeric character from 0 to 9. Enter the following into the interactive shell:

  >>> endsWithNumber = re.compile(r'd$') >>> endsWithNumber.search('Your number is 42') <_sre.SRE_Match object; span=(16, 17), match='2'> >>> endsWithNumber.search('Your number is forty two.') == None True

  The r'^d+$' regular expression string matches strings that both begin and end with one or more numeric characters. Enter the following into the interactive shell:

  >>> wholeStringIsNum = re.compile(r'^d+$') >>> wholeStringIsNum.search('1234567890') <_sre.SRE_Match object; span=(0, 10), match='1234567890'> >>> wholeStringIsNum.search('12345xyz67890') == None True >>> wholeStringIsNum.search('12 34567890') == None True

  The last two search() calls in the previous interactive shell example demonstrate how the entire string must match the regex if ^ and $ are used.

  I always confuse the meanings of these two symbols, so I use the mnemonic “Carrots cost dollars” to remind myself that the caret comes first and the dollar sign comes last.

  The Wildcard Character

  The . (or dot) character in a regular expression is called a wildcard and will match any character except for a newline. For example, enter the following into the interactive shell:

  >>> atRegex = re.compile(r'.at') >>> atRegex.findall('The cat in the hat sat on the flat mat.') ['cat', 'hat', 'sat', 'lat', 'mat']

  Remember that the dot character will match just one character, which is why the match for the text flat in the previous example matched only lat. To match an actual dot, escape the dot with a backslash: ..

  Matching Everything with Dot-Star

  Sometimes you will want to match everything and anything. For example, say you want to match the string 'First Name:', followed by any and all text, followed by 'Last Name:', and then followed by anything again. You can use the dot-star (.*) to stand in for that “anything.” Remember that the dot character means “any single character except the newline,” and the star character means “zero or more of the preceding character.”

  Enter the following into the interactive shell:

  >>> nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)') >>> mo = nameRegex.search('First Name: Al Last Name: Sweigart') >>> mo.group(1) 'Al' >>> mo.group(2) 'Sweigart'

  The dot-star uses greedy mode: It wil
l always try to match as much text as possible. To match any and all text in a nongreedy fashion, use the dot, star, and question mark (.*?). Like with curly brackets, the question mark tells Python to match in a nongreedy way.

  Enter the following into the interactive shell to see the difference between the greedy and nongreedy versions:

  >>> nongreedyRegex = re.compile(r'<.*?>') >>> mo = nongreedyRegex.search(' for dinner.>') >>> mo.group() '' >>> greedyRegex = re.compile(r'<.*>') >>> mo = greedyRegex.search(' for dinner.>') >>> mo.group() ' for dinner.>'

  Both regexes roughly translate to “Match an opening angle bracket, followed by anything, followed by a closing angle bracket.” But the string ' for dinner.>' has two possible matches for the closing angle bracket. In the nongreedy version of the regex, Python matches the shortest possible string: ''. In the greedy version, Python matches the longest possible string: ' for dinner.>'.

  Matching Newlines with the Dot Character

  The dot-star will match everything except a newline. By passing re.DOTALL as the second argument to re.compile(), you can make the dot character match all characters, including the newline character.

  Enter the following into the interactive shell:

  >>> noNewlineRegex = re.compile('.*') >>> noNewlineRegex.search('Serve the public trust.nProtect the innocent. nUphold the law.').group() 'Serve the public trust.' >>> newlineRegex = re.compile('.*', re.DOTALL) >>> newlineRegex.search('Serve the public trust.nProtect the innocent. nUphold the law.').group() 'Serve the public trust.nProtect the innocent.nUphold the law.'

  The regex noNewlineRegex, which did not have re.DOTALL passed to the re.compile() call that created it, will match everything only up to the first newline character, whereas newlineRegex, which did have re.DOTALL passed to re.compile(), matches everything. This is why the newlineRegex.search() call matches the full string, including its newline characters.

  Review of Regex Symbols

  This chapter covered a lot of notation, so here’s a quick review of what you learned:

  The ? matches zero or one of the preceding group.

  The * matches zero or more of the preceding group.

  The + matches one or more of the preceding group.

  The {n} matches exactly n of the preceding group.

  The {n,} matches n or more of the preceding group.

  The {,m} matches 0 to m of the preceding group.

  The {n,m} matches at least n and at most m of the preceding group.

  {n,m}? or *? or +? performs a nongreedy match of the preceding group.

  ^spam means the string must begin with spam.

  spam$ means the string must end with spam.

  The . matches any character, except newline characters.

  d, w, and s match a digit, word, or space character, respectively.

  D, W, and S match anything except a digit, word, or space character, respectively.

  [abc] matches any character between the brackets (such as a, b, or c).

  [^abc] matches any character that isn’t between the brackets.

  Case-Insensitive Matching

  Normally, regular expressions match text with the exact casing you specify. For example, the following regexes match completely different strings:

  >>> regex1 = re.compile('RoboCop') >>> regex2 = re.compile('ROBOCOP') >>> regex3 = re.compile('robOcop') >>> regex4 = re.compile('RobocOp')

  But sometimes you care only about matching the letters without worrying whether they’re uppercase or lowercase. To make your regex case-insensitive, you can pass re.IGNORECASE or re.I as a second argument to re.compile(). Enter the following into the interactive shell:

  >>> robocop = re.compile(r'robocop', re.I) >>> robocop.search('RoboCop is part man, part machine, all cop.').group() 'RoboCop' >>> robocop.search('ROBOCOP protects the innocent.').group() 'ROBOCOP' >>> robocop.search('Al, why does your programming book talk about robocop so much?').group() 'robocop'

  Substituting Strings with the sub() Method

  Regular expressions can not only find text patterns but can also substitute new text in place of those patterns. The sub() method for Regex objects is passed two arguments. The first argument is a string to replace any matches. The second is the string for the regular expression. The sub() method returns a string with the substitutions applied.

  For example, enter the following into the interactive shell:

  >>> namesRegex = re.compile(r'Agent w+') >>> namesRegex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.') 'CENSORED gave the secret documents to CENSORED.'

  Sometimes you may need to use the matched text itself as part of the substitution. In the first argument to sub(), you can type 1, 2, 3, and so on, to mean “Enter the text of group 1, 2, 3, and so on, in the substitution.”

  For example, say you want to censor the names of the secret agents by showing just the first letters of their names. To do this, you could use the regex Agent (w)w* and pass r'1****' as the first argument to sub(). The 1 in that string will be replaced by whatever text was matched by group 1—that is, the (w) group of the regular expression.

  >>> agentNamesRegex = re.compile(r'Agent (w)w*') >>> agentNamesRegex.sub(r'1****', 'Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.') A**** told C**** that E**** knew B**** was a double agent.'

  Managing Complex Regexes

  Regular expressions are fine if the text pattern you need to match is simple. But matching complicated text patterns might require long, convoluted regular expressions. You can mitigate this by telling the re.compile() function to ignore whitespace and comments inside the regular expression string. This “verbose mode” can be enabled by passing the variable re.VERBOSE as the second argument to re.compile().

  Now instead of a hard-to-read regular expression like this:

  phoneRegex = re.compile(r'((d{3}|(d{3}))?(s|-|.)?d{3}(s|-|.)d{4} (s*(ext|x|ext.)s*d{2,5})?)')

  you can spread the regular expression over multiple lines with comments like this:

  phoneRegex = re.compile(r'''( (d{3}|(d{3}))? # area code (s|-|.)? # separator d{3} # first 3 digits (s|-|.) # separator d{4} # last 4 digits (s*(ext|x|ext.)s*d{2,5})? # extension )''', re.VERBOSE)

  Note how the previous example uses the triple-quote syntax (''') to create a multiline string so that you can spread the regular expression definition over many lines, making it much more legible.

  The comment rules inside the regular expression string are the same as regular Python code: The # symbol and everything after it to the end of the line are ignored. Also, the extra spaces inside the multiline string for the regular expression are not considered part of the text pattern to be matched. This lets you organize the regular expression so it’s easier to read.

  Combining re.IGNORECASE, re.DOTALL, and re.VERBOSE

  What if you want to use re.VERBOSE to write comments in your regular expression but also want to use re.IGNORECASE to ignore capitalization? Unfortunately, the re.compile() function takes only a single value as its second argument. You can get around this limitation by combining the re.IGNORECASE, re.DOTALL, and re.VERBOSE variables using the pipe character (|), which in this context is known as the bitwise or operator.

  So if you want a regular expression that’s case-insensitive and includes newlines to match the dot character, you would form your re.compile() call like this:

  >>> someRegexValue = re.compile('foo', re.IGNORECASE | re.DOTALL)

  All three options for the second argument will look like this:

  >>> someRegexValue = re.compile('foo', re.IGNORECASE | re.DOTALL | re.VERBOSE)

  This syntax is a little old-fashioned and originates from early versions of Python. The details of the bitwise operators are beyond the scope of this book, but check out the resources at http://nostarch.com/automatestuff/ for more information. You can also pass other options for the second argument; they’re unc
ommon, but you can read more about them in the resources, too.

  Project: Phone Number and Email Address Extractor

  Say you have the boring task of finding every phone number and email address in a long web page or document. If you manually scroll through the page, you might end up searching for a long time. But if you had a program that could search the text in your clipboard for phone numbers and email addresses, you could simply press CTRL-A to select all the text, press CTRL-C to copy it to the clipboard, and then run your program. It could replace the text on the clipboard with just the phone numbers and email addresses it finds.

  Whenever you’re tackling a new project, it can be tempting to dive right into writing code. But more often than not, it’s best to take a step back and consider the bigger picture. I recommend first drawing up a high-level plan for what your program needs to do. Don’t think about the actual code yet—you can worry about that later. Right now, stick to broad strokes.

  For example, your phone and email address extractor will need to do the following:

  Get the text off the clipboard.

  Find all phone numbers and email addresses in the text.

 

‹ Prev