Automate the Boring Stuff with Python

Home > Other > Automate the Boring Stuff with Python > Page 14
Automate the Boring Stuff with Python Page 14

by Al Sweigart


  Calling isPhoneNumber() with the argument '415-555-4242' will return True. Calling isPhoneNumber() with 'Moshi moshi' will return False; the first test fails because 'Moshi moshi' is not 12 characters long.

  You would have to add even more code to find this pattern of text in a larger string. Replace the last four print() function calls in isPhoneNumber.py with the following:

  message = 'Call me at 415-555-1011 tomorrow. 415-555-9999 is my office.' for i in range(len(message)): ➊ chunk = message[i:i+12] ➋ if isPhoneNumber(chunk): print('Phone number found: ' + chunk) print('Done')

  When this program is run, the output will look like this:

  Phone number found: 415-555-1011 Phone number found: 415-555-9999 Done

  On each iteration of the for loop, a new chunk of 12 characters from message is assigned to the variable chunk ➊. For example, on the first iteration, i is 0, and chunk is assigned message[0:12] (that is, the string 'Call me at 4'). On the next iteration, i is 1, and chunk is assigned message[1:13] (the string 'all me at 41').

  You pass chunk to isPhoneNumber() to see whether it matches the phone number pattern ➋, and if so, you print the chunk.

  Continue to loop through message, and eventually the 12 characters in chunk will be a phone number. The loop goes through the entire string, testing each 12-character piece and printing any chunk it finds that satisfies isPhoneNumber(). Once we’re done going through message, we print Done.

  While the string in message is short in this example, it could be millions of characters long and the program would still run in less than a second. A similar program that finds phone numbers using regular expressions would also run in less than a second, but regular expressions make it quicker to write these programs.

  Finding Patterns of Text with Regular Expressions

  The previous phone number–finding program works, but it uses a lot of code to do something limited: The isPhoneNumber() function is 17 lines but can find only one pattern of phone numbers. What about a phone number formatted like 415.555.4242 or (415) 555-4242? What if the phone number had an extension, like 415-555-4242 x99? The isPhoneNumber() function would fail to validate them. You could add yet more code for these additional patterns, but there is an easier way.

  Regular expressions, called regexes for short, are descriptions for a pattern of text. For example, a d in a regex stands for a digit character—that is, any single numeral 0 to 9. The regex ddd-ddd-dddd is used by Python to match the same text the previous isPhoneNumber() function did: a string of three numbers, a hyphen, three more numbers, another hyphen, and four numbers. Any other string would not match the ddd-ddd-dd dd regex.

  But regular expressions can be much more sophisticated. For example, adding a 3 in curly brackets ({3}) after a pattern is like saying, “Match this pattern three times.” So the slightly shorter regex d{3}-d{3}-d{4} also matches the correct phone number format.

  Creating Regex Objects

  All the regex functions in Python are in the re module. Enter the following into the interactive shell to import this module:

  >>> import re

  Note

  Most of the examples that follow in this chapter will require the re module, so remember to import it at the beginning of any script you write or any time you restart IDLE. Otherwise, you’ll get a NameError: name 're' is not defined error message.

  Passing a string value representing your regular expression to re.compile() returns a Regex pattern object (or simply, a Regex object).

  To create a Regex object that matches the phone number pattern, enter the following into the interactive shell. (Remember that d means “a digit character” and ddd-ddd-dddd is the regular expression for the correct phone number pattern.)

  >>> phoneNumRegex = re.compile(r'ddd-ddd-dddd')

  Now the phoneNumRegex variable contains a Regex object.

  Passing Raw Strings to re.compile( )

  Remember that escape characters in Python use the backslash (). The string value 'n' represents a single newline character, not a backslash followed by a lowercase n. You need to enter the escape character \ to print a single backslash. So '\n' is the string that represents a backslash followed by a lowercase n. However, by putting an r before the first quote of the string value, you can mark the string as a raw string, which does not escape characters.

  Since regular expressions frequently use backslashes in them, it is convenient to pass raw strings to the re.compile() function instead of typing extra backslashes. Typing r'ddd-ddd-dddd' is much easier than typing '\d\d\d-\d\d\d-\d\d\d\d'.

  Matching Regex Objects

  A Regex object’s search() method searches the string it is passed for any matches to the regex. The search() method will return None if the regex pattern is not found in the string. If the pattern is found, the search() method returns a Match object. Match objects have a group() method that will return the actual matched text from the searched string. (I’ll explain groups shortly.) For example, enter the following into the interactive shell:

  >>> phoneNumRegex = re.compile(r'ddd-ddd-dddd') >>> mo = phoneNumRegex.search('My number is 415-555-4242.') >>> print('Phone number found: ' + mo.group()) Phone number found: 415-555-4242

  The mo variable name is just a generic name to use for Match objects. This example might seem complicated at first, but it is much shorter than the earlier isPhoneNumber.py program and does the same thing.

  Here, we pass our desired pattern to re.compile() and store the resulting Regex object in phoneNumRegex. Then we call search() on phoneNumRegex and pass search() the string we want to search for a match. The result of the search gets stored in the variable mo. In this example, we know that our pattern will be found in the string, so we know that a Match object will be returned. Knowing that mo contains a Match object and not the null value None, we can call group() on mo to return the match. Writing mo.group() inside our print statement displays the whole match, 415-555-4242.

  Review of Regular Expression Matching

  While there are several steps to using regular expressions in Python, each step is fairly simple.

  Import the regex module with import re.

  Create a Regex object with the re.compile() function. (Remember to use a raw string.)

  Pass the string you want to search into the Regex object’s search() method. This returns a Match object.

  Call the Match object’s group() method to return a string of the actual matched text.

  Note

  While I encourage you to enter the example code into the interactive shell, you should also make use of web-based regular expression testers, which can show you exactly how a regex matches a piece of text that you enter. I recommend the tester at http://regexpal.com/.

  More Pattern Matching with Regular Expressions

  Now that you know the basic steps for creating and finding regular expression objects with Python, you’re ready to try some of their more powerful pattern-matching capabilities.

  Grouping with Parentheses

  Say you want to separate the area code from the rest of the phone number. Adding parentheses will create groups in the regex: (ddd)-(ddd-dddd). Then you can use the group() match object method to grab the matching text from just one group.

  The first set of parentheses in a regex string will be group 1. The second set will be group 2. By passing the integer 1 or 2 to the group() match object method, you can grab different parts of the matched text. Passing 0 or nothing to the group() method will return the entire matched text. Enter the following into the interactive shell:

  >>> phoneNumRegex = re.compile(r'(ddd)-(ddd-dddd)') >>> mo = phoneNumRegex.search('My number is 415-555-4242.') >>> mo.group(1) '415' >>> mo.group(2) '555-4242' >>> mo.group(0) '415-555-4242' >>> mo.group() '415-555-4242'

  If you would like to retrieve all the groups at once, use the groups() method—note the plural form for the name.

  >>> mo.groups() ('415', '5
55-4242') >>> areaCode, mainNumber = mo.groups() >>> print(areaCode) 415 >>> print(mainNumber) 555-4242

  Since mo.groups() returns a tuple of multiple values, you can use the multiple-assignment trick to assign each value to a separate variable, as in the previous areaCode, mainNumber = mo.groups() line.

  Parentheses have a special meaning in regular expressions, but what do you do if you need to match a parenthesis in your text? For instance, maybe the phone numbers you are trying to match have the area code set in parentheses. In this case, you need to escape the ( and ) characters with a backslash. Enter the following into the interactive shell:

  >>> phoneNumRegex = re.compile(r'((ddd)) (ddd-dddd)') >>> mo = phoneNumRegex.search('My phone number is (415) 555-4242.') >>> mo.group(1) '(415)' >>> mo.group(2) '555-4242'

  The ( and ) escape characters in the raw string passed to re.compile() will match actual parenthesis characters.

  Matching Multiple Groups with the Pipe

  The | character is called a pipe. You can use it anywhere you want to match one of many expressions. For example, the regular expression r'Batman|Tina Fey' will match either 'Batman' or 'Tina Fey'.

  When both Batman and Tina Fey occur in the searched string, the first occurrence of matching text will be returned as the Match object. Enter the following into the interactive shell:

  >>> heroRegex = re.compile (r'Batman|Tina Fey') >>> mo1 = heroRegex.search('Batman and Tina Fey.') >>> mo1.group() 'Batman' >>> mo2 = heroRegex.search('Tina Fey and Batman.') >>> mo2.group() 'Tina Fey'

  Note

  You can find all matching occurrences with the findall() method that’s discussed in The findall() Method.

  You can also use the pipe to match one of several patterns as part of your regex. For example, say you wanted to match any of the strings 'Batman', 'Batmobile', 'Batcopter', and 'Batbat'. Since all these strings start with Bat, it would be nice if you could specify that prefix only once. This can be done with parentheses. Enter the following into the interactive shell:

  >>> batRegex = re.compile(r'Bat(man|mobile|copter|bat)') >>> mo = batRegex.search('Batmobile lost a wheel') >>> mo.group() 'Batmobile' >>> mo.group(1) 'mobile'

  The method call mo.group() returns the full matched text 'Batmobile', while mo.group(1) returns just the part of the matched text inside the first parentheses group, 'mobile'. By using the pipe character and grouping parentheses, you can specify several alternative patterns you would like your regex to match.

  If you need to match an actual pipe character, escape it with a backslash, like |.

  Optional Matching with the Question Mark

  Sometimes there is a pattern that you want to match only optionally. That is, the regex should find a match whether or not that bit of text is there. The ? character flags the group that precedes it as an optional part of the pattern. For example, enter the following into the interactive shell:

  >>> batRegex = re.compile(r'Bat(wo)?man') >>> mo1 = batRegex.search('The Adventures of Batman') >>> mo1.group() 'Batman' >>> mo2 = batRegex.search('The Adventures of Batwoman') >>> mo2.group() 'Batwoman'

  The (wo)? part of the regular expression means that the pattern wo is an optional group. The regex will match text that has zero instances or one instance of wo in it. This is why the regex matches both 'Batwoman' and 'Batman'.

  Using the earlier phone number example, you can make the regex look for phone numbers that do or do not have an area code. Enter the following into the interactive shell:

  >>> phoneRegex = re.compile(r'(ddd-)?ddd-dddd') >>> mo1 = phoneRegex.search('My number is 415-555-4242') >>> mo1.group() '415-555-4242' >>> mo2 = phoneRegex.search('My number is 555-4242') >>> mo2.group() '555-4242'

  You can think of the ? as saying, “Match zero or one of the group preceding this question mark.”

  If you need to match an actual question mark character, escape it with ?.

  Matching Zero or More with the Star

  The * (called the star or asterisk) means “match zero or more”—the group that precedes the star can occur any number of times in the text. It can be completely absent or repeated over and over again. Let’s look at the Batman example again.

  >>> batRegex = re.compile(r'Bat(wo)*man') >>> mo1 = batRegex.search('The Adventures of Batman') >>> mo1.group() 'Batman' >>> mo2 = batRegex.search('The Adventures of Batwoman') >>> mo2.group() 'Batwoman' >>> mo3 = batRegex.search('The Adventures of Batwowowowoman') >>> mo3.group() 'Batwowowowoman'

  For 'Batman', the (wo)* part of the regex matches zero instances of wo in the string; for 'Batwoman', the (wo)* matches one instance of wo; and for 'Batwowowowoman', (wo)* matches four instances of wo.

  If you need to match an actual star character, prefix the star in the regular expression with a backslash, *.

  Matching One or More with the Plus

  While * means “match zero or more,” the + (or plus) means “match one or more.” Unlike the star, which does not require its group to appear in the matched string, the group preceding a plus must appear at least once. It is not optional. Enter the following into the interactive shell, and compare it with the star regexes in the previous section:

  >>> batRegex = re.compile(r'Bat(wo)+man') >>> mo1 = batRegex.search('The Adventures of Batwoman') >>> mo1.group() 'Batwoman' >>> mo2 = batRegex.search('The Adventures of Batwowowowoman') >>> mo2.group() 'Batwowowowoman' >>> mo3 = batRegex.search('The Adventures of Batman') >>> mo3 == None True

  The regex Bat(wo)+man will not match the string 'The Adventures of Batman' because at least one wo is required by the plus sign.

  If you need to match an actual plus sign character, prefix the plus sign with a backslash to escape it: +.

  Matching Specific Repetitions with Curly Brackets

  If you have a group that you want to repeat a specific number of times, follow the group in your regex with a number in curly brackets. For example, the regex (Ha){3} will match the string 'HaHaHa', but it will not match 'HaHa', since the latter has only two repeats of the (Ha) group.

  Instead of one number, you can specify a range by writing a minimum, a comma, and a maximum in between the curly brackets. For example, the regex (Ha){3,5} will match 'HaHaHa', 'HaHaHaHa', and 'HaHaHaHaHa'.

  You can also leave out the first or second number in the curly brackets to leave the minimum or maximum unbounded. For example, (Ha){3,} will match three or more instances of the (Ha) group, while (Ha){,5} will match zero to five instances. Curly brackets can help make your regular expressions shorter. These two regular expressions match identical patterns:

  (Ha){3} (Ha)(Ha)(Ha)

  And these two regular expressions also match identical patterns:

  (Ha){3,5} ((Ha)(Ha)(Ha))|((Ha)(Ha)(Ha)(Ha))|((Ha)(Ha)(Ha)(Ha)(Ha))

  Enter the following into the interactive shell:

  >>> haRegex = re.compile(r'(Ha){3}') >>> mo1 = haRegex.search('HaHaHa') >>> mo1.group() 'HaHaHa' >>> mo2 = haRegex.search('Ha') >>> mo2 == None True

  Here, (Ha){3} matches 'HaHaHa' but not 'Ha'. Since it doesn’t match 'Ha', search() returns None.

  Greedy and Nongreedy Matching

  Since (Ha){3,5} can match three, four, or five instances of Ha in the string 'HaHaHaHaHa', you may wonder why the Match object’s call to group() in the previous curly bracket example returns 'HaHaHaHaHa' instead of the shorter possibilities. After all, 'HaHaHa' and 'HaHaHaHa' are also valid matches of the regular expression (Ha){3,5}.

  Python’s regular expressions are greedy by default, which means that in ambiguous situations they will match the longest string possible. The non-greedy version of the curly brackets, which matches the shortest string possible, has the closing curly bracket followed by a question mark.

  Enter the following into the interactive shell, and notice the difference between the greedy and nongreedy forms of the curly brackets searching the same string:

  >>> greedyHaRegex = re.compile(r'(Ha){3,5}') >>> mo1 = greedyHaRegex.search('HaHaHaHaHa') >>> mo1.group() 'HaHaHaHaHa' >>> nongreedyHaRegex = re.compile(r'(Ha
){3,5}?') >>> mo2 = nongreedyHaRegex.search('HaHaHaHaHa') >>> mo2.group() 'HaHaHa'

  Note that the question mark can have two meanings in regular expressions: declaring a nongreedy match or flagging an optional group. These meanings are entirely unrelated.

  The findall() Method

  In addition to the search() method, Regex objects also have a findall() method. While search() will return a Match object of the first matched text in the searched string, the findall() method will return the strings of every match in the searched string. To see how search() returns a Match object only on the first instance of matching text, enter the following into the interactive shell:

  >>> phoneNumRegex = re.compile(r'ddd-ddd-dddd') >>> mo = phoneNumRegex.search('Cell: 415-555-9999 Work: 212-555-0000') >

‹ Prev