Automate the Boring Stuff with Python

Page 16

by Al Sweigart

Paste them onto the clipboard.

Now you can start thinking about how this might work in code. The code will need to do the following:

Use the pyperclip module to copy and paste strings.

Create two regexes, one for matching phone numbers and the other for matching email addresses.

Find all matches, not just the first match, of both regexes.

Neatly format the matched strings into a single string to paste.

Display some kind of message if no matches were found in the text.

This list is like a road map for the project. As you write the code, you can focus on each of these steps separately. Each step is fairly manageable and expressed in terms of things you already know how to do in Python.

Step 1: Create a Regex for Phone Numbers

First, you have to create a regular expression to search for phone numbers. Create a new file, enter the following, and save it as phoneAndEmail.py:

#! python3 # phoneAndEmail.py - Finds phone numbers and email addresses on the clipboard. import pyperclip, re phoneRegex = re.compile(r'''( (d{3}|(d{3}))? # area code (s|-|.)? # separator (d{3}) # first 3 digits (s|-|.) # separator (d{4}) # last 4 digits (s*(ext|x|ext.)s*(d{2,5}))? # extension )''', re.VERBOSE) # TODO: Create email regex. # TODO: Find matches in clipboard text. # TODO: Copy results to the clipboard.

The TODO comments are just a skeleton for the program. They’ll be replaced as you write the actual code.

The phone number begins with an optional area code, so the area code group is followed with a question mark. Since the area code can be just three digits (that is, d{3}) or three digits within parentheses (that is, (d{3})), you should have a pipe joining those parts. You can add the regex comment # Area code to this part of the multiline string to help you remember what (d{3}|(d{3}))? is supposed to match.

The phone number separator character can be a space (s), hyphen (-), or period (.), so these parts should also be joined by pipes. The next few parts of the regular expression are straightforward: three digits, followed by another separator, followed by four digits. The last part is an optional extension made up of any number of spaces followed by ext, x, or ext., followed by two to five digits.

Step 2: Create a Regex for Email Addresses

You will also need a regular expression that can match email addresses. Make your program look like the following:

#! python3 # phoneAndEmail.py - Finds phone numbers and email addresses on the clipboard. import pyperclip, re phoneRegex = re.compile(r'''( --snip-- # Create email regex. emailRegex = re.compile(r'''( ➊ [a-zA-Z0-9._%+-]+ # username ➋ @ # @ symbol ➌ [a-zA-Z0-9.-]+ # domain name (.[a-zA-Z]{2,4}) # dot-something )''', re.VERBOSE) # TODO: Find matches in clipboard text. # TODO: Copy results to the clipboard.

The username part of the email address ➊ is one or more characters that can be any of the following: lowercase and uppercase letters, numbers, a dot, an underscore, a percent sign, a plus sign, or a hyphen. You can put all of these into a character class: [a-zA-Z0-9._%+-].

The domain and username are separated by an @ symbol ➋. The domain name ➌ has a slightly less permissive character class with only letters, numbers, periods, and hyphens: [a-zA-Z0-9.-]. And last will be the “dot-com” part (technically known as the top-level domain), which can really be dot-anything. This is between two and four characters.

The format for email addresses has a lot of weird rules. This regular expression won’t match every possible valid email address, but it’ll match almost any typical email address you’ll encounter.

Step 3: Find All Matches in the Clipboard Text

Now that you have specified the regular expressions for phone numbers and email addresses, you can let Python’s re module do the hard work of finding all the matches on the clipboard. The pyperclip.paste() function will get a string value of the text on the clipboard, and the findall() regex method will return a list of tuples.

Make your program look like the following:

#! python3 # phoneAndEmail.py - Finds phone numbers and email addresses on the clipboard. import pyperclip, re phoneRegex = re.compile(r'''( --snip-- # Find matches in clipboard text. text = str(pyperclip.paste()) ➊ matches = [] ➋ for groups in phoneRegex.findall(text): phoneNum = '-'.join([groups[1], groups[3], groups[5]]) if groups[8] != '': phoneNum += ' x' + groups[8] matches.append(phoneNum) ➌ for groups in emailRegex.findall(text): matches.append(groups[0]) # TODO: Copy results to the clipboard.

There is one tuple for each match, and each tuple contains strings for each group in the regular expression. Remember that group 0 matches the entire regular expression, so the group at index 0 of the tuple is the one you are interested in.

As you can see at ➊, you’ll store the matches in a list variable named matches. It starts off as an empty list, and a couple for loops. For the email addresses, you append group 0 of each match ➌. For the matched phone numbers, you don’t want to just append group 0. While the program detects phone numbers in several formats, you want the phone number appended to be in a single, standard format. The phoneNum variable contains a string built from groups 1, 3, 5, and 8 of the matched text ➋. (These groups are the area code, first three digits, last four digits, and extension.)

Step 4: Join the Matches into a String for the Clipboard

Now that you have the email addresses and phone numbers as a list of strings in matches, you want to put them on the clipboard. The pyperclip.copy() function takes only a single string value, not a list of strings, so you call the join() method on matches.

To make it easier to see that the program is working, let’s print any matches you find to the terminal. And if no phone numbers or email addresses were found, the program should tell the user this.

Make your program look like the following:

#! python3 # phoneAndEmail.py - Finds phone numbers and email addresses on the clipboard. --snip-- for groups in emailRegex.findall(text): matches.append(groups[0]) # Copy results to the clipboard. if len(matches) > 0: pyperclip.copy('n'.join(matches)) print('Copied to clipboard:') print('n'.join(matches)) else: print('No phone numbers or email addresses found.')

Running the Program

For an example, open your web browser to the No Starch Press contact page at http://www.nostarch.com/contactus.htm, press CTRL-A to select all the text on the page, and press CTRL-C to copy it to the clipboard. When you run this program, the output will look something like this:

Copied to clipboard: 800-420-7240 415-863-9900 415-863-9950 info@nostarch.com media@nostarch.com academic@nostarch.com help@nostarch.com

Ideas for Similar Programs

Identifying patterns of text (and possibly substituting them with the sub() method) has many different potential applications.

Find website URLs that begin with http:// or https://.

Clean up dates in different date formats (such as 3/14/2015, 03-14-2015, and 2015/3/14) by replacing them with dates in a single, standard format.

Remove sensitive information such as Social Security or credit card numbers.

Find common typos such as multiple spaces between words, accidentally accidentally repeated words, or multiple exclamation marks at the end of sentences. Those are annoying!!

Summary

While a computer can search for text quickly, it must be told precisely what to look for. Regular expressions allow you to specify the precise patterns of characters you are looking for. In fact, some word processing and spreadsheet applications provide find-and-replace features that allow you to search using regular expressions.

The re module that comes with Python lets you compile Regex objects. These values have several methods: search() to find a single match, findall() to find all matching instances, and sub() to do a find-and-replace substitution of text.

There’s a bit more to regular expression syntax than is described in this chapter. You can find out more in the official Python documentation at http://docs.python.org/3/library/re.html. The tutorial webs
ite http://www.regular-expressions.info/ is also a useful resource.

Now that you have expertise manipulating and matching strings, it’s time to dive into how to read from and write to files on your computer’s hard drive.

Practice Questions

Q:

1. What is the function that creates Regex objects?

Q:

2. Why are raw strings often used when creating Regex objects?

Q:

3. What does the search() method return?

Q:

4. How do you get the actual strings that match the pattern from a Match object?

Q:

5. In the regex created from r'(ddd)-(ddd-dddd)', what does group 0 cover? Group 1? Group 2?

Q:

6. Parentheses and periods have specific meanings in regular expression syntax. How would you specify that you want a regex to match actual parentheses and period characters?

Q:

7. The findall() method returns a list of strings or a list of tuples of strings. What makes it return one or the other?

Q:

8. What does the | character signify in regular expressions?

Q:

9. What two things does the ? character signify in regular expressions?

Q:

10. What is the difference between the + and * characters in regular expressions?

Q:

11. What is the difference between {3} and {3,5} in regular expressions?

Q:

12. What do the d, w, and s shorthand character classes signify in regular expressions?

Q:

13. What do the D, W, and S shorthand character classes signify in regular expressions?

Q:

14. How do you make a regular expression case-insensitive?

Q:

15. What does the . character normally match? What does it match if re.DOTALL is passed as the second argument to re.compile()?

Q:

16. What is the difference between .* and .*??

Q:

17. What is the character class syntax to match all numbers and lowercase letters?

Q:

18. If numRegex = re.compile(r'd+'), what will numRegex.sub('X', '12 drummers, 11 pipers, five rings, 3 hens') return?

Q:

19. What does passing re.VERBOSE as the second argument to re.compile() allow you to do?

Q:

20. How would you write a regex that matches a number with commas for every three digits? It must match the following:

'42'

'1,234'

'6,368,745'

but not the following:

'12,34,567' (which has only two digits between the commas)

'1234' (which lacks commas)

Q:

21. How would you write a regex that matches the full name of someone whose last name is Nakamoto? You can assume that the first name that comes before it will always be one word that begins with a capital letter. The regex must match the following:

'Satoshi Nakamoto'

'Alice Nakamoto'

'RoboCop Nakamoto'

but not the following:

'satoshi Nakamoto' (where the first name is not capitalized)

'Mr. Nakamoto' (where the preceding word has a nonletter character)

'Nakamoto' (which has no first name)

'Satoshi nakamoto' (where Nakamoto is not capitalized)

Q:

22. How would you write a regex that matches a sentence where the first word is either Alice, Bob, or Carol; the second word is either eats, pets, or throws; the third word is apples, cats, or baseballs; and the sentence ends with a period? This regex should be case-insensitive. It must match the following:

'Alice eats apples.'

'Bob pets cats.'

'Carol throws baseballs.'

'Alice throws Apples.'

'BOB EATS CATS.'

but not the following:

'RoboCop eats apples.'

'ALICE THROWS FOOTBALLS.'

'Carol eats 7 cats.'

Practice Projects

For practice, write programs to do the following tasks.

Strong Password Detection

Write a function that uses regular expressions to make sure the password string it is passed is strong. A strong password is defined as one that is at least eight characters long, contains both uppercase and lowercase characters, and has at least one digit. You may need to test the string against multiple regex patterns to validate its strength.

Regex Version of strip()

Write a function that takes a string and does the same thing as the strip() string method. If no other arguments are passed other than the string to strip, then whitespace characters will be removed from the beginning and end of the string. Otherwise, the characters specified in the second argument to the function will be removed from the string.

* * *

[1] Cory Doctorow, “Here’s what ICT should really teach kids: how to do regular expressions,” Guardian, December 4, 2012, http://www.theguardian.com/technology/2012/dec/04/ict-teach-kids-regular-expressions/.

Chapter 8. Reading and Writing Files

Variables are a fine way to store data while your program is running, but if you want your data to persist even after your program has finished, you need to save it to a file. You can think of a file’s contents as a single string value, potentially gigabytes in size. In this chapter, you will learn how to use Python to create, read, and save files on the hard drive.

Files and File Paths

A file has two key properties: a filename (usually written as one word) and a path. The path specifies the location of a file on the computer. For example, there is a file on my Windows 7 laptop with the filename projects.docx in the path C:UsersasweigartDocuments. The part of the filename after the last period is called the file’s extension and tells you a file’s type. project.docx is a Word document, and Users, asweigart, and Documents all refer to folders (also called directories). Folders can contain files and other folders. For example, project.docx is in the Documents folder, which is inside the asweigart folder, which is inside the Users folder. Figure 8-1 shows this folder organization.

Figure 8-1. A file in a hierarchy of folders

The C: part of the path is the root folder, which contains all other folders. On Windows, the root folder is named C: and is also called the C: drive. On OS X and Linux, the root folder is /. In this book, I’ll be using the Windows-style root folder, C:. If you are entering the interactive shell examples on OS X or Linux, enter / instead.

Additional volumes, such as a DVD drive or USB thumb drive, will appear differently on different operating systems. On Windows, they appear as new, lettered root drives, such as D: or E:. On OS X, they appear as new folders under the /Volumes folder. On Linux, they appear as new folders under the /mnt (“mount”) folder. Also note that while folder names and filenames are not case sensitive on Windows and OS X, they are case sensitive on Linux.

Backslash on Windows and Forward Slash on OS X and Linux

On Windows, paths are written using backslashes () as the separator between folder names. OS X and Linux, however, use the forward slash (/) as their path separator. If you want your programs to work on all operating systems, you will have to write your Python scripts to handle both cases.

Fortunately, this is simple to do with the os.path.join() function. If you pass it the string values of individual file and folder names in your path, os.path.join() will return a string with a file path using the correct path separators. Enter the following into the interactive shell:

>>> import os >>> os.path.join('usr', 'bin', 'spam') 'usr\bin\spam'

I’m running these interactive shell examples on Windows, so os.path.join('usr', 'bin', 'spam') returned 'usr\bin\spam'. (Notice that the backslashes are doubled because each backslash needs to be escaped by another backslash character.) If I had called this function on OS X or Linux, the string would have been 'usr/bin/
spam'.

The os.path.join() function is helpful if you need to create strings for filenames. These strings will be passed to several of the file-related functions introduced in this chapter. For example, the following example joins names from a list of filenames to the end of a folder’s name:

>>> myFiles = ['accounts.txt', 'details.csv', 'invite.docx'] >>> for filename in myFiles: print(os.path.join('C:\Users\asweigart', filename)) C:Usersasweigartaccounts.txt C:Usersasweigartdetails.csv C:Usersasweigartinvite.docx

The Current Working Directory

‹ Prev Next ›