Automate the Boring Stuff with Python

Page 32

by Al Sweigart

Document objects have an add_picture() method that will let you add an image to the end of the document. Say you have a file zophie.png in the current working directory. You can add zophie.png to the end of your document with a width of 1 inch and height of 4 centimeters (Word can use both imperial and metric units) by entering the following:

>>> doc.add_picture('zophie.png', width=docx.shared.Inches(1), height=docx.shared.Cm(4))

The first argument is a string of the image’s filename. The optional width and height keyword arguments will set the width and height of the image in the document. If left out, the width and height will default to the normal size of the image.

You’ll probably prefer to specify an image’s height and width in familiar units such as inches and centimeters, so you can use the docx.shared.Inches() and docx.shared.Cm() functions when you’re specifying the width and height keyword arguments.

Summary

Text information isn’t just for plaintext files; in fact, it’s pretty likely that you deal with PDFs and Word documents much more often. You can use the PyPDF2 module to read and write PDF documents. Unfortunately, reading text from PDF documents might not always result in a perfect translation to a string because of the complicated PDF file format, and some PDFs might not be readable at all. In these cases, you’re out of luck unless future updates to PyPDF2 support additional PDF features.

Word documents are more reliable, and you can read them with the python-docx module. You can manipulate text in Word documents via Paragraph and Run objects. These objects can also be given styles, though they must be from the default set of styles or styles already in the document. You can add new paragraphs, headings, breaks, and pictures to the document, though only to the end.

Many of the limitations that come with working with PDFs and Word documents are because these formats are meant to be nicely displayed for human readers, rather than easy to parse by software. The next chapter takes a look at two other common formats for storing information: JSON and CSV files. These formats are designed to be used by computers, and you’ll see that Python can work with these formats much more easily.

Practice Questions

Q:

1. A string value of the PDF filename is not passed to the PyPDF2.PdfFileReader() function. What do you pass to the function instead?

Q:

2. What modes do the File objects for PdfFileReader() and PdfFileWriter() need to be opened in?

Q:

3. How do you acquire a Page object for About This Book from a PdfFileReader object?

Q:

4. What PdfFileReader variable stores the number of pages in the PDF document?

Q:

5. If a PdfFileReader object’s PDF is encrypted with the password swordfish, what must you do before you can obtain Page objects from it?

Q:

6. What methods do you use to rotate a page?

Q:

7. What method returns a Document object for a file named demo.docx?

Q:

8. What is the difference between a Paragraph object and a Run object?

Q:

9. How do you obtain a list of Paragraph objects for a Document object that’s stored in a variable named doc?

Q:

10. What type of object has bold, underline, italic, strike, and outline variables?

Q:

11. What is the difference between setting the bold variable to True, False, or None?

Q:

12. How do you create a Document object for a new Word document?

Q:

13. How do you add a paragraph with the text 'Hello there!' to a Document object stored in a variable named doc?

Q:

14. What integers represent the levels of headings available in Word documents?

Practice Projects

For practice, write programs that do the following.

PDF Paranoia

Using the os.walk() function from Chapter 9, write a script that will go through every PDF in a folder (and its subfolders) and encrypt the PDFs using a password provided on the command line. Save each encrypted PDF with an _encrypted.pdf suffix added to the original filename. Before deleting the original file, have the program attempt to read and decrypt the file to ensure that it was encrypted correctly.

Then, write a program that finds all encrypted PDFs in a folder (and its subfolders) and creates a decrypted copy of the PDF using a provided password. If the password is incorrect, the program should print a message to the user and continue to the next PDF.

Custom Invitations as Word Documents

Say you have a text file of guest names. This guests.txt file has one name per line, as follows:

Prof. Plum Miss Scarlet Col. Mustard Al Sweigart RoboCop

Write a program that would generate a Word document with custom invitations that look like Figure 13-11.

Since Python-Docx can use only those styles that already exist in the Word document, you will have to first add these styles to a blank Word file and then open that file with Python-Docx. There should be one invitation per page in the resulting Word document, so call add_break() to add a page break after the last paragraph of each invitation. This way, you will need to open only one Word document to print all of the invitations at once.

Figure 13-11. The Word document generated by your custom invite script

You can download a sample guests.txt file from http://nostarch.com/automatestuff/.

Brute-Force PDF Password Breaker

Say you have an encrypted PDF that you have forgotten the password to, but you remember it was a single English word. Trying to guess your forgotten password is quite a boring task. Instead you can write a program that will decrypt the PDF by trying every possible English word until it finds one that works. This is called a brute-force password attack. Download the text file dictionary.txt from http://nostarch.com/automatestuff/. This dictionary file contains over 44,000 English words with one word per line.

Using the file-reading skills you learned in Chapter 8, create a list of word strings by reading this file. Then loop over each word in this list, passing it to the decrypt() method. If this method returns the integer 0, the password was wrong and your program should continue to the next password. If decrypt() returns 1, then your program should break out of the loop and print the hacked password. You should try both the uppercase and lower-case form of each word. (On my laptop, going through all 88,000 uppercase and lowercase words from the dictionary file takes a couple of minutes. This is why you shouldn’t use a simple English word for your passwords.)

Chapter 14. Working with CSV Files and JSON Data

In Chapter 13, you learned how to extract text from PDF and Word documents. These files were in a binary format, which required special Python modules to access their data. CSV and JSON files, on the other hand, are just plaintext files. You can view them in a text editor, such as IDLE’s file editor. But Python also comes with the special csv and json modules, each providing functions to help you work with these file formats.

CSV stands for “comma-separated values,” and CSV files are simplified spreadsheets stored as plaintext files. Python’s csv module makes it easy to parse CSV files.

JSON (pronounced “JAY-sawn” or “Jason”—it doesn’t matter how because either way people will say you’re pronouncing it wrong) is a format that stores information as JavaScript source code in plaintext files.

(JSON is short for JavaScript Object Notation.) You don’t need to know the JavaScript programming language to use JSON files, but the JSON format is useful to know because it’s used in many web applications.

The CSV Module

Each line in a CSV file represents a row in the spreadsheet, and commas separate the cells in the row. For example, the spreadsheet example.xlsx from http://nostarch.com/automatestuff/ would look like this in a CSV file:

4/5/2015 13:34,Apples,73 4/5/2015 3:41,Cherries,85 4/6/2015
12:46,Pears,14 4/8/2015 8:59,Oranges,52 4/10/2015 2:07,Apples,152 4/10/2015 18:10,Bananas,23 4/10/2015 2:40,Strawberries,98

I will use this file for this chapter’s interactive shell examples. You can download example.csv from http://nostarch.com/automatestuff/ or enter the text into a text editor and save it as example.csv.

CSV files are simple, lacking many of the features of an Excel spreadsheet. For example, CSV files

Don’t have types for their values—everything is a string

Don’t have settings for font size or color

Don’t have multiple worksheets

Can’t specify cell widths and heights

Can’t have merged cells

Can’t have images or charts embedded in them

The advantage of CSV files is simplicity. CSV files are widely supported by many types of programs, can be viewed in text editors (including IDLE’s file editor), and are a straightforward way to represent spreadsheet data. The CSV format is exactly as advertised: It’s just a text file of comma-separated values.

Since CSV files are just text files, you might be tempted to read them in as a string and then process that string using the techniques you learned in Chapter 8. For example, since each cell in a CSV file is separated by a comma, maybe you could just call the split() method on each line of text to get the values. But not every comma in a CSV file represents the boundary between two cells. CSV files also have their own set of escape characters to allow commas and other characters to be included as part of the values. The split() method doesn’t handle these escape characters. Because of these potential pitfalls, you should always use the csv module for reading and writing CSV files.

Reader Objects

To read data from a CSV file with the csv module, you need to create a Reader object. A Reader object lets you iterate over lines in the CSV file. Enter the following into the interactive shell, with example.csv in the current working directory:

➊ >>> import csv ➋ >>> exampleFile = open('example.csv') ➌ >>> exampleReader = csv.reader(exampleFile) ➍ >>> exampleData = list(exampleReader) ➍ >>> exampleData [['4/5/2015 13:34', 'Apples', '73'], ['4/5/2015 3:41', 'Cherries', '85'], ['4/6/2015 12:46', 'Pears', '14'], ['4/8/2015 8:59', 'Oranges', '52'], ['4/10/2015 2:07', 'Apples', '152'], ['4/10/2015 18:10', 'Bananas', '23'], ['4/10/2015 2:40', 'Strawberries', '98']]

The csv module comes with Python, so we can import it ➊ without having to install it first.

To read a CSV file with the csv module, first open it using the open() function ➋, just as you would any other text file. But instead of calling the read() or readlines() method on the File object that open() returns, pass it to the csv.reader() function ➌. This will return a Reader object for you to use. Note that you don’t pass a filename string directly to the csv.reader() function.

The most direct way to access the values in the Reader object is to convert it to a plain Python list by passing it to list() ➍. Using list() on this Reader object returns a list of lists, which you can store in a variable like exampleData. Entering exampleData in the shell displays the list of lists ➎.

Now that you have the CSV file as a list of lists, you can access the value at a particular row and column with the expression exampleData[row][col], where row is the index of one of the lists in exampleData, and col is the index of the item you want from that list. Enter the following into the interactive shell:

>>> exampleData[0][0] '4/5/2015 13:34' >>> exampleData[0][1] 'Apples' >>> exampleData[0][2] '73' >>> exampleData[1][1] 'Cherries' >>> exampleData[6][1] 'Strawberries'

exampleData[0][0] goes into the first list and gives us the first string, exampleData[0][2] goes into the first list and gives us the third string, and so on.

Reading Data from Reader Objects in a for Loop

For large CSV files, you’ll want to use the Reader object in a for loop. This avoids loading the entire file into memory at once. For example, enter the following into the interactive shell:

>>> import csv >>> exampleFile = open('example.csv') >>> exampleReader = csv.reader(exampleFile) >>> for row in exampleReader: print('Row #' + str(exampleReader.line_num) + ' ' + str(row)) Row #1 ['4/5/2015 13:34', 'Apples', '73'] Row #2 ['4/5/2015 3:41', 'Cherries', '85'] Row #3 ['4/6/2015 12:46', 'Pears', '14'] Row #4 ['4/8/2015 8:59', 'Oranges', '52'] Row #5 ['4/10/2015 2:07', 'Apples', '152'] Row #6 ['4/10/2015 18:10', 'Bananas', '23'] Row #7 ['4/10/2015 2:40', 'Strawberries', '98']

After you import the csv module and make a Reader object from the CSV file, you can loop through the rows in the Reader object. Each row is a list of values, with each value representing a cell.

The print() function call prints the number of the current row and the contents of the row. To get the row number, use the Reader object’s line_num variable, which contains the number of the current line.

The Reader object can be looped over only once. To reread the CSV file, you must call csv.reader to create a Reader object.

Writer Objects

A Writer object lets you write data to a CSV file. To create a Writer object, you use the csv.writer() function. Enter the following into the interactive shell:

>>> import csv ➊ >>> outputFile = open('output.csv', 'w', newline='') ➋ >>> outputWriter = csv.writer(outputFile) >>> outputWriter.writerow(['spam', 'eggs', 'bacon', 'ham']) 21 >>> outputWriter.writerow(['Hello, world!', 'eggs', 'bacon', 'ham']) 32 >>> outputWriter.writerow([1, 2, 3.141592, 4]) 16 >>> outputFile.close()

First, call open() and pass it 'w' to open a file in write mode ➊. This will create the object you can then pass to csv.writer() ➋ to create a Writer object.

On Windows, you’ll also need to pass a blank string for the open() function’s newline keyword argument. For technical reasons beyond the scope of this book, if you forget to set the newline argument, the rows in output.csv will be double-spaced, as shown in Figure 14-1.

Figure 14-1. If you forget the newline='' keyword argument in open(), the CSV file will be double-spaced.

The writerow() method for Writer objects takes a list argument. Each value in the list is placed in its own cell in the output CSV file. The return value of writerow() is the number of characters written to the file for that row (including newline characters).

This code produces an output.csv file that looks like this:

spam,eggs,bacon,ham "Hello, world!",eggs,bacon,ham 1,2,3.141592,4

Notice how the Writer object automatically escapes the comma in the value 'Hello, world!' with double quotes in the CSV file. The csv module saves you from having to handle these special cases yourself.

The delimiter and lineterminator Keyword Arguments

Say you want to separate cells with a tab character instead of a comma and you want the rows to be double-spaced. You could enter something like the following into the interactive shell:

>>> import csv >>> csvFile = open('example.tsv', 'w', newline='') ➊ >>> csvWriter = csv.writer(csvFile, delimiter='t', lineterminator='nn') >>> csvWriter.writerow(['apples', 'oranges', 'grapes']) 24 >>> csvWriter.writerow(['eggs', 'bacon', 'ham']) 17 >>> csvWriter.writerow(['spam', 'spam', 'spam', 'spam', 'spam', 'spam']) 32 >>> csvFile.close()

This changes the delimiter and line terminator characters in your file. The delimiter is the character that appears between cells on a row. By default, the delimiter for a CSV file is a comma. The line terminator is the character that comes at the end of a row. By default, the line terminator is a newline. You can change characters to different values by using the delimiter and lineterminator keyword arguments with csv.writer().

Passing delimeter='t' and lineterminator='nn' ➊ changes the character between cells to a tab and the character between rows to two newlines. We then call writerow() three times to give us three rows.

This produces a file named example.tsv with the following contents:

apples oranges grapes eggs bacon ham spam spam spam spam spam spam

Now that our cells are separated by tabs, we’re using the file extension .tsv, for tab-separated values.

Project: Removing the Header from CSV Files

Say you have the boring job of removing the first line from several hundred CSV files. Maybe you’ll be feeding them into an automated process that requires just the data and not the headers at the top of the columns. You could open each file in Excel, delete the first row, and resave the file—but that would take hours. Let’s write a program to do it instead.

The program will need to open every file with the .csv extension in the current working directory, read in the contents of the CSV file, and rewrite the contents without the first row to a file of the same name. This will replace the old contents of the CSV file with the new, headless contents.

Note

As always, whenever you write a program that modifies files, be sure to back up the files, first just in case your program does not work the way you expect it to. You don’t want to accidentally erase your original files.

At a high level, the program must do the following:

‹ Prev Next ›