Automate the Boring Stuff with Python

Home > Other > Automate the Boring Stuff with Python > Page 30
Automate the Boring Stuff with Python Page 30

by Al Sweigart


  Download this PDF from http://nostarch.com/automatestuff/, and enter the following into the interactive shell:

  >>> import PyPDF2 >>> pdfFileObj = open('meetingminutes.pdf', 'rb') >>> pdfReader = PyPDF2.PdfFileReader(pdfFileObj) ➊ >>> pdfReader.numPages 19 ➋ >>> pageObj = pdfReader.getPage(0) ➌ >>> pageObj.extractText() 'OOFFFFIICCIIAALL BBOOAARRDD MMIINNUUTTEESS Meeting of March 7, 2015 n The Board of Elementary and Secondary Education shall provide leadership and create policies for education that expand opportunities for children, empower families and communities, and advance Louisiana in an increasingly competitive global market. BOARD of ELEMENTARY and SECONDARY EDUCATION '

  First, import the PyPDF2 module. Then open meetingminutes.pdf in read binary mode and store it in pdfFileObj. To get a PdfFileReader object that represents this PDF, call PyPDF2.PdfFileReader() and pass it pdfFileObj. Store this PdfFileReader object in pdfReader.

  The total number of pages in the document is stored in the numPages attribute of a PdfFileReader object ➊. The example PDF has 19 pages, but let’s extract text from only the first page.

  To extract text from a page, you need to get a Page object, which represents a single page of a PDF, from a PdfFileReader object. You can get a Page object by calling the getPage() method ➋ on a PdfFileReader object and passing it the page number of the page you’re interested in—in our case, 0.

  PyPDF2 uses a zero-based index for getting pages: The first page is page 0, the second is Introduction, and so on. This is always the case, even if pages are numbered differently within the document. For example, say your PDF is a three-page excerpt from a longer report, and its pages are numbered 42, 43, and 44. To get the first page of this document, you would want to call pdfReader.getPage(0), not getPage(42) or getPage(1).

  Once you have your Page object, call its extractText() method to return a string of the page’s text ➌. The text extraction isn’t perfect: The text Charles E. “Chas” Roemer, President from the PDF is absent from the string returned by extractText(), and the spacing is sometimes off. Still, this approximation of the PDF text content may be good enough for your program.

  Decrypting PDFs

  Some PDF documents have an encryption feature that will keep them from being read until whoever is opening the document provides a password. Enter the following into the interactive shell with the PDF you downloaded, which has been encrypted with the password rosebud:

  >>> import PyPDF2 >>> pdfReader = PyPDF2.PdfFileReader(open('encrypted.pdf', 'rb')) ➊ >>> pdfReader.isEncrypted True >>> pdfReader.getPage(0) ➋ Traceback (most recent call last): File "", line 1, in pdfReader.getPage() --snip-- File "C:Python34libsite-packagesPyPDF2pdf.py", line 1173, in getObject raise utils.PdfReadError("file has not been decrypted") PyPDF2.utils.PdfReadError: file has not been decrypted ➌ >>> pdfReader.decrypt('rosebud') 1 >>> pageObj = pdfReader.getPage(0)

  All PdfFileReader objects have an isEncrypted attribute that is True if the PDF is encrypted and False if it isn’t ➊. Any attempt to call a function that reads the file before it has been decrypted with the correct password will result in an error ➋.

  To read an encrypted PDF, call the decrypt() function and pass the password as a string ➌. After you call decrypt() with the correct password, you’ll see that calling getPage() no longer causes an error. If given the wrong password, the decrypt() function will return 0 and getPage() will continue to fail. Note that the decrypt() method decrypts only the PdfFileReader object, not the actual PDF file. After your program terminates, the file on your hard drive remains encrypted. Your program will have to call decrypt() again the next time it is run.

  Creating PDFs

  PyPDF2’s counterpart to PdfFileReader objects is PdfFileWriter objects, which can create new PDF files. But PyPDF2 cannot write arbitrary text to a PDF like Python can do with plaintext files. Instead, PyPDF2’s PDF-writing capabilities are limited to copying pages from other PDFs, rotating pages, overlaying pages, and encrypting files.

  PyPDF2 doesn’t allow you to directly edit a PDF. Instead, you have to create a new PDF and then copy content over from an existing document. The examples in this section will follow this general approach:

  Open one or more existing PDFs (the source PDFs) into PdfFileReader objects.

  Create a new PdfFileWriter object.

  Copy pages from the PdfFileReader objects into the PdfFileWriter object.

  Finally, use the PdfFileWriter object to write the output PDF.

  Creating a PdfFileWriter object creates only a value that represents a PDF document in Python. It doesn’t create the actual PDF file. For that, you must call the PdfFileWriter’s write() method.

  The write() method takes a regular File object that has been opened in write-binary mode. You can get such a File object by calling Python’s open() function with two arguments: the string of what you want the PDF’s filename to be and 'wb' to indicate the file should be opened in write-binary mode.

  If this sounds a little confusing, don’t worry—you’ll see how this works in the following code examples.

  Copying Pages

  You can use PyPDF2 to copy pages from one PDF document to another. This allows you to combine multiple PDF files, cut unwanted pages, or reorder pages.

  Download meetingminutes.pdf and meetingminutes2.pdf from http://nostarch.com/automatestuff/ and place the PDFs in the current working directory. Enter the following into the interactive shell:

  >>> import PyPDF2 >>> pdf1File = open('meetingminutes.pdf', 'rb') >>> pdf2File = open('meetingminutes2.pdf', 'rb') ➊ >>> pdf1Reader = PyPDF2.PdfFileReader(pdf1File) ➋ >>> pdf2Reader = PyPDF2.PdfFileReader(pdf2File) ➌ >>> pdfWriter = PyPDF2.PdfFileWriter() >>> for pageNum in range(pdf1Reader.numPages): ➍ pageObj = pdf1Reader.getPage(pageNum) ➎ pdfWriter.addPage(pageObj) >>> for pageNum in range(pdf2Reader.numPages): ➏ pageObj = pdf2Reader.getPage(pageNum) ➐ pdfWriter.addPage(pageObj) ➑ >>> pdfOutputFile = open('combinedminutes.pdf', 'wb') >>> pdfWriter.write(pdfOutputFile) >>> pdfOutputFile.close() >>> pdf1File.close() >>> pdf2File.close()

  Open both PDF files in read binary mode and store the two resulting File objects in pdf1File and pdf2File. Call PyPDF2.PdfFileReader() and pass it pdf1File to get a PdfFileReader object for meetingminutes.pdf ➊. Call it again and pass it pdf2File to get a PdfFileReader object for meetingminutes2.pdf ➋. Then create a new PdfFileWriter object, which represents a blank PDF document ➌.

  Next, copy all the pages from the two source PDFs and add them to the PdfFileWriter object. Get the Page object by calling getPage() on a PdfFileReader object ➍. Then pass that Page object to your PdfFileWriter’s addPage() method ➎. These steps are done first for pdf1Reader and then again for pdf2Reader. When you’re done copying pages, write a new PDF called combinedminutes.pdf by passing a File object to the PdfFileWriter’s write() method ➏.

  Note

  PyPDF2 cannot insert pages in the middle of a PdfFileWriter object; the addPage() method will only add pages to the end.

  You have now created a new PDF file that combines the pages from meetingminutes.pdf and meetingminutes2.pdf into a single document. Remember that the File object passed to PyPDF2.PdfFileReader() needs to be opened in read-binary mode by passing 'rb' as the second argument to open(). Likewise, the File object passed to PyPDF2.PdfFileWriter() needs to be opened in write-binary mode with 'wb'.

  Rotating Pages

  The pages of a PDF can also be rotated in 90-degree increments with the rotateClockwise() and rotateCounterClockwise() methods. Pass one of the integers 90, 180, or 270 to these methods. Enter the following into the interactive shell, with the meetingminutes.pdf file in the current working directory:

  >>> import PyPDF2 >>> minutesFile = open('meetingminutes.pdf', 'rb') >>> pdfReader = PyPDF2.PdfFileReader(minutesFile) ➊ >>> page = pdfReader.getPage(0) ➋ >>> page.rotateClockwise(90) {'/Contents': [IndirectObject(961, 0), IndirectObject(962, 0), --snip-- } >>> pdfWriter = PyPDF2.PdfFileWriter() >>> pdfWriter.addPage(page) ➌ >>> resultPdf
File = open('rotatedPage.pdf', 'wb') >>> pdfWriter.write(resultPdfFile) >>> resultPdfFile.close() >>> minutesFile.close()

  Here we use getPage(0) to select the first page of the PDF ➊, and then we call rotateClockwise(90) on that page ➋. We write a new PDF with the rotated page and save it as rotatedPage.pdf ➌.

  The resulting PDF will have one page, rotated 90 degrees clockwise, as in Figure 13-2. The return values from rotateClockwise() and rotateCounterClockwise() contain a lot of information that you can ignore.

  Figure 13-2. The rotatedPage.pdf file with the page rotated 90 degrees clockwise

  Overlaying Pages

  PyPDF2 can also overlay the contents of one page over another, which is useful for adding a logo, timestamp, or watermark to a page. With Python, it’s easy to add watermarks to multiple files and only to pages your program specifies.

  Download watermark.pdf from http://nostarch.com/automatestuff/ and place the PDF in the current working directory along with meetingminutes.pdf. Then enter the following into the interactive shell:

  >>> import PyPDF2 >>> minutesFile = open('meetingminutes.pdf', 'rb') ➋ >>> pdfReader = PyPDF2.PdfFileReader(minutesFile) ➋ >>> minutesFirstPage = pdfReader.getPage(0) ➌ >>> pdfWatermarkReader = PyPDF2.PdfFileReader(open('watermark.pdf', 'rb')) ➍ >>> minutesFirstPage.mergePage(pdfWatermarkReader.getPage(0)) ➎ >>> pdfWriter = PyPDF2.PdfFileWriter() ➏ >>> pdfWriter.addPage(minutesFirstPage) ➐ >>> for pageNum in range(1, pdfReader.numPages): pageObj = pdfReader.getPage(pageNum) pdfWriter.addPage(pageObj) >>> resultPdfFile = open('watermarkedCover.pdf', 'wb') >>> pdfWriter.write(resultPdfFile) >>> minutesFile.close() >>> resultPdfFile.close()

  Here we make a PdfFileReader object of meetingminutes.pdf ➊. We call getPage(0) to get a Page object for the first page and store this object in minutesFirstPage ➋. We then make a PdfFileReader object for watermark.pdf ➌ and call mergePage() on minutesFirstPage ➍. The argument we pass to mergePage() is a Page object for the first page of watermark.pdf.

  Now that we’ve called mergePage() on minutesFirstPage, minutesFirstPage represents the watermarked first page. We make a PdfFileWriter object ➎ and add the watermarked first page ➏. Then we loop through the rest of the pages in meetingminutes.pdf and add them to the PdfFileWriter object ➐. Finally, we open a new PDF called watermarkedCover.pdf and write the contents of the PdfFileWriter to the new PDF.

  Figure 13-3 shows the results. Our new PDF, watermarkedCover.pdf, has all the contents of the meetingminutes.pdf, and the first page is watermarked.

  Figure 13-3. The original PDF (left), the watermark PDF (center), and the merged PDF (right)

  Encrypting PDFs

  A PdfFileWriter object can also add encryption to a PDF document. Enter the following into the interactive shell:

  >>> import PyPDF2 >>> pdfFile = open('meetingminutes.pdf', 'rb') >>> pdfReader = PyPDF2.PdfFileReader(pdfFile) >>> pdfWriter = PyPDF2.PdfFileWriter() >>> for pageNum in range(pdfReader.numPages): pdfWriter.addPage(pdfReader.getPage(pageNum)) ➊ >>> pdfWriter.encrypt('swordfish') >>> resultPdf = open('encryptedminutes.pdf', 'wb') >>> pdfWriter.write(resultPdf) >>> resultPdf.close()

  Before calling the write() method to save to a file, call the encrypt() method and pass it a password string ➊. PDFs can have a user password (allowing you to view the PDF) and an owner password (allowing you to set permissions for printing, commenting, extracting text, and other features). The user password and owner password are the first and second arguments to encrypt(), respectively. If only one string argument is passed to encrypt(), it will be used for both passwords.

  In this example, we copied the pages of meetingminutes.pdf to a PdfFileWriter object. We encrypted the PdfFileWriter with the password swordfish, opened a new PDF called encryptedminutes.pdf, and wrote the contents of the PdfFileWriter to the new PDF. Before anyone can view encryptedminutes.pdf, they’ll have to enter this password. You may want to delete the original, unencrypted meetingminutes.pdf file after ensuring its copy was correctly encrypted.

  Project: Combining Select Pages from Many PDFs

  Say you have the boring job of merging several dozen PDF documents into a single PDF file. Each of them has a cover sheet as the first page, but you don’t want the cover sheet repeated in the final result. Even though there are lots of free programs for combining PDFs, many of them simply merge entire files together. Let’s write a Python program to customize which pages you want in the combined PDF.

  At a high level, here’s what the program will do:

  Find all PDF files in the current working directory.

  Sort the filenames so the PDFs are added in order.

  Write each page, excluding the first page, of each PDF to the output file.

  In terms of implementation, your code will need to do the following:

  Call os.listdir() to find all the files in the working directory and remove any non-PDF files.

  Call Python’s sort() list method to alphabetize the filenames.

  Create a PdfFileWriter object for the output PDF.

  Loop over each PDF file, creating a PdfFileReader object for it.

  Loop over each page (except the first) in each PDF file.

  Add the pages to the output PDF.

  Write the output PDF to a file named allminutes.pdf.

  For this project, open a new file editor window and save it as combinePdfs.py.

  Step 1: Find All PDF Files

  First, your program needs to get a list of all files with the .pdf extension in the current working directory and sort them. Make your code look like the following:

  #! python3 # combinePdfs.py - Combines all the PDFs in the current working directory into # into a single PDF. ➊ import PyPDF2, os # Get all the PDF filenames. pdfFiles = [] for filename in os.listdir('.'): if filename.endswith('.pdf'): ➋ pdfFiles.append(filename) ➌ pdfFiles.sort(key = str.lower) ➍ pdfWriter = PyPDF2.PdfFileWriter() # TODO: Loop through all the PDF files. # TODO: Loop through all the pages (except the first) and add them. # TODO: Save the resulting PDF to a file.

  After the shebang line and the descriptive comment about what the program does, this code imports the os and PyPDF2 modules ➊. The os.listdir('.') call will return a list of every file in the current working directory. The code loops over this list and adds only those files with the .pdf extension to pdfFiles ➋. Afterward, this list is sorted in alphabetical order with the key = str.lower keyword argument to sort() ➌.

  A PdfFileWriter object is created to hold the combined PDF pages ➍. Finally, a few comments outline the rest of the program.

  Step 2: Open Each PDF

  Now the program must read each PDF file in pdfFiles. Add the following to your program:

  #! python3 # combinePdfs.py - Combines all the PDFs in the current working directory into # a single PDF. import PyPDF2, os # Get all the PDF filenames. pdfFiles = [] --snip-- # Loop through all the PDF files. for filename in pdfFiles: pdfFileObj = open(filename, 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObj) # TODO: Loop through all the pages (except the first) and add them. # TODO: Save the resulting PDF to a file.

  For each PDF, the loop opens a filename in read-binary mode by calling open() with 'rb' as the second argument. The open() call returns a File object, which gets passed to PyPDF2.PdfFileReader() to create a PdfFileReader object for that PDF file.

  Step 3: Add Each Page

  For each PDF, you’ll want to loop over every page except the first. Add this code to your program:

  #! python3 # combinePdfs.py - Combines all the PDFs in the current working directory into # a single PDF. import PyPDF2, os --snip-- # Loop through all the PDF files. for filename in pdfFiles: --snip-- # Loop through all the pages (except the first) and add them. ➊ for pageNum in range(1, pdfReader.numPages): pageObj = pdfReader.getPage(pageNum) pdfWriter.addPage(pageObj) # TODO: Save the resulting PDF to a file.

  The code inside the for loop copies each Page object individually to the PdfFileWriter object. Remember, you want to skip the first page. Since PyPDF2 conside
rs 0 to be the first page, your loop should start at 1 ➊ and then go up to, but not include, the integer in pdfReader.numPages.

  Step 4: Save the Results

  After these nested for loops are done looping, the pdfWriter variable will contain a PdfFileWriter object with the pages for all the PDFs combined. The last step is to write this content to a file on the hard drive. Add this code to your program:

  #! python3 # combinePdfs.py - Combines all the PDFs in the current working directory into # a single PDF. import PyPDF2, os --snip-- # Loop through all the PDF files. for filename in pdfFiles: --snip-- # Loop through all the pages (except the first) and add them. for pageNum in range(1, pdfReader.numPages): --snip-- # Save the resulting PDF to a file. pdfOutput = open('allminutes.pdf', 'wb') pdfWriter.write(pdfOutput) pdfOutput.close()

  Passing 'wb' to open() opens the output PDF file, allminutes.pdf, in write-binary mode. Then, passing the resulting File object to the write() method creates the actual PDF file. A call to the close() method finishes the program.

 

‹ Prev