Automate the Boring Stuff with Python
Page 29
>>> import openpyxl >>> wbFormulas = openpyxl.load_workbook('writeFormula.xlsx') >>> sheet = wbFormulas.get_active_sheet() >>> sheet['A3'].value '=SUM(A1:A2)' >>> wbDataOnly = openpyxl.load_workbook('writeFormula.xlsx', data_only=True) >>> sheet = wbDataOnly.get_active_sheet() >>> sheet['A3'].value 500
Here, when load_workbook() is called with data_only=True, the A3 cell shows 500, the result of the =SUM(A1:A2) formula, rather than the text of the formula.
Excel formulas offer a level of programmability for spreadsheets but can quickly become unmanageable for complicated tasks. For example, even if you’re deeply familiar with Excel formulas, it’s a headache to try to decipher what =IFERROR(TRIM(IF(LEN(VLOOKUP(F7, Sheet2!$A$1:$B$10000, 2, FALSE))>0,SUBSTITUTE(VLOOKUP(F7, Sheet2!$A$1:$B$10000, 2, FALSE), “ ”, “”),“”)), “”) actually does. Python code is much more readable.
Adjusting Rows and Columns
In Excel, adjusting the sizes of rows and columns is as easy as clicking and dragging the edges of a row or column header. But if you need to set a row or column’s size based on its cells’ contents or if you want to set sizes in a large number of spreadsheet files, it will be much quicker to write a Python program to do it.
Rows and columns can also be hidden entirely from view. Or they can be “frozen” in place so that they are always visible on the screen and appear on every page when the spreadsheet is printed (which is handy for headers).
Setting Row Height and Column Width
Worksheet objects have row_dimensions and column_dimensions attributes that control row heights and column widths. Enter this into the interactive shell:
>>> import openpyxl >>> wb = openpyxl.Workbook() >>> sheet = wb.get_active_sheet() >>> sheet['A1'] = 'Tall row' >>> sheet['B2'] = 'Wide column' >>> sheet.row_dimensions[1].height = 70 >>> sheet.column_dimensions['B'].width = 20 >>> wb.save('dimensions.xlsx')
A sheet’s row_dimensions and column_dimensions are dictionary-like values; row_dimensions contains RowDimension objects and column_dimensions contains ColumnDimension objects. In row_dimensions, you can access one of the objects using the number of the row (in this case, 1 or 2). In column_dimensions, you can access one of the objects using the letter of the column (in this case, A or B).
The dimensions.xlsx spreadsheet looks like Figure 12-6.
Figure 12-6. Row 1 and column B set to larger heights and widths
Once you have the RowDimension object, you can set its height. Once you have the ColumnDimension object, you can set its width. The row height can be set to an integer or float value between 0 and 409. This value represents the height measured in points, where one point equals 1/72 of an inch. The default row height is 12.75. The column width can be set to an integer or float value between 0 and 255. This value represents the number of characters at the default font size (11 point) that can be displayed in the cell. The default column width is 8.43 characters. Columns with widths of 0 or rows with heights of 0 are hidden from the user.
Merging and Unmerging Cells
A rectangular area of cells can be merged into a single cell with the merge_cells() sheet method. Enter the following into the interactive shell:
>>> import openpyxl >>> wb = openpyxl.Workbook() >>> sheet = wb.get_active_sheet() >>> sheet.merge_cells('A1:D3') >>> sheet['A1'] = 'Twelve cells merged together.' >>> sheet.merge_cells('C5:D5') >>> sheet['C5'] = 'Two merged cells.' >>> wb.save('merged.xlsx')
The argument to merge_cells() is a single string of the top-left and bottom-right cells of the rectangular area to be merged: 'A1:D3' merges 12 cells into a single cell. To set the value of these merged cells, simply set the value of the top-left cell of the merged group.
When you run this code, merged.xlsx will look like Figure 12-7.
Figure 12-7. Merged cells in a spreadsheet
To unmerge cells, call the unmerge_cells() sheet method. Enter this into the interactive shell.
>>> import openpyxl >>> wb = openpyxl.load_workbook('merged.xlsx') >>> sheet = wb.get_active_sheet() >>> sheet.unmerge_cells('A1:D3') >>> sheet.unmerge_cells('C5:D5') >>> wb.save('merged.xlsx')
If you save your changes and then take a look at the spreadsheet, you’ll see that the merged cells have gone back to being individual cells.
Freeze Panes
For spreadsheets too large to be displayed all at once, it’s helpful to “freeze” a few of the top rows or leftmost columns onscreen. Frozen column or row headers, for example, are always visible to the user even as they scroll through the spreadsheet. These are known as freeze panes. In OpenPyXL, each Worksheet object has a freeze_panes attribute that can be set to a Cell object or a string of a cell’s coordinates. Note that all rows above and all columns to the left of this cell will be frozen, but the row and column of the cell itself will not be frozen.
To unfreeze all panes, set freeze_panes to None or 'A1'. Table 12-3 shows which rows and columns will be frozen for some example settings of freeze_panes.
Table 12-3. Frozen Pane Examples
freeze_panes setting
Rows and columns frozen
sheet.freeze_panes = 'A2'
Row 1
sheet.freeze_panes = 'B1'
Column A
sheet.freeze_panes = 'C1'
Columns A and B
sheet.freeze_panes = 'C2'
Row 1 and columns A and B
sheet.freeze_panes = 'A1' or sheet.freeze_panes = None
No frozen panes
Make sure you have the produce sales spreadsheet from http://nostarch.com/automatestuff/. Then enter the following into the interactive shell:
>>> import openpyxl >>> wb = openpyxl.load_workbook('produceSales.xlsx') >>> sheet = wb.get_active_sheet() >>> sheet.freeze_panes = 'A2' >>> wb.save('freezeExample.xlsx')
If you set the freeze_panes attribute to 'A2', row 1 will always be viewable, no matter where the user scrolls in the spreadsheet. You can see this in Figure 12-8.
Figure 12-8. With freeze_panes set to 'A2', row 1 is always visible even as the user scrolls down.
Charts
OpenPyXL supports creating bar, line, scatter, and pie charts using the data in a sheet’s cells. To make a chart, you need to do the following:
Create a Reference object from a rectangular selection of cells.
Create a Series object by passing in the Reference object.
Create a Chart object.
Append the Series object to the Chart object.
Optionally, set the drawing.top, drawing.left, drawing.width, and drawing.height variables of the Chart object.
Add the Chart object to the Worksheet object.
The Reference object requires some explaining. Reference objects are created by calling the openpyxl.charts.Reference() function and passing three arguments:
The Worksheet object containing your chart data.
A tuple of two integers, representing the top-left cell of the rectangular selection of cells containing your chart data: The first integer in the tuple is the row, and the second is the column. Note that 1 is the first row, not 0.
A tuple of two integers, representing the bottom-right cell of the rectangular selection of cells containing your chart data: The first integer in the tuple is the row, and the second is the column.
Figure 12-9 shows some sample coordinate arguments.
Figure 12-9. From left to right: (1, 1), (10, 1); (3, 2), (6, 4); (5, 3), (5, 3)
Enter this interactive shell example to create a bar chart and add it to the spreadsheet:
>>> import openpyxl >>> wb = openpyxl.Workbook() >>> sheet = wb.get_active_sheet() >>> for i in range(1, 11): # create some data in column A sheet['A' + str(i)] = i >>> refObj = openpyxl.charts.Reference(sheet, (1, 1), (10, 1)) >>> seriesObj = openpyxl.charts.Series(refObj, title='First series') >>> chartObj = openpyxl.charts.BarChart() >>> chartObj.append(seriesObj) >>> chartObj.drawing.top = 50 # set the position >>> chartObj.drawing.left = 100 >>> chartObj.drawing.width = 300 # set the size >>> chartObj.drawing.height = 200 >>> sheet.add_chart(chartObj) >>
> wb.save('sampleChart.xlsx')
This produces a spreadsheet that looks like Figure 12-10.
Figure 12-10. A spreadsheet with a chart added
We’ve created a bar chart by calling openpyxl.charts.BarChart(). You can also create line charts, scatter charts, and pie charts by calling openpyxl.charts.LineChart(), openpyxl.charts.ScatterChart(), and openpyxl.charts.PieChart().
Unfortunately, in the current version of OpenPyXL (2.1.4), the load_workbook() function does not load charts in Excel files. Even if the Excel file has charts, the loaded Workbook object will not include them. If you load a Workbook object and immediately save it to the same .xlsx filename, you will effectively remove the charts from it.
Summary
Often the hard part of processing information isn’t the processing itself but simply getting the data in the right format for your program. But once you have your spreadsheet loaded into Python, you can extract and manipulate its data much faster than you could by hand.
You can also generate spreadsheets as output from your programs. So if colleagues need your text file or PDF of thousands of sales contacts transferred to a spreadsheet file, you won’t have to tediously copy and paste it all into Excel.
Equipped with the openpyxl module and some programming knowledge, you’ll find processing even the biggest spreadsheets a piece of cake.
Practice Questions
For the following questions, imagine you have a Workbook object in the variable wb, a Worksheet object in sheet, a Cell object in cell, a Comment object in comm, and an Image object in img.
Q:
1. What does the openpyxl.load_workbook() function return?
Q:
2. What does the get_sheet_names() workbook method return?
Q:
3. How would you retrieve the Worksheet object for a sheet named 'Sheet1'?
Q:
4. How would you retrieve the Worksheet object for the workbook’s active sheet?
Q:
5. How would you retrieve the value in the cell C5?
Q:
6. How would you set the value in the cell C5 to "Hello"?
Q:
7. How would you retrieve the cell’s row and column as integers?
Q:
8. What do the get_highest_column() and get_highest_row() sheet methods return, and what is the data type of these return values?
Q:
9. If you needed to get the integer index for column 'M', what function would you need to call?
Q:
10. If you needed to get the string name for column 14, what function would you need to call?
Q:
11. How can you retrieve a tuple of all the Cell objects from A1 to F1?
Q:
12. How would you save the workbook to the filename example.xlsx?
Q:
13. How do you set a formula in a cell?
Q:
14. If you want to retrieve the result of a cell’s formula instead of the cell’s formula itself, what must you do first?
Q:
15. How would you set the height of row 5 to 100?
Q:
16. How would you hide column C?
Q:
17. Name a few features that OpenPyXL 2.1.4 does not load from a spreadsheet file.
Q:
18. What is a freeze pane?
Q:
19. What five functions and methods do you have to call to create a bar chart?
Practice Projects
For practice, write programs that perform the following tasks.
Multiplication Table Maker
Create a program multiplicationTable.py that takes a number N from the command line and creates an N×N multiplication table in an Excel spreadsheet. For example, when the program is run like this:
py multiplicationTable.py 6
... it should create a spreadsheet that looks like Figure 12-11.
Figure 12-11. A multiplication table generated in a spreadsheet
Row 1 and column A should be used for labels and should be in bold.
Blank Row Inserter
Create a program blankRowInserter.py that takes two integers and a filename string as command line arguments. Let’s call the first integer N and the second integer M. Starting at row N, the program should insert M blank rows into the spreadsheet. For example, when the program is run like this:
python blankRowInserter.py 3 2 myProduce.xlsx
... the “before” and “after” spreadsheets should look like Figure 12-12.
Figure 12-12. Before (left) and after (right) the two blank rows are inserted at row 3
You can write this program by reading in the contents of the spreadsheet. Then, when writing out the new spreadsheet, use a for loop to copy the first N lines. For the remaining lines, add M to the row number in the output spreadsheet.
Spreadsheet Cell Inverter
Write a program to invert the row and column of the cells in the spreadsheet. For example, the value at row 5, column 3 will be at row 3, column 5 (and vice versa). This should be done for all cells in the spreadsheet. For example, the “before” and “after” spreadsheets would look something like Figure 12-13.
Figure 12-13. The spreadsheet before (top) and after (bottom) inversion
You can write this program by using nested for loops to read in the spreadsheet’s data into a list of lists data structure. This data structure could have sheetData[x][y] for the cell at column x and row y. Then, when writing out the new spreadsheet, use sheetData[y][x] for the cell at column x and row y.
Text Files to Spreadsheet
Write a program to read in the contents of several text files (you can make the text files yourself) and insert those contents into a spreadsheet, with one line of text per row. The lines of the first text file will be in the cells of column A, the lines of the second text file will be in the cells of column B, and so on.
Use the readlines() File object method to return a list of strings, one string per line in the file. For the first file, output the first line to column 1, row 1. The second line should be written to column 1, row 2, and so on. The next file that is read with readlines() will be written to column 2, the next file to column 3, and so on.
Spreadsheet to Text Files
Write a program that performs the tasks of the previous program in reverse order: The program should open a spreadsheet and write the cells of column A into one text file, the cells of column B into another text file, and so on.
Chapter 13. Working with PDF and word Documents
PDF and Word documents are binary files, which makes them much more complex than plaintext files. In addition to text, they store lots of font, color, and layout information. If you want your programs to read or write to PDFs or Word documents, you’ll need to do more than simply pass their filenames to open().
Fortunately, there are Python modules that make it easy for you to interact with PDFs and Word documents. This chapter will cover two such modules: PyPDF2 and Python-Docx.
PDF Documents
PDF stands for Portable Document Format and uses the .pdf file extension. Although PDFs support many features, this chapter will focus on the two things you’ll be doing most often with them: reading text content from PDFs and crafting new PDFs from existing documents.
The module you’ll use to work with PDFs is PyPDF2. To install it, run pip install PyPDF2 from the command line. This module name is case sensitive, so make sure the y is lowercase and everything else is uppercase. (Check out Appendix A for full details about installing third-party modules.) If the module was installed correctly, running import PyPDF2 in the interactive shell shouldn’t display any errors.
The Problematic PDF Format
While PDF files are great for laying out text in a way that’s easy for people to print and read, they’re not straightforward for software to parse into plaintext. As such, PyPDF2 might make mistakes when extracting text from a PDF and may even be unable to open some PDFs at all. There isn�
��t much you can do about this, unfortunately. PyPDF2 may simply be unable to work with some of your particular PDF files. That said, I haven’t found any PDF files so far that can’t be opened with PyPDF2.
Extracting Text from PDFs
PyPDF2 does not have a way to extract images, charts, or other media from PDF documents, but it can extract text and return it as a Python string. To start learning how PyPDF2 works, we’ll use it on the example PDF shown in Figure 13-1.
Figure 13-1. The PDF page that we will be extracting text from