Automate the Boring Stuff with Python
Page 20
A string of the current folder’s name
A list of strings of the folders in the current folder
A list of strings of the files in the current folder
(By current folder, I mean the folder for the current iteration of the for loop. The current working directory of the program is not changed by os.walk().)
Just like you can choose the variable name i in the code for i in range(10):, you can also choose the variable names for the three values listed earlier. I usually use the names foldername, subfolders, and filenames.
When you run this program, it will output the following:
The current folder is C:delicious SUBFOLDER OF C:delicious: cats SUBFOLDER OF C:delicious: walnut FILE INSIDE C:delicious: spam.txt The current folder is C:deliciouscats FILE INSIDE C:deliciouscats: catnames.txt FILE INSIDE C:deliciouscats: zophie.jpg The current folder is C:deliciouswalnut SUBFOLDER OF C:deliciouswalnut: waffles The current folder is C:deliciouswalnutwaffles FILE INSIDE C:deliciouswalnutwaffles: butter.txt.
Since os.walk() returns lists of strings for the subfolder and filename variables, you can use these lists in their own for loops. Replace the print() function calls with your own custom code. (Or if you don’t need one or both of them, remove the for loops.)
Compressing Files with the zipfile Module
You may be familiar with ZIP files (with the .zip file extension), which can hold the compressed contents of many other files. Compressing a file reduces its size, which is useful when transferring it over the Internet. And since a ZIP file can also contain multiple files and subfolders, it’s a handy way to package several files into one. This single file, called an archive file, can then be, say, attached to an email.
Your Python programs can both create and open (or extract) ZIP files using functions in the zipfile module. Say you have a ZIP file named example.zip that has the contents shown in Figure 9-2.
You can download this ZIP file from http://nostarch.com/automatestuff/ or just follow along using a ZIP file already on your computer.
Figure 9-2. The contents of example.zip
Reading ZIP Files
To read the contents of a ZIP file, first you must create a ZipFile object (note the capital letters Z and F). ZipFile objects are conceptually similar to the File objects you saw returned by the open() function in the previous chapter: They are values through which the program interacts with the file. To create a ZipFile object, call the zipfile.ZipFile() function, passing it a string of the .zip file’s filename. Note that zipfile is the name of the Python module, and ZipFile() is the name of the function.
For example, enter the following into the interactive shell:
>>> import zipfile, os >>> os.chdir('C:\') # move to the folder with example.zip >>> exampleZip = zipfile.ZipFile('example.zip') >>> exampleZip.namelist() ['spam.txt', 'cats/', 'cats/catnames.txt', 'cats/zophie.jpg'] >>> spamInfo = exampleZip.getinfo('spam.txt') >>> spamInfo.file_size 13908 >>> spamInfo.compress_size 3828 ➊ >>> 'Compressed file is %sx smaller!' % (round(spamInfo.file_size / spamInfo .compress_size, 2)) 'Compressed file is 3.63x smaller!' >>> exampleZip.close()
A ZipFile object has a namelist() method that returns a list of strings for all the files and folders contained in the ZIP file. These strings can be passed to the getinfo() ZipFile method to return a ZipInfo object about that particular file. ZipInfo objects have their own attributes, such as file_size and compress_size in bytes, which hold integers of the original file size and compressed file size, respectively. While a ZipFile object represents an entire archive file, a ZipInfo object holds useful information about a single file in the archive.
The command at ➊ calculates how efficiently example.zip is compressed by dividing the original file size by the compressed file size and prints this information using a string formatted with %s.
Extracting from ZIP Files
The extractall() method for ZipFile objects extracts all the files and folders from a ZIP file into the current working directory.
>>> import zipfile, os >>> os.chdir('C:\') # move to the folder with example.zip >>> exampleZip = zipfile.ZipFile('example.zip') ➊ >>> exampleZip.extractall() >>> exampleZip.close()
After running this code, the contents of example.zip will be extracted to C:. Optionally, you can pass a folder name to extractall() to have it extract the files into a folder other than the current working directory. If the folder passed to the extractall() method does not exist, it will be created. For instance, if you replaced the call at ➊ with exampleZip.extractall('C:\ delicious'), the code would extract the files from example.zip into a newly created C:delicious folder.
The extract() method for ZipFile objects will extract a single file from the ZIP file. Continue the interactive shell example:
>>> exampleZip.extract('spam.txt') 'C:\spam.txt' >>> exampleZip.extract('spam.txt', 'C:\some\new\folders') 'C:\some\new\folders\spam.txt' >>> exampleZip.close()
The string you pass to extract() must match one of the strings in the list returned by namelist(). Optionally, you can pass a second argument to extract() to extract the file into a folder other than the current working directory. If this second argument is a folder that doesn’t yet exist, Python will create the folder. The value that extract() returns is the absolute path to which the file was extracted.
Creating and Adding to ZIP Files
To create your own compressed ZIP files, you must open the ZipFile object in write mode by passing 'w' as the second argument. (This is similar to opening a text file in write mode by passing 'w' to the open() function.)
When you pass a path to the write() method of a ZipFile object, Python will compress the file at that path and add it into the ZIP file. The write() method’s first argument is a string of the filename to add. The second argument is the compression type parameter, which tells the computer what algorithm it should use to compress the files; you can always just set this value to zipfile.ZIP_DEFLATED. (This specifies the deflate compression algorithm, which works well on all types of data.) Enter the following into the interactive shell:
>>> import zipfile >>> newZip = zipfile.ZipFile('new.zip', 'w') >>> newZip.write('spam.txt', compress_type=zipfile.ZIP_DEFLATED) >>> newZip.close()
This code will create a new ZIP file named new.zip that has the compressed contents of spam.txt.
Keep in mind that, just as with writing to files, write mode will erase all existing contents of a ZIP file. If you want to simply add files to an existing ZIP file, pass 'a' as the second argument to zipfile.ZipFile() to open the ZIP file in append mode.
Project: Renaming Files with American-Style Dates to European-Style Dates
Say your boss emails you thousands of files with American-style dates (MM-DD-YYYY) in their names and needs them renamed to European-style dates (DD-MM-YYYY). This boring task could take all day to do by hand! Let’s write a program to do it instead.
Here’s what the program does:
It searches all the filenames in the current working directory for American-style dates.
When one is found, it renames the file with the month and day swapped to make it European-style.
This means the code will need to do the following:
Create a regex that can identify the text pattern of American-style dates.
Call os.listdir() to find all the files in the working directory.
Loop over each filename, using the regex to check whether it has a date.
If it has a date, rename the file with shutil.move().
For this project, open a new file editor window and save your code as renameDates.py.
Step 1: Create a Regex for American-Style Dates
The first part of the program will need to import the necessary modules and create a regex that can identify MM-DD-YYYY dates. The to-do comments will remind you what’s left to write in this program. Typing them as TODO makes them easy to find using IDLE’s CTRL-F find feature. Make your code look like the following:
#! python3 # rename
Dates.py - Renames filenames with American MM-DD-YYYY date format # to European DD-MM-YYYY. ➊ import shutil, os, re # Create a regex that matches files with the American date format. ➋ datePattern = re.compile(r"""^(.*?) # all text before the date ((0|1)?d)- # one or two digits for the month ((0|1|2|3)?d)- # one or two digits for the day ((19|20)dd) # four digits for the year (.*?)$ # all text after the date ➌ """, re.VERBOSE) # TODO: Loop over the files in the working directory. # TODO: Skip files without a date. # TODO: Get the different parts of the filename. # TODO: Form the European-style filename. # TODO: Get the full, absolute file paths. # TODO: Rename the files.
From this chapter, you know the shutil.move() function can be used to rename files: Its arguments are the name of the file to rename and the new filename. Because this function exists in the shutil module, you must import that module ➊.
But before renaming the files, you need to identify which files you want to rename. Filenames with dates such as spam4-4-1984.txt and 01-03-2014eggs.zip should be renamed, while filenames without dates such as littlebrother.epub can be ignored.
You can use a regular expression to identify this pattern. After importing the re module at the top, call re.compile() to create a Regex object ➋. Passing re.VERBOSE for the second argument ➌ will allow whitespace and comments in the regex string to make it more readable.
The regular expression string begins with ^(.*?) to match any text at the beginning of the filename that might come before the date. The ((0|1)?d) group matches the month. The first digit can be either 0 or 1, so the regex matches 12 for December but also 02 for February. This digit is also optional so that the month can be 04 or 4 for April. The group for the day is ((0|1|2|3)?d) and follows similar logic; 3, 03, and 31 are all valid numbers for days. (Yes, this regex will accept some invalid dates such as 4-31-2014, 2-29-2013, and 0-15-2014. Dates have a lot of thorny special cases that can be easy to miss. But for simplicity, the regex in this program works well enough.)
While 1885 is a valid year, you can just look for years in the 20th or 21st century. This will keep your program from accidentally matching nondate filenames with a date-like format, such as 10-10-1000.txt.
The (.*?)$ part of the regex will match any text that comes after the date.
Step 2: Identify the Date Parts from the Filenames
Next, the program will have to loop over the list of filename strings returned from os.listdir() and match them against the regex. Any files that do not have a date in them should be skipped. For filenames that have a date, the matched text will be stored in several variables. Fill in the first three TODOs in your program with the following code:
#! python3 # renameDates.py - Renames filenames with American MM-DD-YYYY date format # to European DD-MM-YYYY. --snip-- # Loop over the files in the working directory. for amerFilename in os.listdir('.'): mo = datePattern.search(amerFilename) # Skip files without a date. ➊ if mo == None: ➋ continue ➌ # Get the different parts of the filename. beforePart = mo.group(1) monthPart = mo.group(2) dayPart = mo.group(4) yearPart = mo.group(6) afterPart = mo.group(8) --snip--
If the Match object returned from the search() method is None ➊, then the filename in amerFilename does not match the regular expression. The continue statement ➋ will skip the rest of the loop and move on to the next filename.
Otherwise, the various strings matched in the regular expression groups are stored in variables named beforePart, monthPart, dayPart, yearPart, and afterPart ➌. The strings in these variables will be used to form the European-style filename in the next step.
To keep the group numbers straight, try reading the regex from the beginning and count up each time you encounter an opening parenthesis. Without thinking about the code, just write an outline of the regular expression. This can help you visualize the groups. For example:
datePattern = re.compile(r"""^(1) # all text before the date (2 (3) )- # one or two digits for the month (4 (5) )- # one or two digits for the day (6 (7) ) # four digits for the year (8)$ # all text after the date """, re.VERBOSE)
Here, the numbers 1 through 8 represent the groups in the regular expression you wrote. Making an outline of the regular expression, with just the parentheses and group numbers, can give you a clearer understanding of your regex before you move on with the rest of the program.
Step 3: Form the New Filename and Rename the Files
As the final step, concatenate the strings in the variables made in the previous step with the European-style date: The date comes before the month. Fill in the three remaining TODOs in your program with the following code:
#! python3 # renameDates.py - Renames filenames with American MM-DD-YYYY date format # to European DD-MM-YYYY. --snip-- # Form the European-style filename. ➊ euroFilename = beforePart + dayPart + '-' + monthPart + '-' + yearPart + afterPart # Get the full, absolute file paths. absWorkingDir = os.path.abspath('.') amerFilename = os.path.join(absWorkingDir, amerFilename) euroFilename = os.path.join(absWorkingDir, euroFilename) # Rename the files. ➋ print('Renaming "%s" to "%s"...' % (amerFilename, euroFilename)) ➌ #shutil.move(amerFilename, euroFilename) # uncomment after testing
Store the concatenated string in a variable named euroFilename ➊. Then, pass the original filename in amerFilename and the new euroFilename variable to the shutil.move() function to rename the file ➌.
This program has the shutil.move() call commented out and instead prints the filenames that will be renamed ➋. Running the program like this first can let you double-check that the files are renamed correctly. Then you can uncomment the shutil.move() call and run the program again to actually rename the files.
Ideas for Similar Programs
There are many other reasons why you might want to rename a large number of files.
To add a prefix to the start of the filename, such as adding spam_ to rename eggs.txt to spam_eggs.txt
To change filenames with European-style dates to American-style dates
To remove the zeros from files such as spam0042.txt
Project: Backing Up a Folder into a ZIP File
Say you’re working on a project whose files you keep in a folder named C:AlsPythonBook. You’re worried about losing your work, so you’d like to create ZIP file “snapshots” of the entire folder. You’d like to keep different versions, so you want the ZIP file’s filename to increment each time it is made; for example, AlsPythonBook_1.zip, AlsPythonBook_2.zip, AlsPythonBook_3.zip, and so on. You could do this by hand, but it is rather annoying, and you might accidentally misnumber the ZIP files’ names. It would be much simpler to run a program that does this boring task for you.
For this project, open a new file editor window and save it as backupToZip.py.
Step 1: Figure Out the ZIP File’s Name
The code for this program will be placed into a function named backupToZip(). This will make it easy to copy and paste the function into other Python programs that need this functionality. At the end of the program, the function will be called to perform the backup. Make your program look like this:
#! python3 # backupToZip.py - Copies an entire folder and its contents into # a ZIP file whose filename increments. ➊ import zipfile, os def backupToZip(folder): # Backup the entire contents of "folder" into a ZIP file. folder = os.path.abspath(folder) # make sure folder is absolute # Figure out the filename this code should use based on # what files already exist. ➋ number = 1 ➌ while True: zipFilename = os.path.basename(folder) + '_' + str(number) + '.zip' if not os.path.exists(zipFilename): break number = number + 1 ➍ # TODO: Create the ZIP file. # TODO: Walk the entire folder tree and compress the files in each folder. print('Done.') backupToZip('C:\delicious')
Do the basics first: Add the shebang (#!) line, describe what the program does, and import the zipfile and os modules ➊.
Define a backupToZip() function that takes just one parameter, folder. This parameter is a string path to the folder whose contents should be backed up. The function will determine what filename to use for the ZIP file it will cr
eate; then the function will create the file, walk the folder folder, and add each of the subfolders and files to the ZIP file. Write TODO comments for these steps in the source code to remind yourself to do them later ➍.
The first part, naming the ZIP file, uses the base name of the absolute path of folder. If the folder being backed up is C:delicious, the ZIP file’s name should be delicious_N.zip, where N = 1 is the first time you run the program, N = 2 is the second time, and so on.
You can determine what N should be by checking whether delicious_1.zip already exists, then checking whether delicious_2.zip already exists, and so on. Use a variable named number for N ➋, and keep incrementing it inside the loop that calls os.path.exists() to check whether the file exists ➌. The first nonexistent filename found will cause the loop to break, since it will have found the filename of the new zip.
Step 2: Create the New ZIP File
Next let’s create the ZIP file. Make your program look like the following:
#! python3 # backupToZip.py - Copies an entire folder and its contents into # a ZIP file whose filename increments. --snip-- while True: zipFilename = os.path.basename(folder) + '_' + str(number) + '.zip' if not os.path.exists(zipFilename): break number = number + 1 # Create the ZIP file. print('Creating %s...' % (zipFilename)) ➊ backupZip = zipfile.ZipFile(zipFilename, 'w') # TODO: Walk the entire folder tree and compress the files in each folder. print('Done.') backupToZip('C:\delicious')