Automate the Boring Stuff with Python
Page 33
Find all the CSV files in the current working directory.
Read in the full contents of each file.
Write out the contents, skipping the first line, to a new CSV file.
At the code level, this means the program will need to do the following:
Loop over a list of files from os.listdir(), skipping the non-CSV files.
Create a CSV Reader object and read in the contents of the file, using the line_num attribute to figure out which line to skip.
Create a CSV Writer object and write out the read-in data to the new file.
For this project, open a new file editor window and save it as removeCsvHeader.py.
Step 1: Loop Through Each CSV File
The first thing your program needs to do is loop over a list of all CSV filenames for the current working directory. Make your removeCsvHeader.py look like this:
#! python3 # removeCsvHeader.py - Removes the header from all CSV files in the current # working directory. import csv, os os.makedirs('headerRemoved', exist_ok=True) # Loop through every file in the current working directory. for csvFilename in os.listdir('.'): if not csvFilename.endswith('.csv'): ➊ continue # skip non-csv files print('Removing header from ' + csvFilename + '...') # TODO: Read the CSV file in (skipping first row). # TODO: Write out the CSV file.
The os.makedirs() call will create a headerRemoved folder where all the headless CSV files will be written. A for loop on os.listdir('.') gets you partway there, but it will loop over all files in the working directory, so you’ll need to add some code at the start of the loop that skips filenames that don’t end with .csv. The continue statement ➊ makes the for loop move on to the next filename when it comes across a non-CSV file.
Just so there’s some output as the program runs, print out a message saying which CSV file the program is working on. Then, add some TODO comments for what the rest of the program should do.
Step 2: Read in the CSV File
The program doesn’t remove the first line from the CSV file. Rather, it creates a new copy of the CSV file without the first line. Since the copy’s filename is the same as the original filename, the copy will overwrite the original.
The program will need a way to track whether it is currently looping on the first row. Add the following to removeCsvHeader.py.
#! python3 # removeCsvHeader.py - Removes the header from all CSV files in the current # working directory. --snip-- # Read the CSV file in (skipping first row). csvRows = [] csvFileObj = open(csvFilename) readerObj = csv.reader(csvFileObj) for row in readerObj: if readerObj.line_num == 1: continue # skip first row csvRows.append(row) csvFileObj.close() # TODO: Write out the CSV file.
The Reader object’s line_num attribute can be used to determine which line in the CSV file it is currently reading. Another for loop will loop over the rows returned from the CSV Reader object, and all rows but the first will be appended to csvRows.
As the for loop iterates over each row, the code checks whether readerObj.line_num is set to 1. If so, it executes a continue to move on to the next row without appending it to csvRows. For every row afterward, the condition will be always be False, and the row will be appended to csvRows.
Step 3: Write Out the CSV File Without the First Row
Now that csvRows contains all rows but the first row, the list needs to be written out to a CSV file in the headerRemoved folder. Add the following to removeCsvHeader.py:
#! python3 # removeCsvHeader.py - Removes the header from all CSV files in the current # working directory. --snip-- # Loop through every file in the current working directory. ➊ for csvFilename in os.listdir('.'): if not csvFilename.endswith('.csv'): continue # skip non-CSV files --snip-- # Write out the CSV file. csvFileObj = open(os.path.join('headerRemoved', csvFilename), 'w', newline='') csvWriter = csv.writer(csvFileObj) for row in csvRows: csvWriter.writerow(row) csvFileObj.close()
The CSV Writer object will write the list to a CSV file in headerRemoved using csvFilename (which we also used in the CSV reader). This will overwrite the original file.
Once we create the Writer object, we loop over the sublists stored in csvRows and write each sublist to the file.
After the code is executed, the outer for loop ➊ will loop to the next filename from os.listdir('.'). When that loop is finished, the program will be complete.
To test your program, download removeCsvHeader.zip from http://nostarch.com/automatestuff/ and unzip it to a folder. Run the removeCsvHeader.py program in that folder. The output will look like this:
Removing header from NAICS_data_1048.csv... Removing header from NAICS_data_1218.csv... --snip-- Removing header from NAICS_data_9834.csv... Removing header from NAICS_data_9986.csv...
This program should print a filename each time it strips the first line from a CSV file.
Ideas for Similar Programs
The programs that you could write for CSV files are similar to the kinds you could write for Excel files, since they’re both spreadsheet files. You could write programs to do the following:
Compare data between different rows in a CSV file or between multiple CSV files.
Copy specific data from a CSV file to an Excel file, or vice versa.
Check for invalid data or formatting mistakes in CSV files and alert the user to these errors.
Read data from a CSV file as input for your Python programs.
JSON and APIs
JavaScript Object Notation is a popular way to format data as a single human-readable string. JSON is the native way that JavaScript programs write their data structures and usually resembles what Python’s pprint() function would produce. You don’t need to know JavaScript in order to work with JSON-formatted data.
Here’s an example of data formatted as JSON:
{"name": "Zophie", "isCat": true, "miceCaught": 0, "napsTaken": 37.5, "felineIQ": null}
JSON is useful to know, because many websites offer JSON content as a way for programs to interact with the website. This is known as providing an application programming interface (API). Accessing an API is the same as accessing any other web page via a URL. The difference is that the data returned by an API is formatted (with JSON, for example) for machines; APIs aren’t easy for people to read.
Many websites make their data available in JSON format. Facebook, Twitter, Yahoo, Google, Tumblr, Wikipedia, Flickr, Data.gov, Reddit, IMDb, Rotten Tomatoes, LinkedIn, and many other popular sites offer APIs for programs to use. Some of these sites require registration, which is almost always free. You’ll have to find documentation for what URLs your program needs to request in order to get the data you want, as well as the general format of the JSON data structures that are returned. This documentation should be provided by whatever site is offering the API; if they have a “Developers” page, look for the documentation there.
Using APIs, you could write programs that do the following:
Scrape raw data from websites. (Accessing APIs is often more convenient than downloading web pages and parsing HTML with Beautiful Soup.)
Automatically download new posts from one of your social network accounts and post them to another account. For example, you could take your Tumblr posts and post them to Facebook.
Create a “movie encyclopedia” for your personal movie collection by pulling data from IMDb, Rotten Tomatoes, and Wikipedia and putting it into a single text file on your computer.
You can see some examples of JSON APIs in the resources at http://nostarch.com/automatestuff/.
The JSON Module
Python’s json module handles all the details of translating between a string with JSON data and Python values for the json.loads() and json.dumps() functions. JSON can’t store every kind of Python value. It can contain values of only the following data types: strings, integers, floats, Booleans, lists, dictionaries, and NoneType. JSON cannot represent Python-specific objects, such as File objects, CSV Reader or Writer objects, Regex objects, or Selenium WebElement objects.
Reading JSON with the loads() Function
&nb
sp; To translate a string containing JSON data into a Python value, pass it to the json.loads() function. (The name means “load string,” not “loads.”) Enter the following into the interactive shell:
>>> stringOfJsonData = '{"name": "Zophie", "isCat": true, "miceCaught": 0, "felineIQ": null}' >>> import json >>> jsonDataAsPythonValue = json.loads(stringOfJsonData) >>> jsonDataAsPythonValue {'isCat': True, 'miceCaught': 0, 'name': 'Zophie', 'felineIQ': None}
After you import the json module, you can call loads() and pass it a string of JSON data. Note that JSON strings always use double quotes. It will return that data as a Python dictionary. Python dictionaries are not ordered, so the key-value pairs may appear in a different order when you print jsonDataAsPythonValue.
Writing JSON with the dumps() Function
The json.dumps() function (which means “dump string,” not “dumps”) will translate a Python value into a string of JSON-formatted data. Enter the following into the interactive shell:
>>> pythonValue = {'isCat': True, 'miceCaught': 0, 'name': 'Zophie', 'felineIQ': None} >>> import json >>> stringOfJsonData = json.dumps(pythonValue) >>> stringOfJsonData '{"isCat": true, "felineIQ": null, "miceCaught": 0, "name": "Zophie" }'
The value can only be one of the following basic Python data types: dictionary, list, integer, float, string, Boolean, or None.
Project: Fetching Current Weather Data
Checking the weather seems fairly trivial: Open your web browser, click the address bar, type the URL to a weather website (or search for one and then click the link), wait for the page to load, look past all the ads, and so on.
Actually, there are a lot of boring steps you could skip if you had a program that downloaded the weather forecast for the next few days and printed it as plaintext. This program uses the requests module from Chapter 11 to download data from the Web.
Overall, the program does the following:
Reads the requested location from the command line.
Downloads JSON weather data from OpenWeatherMap.org.
Converts the string of JSON data to a Python data structure.
Prints the weather for today and the next two days.
So the code will need to do the following:
Join strings in sys.argv to get the location.
Call requests.get() to download the weather data.
Call json.loads() to convert the JSON data to a Python data structure.
Print the weather forecast.
For this project, open a new file editor window and save it as quickWeather.py.
Step 1: Get Location from the Command Line Argument
The input for this program will come from the command line. Make quickWeather.py look like this:
#! python3 # quickWeather.py - Prints the weather for a location from the command line. import json, requests, sys # Compute location from command line arguments. if len(sys.argv) < 2: print('Usage: quickWeather.py location') sys.exit() location = ' '.join(sys.argv[1:]) # TODO: Download the JSON data from OpenWeatherMap.org's API. # TODO: Load JSON data into a Python variable.
In Python, command line arguments are stored in the sys.argv list. After the #! shebang line and import statements, the program will check that there is more than one command line argument. (Recall that sys.argv will always have at least one element, sys.argv[0], which contains the Python script’s filename.) If there is only one element in the list, then the user didn’t provide a location on the command line, and a “usage” message will be provided to the user before the program ends.
Command line arguments are split on spaces. The command line argument San Francisco, CA would make sys.argv hold ['quickWeather.py', 'San', 'Francisco,', 'CA']. Therefore, call the join() method to join all the strings except for the first in sys.argv. Store this joined string in a variable named location.
Step 2: Download the JSON Data
OpenWeatherMap.org provides real-time weather information in JSON format. Your program simply has to download the page at http://api.openweathermap.org/data/2.5/forecast/daily?q=
#! python3 # quickWeather.py - Prints the weather for a location from the command line. --snip-- # Download the JSON data from OpenWeatherMap.org's API. url ='http://api.openweathermap.org/data/2.5/forecast/daily?q=%s&cnt=3' % (location) response = requests.get(url) response.raise_for_status() # TODO: Load JSON data into a Python variable.
We have location from our command line arguments. To make the URL we want to access, we use the %s placeholder and insert whatever string is stored in location into that spot in the URL string. We store the result in url and pass url to requests.get(). The requests.get() call returns a Response object, which you can check for errors by calling raise_for_status(). If no exception is raised, the downloaded text will be in response.text.
Step 3: Load JSON Data and Print Weather
The response.text member variable holds a large string of JSON-formatted data. To convert this to a Python value, call the json.loads() function. The JSON data will look something like this:
{'city': {'coord': {'lat': 37.7771, 'lon': -122.42}, 'country': 'United States of America', 'id': '5391959', 'name': 'San Francisco', 'population': 0}, 'cnt': 3, 'cod': '200', 'list': [{'clouds': 0, 'deg': 233, 'dt': 1402344000, 'humidity': 58, 'pressure': 1012.23, 'speed': 1.96, 'temp': {'day': 302.29, 'eve': 296.46, 'max': 302.29, 'min': 289.77, 'morn': 294.59, 'night': 289.77}, 'weather': [{'description': 'sky is clear', 'icon': '01d', --snip--
You can see this data by passing weatherData to pprint.pprint(). You may want to check http://openweathermap.org/ for more documentation on what these fields mean. For example, the online documentation will tell you that the 302.29 after 'day' is the daytime temperature in Kelvin, not Celsius or Fahrenheit.
The weather descriptions you want are after 'main' and 'description'. To neatly print them out, add the following to quickWeather.py.
! python3 # quickWeather.py - Prints the weather for a location from the command line. --snip-- # Load JSON data into a Python variable. weatherData = json.loads(response.text) # Print weather descriptions. ➊ w = weatherData['list'] print('Current weather in %s:' % (location)) print(w[0]['weather'][0]['main'], '-', w[0]['weather'][0]['description']) print() print('Tomorrow:') print(w[1]['weather'][0]['main'], '-', w[1]['weather'][0]['description']) print() print('Day after tomorrow:') print(w[2]['weather'][0]['main'], '-', w[2]['weather'][0]['description'])
Notice how the code stores weatherData['list'] in the variable w to save you some typing ➊. You use w[0], w[1], and w[2] to retrieve the dictionaries for today, tomorrow, and the day after tomorrow’s weather, respectively. Each of these dictionaries has a 'weather' key, which contains a list value. You’re interested in the first list item, a nested dictionary with several more keys, at index 0. Here, we print the values stored in the 'main' and 'description' keys, separated by a hyphen.
When this program is run with the command line argument quickWeather.py San Francisco, CA, the output looks something like this:
Current weather in San Francisco, CA: Clear - sky is clear Tomorrow: Clouds - few clouds Day after tomorrow: Clear - sky is clear
(The weather is one of the reasons I like living in San Francisco!)
Ideas for Similar Programs
Accessing weather data can form the basis for many types of programs. You can create similar programs to do the following:
Collect weather forecasts for several campsites or hiking trails to see which one will have the best weather.
Schedule a program to regularly check the weather and send you a frost alert if you need to move your plants indoors. (Chapter 15 covers scheduling, and Chapter 16 explains how to send email.)
Pull weather data from multiple sites to show all at once, or calculate and show the average of the multiple weather predictions.
Summary
CSV and JSON are common plaintext formats for storing data. They are easy for programs t
o parse while still being human readable, so they are often used for simple spreadsheets or web app data. The csv and json modules greatly simplify the process of reading and writing to CSV and JSON files.
The last few chapters have taught you how to use Python to parse information from a wide variety of file formats. One common task is taking data from a variety of formats and parsing it for the particular information you need. These tasks are often specific to the point that commercial software is not optimally helpful. By writing your own scripts, you can make the computer handle large amounts of data presented in these formats.
In Chapter 15, you’ll break away from data formats and learn how to make your programs communicate with you by sending emails and text messages.
Practice Questions
Q:
1. What are some features Excel spreadsheets have that CSV spreadsheets don’t?
Q: