by Al Sweigart
There was a problem: 404 Client Error: Not Found
Always call raise_for_status() after calling requests.get(). You want to be sure that the download has actually worked before your program continues.
Saving Downloaded Files to the Hard Drive
From here, you can save the web page to a file on your hard drive with the standard open() function and write() method. There are some slight differences, though. First, you must open the file in write binary mode by passing the string 'wb' as the second argument to open(). Even if the page is in plaintext (such as the Romeo and Juliet text you downloaded earlier), you need to write binary data instead of text data in order to maintain the Unicode encoding of the text.
Unicode Encodings
Unicode encodings are beyond the scope of this book, but you can learn more about them from these web pages:
Joel on Software: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!): http://www.joelonsoftware.com/articles/Unicode.html
Pragmatic Unicode: http://nedbatchelder.com/text/unipain.html
To write the web page to a file, you can use a for loop with the Response object’s iter_content() method.
>>> import requests >>> res = requests.get('https://automatetheboringstuff.com/files/rj.txt') >>> res.raise_for_status() >>> playFile = open('RomeoAndJuliet.txt', 'wb') >>> for chunk in res.iter_content(100000): playFile.write(chunk) 100000 78981 >>> playFile.close()
The iter_content() method returns “chunks” of the content on each iteration through the loop. Each chunk is of the bytes data type, and you get to specify how many bytes each chunk will contain. One hundred thousand bytes is generally a good size, so pass 100000 as the argument to iter_content().
The file RomeoAndJuliet.txt will now exist in the current working directory. Note that while the filename on the website was rj.txt, the file on your hard drive has a different filename. The requests module simply handles downloading the contents of web pages. Once the page is downloaded, it is simply data in your program. Even if you were to lose your Internet connection after downloading the web page, all the page data would still be on your computer.
The write() method returns the number of bytes written to the file. In the previous example, there were 100,000 bytes in the first chunk, and the remaining part of the file needed only 78,981 bytes.
To review, here’s the complete process for downloading and saving a file:
Call requests.get() to download the file.
Call open() with 'wb' to create a new file in write binary mode.
Loop over the Response object’s iter_content() method.
Call write() on each iteration to write the content to the file.
Call close() to close the file.
That’s all there is to the requests module! The for loop and iter_content() stuff may seem complicated compared to the open()/write()/close() workflow you’ve been using to write text files, but it’s to ensure that the requests module doesn’t eat up too much memory even if you download massive files. You can learn about the requests module’s other features from http://requests.readthedocs.org/.
HTML
Before you pick apart web pages, you’ll learn some HTML basics. You’ll also see how to access your web browser’s powerful developer tools, which will make scraping information from the Web much easier.
Resources for Learning HTML
Hypertext Markup Language (HTML) is the format that web pages are written in. This chapter assumes you have some basic experience with HTML, but if you need a beginner tutorial, I suggest one of the following sites:
http://htmldog.com/guides/html/beginner/
http://www.codecademy.com/tracks/web/
https://developer.mozilla.org/en-US/learn/html/
A Quick Refresher
In case it’s been a while since you’ve looked at any HTML, here’s a quick overview of the basics. An HTML file is a plaintext file with the .html file extension. The text in these files is surrounded by tags, which are words enclosed in angle brackets. The tags tell the browser how to format the web page. A starting tag and closing tag can enclose some text to form an element. The text (or inner HTML) is the content between the starting and closing tags. For example, the following HTML will display Hello world! in the browser, with Hello in bold:
Hello world!
This HTML will look like Figure 11-1 in a browser.
Figure 11-1. Hello world! rendered in the browser
The opening tag says that the enclosed text will appear in bold. The closing tags tells the browser where the end of the bold text is.
There are many different tags in HTML. Some of these tags have extra properties in the form of attributes within the angle brackets. For example, the tag encloses text that should be a link. The URL that the text links to is determined by the href attribute. Here’s an example:
Al's free Python books.
This HTML will look like Figure 11-2 in a browser.
Figure 11-2. The link rendered in the browser
Some elements have an id attribute that is used to uniquely identify the element in the page. You will often instruct your programs to seek out an element by its id attribute, so figuring out an element’s id attribute using the browser’s developer tools is a common task in writing web scraping programs.
Viewing the Source HTML of a Web Page
You’ll need to look at the HTML source of the web pages that your programs will work with. To do this, right-click (or CTRL-click on OS X) any web page in your web browser, and select View Source or View page source to see the HTML text of the page (see Figure 11-3). This is the text your browser actually receives. The browser knows how to display, or render, the web page from this HTML.
Figure 11-3. Viewing the source of a web page
I highly recommend viewing the source HTML of some of your favorite sites. It’s fine if you don’t fully understand what you are seeing when you look at the source. You won’t need HTML mastery to write simple web scraping programs—after all, you won’t be writing your own websites. You just need enough knowledge to pick out data from an existing site.
Opening Your Browser’s Developer Tools
In addition to viewing a web page’s source, you can look through a page’s HTML using your browser’s developer tools. In Chrome and Internet Explorer for Windows, the developer tools are already installed, and you can press F12 to make them appear (see Figure 11-4). Pressing F12 again will make the developer tools disappear. In Chrome, you can also bring up the developer tools by selecting View▸Developer▸Developer Tools. In OS X, pressing -OPTION-I will open Chrome’s Developer Tools.
Figure 11-4. The Developer Tools window in the Chrome browser
In Firefox, you can bring up the Web Developer Tools Inspector by pressing CTRL-SHIFT-C on Windows and Linux or by pressing ⌘-OPTION-C on OS X. The layout is almost identical to Chrome’s developer tools.
In Safari, open the Preferences window, and on the Advanced pane check the Show Develop menu in the menu bar option. After it has been enabled, you can bring up the developer tools by pressing -OPTION-I.
After enabling or installing the developer tools in your browser, you can right-click any part of the web page and select Inspect Element from the context menu to bring up the HTML responsible for that part of the page. This will be helpful when you begin to parse HTML for your web scraping programs.
Don’t Use Regular Expressions to Parse HTML
Locating a specific piece of HTML in a string seems like a perfect case for regular expressions. However, I advise you against it. There are many different ways that HTML can be formatted and still be considered valid HTML, but trying to capture all these possible variations in a regular expression can be tedious and error prone. A module developed specifically for parsing HTML, such as Beautiful Soup, will be less likely to result in bugs.
Y
ou can find an extended argument for why you shouldn’t to parse HTML with regular expressions at http://stackoverflow.com/a/1732454/1893164/.
Using the Developer Tools to Find HTML Elements
Once your program has downloaded a web page using the requests module, you will have the page’s HTML content as a single string value. Now you need to figure out which part of the HTML corresponds to the information on the web page you’re interested in.
This is where the browser’s developer tools can help. Say you want to write a program to pull weather forecast data from http://weather.gov/. Before writing any code, do a little research. If you visit the site and search for the 94105 ZIP code, the site will take you to a page showing the forecast for that area.
What if you’re interested in scraping the temperature information for that ZIP code? Right-click where it is on the page (or CONTROL-click on OS X) and select Inspect Element from the context menu that appears. This will bring up the Developer Tools window, which shows you the HTML that produces this particular part of the web page. Figure 11-5 shows the developer tools open to the HTML of the temperature.
Figure 11-5. Inspecting the element that holds the temperature text with the developer tools
From the developer tools, you can see that the HTML responsible for the temperature part of the web page is
59°F
. This is exactly what you were looking for! It seems that the temperature information is contained inside a element with the myforecast-current-lrg class. Now that you know what you’re looking for, the BeautifulSoup module will help you find it in the string.
Parsing HTML with the BeautifulSoup Module
Beautiful Soup is a module for extracting information from an HTML page (and is much better for this purpose than regular expressions). The BeautifulSoup module’s name is bs4 (for Beautiful Soup, version 4). To install it, you will need to run pip install beautifulsoup4 from the command line. (Check out Appendix A for instructions on installing third-party modules.) While beautifulsoup4 is the name used for installation, to import Beautiful Soup you run import bs4.
For this chapter, the Beautiful Soup examples will parse (that is, analyze and identify the parts of) an HTML file on the hard drive. Open a new file editor window in IDLE, enter the following, and save it as example.html. Alternatively, download it from http://nostarch.com/automatestuff/.
Download my Python book from my website.
Learn Python the easy way!
By
As you can see, even a simple HTML file involves many different tags and attributes, and matters quickly get confusing with complex websites. Thankfully, Beautiful Soup makes working with HTML much easier.
Creating a BeautifulSoup Object from HTML
The bs4.BeautifulSoup() function needs to be called with a string containing the HTML it will parse. The bs4.BeautifulSoup() function returns is a BeautifulSoup object. Enter the following into the interactive shell while your computer is connected to the Internet:
>>> import requests, bs4 >>> res = requests.get('http://nostarch.com') >>> res.raise_for_status() >>> noStarchSoup = bs4.BeautifulSoup(res.text) >>> type(noStarchSoup)
This code uses requests.get() to download the main page from the No Starch Press website and then passes the text attribute of the response to bs4.BeautifulSoup(). The BeautifulSoup object that it returns is stored in a variable named noStarchSoup.
You can also load an HTML file from your hard drive by passing a File object to bs4.BeautifulSoup(). Enter the following into the interactive shell (make sure the example.html file is in the working directory):
>>> exampleFile = open('example.html') >>> exampleSoup = bs4.BeautifulSoup(exampleFile) >>> type(exampleSoup)
Once you have a BeautifulSoup object, you can use its methods to locate specific parts of an HTML document.
Finding an Element with the select() Method
You can retrieve a web page element from a BeautifulSoup object by calling the select()method and passing a string of a CSS selector for the element you are looking for. Selectors are like regular expressions: They specify a pattern to look for, in this case, in HTML pages instead of general text strings.
A full discussion of CSS selector syntax is beyond the scope of this book (there’s a good selector tutorial in the resources at http://nostarch.com/automatestuff/), but here’s a short introduction to selectors. Table 11-2 shows examples of the most common CSS selector patterns.
Table 11-2. Examples of CSS Selectors
Selector passed to the select() method
Will match...
soup.select('div')
All elements named
soup.select('#author')
The element with an id attribute of author
soup.select('.notice')
All elements that use a CSS class attribute named notice
soup.select('div span')
All elements named that are within an element named
soup.select('div > span')
All elements named that are directly within an element named
soup.select('input[name]')
All elements named that have a name attribute with any value
soup.select('input[type="button"]')
All elements named that have an attribute named type with value button
The various selector patterns can be combined to make sophisticated matches. For example, soup.select('p #author') will match any element that has an id attribute of author, as long as it is also inside a
element. elements from the BeautifulSoup object. Enter this into the interactive shell: Download my Python book from my website. Learn Python the easy way! By
The select() method will return a list of Tag objects, which is how Beautiful Soup represents an HTML element. The list will contain one Tag object for every match in the BeautifulSoup object’s HTML. Tag values can be passed to the str() function to show the HTML tags they represent. Tag values also have an attrs attribute that shows all the HTML attributes of the tag as a dictionary. Using the example.html file from earlier, enter the following into the interactive shell:
>>> import bs4 >>> exampleFile = open('example.html') >>> exampleSoup = bs4.BeautifulSoup(exampleFile.read()) >>> elems = exampleSoup.select('#author') >>> type(elems)
This code will pull the element with id="author" out of our example HTML. We use select('#author') to return a list of all the elements with id="author". We store this list of Tag objects in the variable elems, and len(elems) tells us there is one Tag object in the list; there was one match. Calling getText() on the element returns the element’s text, or inner HTML. The text of an element is the content between the opening and closing tags: in this case, 'Al Sweigart'.
Passing the element to str() returns a string with the starting and closing tags and the element’s text. Finally, attrs gives us a dictionary with the element’s attribute, 'id', and the value of the id attribute, 'author'.
You can also pull all the
>>> pElems = exampleSoup.select('p') >>> str(pElems[0]) '
This time, select() gives us a list of three matches, which we store in pElems. Using str() on pElems[0], pElems[1], and pElems[2] shows you each element as a string, and using getText() o
n each element shows you its text.
Getting Data from an Element’s Attributes
The get() method for Tag objects makes it simple to access attribute values from an element. The method is passed a string of an attribute name and returns that attribute’s value. Using example.html, enter the following into the interactive shell:
>>> import bs4 >>> soup = bs4.BeautifulSoup(open('example.html')) >>> spanElem = soup.select('span')[0] >>> str(spanElem) ' ' >>> spanElem.get('id') 'author' >>> spanElem.get('some_nonexistent_addr') == None True >>> spanElem.attrs {'id': 'author'}
Here we use select() to find any elements and then store the first matched element in spanElem. Passing the attribute name 'id' to get() returns the attribute’s value, 'author'.
Project: “I’m Feeling Lucky” Google Search
Whenever I search a topic on Google, I don’t look at just one search result at a time. By middle-clicking a search result link (or clicking while holding CTRL), I open the first several links in a bunch of new tabs to read later. I search Google often enough that this workflow—opening my browser, searching for a topic, and middle-clicking several links one by one—is tedious. It would be nice if I could simply type a search term on the command line and have my computer automatically open a browser with all the top search results in new tabs. Let’s write a script to do this.