by Al Sweigart
This is what your program does:
Gets search keywords from the command line arguments.
Retrieves the search results page.
Opens a browser tab for each result.
This means your code will need to do the following:
Read the command line arguments from sys.argv.
Fetch the search result page with the requests module.
Find the links to each search result.
Call the webbrowser.open() function to open the web browser.
Open a new file editor window and save it as lucky.py.
Step 1: Get the Command Line Arguments and Request the Search Page
Before coding anything, you first need to know the URL of the search result page. By looking at the browser’s address bar after doing a Google search, you can see that the result page has a URL like https://www.google.com/search?q=SEARCH_TERM_HERE. The requests module can download this page and then you can use Beautiful Soup to find the search result links in the HTML. Finally, you’ll use the webbrowser module to open those links in browser tabs.
Make your code look like the following:
#! python3 # lucky.py - Opens several Google search results. import requests, sys, webbrowser, bs4 print('Googling...') # display text while downloading the Google page res = requests.get('http://google.com/search?q=' + ' '.join(sys.argv[1:])) res.raise_for_status() # TODO: Retrieve top search result links. # TODO: Open a browser tab for each result.
The user will specify the search terms using command line arguments when they launch the program. These arguments will be stored as strings in a list in sys.argv.
Step 2: Find All the Results
Now you need to use Beautiful Soup to extract the top search result links from your downloaded HTML. But how do you figure out the right selector for the job? For example, you can’t just search for all tags, because there are lots of links you don’t care about in the HTML. Instead, you must inspect the search result page with the browser’s developer tools to try to find a selector that will pick out only the links you want.
After doing a Google search for Beautiful Soup, you can open the browser’s developer tools and inspect some of the link elements on the page. They look incredibly complicated, something like this: Beautiful Soup: We called him Tortoise because he taught us..
It doesn’t matter that the element looks incredibly complicated. You just need to find the pattern that all the search result links have. But this element doesn’t have anything that easily distinguishes it from the nonsearch result elements on the page.
Make your code look like the following:
#! python3 # lucky.py - Opens several google search results. import requests, sys, webbrowser, bs4 --snip-- # Retrieve top search result links. soup = bs4.BeautifulSoup(res.text) # Open a browser tab for each result. linkElems = soup.select('.r a')
If you look up a little from the element, though, there is an element like this: . Looking through the rest of the HTML source, it looks like the r class is used only for search result links. You don’t have to know what the CSS class r is or what it does. You’re just going to use it as a marker for the element you are looking for. You can create a BeautifulSoup object from the downloaded page’s HTML text and then use the selector '.r a' to find all elements that are within an element that has the r CSS class.
Step 3: Open Web Browsers for Each Result
Finally, we’ll tell the program to open web browser tabs for our results. Add the following to the end of your program:
#! python3 # lucky.py - Opens several google search results. import requests, sys, webbrowser, bs4 --snip-- # Open a browser tab for each result. linkElems = soup.select('.r a') numOpen = min(5, len(linkElems)) for i in range(numOpen): webbrowser.open('http://google.com' + linkElems[i].get('href'))
By default, you open the first five search results in new tabs using the webbrowser module. However, the user may have searched for something that turned up fewer than five results. The soup.select() call returns a list of all the elements that matched your '.r a' selector, so the number of tabs you want to open is either 5 or the length of this list (whichever is smaller).
The built-in Python function min() returns the smallest of the integer or float arguments it is passed. (There is also a built-in max() function that returns the largest argument it is passed.) You can use min() to find out whether there are fewer than five links in the list and store the number of links to open in a variable named numOpen. Then you can run through a for loop by calling range(numOpen).
On each iteration of the loop, you use webbrowser.open() to open a new tab in the web browser. Note that the href attribute’s value in the returned elements do not have the initial http://google.com part, so you have to concatenate that to the href attribute’s string value.
Now you can instantly open the first five Google results for, say, Python programming tutorials by running lucky python programming tutorials on the command line! (See Appendix B for how to easily run programs on your operating system.)
Ideas for Similar Programs
The benefit of tabbed browsing is that you can easily open links in new tabs to peruse later. A program that automatically opens several links at once can be a nice shortcut to do the following:
Open all the product pages after searching a shopping site such as Amazon
Open all the links to reviews for a single product
Open the result links to photos after performing a search on a photo site such as Flickr or Imgur
Project: Downloading All XKCD Comics
Blogs and other regularly updating websites usually have a front page with the most recent post as well as a Previous button on the page that takes you to the previous post. Then that post will also have a Previous button, and so on, creating a trail from the most recent page to the first post on the site. If you wanted a copy of the site’s content to read when you’re not online, you could manually navigate over every page and save each one. But this is pretty boring work, so let’s write a program to do it instead.
XKCD is a popular geek webcomic with a website that fits this structure (see Figure 11-6). The front page at http://xkcd.com/ has a Prev button that guides the user back through prior comics. Downloading each comic by hand would take forever, but you can write a script to do this in a couple of minutes.
Here’s what your program does:
Loads the XKCD home page.
Saves the comic image on that page.
Follows the Previous Comic link.
Repeats until it reaches the first comic.
Figure 11-6. XKCD, “a webcomic of romance, sarcasm, math, and language”
This means your code will need to do the following:
Download pages with the requests module.
Find the URL of the comic image for a page using Beautiful Soup.
Download and save the comic image to the hard drive with iter_content().
Find the URL of the Previous Comic link, and repeat.
Open a new file editor window and save it as downloadXkcd.py.
Step 1: Design the Program
If you open the browser’s developer tools and inspect the elements on the page, you’ll find the following:
The URL of the comic’s image file is given by the href attribute of an element.
The element is inside a
The Prev button has a rel HTML attribute with the value prev.
The first comic’s Prev button links to the http://xkcd.com/# URL, ind
icating that there are no more previous pages.
Make your code look like the following:
#! python3 # downloadXkcd.py - Downloads every single XKCD comic. import requests, os, bs4 url = 'http://xkcd.com' # starting url os.makedirs('xkcd', exist_ok=True) # store comics in ./xkcd while not url.endswith('#'): # TODO: Download the page. # TODO: Find the URL of the comic image. # TODO: Download the image. # TODO: Save the image to ./xkcd. # TODO: Get the Prev button's url. print('Done.')
You’ll have a url variable that starts with the value 'http://xkcd.com' and repeatedly update it (in a for loop) with the URL of the current page’s Prev link. At every step in the loop, you’ll download the comic at url. You’ll know to end the loop when url ends with '#'.
You will download the image files to a folder in the current working directory named xkcd. The call os.makedirs() ensures that this folder exists, and the exist_ok=True keyword argument prevents the function from throwing an exception if this folder already exists. The rest of the code is just comments that outline the rest of your program.
Step 2: Download the Web Page
Let’s implement the code for downloading the page. Make your code look like the following:
#! python3 # downloadXkcd.py - Downloads every single XKCD comic. import requests, os, bs4 url = 'http://xkcd.com' # starting url os.makedirs('xkcd', exist_ok=True) # store comics in ./xkcd while not url.endswith('#'): # Download the page. print('Downloading page %s...' % url) res = requests.get(url) res.raise_for_status() soup = bs4.BeautifulSoup(res.text) # TODO: Find the URL of the comic image. # TODO: Download the image. # TODO: Save the image to ./xkcd. # TODO: Get the Prev button's url. print('Done.')
First, print url so that the user knows which URL the program is about to download; then use the requests module’s request.get() function to download it. As always, you immediately call the Response object’s raise_for_status() method to throw an exception and end the program if something went wrong with the download. Otherwise, you create a BeautifulSoup object from the text of the downloaded page.
Step 3: Find and Download the Comic Image
Make your code look like the following:
#! python3 # downloadXkcd.py - Downloads every single XKCD comic. import requests, os, bs4 --snip-- # Find the URL of the comic image. comicElem = soup.select('#comic img') if comicElem == []: print('Could not find comic image.') else: comicUrl = 'http:' + comicElem[0].get('src') # Download the image. print('Downloading image %s...' % (comicUrl)) res = requests.get(comicUrl) res.raise_for_status() # TODO: Save the image to ./xkcd. # TODO: Get the Prev button's url. print('Done.')
From inspecting the XKCD home page with your developer tools, you know that the element for the comic image is inside a
A few XKCD pages have special content that isn’t a simple image file. That’s fine; you’ll just skip those. If your selector doesn’t find any elements, then soup.select('#comic img') will return a blank list. When that happens, the program can just print an error message and move on without downloading the image.
Otherwise, the selector will return a list containing one element. You can get the src attribute from this element and pass it to requests.get() to download the comic’s image file.
Step 4: Save the Image and Find the Previous Comic
Make your code look like the following:
#! python3 # downloadXkcd.py - Downloads every single XKCD comic. import requests, os, bs4 --snip-- # Save the image to ./xkcd. imageFile = open(os.path.join('xkcd', os.path.basename(comicUrl)), 'wb') for chunk in res.iter_content(100000): imageFile.write(chunk) imageFile.close() # Get the Prev button's url. prevLink = soup.select('a[rel="prev"]')[0] url = 'http://xkcd.com' + prevLink.get('href') print('Done.')
At this point, the image file of the comic is stored in the res variable. You need to write this image data to a file on the hard drive.
You’ll need a filename for the local image file to pass to open(). The comicUrl will have a value like 'http://imgs.xkcd.com/comics/heartbleed_explanation.png'—which you might have noticed looks a lot like a file path. And in fact, you can call os.path.basename() with comicUrl, and it will return just the last part of the URL, 'heartbleed_explanation.png'. You can use this as the filename when saving the image to your hard drive. You join this name with the name of your xkcd folder using os.path.join() so that your program uses backslashes () on Windows and forward slashes (/) on OS X and Linux. Now that you finally have the filename, you can call open() to open a new file in 'wb' “write binary” mode.
Remember from earlier in this chapter that to save files you’ve downloaded using Requests, you need to loop over the return value of the iter_content() method. The code in the for loop writes out chunks of the image data (at most 100,000 bytes each) to the file and then you close the file. The image is now saved to your hard drive.
Afterward, the selector 'a[rel="prev"]' identifies the element with the rel attribute set to prev, and you can use this element’s href attribute to get the previous comic’s URL, which gets stored in url. Then the while loop begins the entire download process again for this comic.
The output of this program will look like this:
Downloading page http://xkcd.com... Downloading image http://imgs.xkcd.com/comics/phone_alarm.png... Downloading page http://xkcd.com/1358/... Downloading image http://imgs.xkcd.com/comics/nro.png... Downloading page http://xkcd.com/1357/... Downloading image http://imgs.xkcd.com/comics/free_speech.png... Downloading page http://xkcd.com/1356/... Downloading image http://imgs.xkcd.com/comics/orbital_mechanics.png... Downloading page http://xkcd.com/1355/... Downloading image http://imgs.xkcd.com/comics/airplane_message.png... Downloading page http://xkcd.com/1354/... Downloading image http://imgs.xkcd.com/comics/heartbleed_explanation.png... --snip--
This project is a good example of a program that can automatically follow links in order to scrape large amounts of data from the Web. You can learn about Beautiful Soup’s other features from its documentation at http://www.crummy.com/software/BeautifulSoup/bs4/doc/.
Ideas for Similar Programs
Downloading pages and following links are the basis of many web crawling programs. Similar programs could also do the following:
Back up an entire site by following all of its links.
Copy all the messages off a web forum.
Duplicate the catalog of items for sale on an online store.
The requests and BeautifulSoup modules are great as long as you can figure out the URL you need to pass to requests.get(). However, sometimes this isn’t so easy to find. Or perhaps the website you want your program to navigate requires you to log in first. The selenium module will give your programs the power to perform such sophisticated tasks.
Controlling the Browser with the selenium Module
The selenium module lets Python directly control the browser by programmatically clicking links and filling in login information, almost as though there is a human user interacting with the page. Selenium allows you to interact with web pages in a much more advanced way than Requests and Beautiful Soup; but because it launches a web browser, it is a bit slower and hard to run in the background if, say, you just need to download some files from the Web.
Appendix A has more detailed steps on installing third-party modules.
Starting a Selenium-Controlled Browser
For these examples, you’ll need the Firefox web browser. This will be the browser that you control. If you don’t already have Firefox, you can download it for free from http://getfirefox.com/.
Importing the modules for Selenium is slightly tricky. Instead of import selenium, you need to run from selenium import webdriver. (The exact reason why the selenium module is set up this way is beyond the scope of this book.) After that, you can launch the Firefox browser with Selenium. Enter the following into the interactive shell:
>>> from s
elenium import webdriver >>> browser = webdriver.Firefox() >>> type(browser)
You’ll notice when webdriver.Firefox() is called, the Firefox web browser starts up. Calling type() on the value webdriver.Firefox() reveals it’s of the WebDriver data type. And calling browser.get('http://inventwithpython.com') directs the browser to http://inventwithpython.com/. Your browser should look something like Figure 11-7.
Figure 11-7. After calling webdriver.Firefox() and get() in IDLE, the Firefox browser appears.