- Web Scraper Python
- Creating A Web Scraper In Python Tutorial
- Build A Web Scraper Using Python
- How To Make A Web Scraper
- Creating A Web Scraper In Python 3
- How To Scrape A Website Using Python
- Python Web Scraping Tutorial
In this video, you would learn about web scraping and how to scrape data from websites using Python and beautiful soup.Web scraping is a way through which yo. The incredible amount of data on the Internet is a rich resource for any field of research or personal interest. To effectively harvest that data, you’ll need to become skilled at web scraping.The Python libraries requests and Beautiful Soup are powerful tools for the job. If you like to learn with hands-on examples and you have a basic understanding of Python and HTML, then this tutorial is.
- Python Web Scraping Resources
- Selected Reading
In this chapter, let us learn how to perform web scraping on dynamic websites and the concepts involved in detail.
Introduction
Web scraping is a complex task and the complexity multiplies if the website is dynamic. According to United Nations Global Audit of Web Accessibility more than 70% of the websites are dynamic in nature and they rely on JavaScript for their functionalities.
Dynamic Website Example
Let us look at an example of a dynamic website and know about why it is difficult to scrape. Here we are going to take example of searching from a website named http://example.webscraping.com/places/default/search. But how can we say that this website is of dynamic nature? It can be judged from the output of following Python script which will try to scrape data from above mentioned webpage −
Output
The above output shows that the example scraper failed to extract information because the <div> element we are trying to find is empty.
Approaches for Scraping data from Dynamic Websites
We have seen that the scraper cannot scrape the information from a dynamic website because the data is loaded dynamically with JavaScript. In such cases, we can use the following two techniques for scraping data from dynamic JavaScript dependent websites −
- Reverse Engineering JavaScript
- Rendering JavaScript
Reverse Engineering JavaScript
The process called reverse engineering would be useful and lets us understand how data is loaded dynamically by web pages.
For doing this, we need to click the inspect element tab for a specified URL. Next, we will click NETWORK tab to find all the requests made for that web page including search.json with a path of /ajax. Instead of accessing AJAX data from browser or via NETWORK tab, we can do it with the help of following Python script too −
Example
The above script allows us to access JSON response by using Python json method. Similarly we can download the raw string response and by using python’s json.loads method, we can load it too. We are doing this with the help of following Python script. It will basically scrape all of the countries by searching the letter of the alphabet ‘a’ and then iterating the resulting pages of the JSON responses.
After running the above script, we will get the following output and the records would be saved in the file named countries.txt.
Output
Rendering JavaScript
In the previous section, we did reverse engineering on web page that how API worked and how we can use it to retrieve the results in single request. However, we can face following difficulties while doing reverse engineering −
- Sometimes websites can be very difficult. For example, if the website is made with advanced browser tool such as Google Web Toolkit (GWT), then the resulting JS code would be machine-generated and difficult to understand and reverse engineer.
- Some higher level frameworks like React.js can make reverse engineering difficult by abstracting already complex JavaScript logic.
The solution to the above difficulties is to use a browser rendering engine that parses HTML, applies the CSS formatting and executes JavaScript to display a web page.
Example
In this example, for rendering Java Script we are going to use a familiar Python module Selenium. The following Python code will render a web page with the help of Selenium −
First, we need to import webdriver from selenium as follows −
Now, provide the path of web driver which we have downloaded as per our requirement −
Now, provide the url which we want to open in that web browser now controlled by our Python script.
Now, we can use ID of the search toolbox for setting the element to select.
Next, we can use java script to set the select box content as follows −
The following line of code shows that search is ready to be clicked on the web page −
Next line of code shows that it will wait for 45 seconds for completing the AJAX request.
Now, for selecting country links, we can use the CSS selector as follows −
Now the text of each link can be extracted for creating the list of countries −
Introduction
We'll cover how to use Headless Chrome for web scraping Google Places. Google places does not necessarily require javascript because google will serve a different response if you disable javascript. But for better user emulation when browsing/scraping google places, a browser is recommended.
Headless Chrome is essentially the Chrome browser running without a head (no graphical user interface). The benefit being you can run a headless browser on a server environment that also has no graphical interface attached to it, which is normally accessed through shell access. It can also be faster to run headless and can have lower overhead on system resources.
Controlling a browser
We need a way to control the browser with code, this can be done through what is called the Chrome DevTools Protocol or CDP. CDP is essentially a websocket server running on the browser that is based on JSONRPC. Instead of directly working with CDP we'll use a library called pyppeteer which is a python implementation of the CDP protocol that provides an easier to use abstraction. It's inspired by the Node version of the same library called puppeteer.
Setting up
As usual with any of my python projects, I recommend working in a virtual python environment which helps us address dependencies and versions separately for each application / project. Let's create a virtual environment in our home directory and install the dependencies we need.
Make sure you are running at least python 3.6.1, 3.5 is end of support.The pyppeteer library will not work with python 3.6.0, this is due to the websockets library that it depends on not supporting that python version.
Web Scraper Python
Let's create the following folders and files.
Creating A Web Scraper In Python Tutorial
We created a
__main__.py
file, this lets us run the Google Places scraper with the following command (nothing should happen right now):Launching a headless browser
We need to launch a Chrome browser. By default, pyppeteer will install the latest version of Chromium. It's also possible to just use Chrome as long as it is installed on your system. The library makes use of
async/await
for concurrency. In order to use this we import the asyncio package from python.To launch with Chrome instead of Chromium add
executablePath
option to the launch function. Below, we launch the browser, navigate to google and take a screenshot. The screenshot will be saved in the folder you are running the scraper.Digging in
Let's create some functions in
core/browser.py
to simplify working with a browser and the page. We'll make use of what I believe is an awesome feature in python for simplifying management of resources called context manager
. Specifically we will use an async context manager.An asynchronous context manager is a context manager that is able to suspend execution in its enter and exit methods.
This feature in python lets us write code like the below which handles opening and closing a browser with one line.
Let's add the
PageSession
async context manager in the file core/browser.py
.In our
google-places/__main__.py
file let's make use of our new PageSession
and print the html content of the final rendered page with javascript executed.Run the
google-places
module in your terminal with the same command we used earlier.So now we can launch a browser, open a page (a tab in chrome) and navigate to a website and wait for javascript to finish loading/executing then close the browser with the above code.
![Creating Creating](/uploads/1/1/8/9/118912567/898051675.png)
Next let's do the following:
- We want to visit
google.com
- Enter a search query for
pediatrician near 94118
- Click on google places to see more results
- Scrape results from the page
- Save results to a CSV file
Navigating pages
We want to end up on the following page navigations so we can pull the data we need.
Let's start by breaking up our code in
google-places/__main__.py
so we can first search then navigate to google places. We also want to clean up some of the string literals like the google url.![Creating A Web Scraper In Python Creating A Web Scraper In Python](/uploads/1/1/8/9/118912567/769755852.png)
You can see the new code we added above as it has been highlighted. We use XPath to find the search bar, the search button and the view all button to get us to google places.
- Type in the search bar
- Click the search button
- Wait for the view all button to appear
- Click view all button to take us to google places
- Wait for an element on the new page to appear
Scraping the data with Pyppeteer
At this point we should be on the google places page and we can pull the data we want. The navigation flow we followed before is important for emulating a user.
Build A Web Scraper Using Python
Let's define the data we want to pull from the page.
- Name
- Location
- Phone
- Rating
- Website Link
In
core/browser.py
let's add two methods to our PageSession
to help us grab the text and an attribute (the website link for the doctor).So we added
get_text
and get_link
. These two methods will evaluate javascript on the browser, the same way if you were to type it on the Chrome console. You can see that they just use the DOM to grab the text
of the element or the href
attribute.In
google-places/__main__.py
we will add a few functions that will grab the content that we care about from the page.We make use of XPath to grab the elements. You can practice XPath in your Chrome browser by pressing
F12
or right-clicking inspect to open the console.Why do I use XPath? It's easier to specify complex selectors because XPath has built in functions for handling things like finding elements which contain some text or traversing the tree in various ways.For the
phone
, rating
and link
fields we default to None
and substitute with 'N/A' because not all doctors have a phone number listed, a rating or a link. All of them seem to have a location and a name.Because there are many doctors listed on the page we want to find the parent element and loop over each match, then evaluate the XPath we defined above.To do this let's add two more functions to tie it all together.
The entry point here is
scrape_doctors
which evaluates get_doctor_details
on each container element.In the code below, we loop over each container element that matched our XPath and we get back a
Future
object by calling the function get_doctor_details
.Because we don't use the await
keyword, we get back a Future object which can be used by the asyncio.gather
call to evaluate all Future
objects in the tasks
Bonzi buddy plush. list.This line allows us to wait for all
async
calls to finish concurrently.Let's put this together in our main function. First we search and crawl to the right page, then we scrape with
scrape_doctors
.How To Make A Web Scraper
Saving the output
In
core/utils.py
we'll add two functions to help us save our scraped output to a local CSV file.Let's import it in
google-places/__main__.py
and save the output of scrape_doctors
from our main function.We should now have a file called
pediatricians.csv
which contains our output.Creating A Web Scraper In Python 3
Wrapping up
From this guide we should have learned how to use a headless browser to crawl and scrape google places while emulating a real user.There's a lot more you can do with headless browsers such as generate pdfs, screenshots and other automation tasks.
How To Scrape A Website Using Python
Hopefully this guide helped you get started executing javascript and scraping with a headless browser. Till next time!