Web Scraping using Python
Web Scraping refers to the process of gathering data from the internet – and there are various tools that aid us to do this. Selenium is one such tool, and it allows us to automate the process – this is particularly useful when one requires large amounts of data from the internet and collecting this manually would be extremely time-consuming and rigorous.
Selenium in particular, simulates how a human would surf the web, which makes it so user-friendly and popular amongst novice coders, in comparison to other tools such as Scrapy. It also offers more functionality than BeautifulSoup. This blog will cover the basics of using Selenium and the various functions that Selenium offers.
The following requirements are necessary to use Selenium:
1.Python: This tutorial will be coded using Python 3.0 using Jupyter Notebook, but can be coded on Python 2.0 with minor changes.
2.Selenium Installation: The following line of code allows you to install selenium. `!` refers to a terminal command, and `pip` refers to a package management system for installing and managing packages which is already pre-installed by default
!pip install selenium
3.Google Chrome: This tutorial will show you how to use Selenium using Google Chrome but Selenium itself is not limited to Chrome. It can be used with Firefox and Edge as well but some of the nitty-gritties may change.
4.Driver Installation: Selenium WebDriver allows you to launch and perform tasks in a specific browser. You can download ChromeDriver from here. A quick google search will lead you to download links for other WebDrivers for different browsers.
Now that we have our requirements in place, let us begin writing our code.
from selenium import webdriver
Now that we have our requirements in place, let us begin writing our code.
Importing all the required packages in our python Notebook
Setting up the selenium automated browser ‘driver’
The following line of code opens up a new Chrome Window, automated by the Selenium Webdriver
driver = webdriver.Chrome('''Path where the ChromeDriver has been installed in your computer''')
This will stop Selenium from trying to load the webpage if it takes longer than 5 minutes (300 seconds) to load.
Maximizing the window is good practice. Different browser sizes could mean a different page source (which is what we’re going to use to identify the elements we need Selenium to scrape.
Now let’s load up the URL of the page we want to scrape – for the purpose of this tutorial, we shall scrape the reviews for the book ‘1984’ by George Orwell from Amazon.com.
Using time.sleep() will pause the program for 15 seconds – this allows enough time for the page to load all the elements before Selenium starts scraping the required elements. If Selenium starts scraping before the page is fully loaded – it will run into an error.
Accessing the elements to be scraped
The page has now loaded in a new browser window – which is controlled by Selenium.
We use the page source code to access the elements we want to scrape, which you can access using ‘Ctrl + Shift + I’, or simply right-clicking and then selecting ‘Inspect Element’. In our python notebook, we use `driver.find_element_by_` to access the elements we need. Pressing the tab key gives you a list of possible elements you can find, like so.
Scroll to the bottom of the page to find the link to all the 7000-ish reviews. Let’s ask Selenium to click this link for us, using XPath. An element’s xpath is unique, and it is a good way to access particular elements, like a link or a title or a button.
Right click the link, and then click on inspect element. (Do this action twice when you open the source code for the first time, as the first click will send you to the parent tag in the source code and the second click will lead you to the specific element you want.)
This will open up the source code for the link – you will be able to see a tag which has ‘href’ as an attribute. Right click on that element, select ‘Copy’ and then select ‘Copy XPath’.
Paste this XPath in your code as shown below. The element is saved to the all_reviews_link variable, and the click function enables Selenium to click on that link.
all_reviews_link = driver.find_element_by_xpath('''//*[@id="reviews-medley-footer"]/div/a''')
Tada! This should have opened up a new page.
Let’s scrape the title and the author of the book we want the reviews for. In the source code, we can see that the tags holding the text we need have a class attribute. We can use XPath for this as well, as there is only a single element we require, but let’s try this differently.
book_title = driver.find_element_by_class_name('''a-link-normal''')
The text function gives us the text in between the tags.
Similarly, we can use the CSS selector which in this case is < a >. However, there are many < a > tags embedded in the source code. So to pick our particular CSS selector, we can use ‘css_selector.class_name’ as shown in the code below.
book_title = driver.find_element_by_css_selector('''a.a-link-normal''')
We can also copy the selector as we copied the XPath, which is especially useful when there are compound class names, as is the case with the author element.
author = driver.find_element_by_css_selector('''#cm_cr-product_info > div > div.a-fixed-left-grid-col.a-col-right > div > div > div.a-fixed-left-grid-col.product-info.a-col-right > div.a-row.product-by-line > a''')
Now, moving on to the reviews. We can scrape the reviewer’s name and the review content. Using XPath does not bode well here, as the XPath for each review element is unique. Using a general identifier that is present in all reviews is better. It often takes a little trial-and-error to figure out the identifying tag or attribute.
For example, for the reviewer’s name, we can see that the class attribute is unique to the reviews. Hence, we can use `driver.find_elements_by_class_name`. Note how we write elements as plural, as find_element will return us only the first element. find_elements will return a list of elements.
Here, the variable username_element_list is a list of elements, whereas username is the text within those element tags. (We take elements from the second index, as the first two are not part of the actual review list)
username_element_list = driver.find_elements_by_class_name('''a-profile-name''')
usernames = [element.text for element in username_element_list][2:] usernames
Another way of getting the element with another attribute that is not ‘class’ is ‘css_selector[attribute=value]’, as seen in the code below.
review_element_list = driver.find_elements_by_css_selector('''span[data-hook="review-body"]''')
reviews = [element.text for element in review_element_list] reviews
Now that we’ve gotten the data on the first page which we can easily store in a data frame, all we have to do is click the next button and get the next set of reviews running the same code or using for loop.
More useful functions offered by Selenium
The back() and forward() functions allow you to navigate within the browser history. However, these are not foolproof and are often prone to error. An alternative code has also been mentioned.
Another thing to keep in mind is the pop-ups. If the pop-up blocks an element you’re trying to scrape, Selenium will throw an error – if you remember, Selenium mimics a human surfing the web. If you can’t access the element without scrolling / closing the pop-up, Selenium won’t be able to either. Pop-ups can easily be closed by using the xpath and the variety of other methods described above.
A really cool functionality that Selenium offers is it lets you automate search box usage as well. For example, consider the Amazon search box – we can find it using the id attribute (or the class attribute, or the XPath – whatever floats your boat!)
search_box = driver.find_element_by_id('''twotabsearchtextbox''')
You can also clear the search box clear()
Getting Selenium to press the ‘enter’ button requires you to import the Keys module from Selenium.
from selenium.webdriver.common.keys import Keys search_box.send_keys(Keys.ENTER)
Oh, and lastly, if Selenium throws an error, read the error message – you should get an understanding of what went wrong and be able to handle the error after reading this blog.
Thank you for reading! Hopefully, this tutorial has been helpful in aiding you to use selenium for web scraping in python!