How to Scrape Data from Webpages with Python’s Scrapy

By | January 21, 2017

scrapy logoIn this post I’ll show how to gather unstructured information that exists on webpages using Python’s open source web crawling framework, Scrapy. Web crawlers have been around since the conception of the internet, in fact Google started out by visiting links from Stanford’s homepage until all 10 million of them had been explored. In the example code, we’ll be visiting Reddit’s /r/pics sub and grabbing information like title, author, comments, etc. But first, I should do the right thing and caution against the excessive use of a scraper where using an API (application programming interface) will do just fine. Reddit has an API called PRAW that you should use for anything that will be making lots of requests. The reason for this is that web crawlers are responded to by the servers as if they were humans, sometimes slowing the servers down. API’s allow for programs to ask for information in a more efficient manner. For an intro to Scrapy and generic method, visit the Scrapy tutorial page. Now, let’s look main method of the code:


from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.item import Item, Field
import scrapy

class PicItem(Item):
    author = Field()
    user = Field()
    date = Field()
    title = Field()
    url = Field()
    link = Field()
    comments = Field()

# Borrowed and modified from : https://seanmckaybeck.com/scrapy-the-basics.html

class exampleSpider(CrawlSpider):
    name = 'exampleSpider'
    allowed_domains = ['reddit.com']
    start_urls = ['https://www.reddit.com/r/pics']
    custom_settings = {
            'BOT_NAME': 'exampleSpider',
            'DEPTH_LIMIT': 5,
            'DOWNLOAD_DELAY': 5
            }
    rules = [
        Rule(LinkExtractor(allow=['/r/pics/\?count=\d*&after=\w*']),
             callback='parse_item',
             follow=True)
    ]
    
    def parse_item(self, response):
        # queues up a list of valid targets, or threads, and extracts details about the thread
        selector_list = response.css('div.thing')
        
        for selector in selector_list... #snip#

You can receive the full code via email through this form:



This code will load the /r/pics subreddit, queue up a list of threads to visit, and begin visiting them and extracting the additional information (comments, comment author, etc). It will work its way from most recent to oldest. Only a few settings are initialized here, the download delay is a reasonable 5 seconds – something I was using just for testing purposes. The Item class (PicItem) is how Scrapy organizes and represents the data that is being scraped. Think of it like a specialized dictionary, well suited to sloppy web data.

To find the data, Scrapy looks for data within the page based on what HTML patterns you tell it to look for. To see where your target data is embedded in the HTML, right click the text or value and click inspect. The code I provided uses Xpath selection. That is not your only option though, as I’ve seen a fair amount of other examples use lxml (used by default in BeautifulSoup). A useful add-on for viewing web page source is Firebug, I believe it has a one click tool that has “copy xpath” which gets the xpath structure you need to extract data. You may need to tweak it a little bit to only get the relevant information. A good rule of thumb is to dig through the HTML to the field you want to grab and trace the path back up until it is a unique path. For example, a div element with a span element beneath it might occur 28 times on a page, but as you specify more structure (div > span > p > a) you will narrow it down to only the data you want. After having created a new project with “scrapy startproject exampleSpider”, you can start the spider by navigating to the folder the python file is located in and typing:

scrapy runspider exampleSpider.py -o data.csv

And it will spit out the data neatly like so:
scrapy data

However, if you look closely, there are some entries that done line up. In the picture above, the comment field is split into two columns, shifting the rest of the data over a column. What could be causing it? Let’s follow the URL

debugging example

This post stands out from the others because it contains a semicolon and parenthesis. Let’s check another too see if there’s any commonality.

debugging example

Another comment with a semicolon that isn’t being recorded correctly. This is usually good enough evidence to being applying a solution. My hunch is that the semicolon is being interpreted as a delimiter by the file reader (Libre Office). Google search. A way around this would be to specify a different output format with the “-o” flag. Json and XML are also great formats, but I tend to use CSV by default because it’s a little more universal and because most data analysis courses/tutorials that I’ve come across use data in .csv format.

This was a project that fell to the wayside so that’s all I have for you. This goal of the post was to give curious beginning/intermediate programmers an idea of what a scraper looks like in Python. Also, you have free working code to start with so you don’t have to spend as much time coding something up from scratch. Don’t run this code 24/7 with no delays between requests. Do modify it to suit your own curiosity and share the results! I sort of feel like I’m leaving the readers who were looking for an in depth post hanging, so I’ll link to another blog post I came across which goes much more in depth about scraping with Scrapy (you just have to give your email to get the source code). Happy data gathering!

Update: I’ve also written a guide for using requests and BeautifulSoup that might be a little better suited for scraping data from a single page.

P.S. – Also see scrapinghub. Of course I run into all these resources after I write the code.

Facebooktwittergoogle_plusredditpinterestlinkedintumblr

One thought on “How to Scrape Data from Webpages with Python’s Scrapy

  1. Pingback: How-to: Scrape Data with Python’s BeautifulSoup – Adamantine.me

Leave a Reply

Your email address will not be published. Required fields are marked *