How to create a Scrapy spider

You can scrape a website's content by using Scrapy's shell. The shell is interactive and helpful for testing or single-use scraping. However, if you want to automate website scraping, you'll need to use Scrapy's spider.

The spider will allow you to programmatically define how to crawl and extract data from websites using Python. The process can be run manually using scrapy crawl command or automated using tools like cron.

Steps to create a Scrapy spider:

Launch terminal application.
Go to the folder that you want to create your Scrapy project.
```
$ cd ~
```
Install Scrapy if dont already have it installed.

Related: How to install Scrapy on Ubuntu or Debian
Related: How to install Scrapy using pip

Create a Scrapy project.

$ scrapy startproject simplifiedguide
New Scrapy project 'simplifiedguide', using template directory '/usr/lib/python3/dist-packages/scrapy/templates/project', created in:
    /home/user/simplifiedguide

You can start your first spider with:
    cd simplifiedguide
    scrapy genspider example example.com

Go to your project's spiders directory.

$ cd simplifiedguide/simplifiedguide/spiders/

Generate a new spider.

$ scrapy genspider simplified www.simplified.guide
Created spider 'simplified' using template 'basic' in module:
  simplifiedguide.spiders.simplified

The default argument is to specify a name and URL for the spider. It will create a basic spider by default, though you can specify other types of spider using -t option that would better suit your need.

$ scrapy genspider -l
Available templates:
  basic
  crawl
  csvfeed
  xmlfeed

$ scrapy genspider example example.com
Created spider 'example' using template 'basic'

$ scrapy genspider -t crawl scrapyorg scrapy.org
Created spider 'scrapyorg' using template 'crawl'

Edit spider as necessary using your preferred text editor.
```
$ vi simplified.py
```

Use Scrapy's' crawl command to test your spider.

$ scrapy crawl simplified
2022-01-19 07:01:59 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: simplifiedguide)
2022-01-19 07:01:59 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 10 2021, 14:59:43) - [GCC 11.2.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.3.2, Platform Linux-5.13.0-25-generic-aarch64-with-glibc2.34
2022-01-19 07:01:59 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-01-19 07:01:59 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'simplifiedguide',
 'NEWSPIDER_MODULE': 'simplifiedguide.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['simplifiedguide.spiders']}
##### snipped

Configure your Scrapy spider as necessary.

Related: Scrapy

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.

Discuss the article:

Comment anonymously. Login not required.