You can scrape a website's content by using Scrapy's shell. The shell is interactive and helpful for testing or single-use scraping. However, if you want to automate website scraping, you'll need to use Scrapy's spider.

The spider will allow you to programmatically define how to crawl and extract data from websites using Python. The process can be run manually using scrapy crawl command or automated using tools like cron.

Steps to create a Scrapy spider:

  1. Launch terminal application.
  2. Go to the folder that you want to create your Scrapy project.
    $ cd ~
  3. Install Scrapy if dont already have it installed.
  4. Create a Scrapy project.
    $ scrapy startproject simplifiedguide
    New Scrapy project 'simplifiedguide', using template directory '/usr/lib/python3/dist-packages/scrapy/templates/project', created in:
    You can start your first spider with:
        cd simplifiedguide
        scrapy genspider example
  5. Go to your project's spiders directory.
    $ cd simplifiedguide/simplifiedguide/spiders/
  6. Generate a new spider.
    $ scrapy genspider simplified
    Created spider 'simplified' using template 'basic' in module:

    The default argument is to specify a name and URL for the spider. It will create a basic spider by default, though you can specify other types of spider using -t option that would better suit your need.

    $ scrapy genspider -l
    Available templates:
    $ scrapy genspider example
    Created spider 'example' using template 'basic'
    $ scrapy genspider -t crawl scrapyorg
    Created spider 'scrapyorg' using template 'crawl'
  7. Edit spider as necessary using your preferred text editor.
    $ vi
  8. Use Scrapy's' crawl command to test your spider.
    $ scrapy crawl simplified
    2022-01-19 07:01:59 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: simplifiedguide)
    2022-01-19 07:01:59 [scrapy.utils.log] INFO: Versions: lxml, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 10 2021, 14:59:43) - [GCC 11.2.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.3.2, Platform Linux-5.13.0-25-generic-aarch64-with-glibc2.34
    2022-01-19 07:01:59 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
    2022-01-19 07:01:59 [scrapy.crawler] INFO: Overridden settings:
    {'BOT_NAME': 'simplifiedguide',
     'NEWSPIDER_MODULE': 'simplifiedguide.spiders',
     'ROBOTSTXT_OBEY': True,
     'SPIDER_MODULES': ['simplifiedguide.spiders']}
    ##### snipped
  9. Configure your Scrapy spider as necessary.

    Related: Scrapy

Discuss the article:

Comment anonymously. Login not required.