You can scrape a website's content by using Scrapy's shell. The shell is interactive and helpful for testing or single-use scraping. However, if you want to automate website scraping, you'll need to use Scrapy's spider.
The spider will allow you to programmatically define how to crawl and extract data from websites using Python. The process can be run manually using scrapy crawl command or automated using tools like cron.
Steps to create a Scrapy spider:
- Launch terminal application.
- Go to the folder that you want to create your Scrapy project.
$ cd ~
- Install Scrapy if dont already have it installed.
Related: How to install Scrapy on Ubuntu or Debian
Related: How to install Scrapy using pip - Create a Scrapy project.
$ scrapy startproject simplifiedguide New Scrapy project 'simplifiedguide', using template directory '/usr/lib/python3/dist-packages/scrapy/templates/project', created in: /home/user/simplifiedguide You can start your first spider with: cd simplifiedguide scrapy genspider example example.com
- Go to your project's spiders directory.
$ cd simplifiedguide/simplifiedguide/spiders/
- Generate a new spider.
$ scrapy genspider simplified www.simplified.guide Created spider 'simplified' using template 'basic' in module: simplifiedguide.spiders.simplified
The default argument is to specify a name and URL for the spider. It will create a basic spider by default, though you can specify other types of spider using -t option that would better suit your need.
$ scrapy genspider -l Available templates: basic crawl csvfeed xmlfeed $ scrapy genspider example example.com Created spider 'example' using template 'basic' $ scrapy genspider -t crawl scrapyorg scrapy.org Created spider 'scrapyorg' using template 'crawl'
- Edit spider as necessary using your preferred text editor.
$ vi simplified.py
- Use Scrapy's' crawl command to test your spider.
$ scrapy crawl simplified 2022-01-19 07:01:59 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: simplifiedguide) 2022-01-19 07:01:59 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 10 2021, 14:59:43) - [GCC 11.2.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1l 24 Aug 2021), cryptography 3.3.2, Platform Linux-5.13.0-25-generic-aarch64-with-glibc2.34 2022-01-19 07:01:59 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor 2022-01-19 07:01:59 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'simplifiedguide', 'NEWSPIDER_MODULE': 'simplifiedguide.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['simplifiedguide.spiders']} ##### snipped
- Configure your Scrapy spider as necessary.
Related: Scrapy
Mohd Shakir Zakaria is an experienced cloud architect with a strong development and open-source advocacy background. He boasts multiple certifications in AWS, Red Hat, VMware, ITIL, and Linux, underscoring his expertise in cloud architecture and system administration.
Comment anonymously. Login not required.