You can scrape a website's content by using Scrapy's shell. The shell is interactive and helpful for testing or single-use scraping. However, if you want to automate website scraping, you'll need to use Scrapy's spider.
The spider will allow you to programmatically define how to crawl and extract data from websites using Python. The process can be run manually using scrapy crawl command or automated using tools like cron.
$ cd ~
Related: How to install Scrapy on Ubuntu or Debian
Related: How to install Scrapy using pip
$ scrapy startproject simplifiedguide New Scrapy project 'simplifiedguide', using template directory '/usr/lib/python3/dist-packages/scrapy/templates/project', created in: /home/user/simplifiedguide You can start your first spider with: cd simplifiedguide scrapy genspider example example.com
$ cd simplifiedguide/simplifiedguide/spiders/
$ scrapy genspider simplified www.simplified.guide Created spider 'simplified' using template 'basic' in module: simplifiedguide.spiders.simplified
The default argument is to specify a name and URL for the spider. It will create a basic spider by default, though you can specify other types of spider using -t option that would better suit your need.
$ scrapy genspider -l Available templates: basic crawl csvfeed xmlfeed $ scrapy genspider example example.com Created spider 'example' using template 'basic' $ scrapy genspider -t crawl scrapyorg scrapy.org Created spider 'scrapyorg' using template 'crawl'
$ vi simplified.py
$ scrapy crawl simplified 2022-01-19 07:01:59 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: simplifiedguide) 2022-01-19 07:01:59 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 10 2021, 14:59:43) - [GCC 11.2.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1l 24 Aug 2021), cryptography 3.3.2, Platform Linux-5.13.0-25-generic-aarch64-with-glibc2.34 2022-01-19 07:01:59 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor 2022-01-19 07:01:59 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'simplifiedguide', 'NEWSPIDER_MODULE': 'simplifiedguide.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['simplifiedguide.spiders']} ##### snipped
Related: Scrapy
Comment anonymously. Login not required.