Creating a Scrapy spider turns selector experiments into a reusable crawl entry point that can be run on demand, exported to a feed, or scheduled with the rest of a project. That matters once extraction logic needs a stable name, repeatable start requests, and a callback that can be revised without rebuilding the whole crawler.

In a project-based workflow, scrapy genspider writes a new spider class into the module named by NEWSPIDER_MODULE and seeds it with allowed_domains, start_urls, and an empty parse() method. Running the command from the project root keeps the new spider aligned with the same settings, middleware, feed exports, and robots handling used by the rest of the crawler.

Spider names must stay unique inside one project, and the project itself should already exist before a new spider is generated. New projects created with scrapy startproject still enable ROBOTSTXT_OBEY in their generated settings.py, so the first crawl can behave more conservatively than older examples that assume unrestricted requests.

Steps to create a Scrapy spider with scrapy genspider:

  1. Change to the Scrapy project root that contains scrapy.cfg.
    $ cd /home/user/sg-work/catalogbot

    If the project does not exist yet, create it first. Related: How to create a Scrapy project

  2. Read the spider module path so the new file lands in the expected package.
    $ scrapy settings --get NEWSPIDER_MODULE
    catalogbot.spiders

    scrapy genspider writes the new class into this module when the command runs from the project root.

  3. Generate the spider scaffold with a unique spider name and the first target URL.
    $ scrapy genspider offers https://catalog.example.net/products
    Created spider 'offers' using template 'basic' in module:
      catalogbot.spiders.offers

    scrapy genspider accepts a full URL as the target, and -t crawl switches to the link-following crawl template when that behavior is needed.

  4. List the project spiders to confirm that Scrapy registered the new spider name.
    $ scrapy list
    offers
  5. Edit the generated spider file so the parse() callback yields the fields that matter for the target page.
    $ vi catalogbot/spiders/offers.py
    import scrapy
     
     
    class OffersSpider(scrapy.Spider):
        name = "offers"
        allowed_domains = ["catalog.example.net"]
        start_urls = ["https://catalog.example.net/products"]
     
        def parse(self, response):
            for product in response.css("article.product"):
                href = product.css("h2 a::attr(href)").get()
                yield {
                    "title": product.css("h2 a::text").get(),
                    "price": product.css(".price::text").get(),
                    "url": response.urljoin(href) if href else None,
                }

    Build the selector in the shell first when the markup is unclear. Related: How to use Scrapy shell
    Related: How to use CSS selectors in Scrapy

  6. Run the spider and overwrite the export file with the current crawl results.
    $ scrapy crawl offers -O offers.jsonl
    2026-04-16 06:40:53 [scrapy.utils.log] INFO: Scrapy 2.15.0 started (bot: catalogbot)
    ##### snipped #####
    2026-04-16 06:40:54 [scrapy.core.engine] INFO: Spider opened
    2026-04-16 06:40:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://catalog.example.net/products> (referer: None)
    2026-04-16 06:40:54 [scrapy.extensions.feedexport] INFO: Stored jsonl feed (1 items) in: offers.jsonl
    2026-04-16 06:40:54 [scrapy.core.engine] INFO: Spider closed (finished)

    -O replaces any existing file at that path, while the crawl still obeys project settings such as ROBOTSTXT_OBEY, download delays, middleware, and default headers.

  7. Inspect the exported feed to confirm that the spider yielded the expected fields.
    $ cat offers.jsonl
    {"title": "Starter widget", "price": "$49", "url": "https://catalog.example.net/products/widget-1"}

Notes

  • The default basic template is enough for one-page extraction or explicit follow-up requests from parse(), while crawl is the better template when the spider should follow links automatically with rules.
  • scrapy startproject still writes ROBOTSTXT_OBEY = True into generated projects even though the historical fallback setting default is False, so the first run can skip pages that older tutorials would fetch.
  • Replace static start_urls with an async def start() method when the spider must build initial requests dynamically, and add start_requests() only when compatibility with Scrapy versions older than 2.13 is required.