How to follow links from a response in Scrapy

Following links from a Scrapy response turns the page that was just downloaded into the next set of requests. That is how a spider moves from listing pages to detail pages or keeps stepping through pagination without hard-coding every absolute URL.

response.follow() builds a new Request by using the current response as the base URL, so relative paths resolve automatically and there is no need to call response.urljoin() first. It accepts a relative URL string, a Link object, an <a> selector, or an attribute selector such as response.css("li.next a::attr(href)")[0], and response.follow_all() remains the compact option when many matched links share the same callback.

The usual failures are passing a SelectorList instead of one selector, matching anchors that do not contain the intended href, or copying the whole response.meta dictionary into unrelated follow-up requests. Keep callback-only values in cb_kwargs, and list every expected target host in allowed_domains so OffsiteMiddleware does not quietly drop followed requests.

Steps to follow links from a response in Scrapy:

Change to the root of the Scrapy project that will run the spider.
```
$ cd response_follow_link_demo
```

Start scrapy shell against a page that contains both detail links and pagination.

$ scrapy shell --nolog 'http://quotes.toscrape.com/'
[s] Available Scrapy objects:
[s]   request    <GET http://quotes.toscrape.com/>
[s]   response   <200 http://quotes.toscrape.com/>
##### snipped #####
>>>

The shell exposes the fetched page as response, which is the same object a spider callback receives.

Confirm response.follow() turns both an <a> selector and a relative href selector into absolute request URLs.
```
>>> response.follow(response.css("div.quote span a")[0]).url
'http://quotes.toscrape.com/author/Albert-Einstein'
>>> response.follow(response.css("li.next a::attr(href)")[0]).url
'http://quotes.toscrape.com/page/2/'
```
response.follow(response.css("li.next a")) is not valid because response.css() returns a SelectorList. Use one selector with [0], loop over the matches, or switch to response.follow_all().

Replace response_follow_link_demo/spiders/quotes_follow.py with a spider that follows author links into a second callback and reuses parse() for pagination.

import scrapy
 
 
class QuotesFollowSpider(scrapy.Spider):
    name = "quotes_follow"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["http://quotes.toscrape.com/"]
 
    def parse(self, response):
        for author_link in response.css("div.quote span a"):
            yield response.follow(author_link, callback=self.parse_author)
 
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)
 
    def parse_author(self, response):
        yield {
            "name": response.css("h3.author-title::text").get(default="").strip(),
            "birthdate": response.css("span.author-born-date::text").get(default="").strip(),
            "url": response.url,
        }

Passing an <a> selector lets Scrapy read its href automatically. If a callback needs values from the listing page, pass them with cb_kwargs=... instead of copying the whole response.meta dictionary.

Run the spider and overwrite the previous export file for the current crawl.

$ scrapy crawl quotes_follow -O authors.json
[scrapy.utils.log] INFO: Scrapy 2.15.0 started (bot: response_follow_link_demo)
[scrapy.core.engine] INFO: Spider opened
##### snipped #####
[scrapy.extensions.feedexport] INFO: Stored json feed (50 items) in: authors.json
[scrapy.core.engine] INFO: Spider closed (finished)

-O overwrites the export file on each run so the saved results stay aligned with the current spider code.

Open the exported feed and confirm the saved URLs belong to author pages rather than the listing pages.

$ cat authors.json
[
{"name": "André Gide", "birthdate": "November 22, 1869", "url": "http://quotes.toscrape.com/author/Andre-Gide/"},
{"name": "Jane Austen", "birthdate": "December 16, 1775", "url": "http://quotes.toscrape.com/author/Jane-Austen/"},
{"name": "Thomas A. Edison", "birthdate": "February 11, 1847", "url": "http://quotes.toscrape.com/author/Thomas-A-Edison/"},
##### snipped #####
]

The author-page URLs prove that the crawl reached the followed responses and that parse_author() ran on those pages instead of the quote listing.

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.