User-agent is a string browsers use use to identify themselves to the web server. It is sent on every HTTP request in the request header, and in the case of Scrapy, it identifies as the following;

Scrapy/<version> (+https://scrapy.org)

The web server could then be configured to respond accordingly based on the user agent string. A request from a mobile device, for example, could be served with mobile-specific content. However, some web servers are configured to block web scraping traffic altogether and could be a problem for Scrapy spiders.

One way to avoid the issue is for Scrapy to change the user agent string and identify itself as any other browser.

Steps to change user agent for Scrapy:

  1. Fetch a website normally using scrapy fetch command.
    $ scrapy fetch https://www.simplified.guide
    2022-01-21 19:14:23 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
    2022-01-21 19:14:23 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 10 2021, 14:59:43) - [GCC 11.2.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.3.2, Platform Linux-5.13.0-27-generic-aarch64-with-glibc2.34
    2022-01-21 19:14:23 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
    2022-01-21 19:14:23 [scrapy.crawler] INFO: Overridden settings:
    {}
    2022-01-21 19:14:23 [scrapy.extensions.telnet] INFO: Telnet Password: 8cb518588009f800
    2022-01-21 19:14:23 [scrapy.middleware] INFO: Enabled extensions:
    ['scrapy.extensions.corestats.CoreStats',
     'scrapy.extensions.telnet.TelnetConsole',
     'scrapy.extensions.memusage.MemoryUsage',
     'scrapy.extensions.logstats.LogStats']
    2022-01-21 19:14:23 [scrapy.middleware] INFO: Enabled downloader middlewares:
    ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
     'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
    ##### snipped

    You can also test the same with shell or any other Scrapy method.

  2. Get the user agent that you want to use for your Scrapy spider.
  3. Use set option to change the USER_AGENT value for your fetch request.
    $ scrapy fetch https://www.simplified.guide --set=USER_AGENT="Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148"
    2022-01-21 19:15:46 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
    2022-01-21 19:15:46 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 10 2021, 14:59:43) - [GCC 11.2.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.3.2, Platform Linux-5.13.0-27-generic-aarch64-with-glibc2.34
    2022-01-21 19:15:46 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
    2022-01-21 19:15:46 [scrapy.crawler] INFO: Overridden settings:
    {'USER_AGENT': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) '
                   'AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'}
    2022-01-21 19:15:46 [scrapy.extensions.telnet] INFO: Telnet Password: 4b6d55580e3b313c
    2022-01-21 19:15:46 [scrapy.middleware] INFO: Enabled extensions:
    ['scrapy.extensions.corestats.CoreStats',
     'scrapy.extensions.telnet.TelnetConsole',
     'scrapy.extensions.memusage.MemoryUsage',
     'scrapy.extensions.logstats.LogStats']
    2022-01-21 19:15:46 [scrapy.middleware] INFO: Enabled downloader middlewares:
    ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
     'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
     ##### snipped
  4. Open the configuration file of your Scrapy project using your preferred text editor.
    $ vi simplifiedguide/simplifiedguide/settings.py
  5. Search for the USER_AGENT option.
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = 'scraper (+http://www.yourdomain.com)'
  6. Uncomment the line and set the value to the user-agent of your choice to permanently set the user agent for your Scrapy spider.
    USER_AGENT = 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'
Discuss the article:

Comment anonymously. Login not required.

Share!