How to change user agent for Scrapy spiders

User-agent is a string browsers use to identify themselves to the web server. It is sent on every HTTP request in the request header, and in the case of Scrapy, it identifies as the following;

Scrapy/<version> (+https://scrapy.org)

The web server could then be configured to respond accordingly based on the user agent string. A request from a mobile device, for example, could be served with mobile-specific content. However, some web servers are configured to block web scraping traffic altogether and could be a problem for Scrapy spiders.

One way to avoid the issue is for Scrapy to change the user agent string and identify itself as any other browser.

Steps to change user agent for Scrapy:

Fetch a website normally using scrapy fetch command.

$ scrapy fetch https://www.simplified.guide
2022-01-21 19:14:23 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-01-21 19:14:23 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 10 2021, 14:59:43) - [GCC 11.2.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.3.2, Platform Linux-5.13.0-27-generic-aarch64-with-glibc2.34
2022-01-21 19:14:23 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-01-21 19:14:23 [scrapy.crawler] INFO: Overridden settings:
{}
2022-01-21 19:14:23 [scrapy.extensions.telnet] INFO: Telnet Password: 8cb518588009f800
2022-01-21 19:14:23 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2022-01-21 19:14:23 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
##### snipped

You can also test the same with shell or any other Scrapy method.

Get the user agent that you want to use for your Scrapy spider.

Related: List of Browser User Agents

Use set option to change the USER_AGENT value for your fetch request.

$ scrapy fetch https://www.simplified.guide --set=USER_AGENT="Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148"
2022-01-21 19:15:46 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-01-21 19:15:46 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 10 2021, 14:59:43) - [GCC 11.2.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.3.2, Platform Linux-5.13.0-27-generic-aarch64-with-glibc2.34
2022-01-21 19:15:46 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-01-21 19:15:46 [scrapy.crawler] INFO: Overridden settings:
{'USER_AGENT': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) '
               'AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'}
2022-01-21 19:15:46 [scrapy.extensions.telnet] INFO: Telnet Password: 4b6d55580e3b313c
2022-01-21 19:15:46 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2022-01-21 19:15:46 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 ##### snipped

Open the configuration file of your Scrapy project using your preferred text editor.
```
$ vi simplifiedguide/simplifiedguide/settings.py
```

Search for the USER_AGENT option.

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'scraper (+http://www.yourdomain.com)'

Uncomment the line and set the value to the user-agent of your choice to permanently set the user agent for your Scrapy spider.
```
USER_AGENT = 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'
```

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.

Discuss the article:

Comment anonymously. Login not required.