User-agent is a string browsers use to identify themselves to the web server. It is sent on every HTTP request in the request header, and in the case of Scrapy, it identifies as the following;
Scrapy/<version> (+https://scrapy.org)
The web server could then be configured to respond accordingly based on the user agent string. A request from a mobile device, for example, could be served with mobile-specific content. However, some web servers are configured to block web scraping traffic altogether and could be a problem for Scrapy spiders.
One way to avoid the issue is for Scrapy to change the user agent string and identify itself as any other browser.
$ scrapy fetch https://www.simplified.guide 2022-01-21 19:14:23 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot) 2022-01-21 19:14:23 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 10 2021, 14:59:43) - [GCC 11.2.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1l 24 Aug 2021), cryptography 3.3.2, Platform Linux-5.13.0-27-generic-aarch64-with-glibc2.34 2022-01-21 19:14:23 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor 2022-01-21 19:14:23 [scrapy.crawler] INFO: Overridden settings: {} 2022-01-21 19:14:23 [scrapy.extensions.telnet] INFO: Telnet Password: 8cb518588009f800 2022-01-21 19:14:23 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats'] 2022-01-21 19:14:23 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', ##### snipped
You can also test the same with shell or any other Scrapy method.
Related: List of Browser User Agents
$ scrapy fetch https://www.simplified.guide --set=USER_AGENT="Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148" 2022-01-21 19:15:46 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot) 2022-01-21 19:15:46 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 10 2021, 14:59:43) - [GCC 11.2.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1l 24 Aug 2021), cryptography 3.3.2, Platform Linux-5.13.0-27-generic-aarch64-with-glibc2.34 2022-01-21 19:15:46 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor 2022-01-21 19:15:46 [scrapy.crawler] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) ' 'AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'} 2022-01-21 19:15:46 [scrapy.extensions.telnet] INFO: Telnet Password: 4b6d55580e3b313c 2022-01-21 19:15:46 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats'] 2022-01-21 19:15:46 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', ##### snipped
$ vi simplifiedguide/simplifiedguide/settings.py
# Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'scraper (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'
Comment anonymously. Login not required.