Scrapy
will identify as itself on the user-agent
when scraping websites. To be exact, it's in the form of Scrapy/<version> (+https://scrapy.org)
.
While this is the correct and polite way for website scraping, some website owners will outright block Scrapy
or any other scrapers and one way to overcome this is to change Scrapy
's user-agent
. The other reason could be that you want Scrapy
to identify itself to be from your company or if the user-agent
itself is some kind of unique identifier and as a mean of authenticating to the website you're scraping.
Regardless the reason, you can change the user-agent
of Scrapy
by following these steps:
Scrapy
project folder. $ vi scrapyproject/settings.py
USER_AGENT
option. # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'scraper (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
You can also apply the user-agent
change only when you run your particular Scrapy
spider or shell with the -s USER_AGENT
option. It supersedes the configuration file.
To manually change the user-agent
of your spider, use the above option followed by your user-agent
of choice such as the following example:
<html><hr><div class="adsense-inline"><script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script> <ins class="adsbygoogle" style="display:block; text-align:center;" data-ad-layout="in-article" data-ad-format="fluid" data-ad-client="ca-pub-2726248188023431" data-ad-slot="4101244380"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script></div><hr></html> $ scrapy shell --help | grep -A1 set\= --set=NAME=VALUE, -s NAME=VALUE set/override setting (may be repeated) $ scrapy shell -s USER_AGENT='myuseragentname' www.example.com