Scrapy will identify as itself on the user-agent when scraping websites. To be exact, it's in the form of Scrapy/<version> (+https://scrapy.org).

While this is the correct and polite way for website scraping, some website owners will outright block Scrapy or any other scrapers and one way to overcome this is to change Scrapy's user-agent. The other reason could be that you want Scrapy to identify itself to be from your company or if the user-agent itself is some kind of unique identifier and as a mean of authenticating to the website you're scraping.

Regardless the reason, you can change the user-agent of Scrapy by following these steps:

  1. Edit Scrapy's configuration file using your favorite editor from your Scrapy project folder.
    $ vi scrapyproject/settings.py
  2. Search for the USER_AGENT option.
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = 'scraper (+http://www.yourdomain.com)'
  3. Uncomment and set the value to the user-agent of your choice.
    USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'

You can also apply the user-agent change only when you run your particular Scrapy spider or shell with the -s USER_AGENT option. It supersedes the configuration file.

To manually change the user-agent of your spider, use the above option followed by your user-agent of choice such as the following example:

$ scrapy shell --help | grep -A1 set\=
--set=NAME=VALUE, -s NAME=VALUE
                        set/override setting (may be repeated)
$ scrapy shell -s USER_AGENT='myuseragentname' www.example.com
Leave a comment:
Share!