User agent is a string that browsers use to identify itself to the web server. It is sent on every HTTP request in the request header, and in the case of Scrapy, it identifies as the following;

Scrapy/<version> (+https://scrapy.org)

The web server could then be configured to respond accordingly based on the user agent string. A request from a mobile device for example, could be served with mobile-specific content. Some web servers however are configured to block web scraping traffic altogether and is a problem when using Scrapy.

One way to avoid the issue is for Scrapy to change the user agent string and identify itself as any other browser.

Steps to change user agent for Scrapy:

  1. Fetch a website normally using scrapy fetch command.
    $ scrapy fetch https://www.example.com

    Also work with shell or any other method.

  2. Use the set option to change the USER_AGENT value for the fetch request.
    $ scrapy fetch https://www.example.com --set=USER_AGENT="custom user agent string"
  3. Open Scrapy's configuration file using your favorite text.
    $ vi scrapyproject/settings.py
  4. Search for the USER_AGENT option.
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = 'scraper (+http://www.yourdomain.com)'
  5. Remove the initial # to uncomment the line and set the value to the user-agent of your choice.
    USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
Share this guide!
Discuss the article:

Comment anonymously. Login not required.

Share!