Creating a Scrapy project gives a crawler a predictable home for spiders, settings, exported items, and middleware changes, which matters once a scrape needs more than one experimental file. A proper project layout also makes it easier to keep crawling logic, output rules, and site-specific settings together as the crawl grows.
The scrapy startproject command creates a top-level working directory with scrapy.cfg plus a Python package that contains settings.py, items.py, pipelines.py, middlewares.py, and the spiders module. After that scaffold exists, project-aware commands such as scrapy settings, scrapy genspider, and scrapy crawl use the new directory as their configuration root.
The project name becomes both the directory name and the Python package name, so it should use letters, numbers, and underscores instead of spaces or hyphens. Scrapy must already be installed before starting, and the generated settings.py enables ROBOTSTXT_OBEY by default, which means new projects respect target-site robots.txt rules until that setting is changed.
Related: How to install Scrapy using pip
Related: How to create a Scrapy spider
Steps to create a Scrapy project:
- Change to the directory that will hold the new Scrapy project.
$ cd /home/user/sg-work
- Create the project scaffold with the project name that will become the working directory and Python package.
$ scrapy startproject catalogbot New Scrapy project 'catalogbot', created in: /home/user/sg-work/catalogbot You can start your first spider with: cd catalogbot scrapy genspider example example.comProject names become importable package names, so spaces and hyphens create invalid or awkward module names.
- Change into the new project directory before running project-aware Scrapy commands.
$ cd catalogbot
Commands such as scrapy settings, scrapy genspider, and scrapy crawl expect to run from the project root where scrapy.cfg exists.
- List the generated files to confirm that the project scaffold includes the main package, settings, and spider module.
$ find . -maxdepth 2 -print . ./scrapy.cfg ./catalogbot ./catalogbot/spiders ./catalogbot/__init__.py ./catalogbot/middlewares.py ./catalogbot/settings.py ./catalogbot/items.py ./catalogbot/pipelines.py
scrapy.cfg points the CLI at the project settings package, while the inner catalogbot directory holds the code that will be imported during crawls.
- Read the configured bot name to confirm that Scrapy is loading the new project's settings.
$ scrapy settings --get BOT_NAME catalogbot
If this command fails outside the project root, change back to the directory that contains scrapy.cfg.
- Read the spider module path to confirm where new spider files will be created.
$ scrapy settings --get NEWSPIDER_MODULE catalogbot.spiders
scrapy genspider uses this module path when it writes a new spider skeleton. Related: How to create a Scrapy spider
- Check the generated robots policy before adding crawl targets or request logic.
$ scrapy settings --get ROBOTSTXT_OBEY True
New projects enable ROBOTSTXT_OBEY in the generated settings.py file even though Scrapy's historical fallback default is False.
Notes
- Use a project name that can be imported cleanly in Python, such as catalogbot or price_monitor, because the same name appears in BOT_NAME, NEWSPIDER_MODULE, and SPIDER_MODULES.
- Keep one project directory per crawler or site family when settings, pipelines, middleware, or item models differ, and use separate spiders inside that project only when they can reasonably share the same configuration base.
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
