Keeping images alongside scraped item data makes product catalogs, archives, and machine-learning datasets reproducible without relying on the source site staying online. Scrapy's ImagesPipeline automates image fetching and storage so spiders only need to extract image URLs.

ImagesPipeline reads image URLs from the image_urls field, schedules media requests through Scrapy’s downloader, and saves image files under the directory configured by IMAGES_STORE. After downloads complete, the pipeline populates the images field with metadata such as the original URL, the relative storage path (for example full/<sha1>.jpg), and a checksum for integrity tracking.

The pipeline processes images through Pillow and converts them to JPEG in RGB mode by default, so Pillow must be available in the runtime environment. Large image sets can consume disk and bandwidth quickly, and aggressive concurrency can trigger hotlink protection or rate limits, so keep crawl settings reasonable and store images on a volume with enough space. Image URLs should be absolute (or converted using response.urljoin()) so downloads do not fail on relative paths.

Steps to download images with Scrapy ImagesPipeline:

  1. Open the project's item definitions file.
    $ vi images_demo/items.py
  2. Define an item with image_urls and images fields.
    import scrapy
     
    class GalleryItem(scrapy.Item):
        title = scrapy.Field()
        image_urls = scrapy.Field()
        images = scrapy.Field()
  3. Open the Scrapy settings file.
    $ vi images_demo/settings.py
  4. Set IMAGES_STORE to the directory used for downloaded images.
    IMAGES_STORE = "/root/sg-work/images-store"

    Files are stored under full/ inside IMAGES_STORE.

  5. Enable ImagesPipeline in ITEM_PIPELINES.
    ITEM_PIPELINES = {
        "scrapy.pipelines.images.ImagesPipeline": 1,
    }

    ImagesPipeline requires Pillow; missing dependency prevents image processing and storage.

  6. Create the local directory configured in IMAGES_STORE.
    $ mkdir -p /root/sg-work/images-store
  7. Open the spider that extracts image URLs.
    $ vi images_demo/spiders/gallery.py
  8. Yield items that set image_urls to a list of absolute image URLs.
    import scrapy
    from images_demo.items import GalleryItem
     
    class GallerySpider(scrapy.Spider):
        name = "gallery"
        start_urls = ["http://app.internal.example:8000/gallery/"]
     
        def parse(self, response):
            for card in response.css("figure"):
                item = GalleryItem()
                item["title"] = card.css("figcaption::text").get()
                item["image_urls"] = [
                    response.urljoin(url)
                    for url in card.css("img::attr(src)").getall()
                ]
                yield item

    image_urls must be a list, including single-image items.

  9. Run the spider and write items to a JSON feed to capture downloaded image metadata.
    $ scrapy crawl gallery -O gallery.json
    2026-01-01 09:49:44 [scrapy.pipelines.files] DEBUG: File (downloaded): Downloaded file from <GET http://files.example.net:8000/images/gallery-2.png> referred in <None>
    2026-01-01 09:49:44 [scrapy.pipelines.files] DEBUG: File (downloaded): Downloaded file from <GET http://files.example.net:8000/images/gallery-1.png> referred in <None>
    ##### snipped #####

    Stored filenames default to the SHA1 hash of the image URL as full/<sha1>.jpg.

  10. Inspect the exported items to confirm the images field includes the local storage path.
    $ head -n 18 gallery.json
    [
    {"title": "Gallery Image 2", "image_urls": ["http://files.example.net:8000/images/gallery-2.png"], "images": [{"url": "http://files.example.net:8000/images/gallery-2.png", "path": "full/ff26f717928f6093552dcc2faa024598524121cd.jpg", "checksum": "2775f338c469b19c338c4e0ea410271c", "status": "downloaded"}]},
    {"title": "Gallery Image 1", "image_urls": ["http://files.example.net:8000/images/gallery-1.png"], "images": [{"url": "http://files.example.net:8000/images/gallery-1.png", "path": "full/24b14b0355f3bc22f1211a1baefd42641792246f.jpg", "checksum": "2775f338c469b19c338c4e0ea410271c", "status": "downloaded"}]}
    ]
  11. Confirm image files exist in the IMAGES_STORE directory.
    $ ls /root/sg-work/images-store/full
    24b14b0355f3bc22f1211a1baefd42641792246f.jpg
    ff26f717928f6093552dcc2faa024598524121cd.jpg