Item loaders keep field cleanup close to the item schema so a spider callback can stay focused on selectors and link traversal. That becomes useful as soon as one field needs repeatable trimming, prefix cleanup, or URL normalization across more than one response.
In current Scrapy, ItemLoader collects values added through add_css(), add_xpath(), or add_value() into per-field lists, then applies the output processor when load_item() returns the finished item. Scrapy's loader class extends the underlying itemloaders library, so the supported import path is from scrapy.loader import ItemLoader while processors such as MapCompose and TakeFirst come from itemloaders.processors.
Loaders do not fix weak selectors or missing fields on their own. Keep processors small, use field-specific cleanup only where the field genuinely needs it, and confirm the resulting item on one real response before moving the loader into a wider crawl.
Related: How to use Scrapy shell
Related: How to use CSS selectors in Scrapy
Steps to use Item Loaders in Scrapy:
- Create a one-file spider that defines the item, the loader, and one loader-driven callback against Scrapy's sample page.
- loader_spider.py
import scrapy from itemloaders.processors import MapCompose, TakeFirst from scrapy.loader import ItemLoader def normalize_label(value: str) -> str: return value.removeprefix("Name:").strip() class ImageLinkItem(scrapy.Item): label = scrapy.Field() href = scrapy.Field() class ImageLinkLoader(ItemLoader): default_input_processor = MapCompose(str.strip) default_output_processor = TakeFirst() label_in = MapCompose(str.strip, normalize_label) class LoaderSpider(scrapy.Spider): name = "loader" custom_settings = { "ROBOTSTXT_OBEY": False, } start_urls = [ ( "https://docs.scrapy.org/en/latest/_static/" "selectors-sample1.html" ), ] def parse(self, response): for link in response.css("#images a"): loader = ImageLinkLoader( item=ImageLinkItem(), selector=link, ) loader.add_css("label", "::text") loader.add_css( "href", "::attr(href)", MapCompose(response.urljoin), ) yield loader.load_item()
label_in overrides the default input processor only for the label field, while the per-call MapCompose(response.urljoin) keeps the extracted href value absolute. Related: How to define item fields in Scrapy
- Run the spider and write the normalized items to a JSON file.
$ scrapy runspider loader_spider.py -O image-links.json ##### snipped ##### 2026-04-22 06:44:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://docs.scrapy.org/en/latest/_static/selectors-sample1.html> {'href': 'http://example.com/image1.html', 'label': 'My image 1'} 2026-04-22 06:44:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://docs.scrapy.org/en/latest/_static/selectors-sample1.html> {'href': 'http://example.com/image2.html', 'label': 'My image 2'} ##### snipped ##### 2026-04-22 06:44:52 [scrapy.extensions.feedexport] INFO: Stored json feed (5 items) in: image-links.jsonscrapy runspider is useful for a quick loader test because the Item, loader, and spider can stay in one file until the selectors and processors are stable.
- Open the export and confirm the label field lost the Name: prefix while the href field stayed absolute.
$ cat image-links.json [ {"label": "My image 1", "href": "http://example.com/image1.html"}, {"label": "My image 2", "href": "http://example.com/image2.html"}, {"label": "My image 3", "href": "http://example.com/image3.html"}, {"label": "My image 4", "href": "http://example.com/image4.html"}, {"label": "My image 5", "href": "http://example.com/image5.html"} ]TakeFirst() suits single-value fields such as label or href, but repeated fields such as tags or multiple links should keep a list-oriented output processor instead of collapsing to one value.
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
