Event calendars often expose just enough information on the listing page to attract clicks, while the exact date, venue, registration link, or summary lives on each event detail page. Following those detail pages is what turns a calendar scrape into data that can drive reminders, search indexes, or deduplicated event feeds.
In Scrapy, the listing callback extracts event-detail URLs and month-to-month pagination links from each Response, then schedules those requests with response.follow(). The detail callback extracts the title, the raw datetime attribute from the page’s time element, the location, and the registration URL so the exported item keeps both the human-facing page and the machine-readable schedule value.
Calendars are prone to duplicate listings across month views, mixed time zones, and JavaScript-driven event cards that do not exist in the first HTML response. Confirm selectors in scrapy shell before hard-coding them, keep the crawl rate polite, and switch to the underlying JSON or XHR endpoint when the shell shows only placeholder markup instead of real events.
Related: How to scrape paginated pages with Scrapy
Related: How to use CSS selectors in Scrapy
Steps to scrape event calendars with Scrapy:
- Create a new Scrapy project for the calendar crawler.
$ scrapy startproject event_calendar New Scrapy project 'event_calendar', using template directory '##### snipped #####', created in: /home/user/event_calendar You can start your first spider with: cd event_calendar scrapy genspider example example.com - Change to the new project directory.
$ cd event_calendar
- Generate a spider scaffold for the calendar host.
$ scrapy genspider events events.example.net Created spider 'events' using template 'basic' in module: event_calendar.spiders.events
- Probe the calendar listing page in scrapy shell to confirm the event-link and next-page selectors.
$ scrapy shell "https://events.example.net/calendar/" ##### snipped ##### >>> response.css("article.event-card a.event-link::attr(href)").getall() ['/calendar/events/devops-summit.html', '/calendar/events/crawl-workshop.html'] >>> response.css("a[rel='next']::attr(href), a.next::attr(href)").get() '/calendar/page-2.html'Related: How to use Scrapy shell
- Probe one event detail page in scrapy shell to confirm the title, machine-readable time, and registration selectors.
$ scrapy shell "https://events.example.net/calendar/events/devops-summit.html" ##### snipped ##### >>> response.css("h1::text").get() 'DevOps Summit' >>> response.css("time::attr(datetime)").get() '2026-05-14T09:00:00-04:00' >>> response.css("a.register::attr(href), a[href*='register']::attr(href)").get() '/calendar/register/devops-summit.html'If the page exposes a datetime attribute, prefer it over scraping only the formatted clock text because it preserves the source timezone offset.
- Replace the generated spider with a two-stage crawl that follows each event card into its detail page.
import scrapy def join_clean_text(texts): return " ".join(text.strip() for text in texts if text.strip()) class EventsSpider(scrapy.Spider): name = "events" allowed_domains = ["events.example.net"] start_urls = ["https://events.example.net/calendar/"] custom_settings = { "AUTOTHROTTLE_ENABLED": True, "DOWNLOAD_DELAY": 1.0, "CONCURRENT_REQUESTS_PER_DOMAIN": 2, } def parse(self, response): for href in response.css("article.event-card a.event-link::attr(href)").getall(): yield response.follow(href, callback=self.parse_event) next_href = response.css( "a[rel='next']::attr(href), a.next::attr(href)" ).get() if next_href: yield response.follow(next_href, callback=self.parse) def parse_event(self, response): registration_href = response.css( "a.register::attr(href), a[href*='register']::attr(href)" ).get() yield { "title": response.css("h1::text").get(default="").strip(), "start_time": response.css("time::attr(datetime)").get(default="").strip(), "location": response.css(".event-location::text").get(default="").strip(), "summary": join_clean_text( response.css(".event-summary *::text").getall() ), "registration_url": ( response.urljoin(registration_href) if registration_href else "" ), "url": response.url, }
Current projects created by scrapy startproject already set ROBOTSTXT_OBEY = True and FEED_EXPORT_ENCODING = “utf-8” in settings.py, so the spider only adds crawl-rate tuning here.
- Update the example domain, start URL, and CSS selectors to match the real calendar markup.
If scrapy shell returns only layout scaffolding or empty event cards, the calendar is likely populated by JavaScript and the underlying JSON or XHR endpoint is usually the stronger target than the rendered DOM.
- Run the spider with JSON feed export enabled.
$ scrapy crawl events -O events.json ##### snipped ##### 2026-04-16 05:55:51 [scrapy.extensions.feedexport] INFO: Stored json feed (3 items) in: events.json
Use -O to replace the existing file on each run, or -o to append items to the end of the target feed.
- Print the first exported event to confirm the detail fields were written to the feed.
$ python3 -c "import json; data=json.load(open('events.json', encoding='utf-8')); print(data[0])" {'title': 'Crawl Workshop', 'start_time': '2026-05-20T13:30:00-04:00', 'location': 'Remote', 'summary': 'Selector debugging session. Feed export review.', 'registration_url': 'https://events.example.net/calendar/register/crawl-workshop.html', 'url': 'https://events.example.net/calendar/events/crawl-workshop.html'}Blank start_time values or repeated listing URLs usually mean the detail callback is still scraping the calendar index instead of the event page.
- Compare the total item count with the number of unique event URLs to catch repeated month entries.
$ python3 -c "import json; data=json.load(open('events.json', encoding='utf-8')); print('items:', len(data)); print('unique_urls:', len({item['url'] for item in data}))" items: 3 unique_urls: 3
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
