Event calendars usually show only the title and teaser on the month view, while the scheduled start time, venue, and registration link live on each event detail page. Following those detail pages is what turns a calendar crawl into structured data that can feed reminders, search, or syndication workflows.
In Scrapy, the listing callback extracts each event detail URL and the next calendar page from the current Response, then schedules those relative links with response.follow(). A separate detail callback extracts the event title, the machine-readable datetime value, the venue, the cleaned summary text, and the registration URL so the exported item keeps both the schedule value and the source page.
Calendar archives often repeat the same event across month views, mix local time text with timezone-aware datetime attributes, or render cards through JavaScript instead of the first HTML response. Verify selectors in scrapy shell before writing the spider, prefer the time[datetime] value when it exists, and switch to the underlying JSON or XHR endpoint when the first response only contains placeholder markup.
Related: How to scrape paginated pages with Scrapy
Related: How to use CSS selectors in Scrapy
$ scrapy startproject event_calendar
New Scrapy project 'event_calendar', using template directory '##### snipped #####', created in:
/home/user/event_calendar
You can start your first spider with:
cd event_calendar
scrapy genspider example example.com
$ cd event_calendar
$ scrapy genspider events events.example.net Created spider 'events' using template 'basic' in module: event_calendar.spiders.events
$ scrapy shell "https://events.example.net/calendar/"
##### snipped #####
>>> response.css("article.event-card a.event-link::attr(href)").getall()
['/calendar/events/devops-summit.html', '/calendar/events/crawl-workshop.html']
>>> response.css("a[rel='next']::attr(href), a.next::attr(href)").get()
'/calendar/page-2.html'
Related: How to use Scrapy shell
$ scrapy shell "https://events.example.net/calendar/events/devops-summit.html"
##### snipped #####
>>> response.css("h1::text").get()
'DevOps Summit'
>>> response.css("time::attr(datetime)").get()
'2026-05-14T09:00:00-04:00'
>>> response.css("a.register::attr(href), a[href*='register']::attr(href)").get()
'/calendar/register/devops-summit.html'
Prefer the datetime attribute over scraping only the formatted time text because the attribute keeps the source timezone offset.
ROBOTSTXT_OBEY = True DOWNLOAD_DELAY = 1 FEED_EXPORT_ENCODING = "utf-8"
Current scrapy startproject templates already seed polite robots.txt handling, a one-second per-domain delay, and UTF-8 feed export for JSON output.
import scrapy def join_clean_text(texts): return " ".join(text.strip() for text in texts if text.strip()) class EventsSpider(scrapy.Spider): name = "events" allowed_domains = ["events.example.net"] start_urls = ["https://events.example.net/calendar/"] def parse(self, response): for href in response.css("article.event-card a.event-link::attr(href)").getall(): yield response.follow(href, callback=self.parse_event) next_href = response.css( "a[rel='next']::attr(href), a.next::attr(href)" ).get() if next_href: yield response.follow(next_href, callback=self.parse) def parse_event(self, response): registration_href = response.css( "a.register::attr(href), a[href*='register']::attr(href)" ).get() yield { "title": response.css("h1::text").get(default="").strip(), "start_time": response.css("time::attr(datetime)").get(default="").strip(), "location": response.css(".event-location::text").get(default="").strip(), "summary": join_clean_text( response.css(".event-summary *::text").getall() ), "registration_url": ( response.urljoin(registration_href) if registration_href else "" ), "url": response.url, }
response.follow() can use the relative event and next-page links directly, while the exported registration_url still needs response.urljoin() because it is being stored as data instead of scheduled as a request.
If scrapy shell returns only layout scaffolding or empty event cards, the calendar is likely populated by JavaScript and the underlying JSON or XHR endpoint is usually the stronger target than the rendered DOM.
$ scrapy crawl events -O events.json
2026-04-22 07:23:29 [scrapy.core.engine] INFO: Spider opened
2026-04-22 07:23:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://events.example.net/calendar/> (referer: None)
2026-04-22 07:23:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://events.example.net/calendar/page-2.html> (referer: https://events.example.net/calendar/)
2026-04-22 07:23:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://events.example.net/calendar/events/crawl-workshop.html>
{'title': 'Crawl Workshop', 'start_time': '2026-05-20T13:30:00-04:00', 'location': 'Remote', 'summary': 'Selector debugging session. Feed export review.', 'registration_url': 'https://events.example.net/calendar/register/crawl-workshop.html', 'url': 'https://events.example.net/calendar/events/crawl-workshop.html'}
2026-04-22 07:23:44 [scrapy.extensions.feedexport] INFO: Stored json feed (3 items) in: events.json
2026-04-22 07:23:45 [scrapy.core.engine] INFO: Spider closed (finished)
Use -O to replace the previous export file on each run, or -o when the page really needs append behavior.
$ python3 -c "import json; data=json.load(open('events.json', encoding='utf-8')); print(data[0])"
{'title': 'Crawl Workshop', 'start_time': '2026-05-20T13:30:00-04:00', 'location': 'Remote', 'summary': 'Selector debugging session. Feed export review.', 'registration_url': 'https://events.example.net/calendar/register/crawl-workshop.html', 'url': 'https://events.example.net/calendar/events/crawl-workshop.html'}
Blank start_time values or repeated listing URLs usually mean the detail callback is still scraping the calendar index instead of the event page.
$ python3 -c "import json; data=json.load(open('events.json', encoding='utf-8')); print('items:', len(data)); print('unique_urls:', len({item['url'] for item in data}))"
items: 3
unique_urls: 3