Event calendars publish schedules for meetups, webinars, classes, and conferences, but the key details are often split between a listing view and individual event pages. Extracting those details into structured data enables search, reminders, deduplication, and downstream analytics without manual copying.
Scrapy crawls the calendar listing URL, selects event links with CSS selectors, follows each event detail page, and yields one structured item per event. Pagination or month navigation can be captured by following a next link so multiple pages of events are collected in a single run.
Calendars frequently mix time zones, recurring occurrences, and repeated listings across months, so storing the original timestamp plus the event URL reduces drift and double-counting. Some calendars render events with JavaScript, which can leave the raw HTML response empty even when a browser shows event cards; scraping the underlying JSON endpoint is often more reliable than scraping the rendered DOM. Keep the crawl polite with throttling and narrow selectors so unrelated site sections are not spidered.
Related: How to scrape paginated pages with Scrapy
Related: How to use CSS selectors in Scrapy
$ scrapy startproject event_calendar
New Scrapy project 'event_calendar', using template directory '/usr/lib/python3/dist-packages/scrapy/templates/project', created in:
/root/sg-work/event_calendar
$ cd event_calendar
$ scrapy genspider events app.internal.example Created spider 'events' using template 'basic' in module: event_calendar.spiders.events
$ scrapy shell "http://app.internal.example:8000/events/"
>>> response.css("article.event a::attr(href)").getall()
['/events/data-summit.html', '/events/crawl-workshop.html']
>>> response.css("a[rel='next']::attr(href), a.next::attr(href)").get()
None
If selectors return empty while a browser shows events, the calendar is likely JavaScript-rendered and scraping the underlying API endpoint is usually required.
import scrapy from datetime import datetime, timezone class EventsSpider(scrapy.Spider): name = "events" allowed_domains = ["app.internal.example"] start_urls = ["http://app.internal.example:8000/events/"] custom_settings = { "ROBOTSTXT_OBEY": True, "AUTOTHROTTLE_ENABLED": True, "DOWNLOAD_DELAY": 1, "CONCURRENT_REQUESTS_PER_DOMAIN": 4, "FEED_EXPORT_ENCODING": "utf-8", } def parse(self, response): for href in response.css("article.event a::attr(href)").getall(): yield response.follow(href, callback=self.parse_event) next_href = response.css( "a[rel='next']::attr(href), a.next::attr(href)" ).get() if next_href: yield response.follow(next_href, callback=self.parse) def parse_event(self, response): raw_start = self._text(response.css("time::attr(datetime)").get()) reg_href = response.css( "a.register::attr(href), a[href*='register']::attr(href)" ).get() yield { "title": self._text(response.css("h1::text").get()), "start_time": raw_start, "start_time_utc": self._to_utc_iso(raw_start), "location": self._text(response.css(".event-location::text").get()), "registration_url": response.urljoin(reg_href) if reg_href else None, "url": response.url, } def _text(self, value): if value is None: return None text = value.strip() return text if text else None def _to_utc_iso(self, value): if value is None: return None text = value.strip() if not text: return None if text.endswith("Z"): text = f"{text[:-1]}+00:00" try: parsed = datetime.fromisoformat(text) except ValueError: return None if parsed.tzinfo is None: return None return parsed.astimezone(timezone.utc).isoformat()
Keeping start_time as the original value preserves the site time zone, while start_time_utc normalizes only when an offset is present.
$ scrapy crawl events -O events.json 2026-01-01 09:45:20 [scrapy.extensions.feedexport] INFO: Stored json feed (2 items) in: events.json
Use -O to overwrite an existing file, or -o to append.
$ head -n 5 events.json
[
{"title": "Crawl Workshop", "start_time": "2026-03-05T13:30:00-05:00", "start_time_utc": "2026-03-05T18:30:00+00:00", "location": "Remote", "registration_url": "http://app.internal.example:8000/events/register/crawl-workshop", "url": "http://app.internal.example:8000/events/crawl-workshop.html"},
{"title": "Data Summit", "start_time": "2026-02-10T09:00:00-05:00", "start_time_utc": "2026-02-10T14:00:00+00:00", "location": "Online", "registration_url": "http://app.internal.example:8000/events/register/data-summit", "url": "http://app.internal.example:8000/events/data-summit.html"}
]
$ python3 -c 'import collections, json; items=json.load(open("events.json", "r", encoding="utf-8")); c=collections.Counter(i.get("url") for i in items if i.get("url")); d=[u for u,n in c.items() if n>1]; print(f"duplicates: {len(d)}"); print("\n".join(d[:10]))'
duplicates: 0