Event calendars publish schedules for meetups, webinars, classes, and conferences, but the key details are often split between a listing view and individual event pages. Extracting those details into structured data enables search, reminders, deduplication, and downstream analytics without manual copying.
Scrapy crawls the calendar listing URL, selects event links with CSS selectors, follows each event detail page, and yields one structured item per event. Pagination or month navigation can be captured by following a next link so multiple pages of events are collected in a single run.
Calendars frequently mix time zones, recurring occurrences, and repeated listings across months, so storing the original timestamp plus the event URL reduces drift and double-counting. Some calendars render events with JavaScript, which can leave the raw HTML response empty even when a browser shows event cards; scraping the underlying JSON endpoint is often more reliable than scraping the rendered DOM. Keep the crawl polite with throttling and narrow selectors so unrelated site sections are not spidered.
Related: How to scrape paginated pages with Scrapy
Related: How to use CSS selectors in Scrapy
Steps to scrape event calendars with Scrapy:
- Create a new Scrapy project for the calendar scrape.
$ scrapy startproject event_calendar New Scrapy project 'event_calendar', using template directory '/usr/lib/python3/dist-packages/scrapy/templates/project', created in: /root/sg-work/event_calendar - Change into the Scrapy project directory.
$ cd event_calendar
- Generate a spider scaffold for the event calendar domain.
$ scrapy genspider events app.internal.example Created spider 'events' using template 'basic' in module: event_calendar.spiders.events
- Use scrapy shell on the calendar listing URL to confirm selectors for event links and navigation.
$ scrapy shell "http://app.internal.example:8000/events/" >>> response.css("article.event a::attr(href)").getall() ['/events/data-summit.html', '/events/crawl-workshop.html'] >>> response.css("a[rel='next']::attr(href), a.next::attr(href)").get() NoneIf selectors return empty while a browser shows events, the calendar is likely JavaScript-rendered and scraping the underlying API endpoint is usually required.
- Edit event_calendar/spiders/events.py to follow event pages and extract event fields.
import scrapy from datetime import datetime, timezone class EventsSpider(scrapy.Spider): name = "events" allowed_domains = ["app.internal.example"] start_urls = ["http://app.internal.example:8000/events/"] custom_settings = { "ROBOTSTXT_OBEY": True, "AUTOTHROTTLE_ENABLED": True, "DOWNLOAD_DELAY": 1, "CONCURRENT_REQUESTS_PER_DOMAIN": 4, "FEED_EXPORT_ENCODING": "utf-8", } def parse(self, response): for href in response.css("article.event a::attr(href)").getall(): yield response.follow(href, callback=self.parse_event) next_href = response.css( "a[rel='next']::attr(href), a.next::attr(href)" ).get() if next_href: yield response.follow(next_href, callback=self.parse) def parse_event(self, response): raw_start = self._text(response.css("time::attr(datetime)").get()) reg_href = response.css( "a.register::attr(href), a[href*='register']::attr(href)" ).get() yield { "title": self._text(response.css("h1::text").get()), "start_time": raw_start, "start_time_utc": self._to_utc_iso(raw_start), "location": self._text(response.css(".event-location::text").get()), "registration_url": response.urljoin(reg_href) if reg_href else None, "url": response.url, } def _text(self, value): if value is None: return None text = value.strip() return text if text else None def _to_utc_iso(self, value): if value is None: return None text = value.strip() if not text: return None if text.endswith("Z"): text = f"{text[:-1]}+00:00" try: parsed = datetime.fromisoformat(text) except ValueError: return None if parsed.tzinfo is None: return None return parsed.astimezone(timezone.utc).isoformat()
Keeping start_time as the original value preserves the site time zone, while start_time_utc normalizes only when an offset is present.
- Run the spider with JSON feed export enabled.
$ scrapy crawl events -O events.json 2026-01-01 09:45:20 [scrapy.extensions.feedexport] INFO: Stored json feed (2 items) in: events.json
Use -O to overwrite an existing file, or -o to append.
- Inspect the exported file to confirm required fields, including datetime format.
$ head -n 5 events.json [ {"title": "Crawl Workshop", "start_time": "2026-03-05T13:30:00-05:00", "start_time_utc": "2026-03-05T18:30:00+00:00", "location": "Remote", "registration_url": "http://app.internal.example:8000/events/register/crawl-workshop", "url": "http://app.internal.example:8000/events/crawl-workshop.html"}, {"title": "Data Summit", "start_time": "2026-02-10T09:00:00-05:00", "start_time_utc": "2026-02-10T14:00:00+00:00", "location": "Online", "registration_url": "http://app.internal.example:8000/events/register/data-summit", "url": "http://app.internal.example:8000/events/data-summit.html"} ] - Check for duplicate event URLs in the export to avoid double-counting across months.
$ python3 -c 'import collections, json; items=json.load(open("events.json", "r", encoding="utf-8")); c=collections.Counter(i.get("url") for i in items if i.get("url")); d=[u for u,n in c.items() if n>1]; print(f"duplicates: {len(d)}"); print("\n".join(d[:10]))' duplicates: 0
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
