How to scrape event calendars with Scrapy

Event calendars publish schedules for meetups, webinars, classes, and conferences, but the key details are often split between a listing view and individual event pages. Extracting those details into structured data enables search, reminders, deduplication, and downstream analytics without manual copying.

Scrapy crawls the calendar listing URL, selects event links with CSS selectors, follows each event detail page, and yields one structured item per event. Pagination or month navigation can be captured by following a next link so multiple pages of events are collected in a single run.

Calendars frequently mix time zones, recurring occurrences, and repeated listings across months, so storing the original timestamp plus the event URL reduces drift and double-counting. Some calendars render events with JavaScript, which can leave the raw HTML response empty even when a browser shows event cards; scraping the underlying JSON endpoint is often more reliable than scraping the rendered DOM. Keep the crawl polite with throttling and narrow selectors so unrelated site sections are not spidered.

Steps to scrape event calendars with Scrapy:

Create a new Scrapy project for the calendar scrape.

$ scrapy startproject event_calendar
New Scrapy project 'event_calendar', using template directory '/usr/lib/python3/dist-packages/scrapy/templates/project', created in:
    /root/sg-work/event_calendar

Change into the Scrapy project directory.
```
$ cd event_calendar
```

Generate a spider scaffold for the event calendar domain.

$ scrapy genspider events app.internal.example
Created spider 'events' using template 'basic' in module:
  event_calendar.spiders.events

Use scrapy shell on the calendar listing URL to confirm selectors for event links and navigation.

$ scrapy shell "http://app.internal.example:8000/events/"
>>> response.css("article.event a::attr(href)").getall()
['/events/data-summit.html', '/events/crawl-workshop.html']
>>> response.css("a[rel='next']::attr(href), a.next::attr(href)").get()
None

If selectors return empty while a browser shows events, the calendar is likely JavaScript-rendered and scraping the underlying API endpoint is usually required.

Edit event_calendar/spiders/events.py to follow event pages and extract event fields.

import scrapy
from datetime import datetime, timezone
 
 
class EventsSpider(scrapy.Spider):
    name = "events"
    allowed_domains = ["app.internal.example"]
    start_urls = ["http://app.internal.example:8000/events/"]
 
    custom_settings = {
        "ROBOTSTXT_OBEY": True,
        "AUTOTHROTTLE_ENABLED": True,
        "DOWNLOAD_DELAY": 1,
        "CONCURRENT_REQUESTS_PER_DOMAIN": 4,
        "FEED_EXPORT_ENCODING": "utf-8",
    }
 
    def parse(self, response):
        for href in response.css("article.event a::attr(href)").getall():
            yield response.follow(href, callback=self.parse_event)
 
        next_href = response.css(
            "a[rel='next']::attr(href), a.next::attr(href)"
        ).get()
        if next_href:
            yield response.follow(next_href, callback=self.parse)
 
    def parse_event(self, response):
        raw_start = self._text(response.css("time::attr(datetime)").get())
        reg_href = response.css(
            "a.register::attr(href), a[href*='register']::attr(href)"
        ).get()
 
        yield {
            "title": self._text(response.css("h1::text").get()),
            "start_time": raw_start,
            "start_time_utc": self._to_utc_iso(raw_start),
            "location": self._text(response.css(".event-location::text").get()),
            "registration_url": response.urljoin(reg_href) if reg_href else None,
            "url": response.url,
        }
 
    def _text(self, value):
        if value is None:
            return None
        text = value.strip()
        return text if text else None
 
    def _to_utc_iso(self, value):
        if value is None:
            return None
 
        text = value.strip()
        if not text:
            return None
 
        if text.endswith("Z"):
            text = f"{text[:-1]}+00:00"
 
        try:
            parsed = datetime.fromisoformat(text)
        except ValueError:
            return None
 
        if parsed.tzinfo is None:
            return None
 
        return parsed.astimezone(timezone.utc).isoformat()

Keeping start_time as the original value preserves the site time zone, while start_time_utc normalizes only when an offset is present.

Run the spider with JSON feed export enabled.

$ scrapy crawl events -O events.json
2026-01-01 09:45:20 [scrapy.extensions.feedexport] INFO: Stored json feed (2 items) in: events.json

Use -O to overwrite an existing file, or -o to append.

Inspect the exported file to confirm required fields, including datetime format.

$ head -n 5 events.json
[
{"title": "Crawl Workshop", "start_time": "2026-03-05T13:30:00-05:00", "start_time_utc": "2026-03-05T18:30:00+00:00", "location": "Remote", "registration_url": "http://app.internal.example:8000/events/register/crawl-workshop", "url": "http://app.internal.example:8000/events/crawl-workshop.html"},
{"title": "Data Summit", "start_time": "2026-02-10T09:00:00-05:00", "start_time_utc": "2026-02-10T14:00:00+00:00", "location": "Online", "registration_url": "http://app.internal.example:8000/events/register/data-summit", "url": "http://app.internal.example:8000/events/data-summit.html"}
]

Check for duplicate event URLs in the export to avoid double-counting across months.

$ python3 -c 'import collections, json; items=json.load(open("events.json", "r", encoding="utf-8")); c=collections.Counter(i.get("url") for i in items if i.get("url")); d=[u for u,n in c.items() if n>1]; print(f"duplicates: {len(d)}"); print("\n".join(d[:10]))'
duplicates: 0

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.