Event calendars publish schedules for meetups, webinars, classes, and conferences, but the key details are often split between a listing view and individual event pages. Extracting those details into structured data enables search, reminders, deduplication, and downstream analytics without manual copying.

Scrapy crawls the calendar listing URL, selects event links with CSS selectors, follows each event detail page, and yields one structured item per event. Pagination or month navigation can be captured by following a next link so multiple pages of events are collected in a single run.

Calendars frequently mix time zones, recurring occurrences, and repeated listings across months, so storing the original timestamp plus the event URL reduces drift and double-counting. Some calendars render events with JavaScript, which can leave the raw HTML response empty even when a browser shows event cards; scraping the underlying JSON endpoint is often more reliable than scraping the rendered DOM. Keep the crawl polite with throttling and narrow selectors so unrelated site sections are not spidered.

Steps to scrape event calendars with Scrapy:

  1. Create a new Scrapy project for the calendar scrape.
    $ scrapy startproject event_calendar
    New Scrapy project 'event_calendar', using template directory '/usr/lib/python3/dist-packages/scrapy/templates/project', created in:
        /root/sg-work/event_calendar
  2. Change into the Scrapy project directory.
    $ cd event_calendar
  3. Generate a spider scaffold for the event calendar domain.
    $ scrapy genspider events app.internal.example
    Created spider 'events' using template 'basic' in module:
      event_calendar.spiders.events
  4. Use scrapy shell on the calendar listing URL to confirm selectors for event links and navigation.
    $ scrapy shell "http://app.internal.example:8000/events/"
    >>> response.css("article.event a::attr(href)").getall()
    ['/events/data-summit.html', '/events/crawl-workshop.html']
    >>> response.css("a[rel='next']::attr(href), a.next::attr(href)").get()
    None

    If selectors return empty while a browser shows events, the calendar is likely JavaScript-rendered and scraping the underlying API endpoint is usually required.

  5. Edit event_calendar/spiders/events.py to follow event pages and extract event fields.
    import scrapy
    from datetime import datetime, timezone
     
     
    class EventsSpider(scrapy.Spider):
        name = "events"
        allowed_domains = ["app.internal.example"]
        start_urls = ["http://app.internal.example:8000/events/"]
     
        custom_settings = {
            "ROBOTSTXT_OBEY": True,
            "AUTOTHROTTLE_ENABLED": True,
            "DOWNLOAD_DELAY": 1,
            "CONCURRENT_REQUESTS_PER_DOMAIN": 4,
            "FEED_EXPORT_ENCODING": "utf-8",
        }
     
        def parse(self, response):
            for href in response.css("article.event a::attr(href)").getall():
                yield response.follow(href, callback=self.parse_event)
     
            next_href = response.css(
                "a[rel='next']::attr(href), a.next::attr(href)"
            ).get()
            if next_href:
                yield response.follow(next_href, callback=self.parse)
     
        def parse_event(self, response):
            raw_start = self._text(response.css("time::attr(datetime)").get())
            reg_href = response.css(
                "a.register::attr(href), a[href*='register']::attr(href)"
            ).get()
     
            yield {
                "title": self._text(response.css("h1::text").get()),
                "start_time": raw_start,
                "start_time_utc": self._to_utc_iso(raw_start),
                "location": self._text(response.css(".event-location::text").get()),
                "registration_url": response.urljoin(reg_href) if reg_href else None,
                "url": response.url,
            }
     
        def _text(self, value):
            if value is None:
                return None
            text = value.strip()
            return text if text else None
     
        def _to_utc_iso(self, value):
            if value is None:
                return None
     
            text = value.strip()
            if not text:
                return None
     
            if text.endswith("Z"):
                text = f"{text[:-1]}+00:00"
     
            try:
                parsed = datetime.fromisoformat(text)
            except ValueError:
                return None
     
            if parsed.tzinfo is None:
                return None
     
            return parsed.astimezone(timezone.utc).isoformat()

    Keeping start_time as the original value preserves the site time zone, while start_time_utc normalizes only when an offset is present.

  6. Run the spider with JSON feed export enabled.
    $ scrapy crawl events -O events.json
    2026-01-01 09:45:20 [scrapy.extensions.feedexport] INFO: Stored json feed (2 items) in: events.json

    Use -O to overwrite an existing file, or -o to append.

  7. Inspect the exported file to confirm required fields, including datetime format.
    $ head -n 5 events.json
    [
    {"title": "Crawl Workshop", "start_time": "2026-03-05T13:30:00-05:00", "start_time_utc": "2026-03-05T18:30:00+00:00", "location": "Remote", "registration_url": "http://app.internal.example:8000/events/register/crawl-workshop", "url": "http://app.internal.example:8000/events/crawl-workshop.html"},
    {"title": "Data Summit", "start_time": "2026-02-10T09:00:00-05:00", "start_time_utc": "2026-02-10T14:00:00+00:00", "location": "Online", "registration_url": "http://app.internal.example:8000/events/register/data-summit", "url": "http://app.internal.example:8000/events/data-summit.html"}
    ]
  8. Check for duplicate event URLs in the export to avoid double-counting across months.
    $ python3 -c 'import collections, json; items=json.load(open("events.json", "r", encoding="utf-8")); c=collections.Counter(i.get("url") for i in items if i.get("url")); d=[u for u,n in c.items() if n>1]; print(f"duplicates: {len(d)}"); print("\n".join(d[:10]))'
    duplicates: 0