How to scrape event calendars with Scrapy

Event calendars usually show only the title and teaser on the month view, while the scheduled start time, venue, and registration link live on each event detail page. Following those detail pages is what turns a calendar crawl into structured data that can feed reminders, search, or syndication workflows.

In Scrapy, the listing callback extracts each event detail URL and the next calendar page from the current Response, then schedules those relative links with response.follow(). A separate detail callback extracts the event title, the machine-readable datetime value, the venue, the cleaned summary text, and the registration URL so the exported item keeps both the schedule value and the source page.

Calendar archives often repeat the same event across month views, mix local time text with timezone-aware datetime attributes, or render cards through JavaScript instead of the first HTML response. Verify selectors in scrapy shell before writing the spider, prefer the time[datetime] value when it exists, and switch to the underlying JSON or XHR endpoint when the first response only contains placeholder markup.

Steps to scrape event calendars with Scrapy:

Create a new Scrapy project for the calendar crawl.

$ scrapy startproject event_calendar
New Scrapy project 'event_calendar', using template directory '##### snipped #####', created in:
    /home/user/event_calendar

You can start your first spider with:
    cd event_calendar
    scrapy genspider example example.com

Change to the new project directory.
```
$ cd event_calendar
```

Generate a spider scaffold for the calendar host.

$ scrapy genspider events events.example.net
Created spider 'events' using template 'basic' in module:
  event_calendar.spiders.events

Probe the calendar listing page in scrapy shell so the event-link and next-page selectors are based on the real HTML.

$ scrapy shell "https://events.example.net/calendar/"
##### snipped #####
>>> response.css("article.event-card a.event-link::attr(href)").getall()
['/calendar/events/devops-summit.html', '/calendar/events/crawl-workshop.html']
>>> response.css("a[rel='next']::attr(href), a.next::attr(href)").get()
'/calendar/page-2.html'

Related: How to use Scrapy shell

Probe one event detail page in scrapy shell before the spider starts extracting fields.

$ scrapy shell "https://events.example.net/calendar/events/devops-summit.html"
##### snipped #####
>>> response.css("h1::text").get()
'DevOps Summit'
>>> response.css("time::attr(datetime)").get()
'2026-05-14T09:00:00-04:00'
>>> response.css("a.register::attr(href), a[href*='register']::attr(href)").get()
'/calendar/register/devops-summit.html'

Prefer the datetime attribute over scraping only the formatted time text because the attribute keeps the source timezone offset.

Review the generated crawl defaults before increasing request rate.
event_calendar/settings.py
```
ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 1
FEED_EXPORT_ENCODING = "utf-8"
```
Current scrapy startproject templates already seed polite robots.txt handling, a one-second per-domain delay, and UTF-8 feed export for JSON output.

Replace the generated spider with a two-stage crawl that follows each event card into its detail page.

event_calendar/spiders/events.py

import scrapy
 
 
def join_clean_text(texts):
    return " ".join(text.strip() for text in texts if text.strip())
 
 
class EventsSpider(scrapy.Spider):
    name = "events"
    allowed_domains = ["events.example.net"]
    start_urls = ["https://events.example.net/calendar/"]
 
    def parse(self, response):
        for href in response.css("article.event-card a.event-link::attr(href)").getall():
            yield response.follow(href, callback=self.parse_event)
 
        next_href = response.css(
            "a[rel='next']::attr(href), a.next::attr(href)"
        ).get()
        if next_href:
            yield response.follow(next_href, callback=self.parse)
 
    def parse_event(self, response):
        registration_href = response.css(
            "a.register::attr(href), a[href*='register']::attr(href)"
        ).get()
 
        yield {
            "title": response.css("h1::text").get(default="").strip(),
            "start_time": response.css("time::attr(datetime)").get(default="").strip(),
            "location": response.css(".event-location::text").get(default="").strip(),
            "summary": join_clean_text(
                response.css(".event-summary *::text").getall()
            ),
            "registration_url": (
                response.urljoin(registration_href) if registration_href else ""
            ),
            "url": response.url,
        }

response.follow() can use the relative event and next-page links directly, while the exported registration_url still needs response.urljoin() because it is being stored as data instead of scheduled as a request.

Update the example domain, start URL, and CSS selectors to match the real calendar markup.

If scrapy shell returns only layout scaffolding or empty event cards, the calendar is likely populated by JavaScript and the underlying JSON or XHR endpoint is usually the stronger target than the rendered DOM.

Run the spider and export the collected events to JSON.

$ scrapy crawl events -O events.json
2026-04-22 07:23:29 [scrapy.core.engine] INFO: Spider opened
2026-04-22 07:23:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://events.example.net/calendar/> (referer: None)
2026-04-22 07:23:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://events.example.net/calendar/page-2.html> (referer: https://events.example.net/calendar/)
2026-04-22 07:23:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://events.example.net/calendar/events/crawl-workshop.html>
{'title': 'Crawl Workshop', 'start_time': '2026-05-20T13:30:00-04:00', 'location': 'Remote', 'summary': 'Selector debugging session. Feed export review.', 'registration_url': 'https://events.example.net/calendar/register/crawl-workshop.html', 'url': 'https://events.example.net/calendar/events/crawl-workshop.html'}
2026-04-22 07:23:44 [scrapy.extensions.feedexport] INFO: Stored json feed (3 items) in: events.json
2026-04-22 07:23:45 [scrapy.core.engine] INFO: Spider closed (finished)

Use -O to replace the previous export file on each run, or -o when the page really needs append behavior.

Print the first exported event so the detail callback fields can be checked before the feed is reused elsewhere.

$ python3 -c "import json; data=json.load(open('events.json', encoding='utf-8')); print(data[0])"
{'title': 'Crawl Workshop', 'start_time': '2026-05-20T13:30:00-04:00', 'location': 'Remote', 'summary': 'Selector debugging session. Feed export review.', 'registration_url': 'https://events.example.net/calendar/register/crawl-workshop.html', 'url': 'https://events.example.net/calendar/events/crawl-workshop.html'}

Blank start_time values or repeated listing URLs usually mean the detail callback is still scraping the calendar index instead of the event page.

Compare the total item count with the number of unique event URLs before treating the export as deduplicated.

$ python3 -c "import json; data=json.load(open('events.json', encoding='utf-8')); print('items:', len(data)); print('unique_urls:', len({item['url'] for item in data}))"
items: 3
unique_urls: 3