Skip to content

mreinrt/datacol

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

datacol.py - Event Participant Data Collector

A Python script that scrapes participant information from event websites and exports it to CSV. Built with Selenium and BeautifulSoup, with built-in support for n8n automation.

Features

  • Automatic participant extraction from web pages using .card selectors
  • Interactive field selection – choose which data fields to export
  • Selective participant filtering – include specific participants or all
  • CSV export – clean, formatted output ready for spreadsheets or databases
  • Headless browser support – runs without opening a visible browser window
  • n8n ready – can be integrated into automated workflows (see below)

Requirements

System Dependencies

  • Python 3.6+
  • Chrome or Chromium browser
  • ChromeDriver (matching your browser version)

Python Packages

pip install selenium beautifulsoup4

Installation

  1. Clone or download this script to your machine

  2. Install ChromeDriver (if not already installed):

    # Ubuntu/Debian
    sudo apt install chromium-chromedriver
    
    # Or download manually from: https://chromedriver.chromium.org/
  3. Update paths in the script if needed:

    • options.binary_location – path to your Chrome/Chromium executable
    • Service("/usr/local/bin/chromedriver") – path to your ChromeDriver

Usage

Basic Usage

Run the script and follow the interactive prompts:

python datacol.py

You will be guided through:

  1. Enter URL – the event page containing participant cards
  2. Select participants – choose specific numbers or type all
  3. Choose fields – select which data fields to export
  4. CSV generated – results saved to participants_selected.csv

Example Output

Enter the full event participants page URL: https://example.com/event/speakers
Loading https://example.com/event/speakers ...

✅ Found 24 participant cards.

1. Dr. Sarah Chen
2. Prof. Michael Rodriguez
3. Dr. Emily Watson
...

Enter participant numbers to include (comma separated, or 'all'): 1,3,5-7
Selected 5 participants.

Detected available fields:
1. Name
2. Title
3. Company
4. Country

Enter field number to include (press Enter to finish): 1
Added: Name
Enter field number to include (press Enter to finish): 2
Added: Title
Enter field number to include (press Enter to finish): 

✅ Extraction complete: participants_selected.csv

Customizing for Different Websites

The script is designed for pages where participant cards use the .card CSS class. To adapt it for other websites:

Element Where to Change Example
Card selector soup.select(".card") Change .card to .participant, .speaker-item, etc.
Name selector card.find("h3") Change to h2, .name, [class*="name"]
Field extraction card.find_all("p") Change to .details, div.info, etc.

n8n Integration

This script can be integrated into n8n workflows for fully automated data collection. Here are three approaches:

Approach 1: Execute Command Node (Simple)

Run the script directly from n8n using the Execute Command node:

Workflow:

[Schedule Trigger] → [Execute Command] → [Read File] → [Google Drive/Send Email]

Execute Command Node Configuration:

  • Command: python
  • Arguments: /path/to/datacol.py
  • Note: For non-interactive use, you'll need to modify the script to accept arguments (see Approach 2)

Approach 2: Modified Version with CLI Arguments

Create a modified version of the script that accepts command-line arguments:

# datacol_cli.py - Modified for n8n
import argparse
import csv
from bs4 import BeautifulSoup
from selenium import webdriver
# ... (rest of imports)

parser = argparse.ArgumentParser()
parser.add_argument("--url", required=True, help="Event page URL")
parser.add_argument("--participants", default="all", help="Participant numbers or 'all'")
parser.add_argument("--fields", required=True, help="Comma-separated field names")
parser.add_argument("--output", default="participants_selected.csv", help="Output file")

args = parser.parse_args()
# ... (rest of script logic)

Then in n8n, use:

python datacol_cli.py --url "https://example.com/event" --fields "Name,Title,Company" --output "/tmp/participants.csv"

Approach 3: HTTP API Wrapper (Most Flexible)

Create a simple Flask API wrapper that n8n can call via HTTP Request node:

# datacol_api.py
from flask import Flask, request, jsonify
import subprocess
import json

app = Flask(__name__)

@app.route('/scrape', methods=['POST'])
def scrape():
    data = request.json
    url = data.get('url')
    fields = data.get('fields', [])
    
    result = subprocess.run(
        ['python', 'datacol_cli.py', '--url', url, '--fields', ','.join(fields)],
        capture_output=True, text=True
    )
    
    with open('participants_selected.csv', 'r') as f:
        csv_data = f.read()
    
    return jsonify({'status': 'success', 'data': csv_data})

if __name__ == '__main__':
    app.run(port=5000)

n8n HTTP Request Node:

  • Method: POST
  • URL: http://localhost:5000/scrape
  • Body (JSON):
    {
      "url": "https://example.com/event",
      "fields": ["Name", "Title", "Company"]
    }

Approach 4: Complete n8n Workflow (No Python Needed)

As an alternative, you can skip the Python script entirely and build the entire scraper in n8n using built-in nodes. This approach eliminates the need for Python dependencies and browser drivers.

Example n8n Workflow Structure

[Schedule Trigger] 
  ↓ (runs every Monday at 9 AM)
[Set Node] 
  - Set URL: "https://example.com/event/participants"
  - Set fields: ["Name", "Title", "Company"]
  ↓
[Execute Command] 
  - Run: python /path/to/datacol_cli.py
  - Arguments: --url "{{$json.url}}" --fields "{{$json.fields.join(',')}}"
  ↓
[Read File] 
  - File path: /tmp/participants.csv
  ↓
[Convert to JSON] 
  - Parse CSV to structured data
  ↓
[Google Sheets] 
  - Append data to spreadsheet

Troubleshooting

ChromeDriver Issues

If you get errors about ChromeDriver:

# Check ChromeDriver version
chromedriver --version

# Update if needed
sudo apt update && sudo apt upgrade chromium-chromedriver

No Cards Found

If the script finds 0 cards:

  • Check that the page uses .card class for participant elements
  • Try opening the page manually and inspecting element selectors
  • The page may require login or may load content dynamically with JavaScript

Headless Mode Problems

If pages don't load properly in headless mode:

  • Remove options.add_argument("--headless") to see what's happening
  • Add time.sleep(5) to give more time for JavaScript to execute

File Structure

datacol.py               # Main script
participants_selected.csv # Generated output file
datacol_cli.py          # CLI version for n8n (optional)
datacol_api.py          # API wrapper for n8n (optional)

License

MIT License

Contributing

Feel free to submit issues or pull requests for:

  • Additional website selectors
  • Better error handling
  • Performance improvements
  • Enhanced n8n integration examples

Related

About

A configurable web scraping tool for event participant data, featuring interactive participant selection, field mapping, and CSV export. Includes n8n integration examples for automated workflow orchestration.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages