Add contract validation project by vanadium-n42 · Pull Request #6 · Kalshi/tools-and-analysis

vanadium-n42 · 2025-03-18T02:16:00Z

Project Summary

This project provides python scripts for downloading and analyzing PDF rules documents from Kalshi events. It is composed of two main components:

downloader.py (Selenium-based)
- Automates a browser (in headless mode) to navigate the site where the contracts are hosted.
- Finds and clicks the “View full rules” button (or equivalent) to open each contract’s PDF link in a new tab.
- Extracts the direct PDF URL and downloads it to a local directory.
contract_analyzer.py (Text-based Validations)
- Reads each PDF file from the local directory.
- Extracts text and checks for:
  - Suspicious keywords (e.g., “Dear,” “CFTC,” “KalshiEX”).
  - Excessive special characters (above a chosen threshold).
  - Required contract sections (e.g., Underlying, Instructions, Settlement Value, etc.).
  - Explicit language or curse words.
  - Basic contact info (emails, phone numbers).
  - Word count anomalies (too few or too many words).
  - (Optional) Bias or discriminatory language detection (stub or ML-based).
- Computes a “suspicion score” based on these checks and ranks the contracts from highest to lowest suspicion.

Key Features

Parallel Execution: Uses a thread pool to speed up analysis when dealing with numerous PDFs.
Configurable Thresholds: Allows tuning of special character and suspicion-score thresholds.
Lightweight Dependencies: Primarily relies on requests, Selenium, PyPDF2, tqdm, and a few other standard libraries.
Professional Workflow: Includes instructions for clean commit history, pulling via subtree, and other best practices for repository organization.

This setup ensures that large batches of PDF contracts can be automatically downloaded, then systematically checked and flagged for further review.

Testing and Validation

Local Test Run:
- Cloned this repository and installed the required dependencies (see requirements.txt).
- Ran python downloader.py locally, which used Selenium to navigate and download a batch of PDFs into the pdfs/ directory.
- Confirmed that the PDF files appeared correctly in the pdfs/ folder.
PDF Analysis:
- Executed python contract_analyzer.py to parse the downloaded PDFs.
- Observed the progress bar and final output indicating:
  - Word count checks
  - Suspicious keywords found
  - Missing sections if any
  - Special characters above the configured threshold
- Verified that each PDF's suspicion score made sense (high for documents with many issues, lower for well-formed PDFs).
Manual Validation:
- Randomly opened a few of the downloaded PDFs to confirm they displayed as expected.
- Checked the analyzer’s logs for any warnings or errors, ensuring the script handled edge cases (e.g., empty PDFs or those missing sections).
Results:
- The system flagged suspicious documents (e.g., very short, missing sections, or with explicit language) for further inspection.
- Confirmed that normal or well-structured PDFs did not receive high suspicion scores.

Overall, testing locally demonstrated that both the downloader and analyzer scripts functioned as intended, reliably identifying and ranking potential issues in the PDFs.

Add contract validation project

fa8aaf2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add contract validation project#6

Add contract validation project#6
vanadium-n42 wants to merge 1 commit intoKalshi:mainfrom
vanadium-n42:feature/contract-validation

vanadium-n42 commented Mar 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vanadium-n42 commented Mar 18, 2025

Project Summary

Key Features

Testing and Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant