Skip to content

Add contract validation project#6

Open
vanadium-n42 wants to merge 1 commit intoKalshi:mainfrom
vanadium-n42:feature/contract-validation
Open

Add contract validation project#6
vanadium-n42 wants to merge 1 commit intoKalshi:mainfrom
vanadium-n42:feature/contract-validation

Conversation

@vanadium-n42
Copy link
Copy Markdown

Project Summary

This project provides python scripts for downloading and analyzing PDF rules documents from Kalshi events. It is composed of two main components:

  1. downloader.py (Selenium-based)

    • Automates a browser (in headless mode) to navigate the site where the contracts are hosted.
    • Finds and clicks the “View full rules” button (or equivalent) to open each contract’s PDF link in a new tab.
    • Extracts the direct PDF URL and downloads it to a local directory.
  2. contract_analyzer.py (Text-based Validations)

    • Reads each PDF file from the local directory.
    • Extracts text and checks for:
      • Suspicious keywords (e.g., “Dear,” “CFTC,” “KalshiEX”).
      • Excessive special characters (above a chosen threshold).
      • Required contract sections (e.g., Underlying, Instructions, Settlement Value, etc.).
      • Explicit language or curse words.
      • Basic contact info (emails, phone numbers).
      • Word count anomalies (too few or too many words).
      • (Optional) Bias or discriminatory language detection (stub or ML-based).
    • Computes a “suspicion score” based on these checks and ranks the contracts from highest to lowest suspicion.

Key Features

  • Parallel Execution: Uses a thread pool to speed up analysis when dealing with numerous PDFs.
  • Configurable Thresholds: Allows tuning of special character and suspicion-score thresholds.
  • Lightweight Dependencies: Primarily relies on requests, Selenium, PyPDF2, tqdm, and a few other standard libraries.
  • Professional Workflow: Includes instructions for clean commit history, pulling via subtree, and other best practices for repository organization.

This setup ensures that large batches of PDF contracts can be automatically downloaded, then systematically checked and flagged for further review.

Testing and Validation

  1. Local Test Run:

    • Cloned this repository and installed the required dependencies (see requirements.txt).
    • Ran python downloader.py locally, which used Selenium to navigate and download a batch of PDFs into the pdfs/ directory.
    • Confirmed that the PDF files appeared correctly in the pdfs/ folder.
  2. PDF Analysis:

    • Executed python contract_analyzer.py to parse the downloaded PDFs.
    • Observed the progress bar and final output indicating:
      • Word count checks
      • Suspicious keywords found
      • Missing sections if any
      • Special characters above the configured threshold
    • Verified that each PDF's suspicion score made sense (high for documents with many issues, lower for well-formed PDFs).
  3. Manual Validation:

    • Randomly opened a few of the downloaded PDFs to confirm they displayed as expected.
    • Checked the analyzer’s logs for any warnings or errors, ensuring the script handled edge cases (e.g., empty PDFs or those missing sections).
  4. Results:

    • The system flagged suspicious documents (e.g., very short, missing sections, or with explicit language) for further inspection.
    • Confirmed that normal or well-structured PDFs did not receive high suspicion scores.

Overall, testing locally demonstrated that both the downloader and analyzer scripts functioned as intended, reliably identifying and ranking potential issues in the PDFs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant