Skip to content

Linkcharsoft/consultores-automation-framework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

consultores-automation-framework

Shared framework for AFIP/ARCA scrapers. Provides all the infrastructure needed so that each scraper is an independent repo that only has to implement its own logic.


Why this project exists

The original system was a monorepo with all scrapers together (altas, bajas, suss, iva, sworn_statements, certs_download). That worked while there were few of them, but caused problems at scale:

Problem Monorepo This framework
Dependencies All share the same pyproject.toml. If altas needs langchain, everyone installs it Each repo has only its own deps
Deploy One giant Docker image. Changing one line in suss rebuilds everything Minimal Docker image per scraper
Environment variables .env has OPENAI_API_KEY, PROFIT_USERNAME, credentials for everyone Each container only has its own variables
Memory The runner imports all scrapers even if only one runs Only loads what it needs
Onboarding Adding a scraper requires understanding the whole architecture Edit 2 files and implement run()

What the library includes

consultores_automation/
├── automation/
│   ├── automation.py    ← Automation base class (hooks, helpers)
│   ├── registry.py      ← @register — registers and configures the scraper
│   ├── runner.py        ← CLI entrypoint (python -m consultores_automation.runner)
│   └── provider.py      ← input providers: local JSON or S3
│
├── core_system/
│   ├── client.py        ← HTTP client to the central run-tracking system
│   ├── run_context.py   ← context manager wrapping each execution
│   └── schemas.py       ← Pydantic models (AutomationRunSchema, StatusEnum)
│
├── driver/
│   ├── base.py          ← DriverBuilder + BaseDriver
│   ├── mixins.py        ← TempProfile, AutoDownload, BotDetection,
│   │                       VirtualDisplay, CompleteBotDetectionEvasion
│   └── constants.py     ← BaseDriverConstants (base constants class)
│
├── afip/
│   ├── scraper.py       ← AfipScraper: AFIP login, service search
│   └── constants.py     ← AfipScraperConstants, SimplificacionRegistralConstants
│
├── utils/
│   ├── s3.py            ← S3Utils (upload, download, zip+upload)
│   ├── files.py         ← local file helpers
│   ├── misc.py          ← miscellaneous helpers
│   └── dates.py         ← get_month_period() and date utilities
│
├── settings.py          ← unified configuration (Settings base class)
├── logger.py            ← pre-configured logger
├── exceptions.py        ← base exception hierarchy
├── alert.py             ← Discord notifications
└── cli.py               ← uv run commands: execute, setup, automation, help

Compatibility shims (kept for backward compatibility, do not use in new code):

File Points to
conf.py settings.py
envs.py settings.py
rets.py utils/dates.py

Quick start: creating a new scraper

1. Scaffold the repo

From the root directory of the new repo:

uv run setup --name my_scraper --label "My Scraper"
uv run automation --name my_scraper --label "My Scraper"

This generates:

my-scraper/
├── automations/
│   └── my_scraper/
│       ├── __init__.py
│       ├── automation.py   ← implement run() here
│       ├── config.py       ← CODE, NAME and environment variables
│       ├── constants.py    ← selectors and scraper strings (optional)
│       └── tests/
│           └── input.json  ← test input for this automation
├── scripts/
│   └── build_push.sh       ← script to build and push to ECR
├── Dockerfile
├── docker-compose.yml
├── .env.example
├── pyproject.toml          ← ready to use (adjust deps if needed)
├── .gitignore
├── .dockerignore
└── README.md

2. Setup

cp .env.example .env
# Fill in .env with the credentials
uv sync

3. Implement the logic

Only two files need to be edited inside automations/my_scraper/:

config.py — declare the scraper's own variables:

import os
from consultores_automation.settings import Settings

CODE = "my_scraper"
NAME = "My Scraper"

class MyScraperSettings(Settings):
    # Framework variables (AFIP_USERNAME, S3_BUCKET, etc.)
    # are already in the base class — no need to re-declare them.
    # Only add your own:
    PROFIT_USERNAME = os.getenv("PROFIT_USERNAME")
    PROFIT_PASSWORD = os.getenv("PROFIT_PASSWORD")

    # Declare mandatory variables — validated automatically before run():
    required = ["AFIP_USERNAME", "AFIP_PASSWORD", "PROFIT_USERNAME"]

automation.py — implement run():

from consultores_automation import Automation, register
from automations.my_scraper.config import CODE, NAME, MyScraperSettings

@register(code=CODE, name=NAME)
class MyScraperAutomation(Automation):
    Settings = MyScraperSettings

    def run(self, data: dict, **kwargs):
        self.logger.info(f"Processing: {data['cuit']}")

        scraper = MyScraper(
            username=self.settings.AFIP_USERNAME,
            password=self.settings.AFIP_PASSWORD,
        )
        result = scraper.process(data)

        url = self.upload_output("files/downloads/result.pdf")
        return {"url": url, "success": True}

4. Test it

uv sync

# Without core system credentials:
uv run execute --json automations/my_scraper/tests/input.json --fake-core-system

# With visible browser (debug):
uv run execute --json automations/my_scraper/tests/input.json --fake-core-system --head

# Without any external infrastructure (no run tracking):
uv run execute --json automations/my_scraper/tests/input.json --fake-core-system --no-context

# Multi-automation repo — specify the automation name:
uv run execute my_scraper --json automations/my_scraper/tests/input.json --fake-core-system

Available commands

Command Description
uv run execute Run an automation locally
uv run setup Scaffold repo infra (docker, configs, .env)
uv run automation Add a new automation to the repo
uv run help Show all commands and options

Run uv run help for the full reference with all flags and examples.


Multiple automations in one repo

90% of repos have a single automation. But when two automations are very similar (shared infrastructure, credentials, dependencies), it can make sense to keep them together.

Add a second automation

uv run automation --name other_automation --label "Other Automation"

The resulting structure:

my-scraper/
├── automations/
│   ├── my_scraper/
│   │   ├── automation.py
│   │   ├── config.py
│   │   ├── constants.py
│   │   └── tests/
│   │       └── input.json
│   └── other_automation/
│       ├── automation.py
│       ├── config.py
│       ├── constants.py
│       └── tests/
│           └── input.json
└── ...

The runner discovers both automatically by scanning the automations/ folder.

Running (local and production)

The automation name is passed as the first argument, both in development and production:

# Local
uv run execute my_scraper       --json automations/my_scraper/tests/input.json --fake-core-system
uv run execute other_automation --json automations/other_automation/tests/input.json --fake-core-system

# Production (S3_KEY in env, no --json)
python -m consultores_automation.runner my_scraper
python -m consultores_automation.runner other_automation

In production, the infrastructure maps each S3 path to the corresponding container command:

S3 path /my_scraper/input/       →  CMD python -m consultores_automation.runner my_scraper
S3 path /other_automation/input/ →  CMD python -m consultores_automation.runner other_automation

Both share the same Docker image. The automation name comes from the command, not the input JSON.


Execution flow

              .env / S3 / --json
                    │
                    ▼
              runner.py (CLI)
                    │
                    ├── discovers automations/ by scanning subfolders
                    ├── if more than one: determines which by subcommand
                    ├── reads input (local JSON or S3)
                    ├── instantiates Automation via @register
                    └── opens AutomationRunContext
                                │
                    ┌───────────┘
                    │
                    ▼
          automation.on_init()        ← validates required env vars
                    │
                    ▼
          automation.run(data)
              [scraper logic]
              returns dict / None
                    │
                    ▼
          automation.on_success()  (or on_error() if exception)
                    │
                    ▼
          AutomationRunContext.__exit__()
              ├── sends logs + output + status to the core system
              └── uploads debug/ folder to S3 if it exists

Reference: config.py

import os
from consultores_automation.settings import Settings

CODE = "my_scraper"    # snake_case, unique per scraper
NAME = "My Scraper"    # human-readable name

class MyScraperSettings(Settings):

    # ── Scraper-specific variables ─────────────────────────────────────
    PROFIT_USERNAME = os.getenv("PROFIT_USERNAME")
    PROFIT_PASSWORD = os.getenv("PROFIT_PASSWORD")
    MAX_ITEMS       = int(os.getenv("MAX_ITEMS", "100"))

    # ── Mandatory variables — validated automatically before run() ─────
    required = ["AFIP_USERNAME", "AFIP_PASSWORD", "PROFIT_USERNAME"]

    # ── Framework variables (available without re-declaring) ───────────
    # IS_PRODUCTION  = os.getenv("IS_PRODUCTION", "false").lower() == "true"
    # DEBUG          = os.getenv("DEBUG",         "false").lower() == "true"
    # DEMO           = os.getenv("DEMO",          "false").lower() == "true"
    # DRY_RUN        = os.getenv("DRY_RUN",       "false").lower() == "true"
    # AFIP_USERNAME  = os.getenv("AFIP_USERNAME")
    # AFIP_PASSWORD  = os.getenv("AFIP_PASSWORD")
    # S3_BUCKET      = os.getenv("S3_BUCKET")
    # S3_KEY         = os.getenv("S3_KEY")              # injected by Lambda
    # CORE_SYSTEM_API_URL = os.getenv("CORE_SYSTEM_API_URL")
    # DISCORD_WEBHOOK_URL = os.getenv("DISCORD_WEBHOOK_URL")

Access from the scraper (all variables — own and framework):

self.settings.PROFIT_USERNAME       # → own variable
self.settings.MAX_ITEMS             # → with default if not in .env
self.settings.AFIP_USERNAME         # → from framework, comes from .env
self.settings.DEMO                  # → from framework, always available
self.settings.is_production()       # → bool helpers
self.settings.is_debug()
self.settings.is_demo()

The secret key for the core system is resolved automatically by convention:

CORE_SYSTEM_{CODE}_SECRET_KEY   e.g.  CORE_SYSTEM_MY_SCRAPER_SECRET_KEY

This allows multiple automations in the same environment to each have their own key.


Reference: constants.py (optional)

Centralizes CSS/XPath selectors, element IDs and URLs of the scraper. When something breaks in Selenium, it is immediately clear what was being targeted without having to hunt strings scattered across the code.

# Choose the base class that matches the scraper being extended
from consultores_automation.afip.constants import AfipScraperConstants

class MyScraperConstants(AfipScraperConstants):
    MY_BUTTON_ID    = "btnProcess"
    MY_TABLE_XPATH  = "//table[@id='grid']"
    MY_INPUT_ID     = "txtSearch"

Available hierarchy:

Class Inherits from When to use
BaseDriverConstants Generic scraper (not AFIP)
AfipScraperConstants BaseDriverConstants Scraper extending AfipScraper
SimplificacionRegistralConstants AfipScraperConstants Scraper extending SimplificacionRegistralScraper

Assign the class in the scraper:

from automations.my_scraper.constants import MyScraperConstants

class MyScraper(AfipScraper):
    Constants = MyScraperConstants   # inherits all AFIP constants + own ones

    def process(self):
        self.click_element(By.ID, self.Constants.MY_BUTTON_ID)
        # Inherited constants also available:
        self.open_website(self.Constants.LOGIN_URL)

Not required. Scrapers work without constants.py. Recommended for any selector or string that appears more than once, or that would be hard to identify at a glance if something breaks.


Reference: automation.py

from consultores_automation import Automation, register
from automations.my_scraper.config import CODE, NAME, MyScraperSettings

@register(code=CODE, name=NAME)
class MyAutomation(Automation):

    Settings = MyScraperSettings   # connects the config to the framework

    # ── Helpers available in run() ───────────────────────────────────────

    # self.settings                     → instance of MyScraperSettings (+ framework vars)
    # self.logger                       → logger with format [timestamp | level | message]
    # self.s3                           → S3Utils (lazy — only fails if actually used)
    # self.set_output(x)                → store partial output before finishing
    # self.upload_output(path)          → upload file to S3, returns URL
    # self.make_zip_output(folder)      → zip folder, upload to S3, returns URL

    # ── Lifecycle hooks (all optional) ──────────────────────────────────

    def on_init(self):
        """Called before run(). Validates required vars and logs start by default."""
        self.logger.info("Starting...")

    def on_success(self):
        """Called after run() completes without errors."""
        pass

    def on_error(self, exc: Exception):
        """
        Called when run() raises an unhandled exception.
        Re-raises by default. Override to send alerts.
        """
        self.logger.error(f"Failed: {exc}")
        raise exc

    # ── Extra CLI arguments (optional) ──────────────────────────────────

    @classmethod
    def add_arguments(cls, parser):
        """Add args to the CLI. Accessible in run() via **kwargs."""
        parser.add_argument("--mode", default="normal", choices=["normal", "debug"])

    # ── Factory with DI (optional) ───────────────────────────────────────

    @classmethod
    def create(cls, headless=True, **kwargs):
        """
        Override if you need to build complex dependencies on instantiation.
        If not defined, the class is instantiated without arguments.
        """
        return cls(service=MyService(), headless=headless)

    # ── Main logic ───────────────────────────────────────────────────────

    def run(self, data: dict, mode="normal", **kwargs):
        cuit = data["cuit"]
        self.logger.info(f"Processing CUIT {cuit} in mode {mode}")
        # ... logic
        return {"result": "ok"}

Reference: runner CLI options

uv run execute [automation] [options]

  automation             Code of the automation to run.
                         Required if there is more than one. Auto-detected if only one.
                         In production it comes from the command that launches the container.

  --json PATH            Load input from a local JSON file (development).
                         In production not used — input comes from the S3_KEY env var.

  --fake-core-system     Do not call the core system (uses DummyClient).
                         Useful locally when CORE_SYSTEM_* is not configured.

  --no-context           Run without AutomationRunContext.
                         Output is logged but not sent anywhere.

  --head                 Open the browser with UI instead of headless.
                         Only works locally with a display.

  [scraper args]         Any args defined by add_arguments() in the class.

Reference: Selenium — BaseDriver, DriverBuilder and Mixins

Logger without imports

Every scraper that inherits from BaseDriver has self.logger available automatically, just like Automation. No imports needed:

class MyScraper(BaseDriver):
    def login(self, username, password):
        self.logger.info(f"Starting login for {username}")
        self.logger.error("Login failed")

DriverBuilder — customising the driver construction

DriverBuilder encapsulates all WebDriver construction logic. Its methods are independently overridable. To change a behaviour, create a subclass and assign it to driver_builder_class in the scraper:

from consultores_automation import DriverBuilder, BaseDriver

class MyDriverBuilder(DriverBuilder):
    def _find_chromium_binary(self):
        return "/opt/chrome/chrome"

    def _apply_stealth_options(self, options):
        super()._apply_stealth_options(options)
        options.add_argument("--disable-extensions")

class MyScraper(BaseDriver):
    driver_builder_class = MyDriverBuilder

Available methods to override in DriverBuilder:

Method What it does
_is_display_accessible(display) Checks if the X display is active via xdpyinfo
_ensure_xvfb_running(display) Starts Xvfb if not running (needed on Fargate)
_configure_display(options, skip_headless, display) Chooses between Xvfb and --headless=new
_apply_stealth_options(options) Adds anti-WAF flags (user-agent, excludeSwitches, etc.)
_find_chromium_binary() Resolves path to the Chrome/Chromium binary
_find_chromedriver_binary() Resolves path to chromedriver
build_driver(options, ...) Entry point — orchestrates all the steps above

Available Mixins

Mixins are combined with multiple inheritance before BaseDriver. Order matters so that Python's MRO correctly stacks the get_chrome_options() calls.

# Correct pattern: mixins first, BaseDriver last
class MyScraper(MixinA, MixinB, BaseDriver):
    pass
Mixin What it adds
TempProfileDriverMixin Temporary Chrome profile (--user-data-dir fresh per run). Avoids cache between executions
AutoDownloadDriverMixin Automatic file downloads. Configures downloads_dir and exposes wait_for_file_download()
BotDetectionMixin Enables stealth options and STEALTH_SCRIPT via CDP. For sites with Incapsula/Imperva
VirtualDisplayMixin Uses Xvfb (display=":99") instead of --headless=new. Chrome behaves like on a desktop
CompleteBotDetectionEvasionMixin BotDetectionMixin + VirtualDisplayMixin combined. Maximum evasion level

Common combination examples:

# Automatic downloads only
class MyScraper(AutoDownloadDriverMixin, BaseDriver):
    pass

# Full evasion + automatic downloads
class MyScraper(CompleteBotDetectionEvasionMixin, AutoDownloadDriverMixin, BaseDriver):
    pass

# Clean profile + basic evasion
class MyScraper(TempProfileDriverMixin, BotDetectionMixin, BaseDriver):
    pass

Class attributes to configure the driver

These can be declared directly on the class without a mixin:

class MyScraper(BaseDriver):
    bot_detection_is_a_concern = True   # enables stealth options
    skip_headless = True                # uses Xvfb instead of --headless=new
    display = ":1"                      # X display to use (default: ":99")
    driver_builder_class = MyDriverBuilder  # custom builder

Selenium class hierarchy

DriverBuilder
    ← override individual methods to customise the driver

BaseDriver                               Constants = BaseDriverConstants
    ├── self.logger                   ← ready-to-use logger, no imports
    ├── self.driver                   ← WebDriver built by driver_builder_class
    ├── self.wait                     ← WebDriverWait with configured timeout
    ├── self.Constants                ← constants class (selectors, URLs)
    └── driver_builder_class          ← extension point for the driver

AfipScraper(BaseDriver)                  Constants = AfipScraperConstants
    ├── login(username, password)
    ├── search_for_service(service_name)
    └── fill_autocomplete_field(...)

SimplificacionRegistralScraper(AfipScraper)   Constants = SimplificacionRegistralConstants

AWS production flow

In production each scraper runs as an ECS Fargate container triggered by an S3 event:

Core system
   uploads JSON to S3
        │
        ▼
s3://bucket/certs_download/input/run_123.json
        │
        ▼
S3 Event → Lambda dispatch
        │
        ├── "certs_download/input/" → ECS task scraper-certs
        ├── "suss/input/"           → ECS task scraper-suss
        └── "altas/input/"          → ECS task scraper-altas
                │
                ▼
        Container starts with:
            S3_KEY                                = "certs_download/input/run_123.json"
            CORE_SYSTEM_CERTS_DOWNLOAD_SECRET_KEY = "..."
            CMD = python -m consultores_automation.runner
                │
                ▼
        runner discovers automations in automations/
        downloads JSON from S3
        runs the automation
        returns output to core system
        container exits (ephemeral)

For repos with multiple automations, each automation has its own infrastructure rule mapping its S3 path to the container command with its name:

S3 path /suss/input/   →  CMD ["python", "-m", "consultores_automation.runner", "suss"]
S3 path /altas/input/  →  CMD ["python", "-m", "consultores_automation.runner", "altas"]

Both share the same Docker image. The automation name comes from the command, not the input JSON. No always-on server — the container only exists while processing.


Proposed repo structure

GitHub/
├── consultores-automation-framework/   ← this library
│
├── scraper-altas/                      ← one repo per scraper (standard case)
├── scraper-bajas/
├── scraper-suss/
├── scraper-iva/
├── scraper-sworn/
└── scraper-certs/

Each scraper repo:

my-scraper/
├── automations/
│   └── my_scraper/          ← one automation per repo in the standard case
│       ├── __init__.py
│       ├── automation.py    ← logic: only implement run()
│       ├── config.py        ← CODE, NAME, Settings subclass with env vars
│       ├── constants.py     ← selectors and strings (optional)
│       └── tests/
│           └── input.json   ← test input
├── scripts/
│   └── build_push.sh
├── Dockerfile
├── docker-compose.yml
├── .env.example
└── pyproject.toml

Differences from the original monorepo

Aspect Before Now
Register a scraper @AutomationRegistry.register() in main.py + import in scrapers/__init__.py @register() directly on the class, one file
Configuration settings.add(X=EnvVar("X")) in config.py class MySettings(Settings): X = os.getenv("X")
Credentials in run os.getenv() in the factory in main.py self.settings.X in run()
Logger from scrapers.common.logger import logger self.logger.info(), no imports
S3 from scrapers.common.utils import s3_utils self.s3.upload_to_s3(), no imports
Upload output Manual code in each scraper self.upload_output("file.pdf")
Core system secret key CORE_SYSTEM_SECRET_KEY, one for the whole repo CORE_SYSTEM_{CODE}_SECRET_KEY, one per automation
File structure scrapers/automation.py inside a package automations/<name>/automation.py
Imports from scrapers.common.X import Y from consultores_automation.X import Y
Selenium module selenium/ driver/
Scaffolding python -m consultores_automation init --name X uv run setup + uv run automation
Run locally python -m consultores_automation.runner ... uv run execute ...

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors