consultores-automation-framework

Shared framework for AFIP/ARCA scrapers. Provides all the infrastructure needed so that each scraper is an independent repo that only has to implement its own logic.

Why this project exists

The original system was a monorepo with all scrapers together (altas, bajas, suss, iva, sworn_statements, certs_download). That worked while there were few of them, but caused problems at scale:

Problem	Monorepo	This framework
Dependencies	All share the same `pyproject.toml`. If `altas` needs `langchain`, everyone installs it	Each repo has only its own deps
Deploy	One giant Docker image. Changing one line in `suss` rebuilds everything	Minimal Docker image per scraper
Environment variables	`.env` has `OPENAI_API_KEY`, `PROFIT_USERNAME`, credentials for everyone	Each container only has its own variables
Memory	The runner imports all scrapers even if only one runs	Only loads what it needs
Onboarding	Adding a scraper requires understanding the whole architecture	Edit 2 files and implement `run()`

What the library includes

consultores_automation/
├── automation/
│   ├── automation.py    ← Automation base class (hooks, helpers)
│   ├── registry.py      ← @register — registers and configures the scraper
│   ├── runner.py        ← CLI entrypoint (python -m consultores_automation.runner)
│   └── provider.py      ← input providers: local JSON or S3
│
├── core_system/
│   ├── client.py        ← HTTP client to the central run-tracking system
│   ├── run_context.py   ← context manager wrapping each execution
│   └── schemas.py       ← Pydantic models (AutomationRunSchema, StatusEnum)
│
├── driver/
│   ├── base.py          ← DriverBuilder + BaseDriver
│   ├── mixins.py        ← TempProfile, AutoDownload, BotDetection,
│   │                       VirtualDisplay, CompleteBotDetectionEvasion
│   └── constants.py     ← BaseDriverConstants (base constants class)
│
├── afip/
│   ├── scraper.py       ← AfipScraper: AFIP login, service search
│   └── constants.py     ← AfipScraperConstants, SimplificacionRegistralConstants
│
├── utils/
│   ├── s3.py            ← S3Utils (upload, download, zip+upload)
│   ├── files.py         ← local file helpers
│   ├── misc.py          ← miscellaneous helpers
│   └── dates.py         ← get_month_period() and date utilities
│
├── settings.py          ← unified configuration (Settings base class)
├── logger.py            ← pre-configured logger
├── exceptions.py        ← base exception hierarchy
├── alert.py             ← Discord notifications
└── cli.py               ← uv run commands: execute, setup, automation, help

Compatibility shims (kept for backward compatibility, do not use in new code):

File	Points to
`conf.py`	`settings.py`
`envs.py`	`settings.py`
`rets.py`	`utils/dates.py`

Quick start: creating a new scraper

1. Scaffold the repo

From the root directory of the new repo:

uv run setup --name my_scraper --label "My Scraper"
uv run automation --name my_scraper --label "My Scraper"

This generates:

my-scraper/
├── automations/
│   └── my_scraper/
│       ├── __init__.py
│       ├── automation.py   ← implement run() here
│       ├── config.py       ← CODE, NAME and environment variables
│       ├── constants.py    ← selectors and scraper strings (optional)
│       └── tests/
│           └── input.json  ← test input for this automation
├── scripts/
│   └── build_push.sh       ← script to build and push to ECR
├── Dockerfile
├── docker-compose.yml
├── .env.example
├── pyproject.toml          ← ready to use (adjust deps if needed)
├── .gitignore
├── .dockerignore
└── README.md

2. Setup

cp .env.example .env
# Fill in .env with the credentials
uv sync

3. Implement the logic

Only two files need to be edited inside automations/my_scraper/:

config.py — declare the scraper's own variables:

import os
from consultores_automation.settings import Settings

CODE = "my_scraper"
NAME = "My Scraper"

class MyScraperSettings(Settings):
    # Framework variables (AFIP_USERNAME, S3_BUCKET, etc.)
    # are already in the base class — no need to re-declare them.
    # Only add your own:
    PROFIT_USERNAME = os.getenv("PROFIT_USERNAME")
    PROFIT_PASSWORD = os.getenv("PROFIT_PASSWORD")

    # Declare mandatory variables — validated automatically before run():
    required = ["AFIP_USERNAME", "AFIP_PASSWORD", "PROFIT_USERNAME"]

automation.py — implement run():

from consultores_automation import Automation, register
from automations.my_scraper.config import CODE, NAME, MyScraperSettings

@register(code=CODE, name=NAME)
class MyScraperAutomation(Automation):
    Settings = MyScraperSettings

    def run(self, data: dict, **kwargs):
        self.logger.info(f"Processing: {data['cuit']}")

        scraper = MyScraper(
            username=self.settings.AFIP_USERNAME,
            password=self.settings.AFIP_PASSWORD,
        )
        result = scraper.process(data)

        url = self.upload_output("files/downloads/result.pdf")
        return {"url": url, "success": True}

4. Test it

uv sync

# Without core system credentials:
uv run execute --json automations/my_scraper/tests/input.json --fake-core-system

# With visible browser (debug):
uv run execute --json automations/my_scraper/tests/input.json --fake-core-system --head

# Without any external infrastructure (no run tracking):
uv run execute --json automations/my_scraper/tests/input.json --fake-core-system --no-context

# Multi-automation repo — specify the automation name:
uv run execute my_scraper --json automations/my_scraper/tests/input.json --fake-core-system

Available commands

Command	Description
`uv run execute`	Run an automation locally
`uv run setup`	Scaffold repo infra (docker, configs, .env)
`uv run automation`	Add a new automation to the repo
`uv run help`	Show all commands and options

Run uv run help for the full reference with all flags and examples.

Multiple automations in one repo

90% of repos have a single automation. But when two automations are very similar (shared infrastructure, credentials, dependencies), it can make sense to keep them together.

Add a second automation

uv run automation --name other_automation --label "Other Automation"

The resulting structure:

my-scraper/
├── automations/
│   ├── my_scraper/
│   │   ├── automation.py
│   │   ├── config.py
│   │   ├── constants.py
│   │   └── tests/
│   │       └── input.json
│   └── other_automation/
│       ├── automation.py
│       ├── config.py
│       ├── constants.py
│       └── tests/
│           └── input.json
└── ...

The runner discovers both automatically by scanning the automations/ folder.

Running (local and production)

The automation name is passed as the first argument, both in development and production:

# Local
uv run execute my_scraper       --json automations/my_scraper/tests/input.json --fake-core-system
uv run execute other_automation --json automations/other_automation/tests/input.json --fake-core-system

# Production (S3_KEY in env, no --json)
python -m consultores_automation.runner my_scraper
python -m consultores_automation.runner other_automation

In production, the infrastructure maps each S3 path to the corresponding container command:

S3 path /my_scraper/input/       →  CMD python -m consultores_automation.runner my_scraper
S3 path /other_automation/input/ →  CMD python -m consultores_automation.runner other_automation

Both share the same Docker image. The automation name comes from the command, not the input JSON.

Execution flow

              .env / S3 / --json
                    │
                    ▼
              runner.py (CLI)
                    │
                    ├── discovers automations/ by scanning subfolders
                    ├── if more than one: determines which by subcommand
                    ├── reads input (local JSON or S3)
                    ├── instantiates Automation via @register
                    └── opens AutomationRunContext
                                │
                    ┌───────────┘
                    │
                    ▼
          automation.on_init()        ← validates required env vars
                    │
                    ▼
          automation.run(data)
              [scraper logic]
              returns dict / None
                    │
                    ▼
          automation.on_success()  (or on_error() if exception)
                    │
                    ▼
          AutomationRunContext.__exit__()
              ├── sends logs + output + status to the core system
              └── uploads debug/ folder to S3 if it exists

Reference: `config.py`

import os
from consultores_automation.settings import Settings

CODE = "my_scraper"    # snake_case, unique per scraper
NAME = "My Scraper"    # human-readable name

class MyScraperSettings(Settings):

    # ── Scraper-specific variables ─────────────────────────────────────
    PROFIT_USERNAME = os.getenv("PROFIT_USERNAME")
    PROFIT_PASSWORD = os.getenv("PROFIT_PASSWORD")
    MAX_ITEMS       = int(os.getenv("MAX_ITEMS", "100"))

    # ── Mandatory variables — validated automatically before run() ─────
    required = ["AFIP_USERNAME", "AFIP_PASSWORD", "PROFIT_USERNAME"]

    # ── Framework variables (available without re-declaring) ───────────
    # IS_PRODUCTION  = os.getenv("IS_PRODUCTION", "false").lower() == "true"
    # DEBUG          = os.getenv("DEBUG",         "false").lower() == "true"
    # DEMO           = os.getenv("DEMO",          "false").lower() == "true"
    # DRY_RUN        = os.getenv("DRY_RUN",       "false").lower() == "true"
    # AFIP_USERNAME  = os.getenv("AFIP_USERNAME")
    # AFIP_PASSWORD  = os.getenv("AFIP_PASSWORD")
    # S3_BUCKET      = os.getenv("S3_BUCKET")
    # S3_KEY         = os.getenv("S3_KEY")              # injected by Lambda
    # CORE_SYSTEM_API_URL = os.getenv("CORE_SYSTEM_API_URL")
    # DISCORD_WEBHOOK_URL = os.getenv("DISCORD_WEBHOOK_URL")

Access from the scraper (all variables — own and framework):

self.settings.PROFIT_USERNAME       # → own variable
self.settings.MAX_ITEMS             # → with default if not in .env
self.settings.AFIP_USERNAME         # → from framework, comes from .env
self.settings.DEMO                  # → from framework, always available
self.settings.is_production()       # → bool helpers
self.settings.is_debug()
self.settings.is_demo()

The secret key for the core system is resolved automatically by convention:

CORE_SYSTEM_{CODE}_SECRET_KEY   e.g.  CORE_SYSTEM_MY_SCRAPER_SECRET_KEY

This allows multiple automations in the same environment to each have their own key.

Reference: `constants.py` (optional)

Centralizes CSS/XPath selectors, element IDs and URLs of the scraper. When something breaks in Selenium, it is immediately clear what was being targeted without having to hunt strings scattered across the code.

# Choose the base class that matches the scraper being extended
from consultores_automation.afip.constants import AfipScraperConstants

class MyScraperConstants(AfipScraperConstants):
    MY_BUTTON_ID    = "btnProcess"
    MY_TABLE_XPATH  = "//table[@id='grid']"
    MY_INPUT_ID     = "txtSearch"

Available hierarchy:

Class	Inherits from	When to use
`BaseDriverConstants`	—	Generic scraper (not AFIP)
`AfipScraperConstants`	`BaseDriverConstants`	Scraper extending `AfipScraper`
`SimplificacionRegistralConstants`	`AfipScraperConstants`	Scraper extending `SimplificacionRegistralScraper`

Assign the class in the scraper:

from automations.my_scraper.constants import MyScraperConstants

class MyScraper(AfipScraper):
    Constants = MyScraperConstants   # inherits all AFIP constants + own ones

    def process(self):
        self.click_element(By.ID, self.Constants.MY_BUTTON_ID)
        # Inherited constants also available:
        self.open_website(self.Constants.LOGIN_URL)

Not required. Scrapers work without constants.py. Recommended for any selector or string that appears more than once, or that would be hard to identify at a glance if something breaks.

Reference: `automation.py`

from consultores_automation import Automation, register
from automations.my_scraper.config import CODE, NAME, MyScraperSettings

@register(code=CODE, name=NAME)
class MyAutomation(Automation):

    Settings = MyScraperSettings   # connects the config to the framework

    # ── Helpers available in run() ───────────────────────────────────────

    # self.settings                     → instance of MyScraperSettings (+ framework vars)
    # self.logger                       → logger with format [timestamp | level | message]
    # self.s3                           → S3Utils (lazy — only fails if actually used)
    # self.set_output(x)                → store partial output before finishing
    # self.upload_output(path)          → upload file to S3, returns URL
    # self.make_zip_output(folder)      → zip folder, upload to S3, returns URL

    # ── Lifecycle hooks (all optional) ──────────────────────────────────

    def on_init(self):
        """Called before run(). Validates required vars and logs start by default."""
        self.logger.info("Starting...")

    def on_success(self):
        """Called after run() completes without errors."""
        pass

    def on_error(self, exc: Exception):
        """
        Called when run() raises an unhandled exception.
        Re-raises by default. Override to send alerts.
        """
        self.logger.error(f"Failed: {exc}")
        raise exc

    # ── Extra CLI arguments (optional) ──────────────────────────────────

    @classmethod
    def add_arguments(cls, parser):
        """Add args to the CLI. Accessible in run() via **kwargs."""
        parser.add_argument("--mode", default="normal", choices=["normal", "debug"])

    # ── Factory with DI (optional) ───────────────────────────────────────

    @classmethod
    def create(cls, headless=True, **kwargs):
        """
        Override if you need to build complex dependencies on instantiation.
        If not defined, the class is instantiated without arguments.
        """
        return cls(service=MyService(), headless=headless)

    # ── Main logic ───────────────────────────────────────────────────────

    def run(self, data: dict, mode="normal", **kwargs):
        cuit = data["cuit"]
        self.logger.info(f"Processing CUIT {cuit} in mode {mode}")
        # ... logic
        return {"result": "ok"}

Reference: runner CLI options

uv run execute [automation] [options]

  automation             Code of the automation to run.
                         Required if there is more than one. Auto-detected if only one.
                         In production it comes from the command that launches the container.

  --json PATH            Load input from a local JSON file (development).
                         In production not used — input comes from the S3_KEY env var.

  --fake-core-system     Do not call the core system (uses DummyClient).
                         Useful locally when CORE_SYSTEM_* is not configured.

  --no-context           Run without AutomationRunContext.
                         Output is logged but not sent anywhere.

  --head                 Open the browser with UI instead of headless.
                         Only works locally with a display.

  [scraper args]         Any args defined by add_arguments() in the class.

Reference: Selenium — `BaseDriver`, `DriverBuilder` and Mixins

Logger without imports

Every scraper that inherits from BaseDriver has self.logger available automatically, just like Automation. No imports needed:

class MyScraper(BaseDriver):
    def login(self, username, password):
        self.logger.info(f"Starting login for {username}")
        self.logger.error("Login failed")

`DriverBuilder` — customising the driver construction

DriverBuilder encapsulates all WebDriver construction logic. Its methods are independently overridable. To change a behaviour, create a subclass and assign it to driver_builder_class in the scraper:

from consultores_automation import DriverBuilder, BaseDriver

class MyDriverBuilder(DriverBuilder):
    def _find_chromium_binary(self):
        return "/opt/chrome/chrome"

    def _apply_stealth_options(self, options):
        super()._apply_stealth_options(options)
        options.add_argument("--disable-extensions")

class MyScraper(BaseDriver):
    driver_builder_class = MyDriverBuilder

Available methods to override in DriverBuilder:

Method	What it does
`_is_display_accessible(display)`	Checks if the X display is active via `xdpyinfo`
`_ensure_xvfb_running(display)`	Starts Xvfb if not running (needed on Fargate)
`_configure_display(options, skip_headless, display)`	Chooses between Xvfb and `--headless=new`
`_apply_stealth_options(options)`	Adds anti-WAF flags (user-agent, excludeSwitches, etc.)
`_find_chromium_binary()`	Resolves path to the Chrome/Chromium binary
`_find_chromedriver_binary()`	Resolves path to chromedriver
`build_driver(options, ...)`	Entry point — orchestrates all the steps above

Available Mixins

Mixins are combined with multiple inheritance before BaseDriver. Order matters so that Python's MRO correctly stacks the get_chrome_options() calls.

# Correct pattern: mixins first, BaseDriver last
class MyScraper(MixinA, MixinB, BaseDriver):
    pass

Mixin	What it adds
`TempProfileDriverMixin`	Temporary Chrome profile (`--user-data-dir` fresh per run). Avoids cache between executions
`AutoDownloadDriverMixin`	Automatic file downloads. Configures `downloads_dir` and exposes `wait_for_file_download()`
`BotDetectionMixin`	Enables stealth options and STEALTH_SCRIPT via CDP. For sites with Incapsula/Imperva
`VirtualDisplayMixin`	Uses Xvfb (`display=":99"`) instead of `--headless=new`. Chrome behaves like on a desktop
`CompleteBotDetectionEvasionMixin`	`BotDetectionMixin` + `VirtualDisplayMixin` combined. Maximum evasion level

Common combination examples:

# Automatic downloads only
class MyScraper(AutoDownloadDriverMixin, BaseDriver):
    pass

# Full evasion + automatic downloads
class MyScraper(CompleteBotDetectionEvasionMixin, AutoDownloadDriverMixin, BaseDriver):
    pass

# Clean profile + basic evasion
class MyScraper(TempProfileDriverMixin, BotDetectionMixin, BaseDriver):
    pass

Class attributes to configure the driver

These can be declared directly on the class without a mixin:

class MyScraper(BaseDriver):
    bot_detection_is_a_concern = True   # enables stealth options
    skip_headless = True                # uses Xvfb instead of --headless=new
    display = ":1"                      # X display to use (default: ":99")
    driver_builder_class = MyDriverBuilder  # custom builder

Selenium class hierarchy

DriverBuilder
    ← override individual methods to customise the driver

BaseDriver                               Constants = BaseDriverConstants
    ├── self.logger                   ← ready-to-use logger, no imports
    ├── self.driver                   ← WebDriver built by driver_builder_class
    ├── self.wait                     ← WebDriverWait with configured timeout
    ├── self.Constants                ← constants class (selectors, URLs)
    └── driver_builder_class          ← extension point for the driver

AfipScraper(BaseDriver)                  Constants = AfipScraperConstants
    ├── login(username, password)
    ├── search_for_service(service_name)
    └── fill_autocomplete_field(...)

SimplificacionRegistralScraper(AfipScraper)   Constants = SimplificacionRegistralConstants

AWS production flow

In production each scraper runs as an ECS Fargate container triggered by an S3 event:

Core system
   uploads JSON to S3
        │
        ▼
s3://bucket/certs_download/input/run_123.json
        │
        ▼
S3 Event → Lambda dispatch
        │
        ├── "certs_download/input/" → ECS task scraper-certs
        ├── "suss/input/"           → ECS task scraper-suss
        └── "altas/input/"          → ECS task scraper-altas
                │
                ▼
        Container starts with:
            S3_KEY                                = "certs_download/input/run_123.json"
            CORE_SYSTEM_CERTS_DOWNLOAD_SECRET_KEY = "..."
            CMD = python -m consultores_automation.runner
                │
                ▼
        runner discovers automations in automations/
        downloads JSON from S3
        runs the automation
        returns output to core system
        container exits (ephemeral)

For repos with multiple automations, each automation has its own infrastructure rule mapping its S3 path to the container command with its name:

S3 path /suss/input/   →  CMD ["python", "-m", "consultores_automation.runner", "suss"]
S3 path /altas/input/  →  CMD ["python", "-m", "consultores_automation.runner", "altas"]

Both share the same Docker image. The automation name comes from the command, not the input JSON. No always-on server — the container only exists while processing.

Proposed repo structure

GitHub/
├── consultores-automation-framework/   ← this library
│
├── scraper-altas/                      ← one repo per scraper (standard case)
├── scraper-bajas/
├── scraper-suss/
├── scraper-iva/
├── scraper-sworn/
└── scraper-certs/

Each scraper repo:

my-scraper/
├── automations/
│   └── my_scraper/          ← one automation per repo in the standard case
│       ├── __init__.py
│       ├── automation.py    ← logic: only implement run()
│       ├── config.py        ← CODE, NAME, Settings subclass with env vars
│       ├── constants.py     ← selectors and strings (optional)
│       └── tests/
│           └── input.json   ← test input
├── scripts/
│   └── build_push.sh
├── Dockerfile
├── docker-compose.yml
├── .env.example
└── pyproject.toml

Differences from the original monorepo

Aspect	Before	Now
Register a scraper	`@AutomationRegistry.register()` in `main.py` + import in `scrapers/__init__.py`	`@register()` directly on the class, one file
Configuration	`settings.add(X=EnvVar("X"))` in `config.py`	`class MySettings(Settings): X = os.getenv("X")`
Credentials in run	`os.getenv()` in the factory in `main.py`	`self.settings.X` in `run()`
Logger	`from scrapers.common.logger import logger`	`self.logger.info()`, no imports
S3	`from scrapers.common.utils import s3_utils`	`self.s3.upload_to_s3()`, no imports
Upload output	Manual code in each scraper	`self.upload_output("file.pdf")`
Core system secret key	`CORE_SYSTEM_SECRET_KEY`, one for the whole repo	`CORE_SYSTEM_{CODE}_SECRET_KEY`, one per automation
File structure	`scrapers/automation.py` inside a package	`automations/<name>/automation.py`
Imports	`from scrapers.common.X import Y`	`from consultores_automation.X import Y`
Selenium module	`selenium/`	`driver/`
Scaffolding	`python -m consultores_automation init --name X`	`uv run setup` + `uv run automation`
Run locally	`python -m consultores_automation.runner ...`	`uv run execute ...`

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
skills		skills
src/consultores_automation		src/consultores_automation
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

consultores-automation-framework

Why this project exists

What the library includes

Quick start: creating a new scraper

1. Scaffold the repo

2. Setup

3. Implement the logic

4. Test it

Available commands

Multiple automations in one repo

Add a second automation

Running (local and production)

Execution flow

Reference: `config.py`

Reference: `constants.py` (optional)

Reference: `automation.py`

Reference: runner CLI options

Reference: Selenium — `BaseDriver`, `DriverBuilder` and Mixins

Logger without imports

`DriverBuilder` — customising the driver construction

Available Mixins

Class attributes to configure the driver

Selenium class hierarchy

AWS production flow

Proposed repo structure

Differences from the original monorepo

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

consultores-automation-framework

Why this project exists

What the library includes

Quick start: creating a new scraper

1. Scaffold the repo

2. Setup

3. Implement the logic

4. Test it

Available commands

Multiple automations in one repo

Add a second automation

Running (local and production)

Execution flow

Reference: config.py

Reference: constants.py (optional)

Reference: automation.py

Reference: runner CLI options

Reference: Selenium — BaseDriver, DriverBuilder and Mixins

Logger without imports

DriverBuilder — customising the driver construction

Available Mixins

Class attributes to configure the driver

Selenium class hierarchy

AWS production flow

Proposed repo structure

Differences from the original monorepo

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Reference: `config.py`

Reference: `constants.py` (optional)

Reference: `automation.py`

Reference: Selenium — `BaseDriver`, `DriverBuilder` and Mixins

`DriverBuilder` — customising the driver construction

Packages