Shared framework for AFIP/ARCA scrapers. Provides all the infrastructure needed so that each scraper is an independent repo that only has to implement its own logic.
The original system was a monorepo with all scrapers together (altas, bajas, suss, iva, sworn_statements, certs_download). That worked while there were few of them, but caused problems at scale:
| Problem | Monorepo | This framework |
|---|---|---|
| Dependencies | All share the same pyproject.toml. If altas needs langchain, everyone installs it |
Each repo has only its own deps |
| Deploy | One giant Docker image. Changing one line in suss rebuilds everything |
Minimal Docker image per scraper |
| Environment variables | .env has OPENAI_API_KEY, PROFIT_USERNAME, credentials for everyone |
Each container only has its own variables |
| Memory | The runner imports all scrapers even if only one runs | Only loads what it needs |
| Onboarding | Adding a scraper requires understanding the whole architecture | Edit 2 files and implement run() |
consultores_automation/
├── automation/
│ ├── automation.py ← Automation base class (hooks, helpers)
│ ├── registry.py ← @register — registers and configures the scraper
│ ├── runner.py ← CLI entrypoint (python -m consultores_automation.runner)
│ └── provider.py ← input providers: local JSON or S3
│
├── core_system/
│ ├── client.py ← HTTP client to the central run-tracking system
│ ├── run_context.py ← context manager wrapping each execution
│ └── schemas.py ← Pydantic models (AutomationRunSchema, StatusEnum)
│
├── driver/
│ ├── base.py ← DriverBuilder + BaseDriver
│ ├── mixins.py ← TempProfile, AutoDownload, BotDetection,
│ │ VirtualDisplay, CompleteBotDetectionEvasion
│ └── constants.py ← BaseDriverConstants (base constants class)
│
├── afip/
│ ├── scraper.py ← AfipScraper: AFIP login, service search
│ └── constants.py ← AfipScraperConstants, SimplificacionRegistralConstants
│
├── utils/
│ ├── s3.py ← S3Utils (upload, download, zip+upload)
│ ├── files.py ← local file helpers
│ ├── misc.py ← miscellaneous helpers
│ └── dates.py ← get_month_period() and date utilities
│
├── settings.py ← unified configuration (Settings base class)
├── logger.py ← pre-configured logger
├── exceptions.py ← base exception hierarchy
├── alert.py ← Discord notifications
└── cli.py ← uv run commands: execute, setup, automation, help
Compatibility shims (kept for backward compatibility, do not use in new code):
| File | Points to |
|---|---|
conf.py |
settings.py |
envs.py |
settings.py |
rets.py |
utils/dates.py |
From the root directory of the new repo:
uv run setup --name my_scraper --label "My Scraper"
uv run automation --name my_scraper --label "My Scraper"This generates:
my-scraper/
├── automations/
│ └── my_scraper/
│ ├── __init__.py
│ ├── automation.py ← implement run() here
│ ├── config.py ← CODE, NAME and environment variables
│ ├── constants.py ← selectors and scraper strings (optional)
│ └── tests/
│ └── input.json ← test input for this automation
├── scripts/
│ └── build_push.sh ← script to build and push to ECR
├── Dockerfile
├── docker-compose.yml
├── .env.example
├── pyproject.toml ← ready to use (adjust deps if needed)
├── .gitignore
├── .dockerignore
└── README.md
cp .env.example .env
# Fill in .env with the credentials
uv syncOnly two files need to be edited inside automations/my_scraper/:
config.py — declare the scraper's own variables:
import os
from consultores_automation.settings import Settings
CODE = "my_scraper"
NAME = "My Scraper"
class MyScraperSettings(Settings):
# Framework variables (AFIP_USERNAME, S3_BUCKET, etc.)
# are already in the base class — no need to re-declare them.
# Only add your own:
PROFIT_USERNAME = os.getenv("PROFIT_USERNAME")
PROFIT_PASSWORD = os.getenv("PROFIT_PASSWORD")
# Declare mandatory variables — validated automatically before run():
required = ["AFIP_USERNAME", "AFIP_PASSWORD", "PROFIT_USERNAME"]automation.py — implement run():
from consultores_automation import Automation, register
from automations.my_scraper.config import CODE, NAME, MyScraperSettings
@register(code=CODE, name=NAME)
class MyScraperAutomation(Automation):
Settings = MyScraperSettings
def run(self, data: dict, **kwargs):
self.logger.info(f"Processing: {data['cuit']}")
scraper = MyScraper(
username=self.settings.AFIP_USERNAME,
password=self.settings.AFIP_PASSWORD,
)
result = scraper.process(data)
url = self.upload_output("files/downloads/result.pdf")
return {"url": url, "success": True}uv sync
# Without core system credentials:
uv run execute --json automations/my_scraper/tests/input.json --fake-core-system
# With visible browser (debug):
uv run execute --json automations/my_scraper/tests/input.json --fake-core-system --head
# Without any external infrastructure (no run tracking):
uv run execute --json automations/my_scraper/tests/input.json --fake-core-system --no-context
# Multi-automation repo — specify the automation name:
uv run execute my_scraper --json automations/my_scraper/tests/input.json --fake-core-system| Command | Description |
|---|---|
uv run execute |
Run an automation locally |
uv run setup |
Scaffold repo infra (docker, configs, .env) |
uv run automation |
Add a new automation to the repo |
uv run help |
Show all commands and options |
Run uv run help for the full reference with all flags and examples.
90% of repos have a single automation. But when two automations are very similar (shared infrastructure, credentials, dependencies), it can make sense to keep them together.
uv run automation --name other_automation --label "Other Automation"The resulting structure:
my-scraper/
├── automations/
│ ├── my_scraper/
│ │ ├── automation.py
│ │ ├── config.py
│ │ ├── constants.py
│ │ └── tests/
│ │ └── input.json
│ └── other_automation/
│ ├── automation.py
│ ├── config.py
│ ├── constants.py
│ └── tests/
│ └── input.json
└── ...
The runner discovers both automatically by scanning the automations/ folder.
The automation name is passed as the first argument, both in development and production:
# Local
uv run execute my_scraper --json automations/my_scraper/tests/input.json --fake-core-system
uv run execute other_automation --json automations/other_automation/tests/input.json --fake-core-system
# Production (S3_KEY in env, no --json)
python -m consultores_automation.runner my_scraper
python -m consultores_automation.runner other_automationIn production, the infrastructure maps each S3 path to the corresponding container command:
S3 path /my_scraper/input/ → CMD python -m consultores_automation.runner my_scraper
S3 path /other_automation/input/ → CMD python -m consultores_automation.runner other_automation
Both share the same Docker image. The automation name comes from the command, not the input JSON.
.env / S3 / --json
│
▼
runner.py (CLI)
│
├── discovers automations/ by scanning subfolders
├── if more than one: determines which by subcommand
├── reads input (local JSON or S3)
├── instantiates Automation via @register
└── opens AutomationRunContext
│
┌───────────┘
│
▼
automation.on_init() ← validates required env vars
│
▼
automation.run(data)
[scraper logic]
returns dict / None
│
▼
automation.on_success() (or on_error() if exception)
│
▼
AutomationRunContext.__exit__()
├── sends logs + output + status to the core system
└── uploads debug/ folder to S3 if it exists
import os
from consultores_automation.settings import Settings
CODE = "my_scraper" # snake_case, unique per scraper
NAME = "My Scraper" # human-readable name
class MyScraperSettings(Settings):
# ── Scraper-specific variables ─────────────────────────────────────
PROFIT_USERNAME = os.getenv("PROFIT_USERNAME")
PROFIT_PASSWORD = os.getenv("PROFIT_PASSWORD")
MAX_ITEMS = int(os.getenv("MAX_ITEMS", "100"))
# ── Mandatory variables — validated automatically before run() ─────
required = ["AFIP_USERNAME", "AFIP_PASSWORD", "PROFIT_USERNAME"]
# ── Framework variables (available without re-declaring) ───────────
# IS_PRODUCTION = os.getenv("IS_PRODUCTION", "false").lower() == "true"
# DEBUG = os.getenv("DEBUG", "false").lower() == "true"
# DEMO = os.getenv("DEMO", "false").lower() == "true"
# DRY_RUN = os.getenv("DRY_RUN", "false").lower() == "true"
# AFIP_USERNAME = os.getenv("AFIP_USERNAME")
# AFIP_PASSWORD = os.getenv("AFIP_PASSWORD")
# S3_BUCKET = os.getenv("S3_BUCKET")
# S3_KEY = os.getenv("S3_KEY") # injected by Lambda
# CORE_SYSTEM_API_URL = os.getenv("CORE_SYSTEM_API_URL")
# DISCORD_WEBHOOK_URL = os.getenv("DISCORD_WEBHOOK_URL")Access from the scraper (all variables — own and framework):
self.settings.PROFIT_USERNAME # → own variable
self.settings.MAX_ITEMS # → with default if not in .env
self.settings.AFIP_USERNAME # → from framework, comes from .env
self.settings.DEMO # → from framework, always available
self.settings.is_production() # → bool helpers
self.settings.is_debug()
self.settings.is_demo()The secret key for the core system is resolved automatically by convention:
CORE_SYSTEM_{CODE}_SECRET_KEY e.g. CORE_SYSTEM_MY_SCRAPER_SECRET_KEY
This allows multiple automations in the same environment to each have their own key.
Centralizes CSS/XPath selectors, element IDs and URLs of the scraper. When something breaks in Selenium, it is immediately clear what was being targeted without having to hunt strings scattered across the code.
# Choose the base class that matches the scraper being extended
from consultores_automation.afip.constants import AfipScraperConstants
class MyScraperConstants(AfipScraperConstants):
MY_BUTTON_ID = "btnProcess"
MY_TABLE_XPATH = "//table[@id='grid']"
MY_INPUT_ID = "txtSearch"Available hierarchy:
| Class | Inherits from | When to use |
|---|---|---|
BaseDriverConstants |
— | Generic scraper (not AFIP) |
AfipScraperConstants |
BaseDriverConstants |
Scraper extending AfipScraper |
SimplificacionRegistralConstants |
AfipScraperConstants |
Scraper extending SimplificacionRegistralScraper |
Assign the class in the scraper:
from automations.my_scraper.constants import MyScraperConstants
class MyScraper(AfipScraper):
Constants = MyScraperConstants # inherits all AFIP constants + own ones
def process(self):
self.click_element(By.ID, self.Constants.MY_BUTTON_ID)
# Inherited constants also available:
self.open_website(self.Constants.LOGIN_URL)Not required. Scrapers work without
constants.py. Recommended for any selector or string that appears more than once, or that would be hard to identify at a glance if something breaks.
from consultores_automation import Automation, register
from automations.my_scraper.config import CODE, NAME, MyScraperSettings
@register(code=CODE, name=NAME)
class MyAutomation(Automation):
Settings = MyScraperSettings # connects the config to the framework
# ── Helpers available in run() ───────────────────────────────────────
# self.settings → instance of MyScraperSettings (+ framework vars)
# self.logger → logger with format [timestamp | level | message]
# self.s3 → S3Utils (lazy — only fails if actually used)
# self.set_output(x) → store partial output before finishing
# self.upload_output(path) → upload file to S3, returns URL
# self.make_zip_output(folder) → zip folder, upload to S3, returns URL
# ── Lifecycle hooks (all optional) ──────────────────────────────────
def on_init(self):
"""Called before run(). Validates required vars and logs start by default."""
self.logger.info("Starting...")
def on_success(self):
"""Called after run() completes without errors."""
pass
def on_error(self, exc: Exception):
"""
Called when run() raises an unhandled exception.
Re-raises by default. Override to send alerts.
"""
self.logger.error(f"Failed: {exc}")
raise exc
# ── Extra CLI arguments (optional) ──────────────────────────────────
@classmethod
def add_arguments(cls, parser):
"""Add args to the CLI. Accessible in run() via **kwargs."""
parser.add_argument("--mode", default="normal", choices=["normal", "debug"])
# ── Factory with DI (optional) ───────────────────────────────────────
@classmethod
def create(cls, headless=True, **kwargs):
"""
Override if you need to build complex dependencies on instantiation.
If not defined, the class is instantiated without arguments.
"""
return cls(service=MyService(), headless=headless)
# ── Main logic ───────────────────────────────────────────────────────
def run(self, data: dict, mode="normal", **kwargs):
cuit = data["cuit"]
self.logger.info(f"Processing CUIT {cuit} in mode {mode}")
# ... logic
return {"result": "ok"}uv run execute [automation] [options]
automation Code of the automation to run.
Required if there is more than one. Auto-detected if only one.
In production it comes from the command that launches the container.
--json PATH Load input from a local JSON file (development).
In production not used — input comes from the S3_KEY env var.
--fake-core-system Do not call the core system (uses DummyClient).
Useful locally when CORE_SYSTEM_* is not configured.
--no-context Run without AutomationRunContext.
Output is logged but not sent anywhere.
--head Open the browser with UI instead of headless.
Only works locally with a display.
[scraper args] Any args defined by add_arguments() in the class.
Every scraper that inherits from BaseDriver has self.logger available automatically,
just like Automation. No imports needed:
class MyScraper(BaseDriver):
def login(self, username, password):
self.logger.info(f"Starting login for {username}")
self.logger.error("Login failed")DriverBuilder encapsulates all WebDriver construction logic. Its methods are
independently overridable. To change a behaviour, create a subclass and assign it
to driver_builder_class in the scraper:
from consultores_automation import DriverBuilder, BaseDriver
class MyDriverBuilder(DriverBuilder):
def _find_chromium_binary(self):
return "/opt/chrome/chrome"
def _apply_stealth_options(self, options):
super()._apply_stealth_options(options)
options.add_argument("--disable-extensions")
class MyScraper(BaseDriver):
driver_builder_class = MyDriverBuilderAvailable methods to override in DriverBuilder:
| Method | What it does |
|---|---|
_is_display_accessible(display) |
Checks if the X display is active via xdpyinfo |
_ensure_xvfb_running(display) |
Starts Xvfb if not running (needed on Fargate) |
_configure_display(options, skip_headless, display) |
Chooses between Xvfb and --headless=new |
_apply_stealth_options(options) |
Adds anti-WAF flags (user-agent, excludeSwitches, etc.) |
_find_chromium_binary() |
Resolves path to the Chrome/Chromium binary |
_find_chromedriver_binary() |
Resolves path to chromedriver |
build_driver(options, ...) |
Entry point — orchestrates all the steps above |
Mixins are combined with multiple inheritance before BaseDriver. Order matters
so that Python's MRO correctly stacks the get_chrome_options() calls.
# Correct pattern: mixins first, BaseDriver last
class MyScraper(MixinA, MixinB, BaseDriver):
pass| Mixin | What it adds |
|---|---|
TempProfileDriverMixin |
Temporary Chrome profile (--user-data-dir fresh per run). Avoids cache between executions |
AutoDownloadDriverMixin |
Automatic file downloads. Configures downloads_dir and exposes wait_for_file_download() |
BotDetectionMixin |
Enables stealth options and STEALTH_SCRIPT via CDP. For sites with Incapsula/Imperva |
VirtualDisplayMixin |
Uses Xvfb (display=":99") instead of --headless=new. Chrome behaves like on a desktop |
CompleteBotDetectionEvasionMixin |
BotDetectionMixin + VirtualDisplayMixin combined. Maximum evasion level |
Common combination examples:
# Automatic downloads only
class MyScraper(AutoDownloadDriverMixin, BaseDriver):
pass
# Full evasion + automatic downloads
class MyScraper(CompleteBotDetectionEvasionMixin, AutoDownloadDriverMixin, BaseDriver):
pass
# Clean profile + basic evasion
class MyScraper(TempProfileDriverMixin, BotDetectionMixin, BaseDriver):
passThese can be declared directly on the class without a mixin:
class MyScraper(BaseDriver):
bot_detection_is_a_concern = True # enables stealth options
skip_headless = True # uses Xvfb instead of --headless=new
display = ":1" # X display to use (default: ":99")
driver_builder_class = MyDriverBuilder # custom builderDriverBuilder
← override individual methods to customise the driver
BaseDriver Constants = BaseDriverConstants
├── self.logger ← ready-to-use logger, no imports
├── self.driver ← WebDriver built by driver_builder_class
├── self.wait ← WebDriverWait with configured timeout
├── self.Constants ← constants class (selectors, URLs)
└── driver_builder_class ← extension point for the driver
AfipScraper(BaseDriver) Constants = AfipScraperConstants
├── login(username, password)
├── search_for_service(service_name)
└── fill_autocomplete_field(...)
SimplificacionRegistralScraper(AfipScraper) Constants = SimplificacionRegistralConstants
In production each scraper runs as an ECS Fargate container triggered by an S3 event:
Core system
uploads JSON to S3
│
▼
s3://bucket/certs_download/input/run_123.json
│
▼
S3 Event → Lambda dispatch
│
├── "certs_download/input/" → ECS task scraper-certs
├── "suss/input/" → ECS task scraper-suss
└── "altas/input/" → ECS task scraper-altas
│
▼
Container starts with:
S3_KEY = "certs_download/input/run_123.json"
CORE_SYSTEM_CERTS_DOWNLOAD_SECRET_KEY = "..."
CMD = python -m consultores_automation.runner
│
▼
runner discovers automations in automations/
downloads JSON from S3
runs the automation
returns output to core system
container exits (ephemeral)
For repos with multiple automations, each automation has its own infrastructure rule mapping its S3 path to the container command with its name:
S3 path /suss/input/ → CMD ["python", "-m", "consultores_automation.runner", "suss"]
S3 path /altas/input/ → CMD ["python", "-m", "consultores_automation.runner", "altas"]
Both share the same Docker image. The automation name comes from the command, not the input JSON. No always-on server — the container only exists while processing.
GitHub/
├── consultores-automation-framework/ ← this library
│
├── scraper-altas/ ← one repo per scraper (standard case)
├── scraper-bajas/
├── scraper-suss/
├── scraper-iva/
├── scraper-sworn/
└── scraper-certs/
Each scraper repo:
my-scraper/
├── automations/
│ └── my_scraper/ ← one automation per repo in the standard case
│ ├── __init__.py
│ ├── automation.py ← logic: only implement run()
│ ├── config.py ← CODE, NAME, Settings subclass with env vars
│ ├── constants.py ← selectors and strings (optional)
│ └── tests/
│ └── input.json ← test input
├── scripts/
│ └── build_push.sh
├── Dockerfile
├── docker-compose.yml
├── .env.example
└── pyproject.toml
| Aspect | Before | Now |
|---|---|---|
| Register a scraper | @AutomationRegistry.register() in main.py + import in scrapers/__init__.py |
@register() directly on the class, one file |
| Configuration | settings.add(X=EnvVar("X")) in config.py |
class MySettings(Settings): X = os.getenv("X") |
| Credentials in run | os.getenv() in the factory in main.py |
self.settings.X in run() |
| Logger | from scrapers.common.logger import logger |
self.logger.info(), no imports |
| S3 | from scrapers.common.utils import s3_utils |
self.s3.upload_to_s3(), no imports |
| Upload output | Manual code in each scraper | self.upload_output("file.pdf") |
| Core system secret key | CORE_SYSTEM_SECRET_KEY, one for the whole repo |
CORE_SYSTEM_{CODE}_SECRET_KEY, one per automation |
| File structure | scrapers/automation.py inside a package |
automations/<name>/automation.py |
| Imports | from scrapers.common.X import Y |
from consultores_automation.X import Y |
| Selenium module | selenium/ |
driver/ |
| Scaffolding | python -m consultores_automation init --name X |
uv run setup + uv run automation |
| Run locally | python -m consultores_automation.runner ... |
uv run execute ... |