Skip to content

feat: add destination readback introspection for smoke tests#1000

Merged
Aaron ("AJ") Steers (aaronsteers) merged 49 commits intomainfrom
devin/1774225239-destination-readback-testing
Mar 23, 2026
Merged

feat: add destination readback introspection for smoke tests#1000
Aaron ("AJ") Steers (aaronsteers) merged 49 commits intomainfrom
devin/1774225239-destination-readback-testing

Conversation

@aaronsteers
Copy link
Copy Markdown
Member

@aaronsteers Aaron ("AJ") Steers (aaronsteers) commented Mar 23, 2026

Summary

Adds automatic readback introspection to destination smoke tests. After writing data, the system reads back stats from the destination using existing PyAirbyte cache implementations as query adapters. This covers 5 destinations (~95% of user traffic): BigQuery, DuckDB, MotherDuck, Postgres, Snowflake.

Core changes

Destination.get_sql_cache(schema_name=...) and Destination.is_cache_supported (destinations/base.py):

  • New method following the same pattern as SyncResult.get_sql_cache() in cloud sync results
  • Builds a cache from the destination's hydrated config via destination_to_cache()
  • is_cache_supported property for early guard before attempting cache construction

Table statistics on SqlProcessorBase (shared/sql_processor.py):

  • fetch_row_count(table_name) — row count via SELECT COUNT(*)
  • fetch_column_info(table_name) — column names/types via SQLAlchemy Inspector (shared across calls)
  • _get_column_stats(table_name, columns) — per-column null/non-null counts using positional aliases (nn_0, nn_1, ...) to avoid identifier truncation on PostgreSQL
  • fetch_table_statistics(stream_names) — full orchestration: resolves table names via the processor's normalizer, queries all streams, returns dict[str, TableStatistics]
  • New Pydantic models: ColumnStatistics, TableStatistics

Smoke test integration (destination_smoke_tests.py):

  • Readback runs automatically when a compatible cache exists — no new CLI args needed
  • Runs even on write failure to support partial-success inspection
  • DestinationSmokeTestResult gains table_statistics: dict[str, TableStatistics] | None and tables_not_found: dict[str, str] | None

Translation files moved (caches/_utils/):

  • destinations/_translate_dest_to_cache.pycaches/_utils/_dest_to_cache.py
  • destinations/_translate_cache_to_dest.pycaches/_utils/_cache_to_dest.py
  • Resolves the circular import cycle (caches.bigquerydestinations.__init__destinations.basecaches.bigquery)
  • Inline imports in destinations/base.py promoted to top-level now that the cycle is broken

Other:

  • Fixed paired_destination_name copy-paste bugs in SnowflakeCache, PostgresCache, and MotherDuckCache
  • Added dict-config handling to all *_destination_to_cache() functions (strip destinationType dispatch key before model construction)
  • Docstring style standardized to Markdown; CONTRIBUTING.md updated with style guide

Namespace resolution approach

Investigation confirmed that the catalog namespace (set on each stream by the source) is the primary and sufficient mechanism for directing destinations to write into the test schema. Both Postgres and Snowflake correctly create final tables in the catalog-specified namespace without needing a config-level schema key override.

The config schema key override was removed because:

  • It is redundant when the catalog namespace is set correctly
  • It caused Postgres failures due to 63-char identifier truncation on raw table names (long namespace prefix + raw table name exceeded the limit)

_prepare_destination_config() now only forces disable_type_dedupe=False (required because some GSM configs have it set to True, which causes destinations to skip final typed tables that readback depends on). Uses validate=False to skip redundant spec fetch.

The schema override is instead applied at the cache level via destination_to_cache(schema_name=...), which sets cache.schema_name and calls dispose_engine() to refresh the schema_translate_map.

Updates since last revision

Credential handling fixes for the Destination.get_sql_cache() path:

The *_destination_to_cache() functions were originally designed for the cloud API path, which returns obfuscated credentials. The new local Destination.get_sql_cache() path passes plaintext credentials from the destination's hydrated config, which required two fixes:

  • Snowflake (snowflake_destination_to_cache): Plaintext passwords were being passed to get_secret() as if they were secret names. Now detects plaintext passwords (no ****) and uses them directly.
  • BigQuery (bigquery_destination_to_cache): Always fell back to BIGQUERY_CREDENTIALS_PATH env var. Now extracts plaintext credentials_json from the destination config, writes it to a temp file via tempfile.mkstemp(), and passes the path to BigQueryCache. Falls back to the env var only when credentials are obfuscated (cloud API path).

validate=False on set_config: _prepare_destination_config() now skips re-validation when toggling disable_type_dedupe, avoiding an unnecessary spec fetch + JSON schema check.

E2E test results

Destination Status Notes
Postgres ✅ 14/14 tables Docker-based, catalog namespace only
Snowflake ✅ 14/14 tables GSM creds, catalog namespace only
BigQuery ✅ 14/14 tables GSM creds; fixed backtick quoting + credentials_json handling
MotherDuck ⏳ Blocked Expired JWT in GSM
DuckDB Skipped Stale/non-compliant destination

Review & Testing Checklist for Human

  • BigQuery temp file for credentials: bigquery_destination_to_cache() writes credentials_json to a temp file via mkstemp() that is intentionally never deleted (the cache needs the path to remain valid). Verify this doesn't cause issues in long-running or repeated invocations (temp file accumulation). Consider whether a cleanup hook or atexit handler is warranted.
  • Obfuscation detection via "****": Both snowflake_destination_to_cache() and bigquery_destination_to_cache() detect obfuscated credentials by checking for "****" in the string. Verify this heuristic is reliable — could a legitimate password or credentials JSON contain that substring?
  • dispose_engine() after schema override: In destination_to_cache(), cache.processor.sql_config.dispose_engine() is called after setting schema_name so the engine's schema_translate_map is refreshed. Verify this doesn't cause issues if the cache is used immediately after (engine should lazily recreate). Also confirm no other code paths set schema_name post-construction without disposing the engine.
  • _get_column_stats positional aliases: The nn_0, nn_1 pattern avoids PostgreSQL's 63-char truncation but assumes result columns are returned in declaration order. Verify this holds across BigQuery (backtick quoting), Snowflake (uppercases aliases to NN_0), and Postgres.
  • destinationType key stripping in *_destination_to_cache() functions: All conversion functions now filter out destinationType/DESTINATION_TYPE before Pydantic model construction. Confirm no destination legitimately uses a config field with these exact names.
  • Recommended test plan: Run pyab destination-smoke-test --destination=destination-postgres --config=<config> and verify JSON output includes populated table_statistics with row counts, column types, and null stats for all 14 streams. Also test against an unsupported destination to confirm graceful table_statistics: null.

Notes

  • No unit tests are included for the new readback methods. Adding tests against DuckDBCache with edge-case column names would be valuable as a follow-up.
  • Table name resolution delegates to cache.processor.get_sql_table_name() with fallback to the original stream name, so naming conventions are driven by each cache's existing normalizer.
  • MotherDuck e2e test is blocked on a refreshed JWT token in GSM (SECRET_DESTINATION_DUCKDB__MOTHERDUCK__CREDS).

Summary by CodeRabbit

  • New Features

    • Destination smoke tests now perform automatic readback introspection after successful writes for DuckDB, Postgres, Snowflake, BigQuery, and MotherDuck when supported.
    • Readback returns detailed statistics: table row counts, column names/types, and per-column null/non-null counts. If readback isn't supported or fails, it's skipped and reported without failing the smoke test.
  • Documentation

    • CLI and tool docs updated to describe the new readback behavior and reported statistics.

Link to Devin session: https://app.devin.ai/sessions/f285ca4ab752412581eb228490d47e03
Requested by: Aaron ("AJ") Steers (@aaronsteers)


Open with Devin

- Add destination_readback.py module with stats-level readback:
  table row counts, column names/types, per-column null/non-null stats
- Map 5 destinations to cache implementations (BigQuery, DuckDB,
  MotherDuck, Postgres, Snowflake) for ~95% user coverage
- Deterministic table name resolution per destination
- Integrate readback into run_destination_smoke_test() (on by default)
- Fix paired_destination_name bugs in Snowflake, Postgres, MotherDuck
- Update CLI and MCP tool docstrings to reflect readback capability

Co-Authored-By: AJ Steers <aj@airbyte.io>
@devin-ai-integration
Copy link
Copy Markdown
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link
Copy Markdown

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This PyAirbyte Version

You can test this version of PyAirbyte using the following:

# Run PyAirbyte CLI from this branch:
uvx --from 'git+https://github.com/airbytehq/PyAirbyte.git@devin/1774225239-destination-readback-testing' pyairbyte --help

# Install PyAirbyte from this branch for development:
pip install 'git+https://github.com/airbytehq/PyAirbyte.git@devin/1774225239-destination-readback-testing'

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /fix-pr - Fixes most formatting and linting issues
  • /uv-lock - Updates uv.lock file
  • /test-pr - Runs tests with the updated PyAirbyte
  • /prerelease - Builds and publishes a prerelease version to PyPI
📚 Show Repo Guidance

Helpful Resources

Community Support

Questions? Join the #pyairbyte channel in our Slack workspace.

📝 Edit this welcome message.

- Fix pyrefly type error: explicitly type count_exprs as list[str]
- Restore ruff-compliant formatting on mcp/local.py line 546

Co-Authored-By: AJ Steers <aj@airbyte.io>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 23, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Introduces destination readback introspection for smoke tests: new Pydantic readback models, smoke-test control flow to capture stream names and invoke cache readback after successful writes, cache-level SQL readback helpers, paired-destination name fixes, and a Destination API to obtain normalized SQL caches.

Changes

Cohort / File(s) Summary
Readback Models & Smoke Test Integration
airbyte/_util/destination_smoke_tests.py
Adds Pydantic models (ColumnStats, ColumnInfo, TableInfo, TableReadbackReport, DestinationReadbackResult), extends DestinationSmokeTestResult with readback_result, captures stream names pre-write, and calls destination.get_sql_cache(...)._readback_get_results(stream_names) after successful writes; handles unsupported or failed readback gracefully.
Cache Readback Query Helpers
airbyte/caches/base.py
Adds private CacheBase helpers for readback introspection: _readback_query_row_count(), _readback_query_column_info(), _readback_query_column_stats(), _readback_get_table_report(), and _readback_get_results() to assemble table and column statistics from the SQL backend, with robust missing-table and error handling.
Cache Paired Destination Names
airbyte/caches/motherduck.py, airbyte/caches/postgres.py, airbyte/caches/snowflake.py
Updates paired_destination_name constants to their correct canonical identifiers ("destination-motherduck", "destination-postgres", "destination-snowflake"), replacing prior "destination-bigquery".
Destination Cache Access & Normalization
airbyte/destinations/base.py
Adds _CANONICAL_PREFIX, _normalize_destination_name() to canonicalize destination names, and Destination.get_sql_cache(... ) to construct/configure a CacheBase instance (accepts optional schema, destination_name, destination_config, and enforces version gate).
CLI / MCP Documentation
airbyte/cli/pyab.py, airbyte/mcp/local.py
Updates user-facing descriptions for the destination_smoke_test command/tool to document automatic post-write readback introspection and the statistics produced (row counts, column names/types, per-column null/non-null counts).

Sequence Diagram

sequenceDiagram
    participant Test as Smoke Test Runner
    participant Dest as Destination
    participant Cache as SQL Cache
    participant DB as SQL Backend

    Test->>Test: capture configured stream_names
    Test->>Dest: destination.write(source)
    Dest->>DB: write records
    DB-->>Dest: write success
    Test->>Dest: get_sql_cache(...)
    Dest-->>Cache: returns CacheBase
    Test->>Cache: _readback_get_results(stream_names)
    Cache->>DB: SELECT COUNT(*) from table
    DB-->>Cache: row_count
    Cache->>DB: inspect table columns (SQLAlchemy)
    DB-->>Cache: column names/types
    Cache->>DB: SELECT COUNT(col), ... aggregate query for column stats
    DB-->>Cache: column statistics
    Cache-->>Test: DestinationReadbackResult
    Test-->>Test: populate DestinationSmokeTestResult.readback_result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Would you like me to call out specific lines or functions to review first (e.g., cache SQL aggregation query, destination.get_sql_cache validation), wdyt?

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the main objective of the changeset: adding destination readback introspection to smoke tests.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch devin/1774225239-destination-readback-testing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Co-Authored-By: AJ Steers <aj@airbyte.io>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 23, 2026

PyTest Results (Fast Tests Only, No Creds)

343 tests  ±0   343 ✅ ±0   5m 42s ⏱️ -5s
  1 suites ±0     0 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit e9adb75. ± Comparison against base commit a8666b2.

♻️ This comment has been updated with latest results.

coderabbitai[bot]

This comment was marked as resolved.

…CacheBase

Co-Authored-By: AJ Steers <aj@airbyte.io>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 23, 2026

PyTest Results (Full)

413 tests  ±0   395 ✅ ±0   25m 43s ⏱️ -4s
  1 suites ±0    18 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit e9adb75. ± Comparison against base commit a8666b2.

♻️ This comment has been updated with latest results.

…e it

Co-Authored-By: AJ Steers <aj@airbyte.io>
coderabbitai[bot]

This comment was marked as resolved.

Co-Authored-By: AJ Steers <aj@airbyte.io>
coderabbitai[bot]

This comment was marked as resolved.

@aaronsteers Aaron ("AJ") Steers (aaronsteers) marked this pull request as ready for review March 23, 2026 06:29
Copilot AI review requested due to automatic review settings March 23, 2026 06:29
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin, respond inline to all bot feedback.

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

…t hydration bugs

- Remove destination_readback.py entirely per AJ's review
- Move readback models into destination_smoke_tests.py (only consumer)
- Add _readback_get_table_report() and _readback_get_results() to CacheBase
  to own the full orchestration loop (eliminates SLF001 suppressions)
- Inline readback orchestration in run_destination_smoke_test()
- Fix: case-insensitive row key lookup via casefold map (Snowflake/PG)
- Fix: use 'is not None' instead of truthy check for 0-valued counts
- Fix: use self._hydrated_config instead of self.get_config() to resolve
  secret references in destination config

Co-Authored-By: AJ Steers <aj@airbyte.io>
Co-Authored-By: AJ Steers <aj@airbyte.io>
coderabbitai[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

…upport)

Co-Authored-By: AJ Steers <aj@airbyte.io>
…onfig/version params

Co-Authored-By: AJ Steers <aj@airbyte.io>
devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration bot and others added 2 commits March 23, 2026 17:53
… to resolve circular import

Co-Authored-By: AJ Steers <aj@airbyte.io>
Co-Authored-By: AJ Steers <aj@airbyte.io>
@devin-ai-integration
Copy link
Copy Markdown
Contributor

Done — moved both translation files to airbyte/caches/_utils/:

  • destinations/_translate_dest_to_cache.pycaches/_utils/_dest_to_cache.py
  • destinations/_translate_cache_to_dest.pycaches/_utils/_cache_to_dest.py

This resolves the circular import. The dependency is now one-way: destinationscaches. Cache subclasses import _cache_to_dest from within their own package (no more destinations.__init__ trigger). Confirmed locally that from airbyte.destinations.base import Destination succeeds without inline imports.

The inline imports in base.py (is_cache_supported and get_sql_cache) could now be promoted to top-level, but I left them as-is since they're still in a class that doesn't always need them. Happy to change either way.

…s resolved

Co-Authored-By: AJ Steers <aj@airbyte.io>
devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration bot and others added 2 commits March 23, 2026 18:06
devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration bot and others added 7 commits March 23, 2026 18:26
… result

Co-Authored-By: AJ Steers <aj@airbyte.io>
…anslate_map

Co-Authored-By: AJ Steers <aj@airbyte.io>
Some destinations (e.g. Snowflake) ignore the source namespace and write
to their configured default schema. The readback now tries the namespace
first, then falls back to the destination's default schema if no tables
are found.

Co-Authored-By: AJ Steers <aj@airbyte.io>
The smoke test now overrides the destination's schema/dataset config key
(e.g. 'schema' for Snowflake/Postgres, 'dataset_id' for BigQuery) to
the test namespace before writing. This is the 'belt and suspenders'
approach: the catalog namespace is already set on each stream by the
source, but some destinations (e.g. Snowflake) prioritize their config
schema over the catalog namespace.

Also removes the readback fallback to default schema since the config
override ensures the destination writes to the correct namespace.

Co-Authored-By: AJ Steers <aj@airbyte.io>
Co-Authored-By: AJ Steers <aj@airbyte.io>
…override

The smoke test now also forces disable_type_dedupe=False in the
destination config so that final typed tables are created (not just raw
staging). Readback introspection requires final tables to query.

Co-Authored-By: AJ Steers <aj@airbyte.io>
…ient

Investigation confirmed that the catalog namespace (set on each stream
by the source) is the primary and sufficient mechanism to direct
destinations to write into the correct schema. The config-level schema
key override was unnecessary and counterproductive for Postgres, where
long namespace prefixes on raw table names in airbyte_internal exceed
the 63-char identifier limit, causing index name collisions.

Changes:
- Remove _DESTINATION_SCHEMA_CONFIG_KEYS dict
- Remove schema override logic from _apply_namespace_to_destination_config
- Rename to _prepare_destination_config (only handles disable_type_dedupe)
- Update docstrings and comments to reflect investigation findings

Tested: Postgres 14/14, Snowflake 14/14, BigQuery 14/14 all pass
with catalog namespace alone directing writes to the test schema.

Co-Authored-By: AJ Steers <aj@airbyte.io>
devin-ai-integration[bot]

This comment was marked as resolved.

The config was already validated when first set on the destination.
Re-validating just to toggle disable_type_dedupe adds unnecessary
latency (spec fetch + JSON schema check). Pass validate=False.

Co-Authored-By: AJ Steers <aj@airbyte.io>
devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration bot and others added 2 commits March 23, 2026 19:46
When called from the local Destination.get_sql_cache() path, the
password in the hydrated config is already a plaintext value — not a
secret name to look up.  The previous code passed it to get_secret(),
which would fail because no env var matches the literal password string.

Now the else branch assigns the password directly instead of routing
through get_secret().

Co-Authored-By: AJ Steers <aj@airbyte.io>
When called from the local Destination.get_sql_cache() path, the
destination config contains a plaintext credentials_json field.  The
previous code always looked up BIGQUERY_CREDENTIALS_PATH, which may
point to GSM credentials that lack bigquery.jobs.create permission.

Now bigquery_destination_to_cache() extracts credentials_json from the
config when available and writes it to a temp file for the cache.  Falls
back to BIGQUERY_CREDENTIALS_PATH for the cloud API path (obfuscated
credentials).

Co-Authored-By: AJ Steers <aj@airbyte.io>
@aaronsteers Aaron ("AJ") Steers (aaronsteers) merged commit 3ab17f9 into main Mar 23, 2026
22 checks passed
@aaronsteers Aaron ("AJ") Steers (aaronsteers) deleted the devin/1774225239-destination-readback-testing branch March 23, 2026 20:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants