feat: add destination readback introspection for smoke tests#1000
Conversation
- Add destination_readback.py module with stats-level readback: table row counts, column names/types, per-column null/non-null stats - Map 5 destinations to cache implementations (BigQuery, DuckDB, MotherDuck, Postgres, Snowflake) for ~95% user coverage - Deterministic table name resolution per destination - Integrate readback into run_destination_smoke_test() (on by default) - Fix paired_destination_name bugs in Snowflake, Postgres, MotherDuck - Update CLI and MCP tool docstrings to reflect readback capability Co-Authored-By: AJ Steers <aj@airbyte.io>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. 💡 Show Tips and TricksTesting This PyAirbyte VersionYou can test this version of PyAirbyte using the following: # Run PyAirbyte CLI from this branch:
uvx --from 'git+https://github.com/airbytehq/PyAirbyte.git@devin/1774225239-destination-readback-testing' pyairbyte --help
# Install PyAirbyte from this branch for development:
pip install 'git+https://github.com/airbytehq/PyAirbyte.git@devin/1774225239-destination-readback-testing'PR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
📚 Show Repo GuidanceHelpful ResourcesCommunity SupportQuestions? Join the #pyairbyte channel in our Slack workspace. |
- Fix pyrefly type error: explicitly type count_exprs as list[str] - Restore ruff-compliant formatting on mcp/local.py line 546 Co-Authored-By: AJ Steers <aj@airbyte.io>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughIntroduces destination readback introspection for smoke tests: new Pydantic readback models, smoke-test control flow to capture stream names and invoke cache readback after successful writes, cache-level SQL readback helpers, paired-destination name fixes, and a Destination API to obtain normalized SQL caches. Changes
Sequence DiagramsequenceDiagram
participant Test as Smoke Test Runner
participant Dest as Destination
participant Cache as SQL Cache
participant DB as SQL Backend
Test->>Test: capture configured stream_names
Test->>Dest: destination.write(source)
Dest->>DB: write records
DB-->>Dest: write success
Test->>Dest: get_sql_cache(...)
Dest-->>Cache: returns CacheBase
Test->>Cache: _readback_get_results(stream_names)
Cache->>DB: SELECT COUNT(*) from table
DB-->>Cache: row_count
Cache->>DB: inspect table columns (SQLAlchemy)
DB-->>Cache: column names/types
Cache->>DB: SELECT COUNT(col), ... aggregate query for column stats
DB-->>Cache: column statistics
Cache-->>Test: DestinationReadbackResult
Test-->>Test: populate DestinationSmokeTestResult.readback_result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Would you like me to call out specific lines or functions to review first (e.g., cache SQL aggregation query, destination.get_sql_cache validation), wdyt? 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Co-Authored-By: AJ Steers <aj@airbyte.io>
…CacheBase Co-Authored-By: AJ Steers <aj@airbyte.io>
…e it Co-Authored-By: AJ Steers <aj@airbyte.io>
Co-Authored-By: AJ Steers <aj@airbyte.io>
Aaron ("AJ") Steers (aaronsteers)
left a comment
There was a problem hiding this comment.
Devin, respond inline to all bot feedback.
…t hydration bugs - Remove destination_readback.py entirely per AJ's review - Move readback models into destination_smoke_tests.py (only consumer) - Add _readback_get_table_report() and _readback_get_results() to CacheBase to own the full orchestration loop (eliminates SLF001 suppressions) - Inline readback orchestration in run_destination_smoke_test() - Fix: case-insensitive row key lookup via casefold map (Snowflake/PG) - Fix: use 'is not None' instead of truthy check for 0-valued counts - Fix: use self._hydrated_config instead of self.get_config() to resolve secret references in destination config Co-Authored-By: AJ Steers <aj@airbyte.io>
Co-Authored-By: AJ Steers <aj@airbyte.io>
…upport) Co-Authored-By: AJ Steers <aj@airbyte.io>
Co-Authored-By: AJ Steers <aj@airbyte.io>
…onfig/version params Co-Authored-By: AJ Steers <aj@airbyte.io>
… to resolve circular import Co-Authored-By: AJ Steers <aj@airbyte.io>
Co-Authored-By: AJ Steers <aj@airbyte.io>
|
Done — moved both translation files to
This resolves the circular import. The dependency is now one-way: The inline imports in |
…s resolved Co-Authored-By: AJ Steers <aj@airbyte.io>
Co-Authored-By: AJ Steers <aj@airbyte.io>
Co-Authored-By: AJ Steers <aj@airbyte.io>
… result Co-Authored-By: AJ Steers <aj@airbyte.io>
…anslate_map Co-Authored-By: AJ Steers <aj@airbyte.io>
Some destinations (e.g. Snowflake) ignore the source namespace and write to their configured default schema. The readback now tries the namespace first, then falls back to the destination's default schema if no tables are found. Co-Authored-By: AJ Steers <aj@airbyte.io>
The smoke test now overrides the destination's schema/dataset config key (e.g. 'schema' for Snowflake/Postgres, 'dataset_id' for BigQuery) to the test namespace before writing. This is the 'belt and suspenders' approach: the catalog namespace is already set on each stream by the source, but some destinations (e.g. Snowflake) prioritize their config schema over the catalog namespace. Also removes the readback fallback to default schema since the config override ensures the destination writes to the correct namespace. Co-Authored-By: AJ Steers <aj@airbyte.io>
Co-Authored-By: AJ Steers <aj@airbyte.io>
…override The smoke test now also forces disable_type_dedupe=False in the destination config so that final typed tables are created (not just raw staging). Readback introspection requires final tables to query. Co-Authored-By: AJ Steers <aj@airbyte.io>
…ient Investigation confirmed that the catalog namespace (set on each stream by the source) is the primary and sufficient mechanism to direct destinations to write into the correct schema. The config-level schema key override was unnecessary and counterproductive for Postgres, where long namespace prefixes on raw table names in airbyte_internal exceed the 63-char identifier limit, causing index name collisions. Changes: - Remove _DESTINATION_SCHEMA_CONFIG_KEYS dict - Remove schema override logic from _apply_namespace_to_destination_config - Rename to _prepare_destination_config (only handles disable_type_dedupe) - Update docstrings and comments to reflect investigation findings Tested: Postgres 14/14, Snowflake 14/14, BigQuery 14/14 all pass with catalog namespace alone directing writes to the test schema. Co-Authored-By: AJ Steers <aj@airbyte.io>
The config was already validated when first set on the destination. Re-validating just to toggle disable_type_dedupe adds unnecessary latency (spec fetch + JSON schema check). Pass validate=False. Co-Authored-By: AJ Steers <aj@airbyte.io>
When called from the local Destination.get_sql_cache() path, the password in the hydrated config is already a plaintext value — not a secret name to look up. The previous code passed it to get_secret(), which would fail because no env var matches the literal password string. Now the else branch assigns the password directly instead of routing through get_secret(). Co-Authored-By: AJ Steers <aj@airbyte.io>
When called from the local Destination.get_sql_cache() path, the destination config contains a plaintext credentials_json field. The previous code always looked up BIGQUERY_CREDENTIALS_PATH, which may point to GSM credentials that lack bigquery.jobs.create permission. Now bigquery_destination_to_cache() extracts credentials_json from the config when available and writes it to a temp file for the cache. Falls back to BIGQUERY_CREDENTIALS_PATH for the cloud API path (obfuscated credentials). Co-Authored-By: AJ Steers <aj@airbyte.io>
Summary
Adds automatic readback introspection to destination smoke tests. After writing data, the system reads back stats from the destination using existing PyAirbyte cache implementations as query adapters. This covers 5 destinations (~95% of user traffic): BigQuery, DuckDB, MotherDuck, Postgres, Snowflake.
Core changes
Destination.get_sql_cache(schema_name=...)andDestination.is_cache_supported(destinations/base.py):SyncResult.get_sql_cache()in cloud sync resultsdestination_to_cache()is_cache_supportedproperty for early guard before attempting cache constructionTable statistics on
SqlProcessorBase(shared/sql_processor.py):fetch_row_count(table_name)— row count viaSELECT COUNT(*)fetch_column_info(table_name)— column names/types via SQLAlchemyInspector(shared across calls)_get_column_stats(table_name, columns)— per-column null/non-null counts using positional aliases (nn_0,nn_1, ...) to avoid identifier truncation on PostgreSQLfetch_table_statistics(stream_names)— full orchestration: resolves table names via the processor's normalizer, queries all streams, returnsdict[str, TableStatistics]ColumnStatistics,TableStatisticsSmoke test integration (
destination_smoke_tests.py):DestinationSmokeTestResultgainstable_statistics: dict[str, TableStatistics] | Noneandtables_not_found: dict[str, str] | NoneTranslation files moved (
caches/_utils/):destinations/_translate_dest_to_cache.py→caches/_utils/_dest_to_cache.pydestinations/_translate_cache_to_dest.py→caches/_utils/_cache_to_dest.pycaches.bigquery→destinations.__init__→destinations.base→caches.bigquery)destinations/base.pypromoted to top-level now that the cycle is brokenOther:
paired_destination_namecopy-paste bugs inSnowflakeCache,PostgresCache, andMotherDuckCache*_destination_to_cache()functions (stripdestinationTypedispatch key before model construction)CONTRIBUTING.mdupdated with style guideNamespace resolution approach
Investigation confirmed that the catalog namespace (set on each stream by the source) is the primary and sufficient mechanism for directing destinations to write into the test schema. Both Postgres and Snowflake correctly create final tables in the catalog-specified namespace without needing a config-level schema key override.
The config schema key override was removed because:
_prepare_destination_config()now only forcesdisable_type_dedupe=False(required because some GSM configs have it set toTrue, which causes destinations to skip final typed tables that readback depends on). Usesvalidate=Falseto skip redundant spec fetch.The schema override is instead applied at the cache level via
destination_to_cache(schema_name=...), which setscache.schema_nameand callsdispose_engine()to refresh theschema_translate_map.Updates since last revision
Credential handling fixes for the
Destination.get_sql_cache()path:The
*_destination_to_cache()functions were originally designed for the cloud API path, which returns obfuscated credentials. The new localDestination.get_sql_cache()path passes plaintext credentials from the destination's hydrated config, which required two fixes:snowflake_destination_to_cache): Plaintext passwords were being passed toget_secret()as if they were secret names. Now detects plaintext passwords (no****) and uses them directly.bigquery_destination_to_cache): Always fell back toBIGQUERY_CREDENTIALS_PATHenv var. Now extracts plaintextcredentials_jsonfrom the destination config, writes it to a temp file viatempfile.mkstemp(), and passes the path toBigQueryCache. Falls back to the env var only when credentials are obfuscated (cloud API path).validate=Falseonset_config:_prepare_destination_config()now skips re-validation when togglingdisable_type_dedupe, avoiding an unnecessary spec fetch + JSON schema check.E2E test results
Review & Testing Checklist for Human
bigquery_destination_to_cache()writescredentials_jsonto a temp file viamkstemp()that is intentionally never deleted (the cache needs the path to remain valid). Verify this doesn't cause issues in long-running or repeated invocations (temp file accumulation). Consider whether a cleanup hook oratexithandler is warranted."****": Bothsnowflake_destination_to_cache()andbigquery_destination_to_cache()detect obfuscated credentials by checking for"****"in the string. Verify this heuristic is reliable — could a legitimate password or credentials JSON contain that substring?dispose_engine()after schema override: Indestination_to_cache(),cache.processor.sql_config.dispose_engine()is called after settingschema_nameso the engine'sschema_translate_mapis refreshed. Verify this doesn't cause issues if the cache is used immediately after (engine should lazily recreate). Also confirm no other code paths setschema_namepost-construction without disposing the engine._get_column_statspositional aliases: Thenn_0,nn_1pattern avoids PostgreSQL's 63-char truncation but assumes result columns are returned in declaration order. Verify this holds across BigQuery (backtick quoting), Snowflake (uppercases aliases toNN_0), and Postgres.destinationTypekey stripping in*_destination_to_cache()functions: All conversion functions now filter outdestinationType/DESTINATION_TYPEbefore Pydantic model construction. Confirm no destination legitimately uses a config field with these exact names.pyab destination-smoke-test --destination=destination-postgres --config=<config>and verify JSON output includes populatedtable_statisticswith row counts, column types, and null stats for all 14 streams. Also test against an unsupported destination to confirm gracefultable_statistics: null.Notes
DuckDBCachewith edge-case column names would be valuable as a follow-up.cache.processor.get_sql_table_name()with fallback to the original stream name, so naming conventions are driven by each cache's existing normalizer.SECRET_DESTINATION_DUCKDB__MOTHERDUCK__CREDS).Summary by CodeRabbit
New Features
Documentation
Link to Devin session: https://app.devin.ai/sessions/f285ca4ab752412581eb228490d47e03
Requested by: Aaron ("AJ") Steers (@aaronsteers)