Skip to content

Conversation

@serprex
Copy link
Member

@serprex serprex commented Oct 16, 2025

despite our efforts, nulled values still frequently leak through. put behind setting

@codecov
Copy link

codecov bot commented Oct 16, 2025

❌ 2 Tests Failed:

Tests completed Failed Passed Skipped
1253 2 1251 128
View the full list of 2 ❄️ flaky test(s)
github.com/PeerDB-io/peerdb/flow/e2e::TestApiPg

Flake rate in main: 46.67% (Passed 16 times, Failed 14 times)

Stack Traces | 0.01s run time
=== RUN   TestApiPg
=== PAUSE TestApiPg
=== CONT  TestApiPg
--- FAIL: TestApiPg (0.01s)
github.com/PeerDB-io/peerdb/flow/e2e::TestApiPg/TestFlowStatusUpdate

Flake rate in main: 46.67% (Passed 16 times, Failed 14 times)

Stack Traces | 141s run time
=== RUN   TestApiPg/TestFlowStatusUpdate
=== PAUSE TestApiPg/TestFlowStatusUpdate
=== CONT  TestApiPg/TestFlowStatusUpdate
2025/10/16 20:18:14 INFO fetched schema x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} table=e2e_test_mychcl_2hmx7yvo.test_nullengine
    flow_status_test.go:134: WaitFor wait for paused state 2025-10-16 20:18:18.821020432 +0000 UTC m=+265.914762732
    flow_status_test.go:144: WaitFor wait for running state 2025-10-16 20:18:34.843753164 +0000 UTC m=+281.937495474
2025/10/16 20:18:34 INFO fetched schema x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} table=e2e_test_mychcl_2hmx7yvo.test_nullengine
    api_test.go:44: begin tearing down postgres schema api_22pfid5x
2025/10/16 20:18:35 INFO fetched schema x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} table=e2e_test_mychcl_2hmx7yvo.test_nullengine
    api_test.go:44: 
        	Error Trace:	.../flow/e2e/pg.go:149
        	            				.../flow/e2e/api_test.go:44
        	            				.../flow/e2eshared/e2eshared.go:38
        	            				.../hostedtoolcache/go/1.25.2.../src/testing/testing.go:1308
        	            				.../hostedtoolcache/go/1.25.2.../src/testing/testing.go:1572
        	            				.../hostedtoolcache/go/1.25.2.../src/testing/testing.go:1928
        	            				.../hostedtoolcache/go/1.25.2.../src/runtime/asm_amd64.s:1693
        	Error:      	failed to teardown postgres schema
        	Test:       	TestApiPg/TestFlowStatusUpdate
        	Messages:   	api_22pfid5x: failed to drop e2e_test schema: timeout: context already done: context deadline exceeded
--- FAIL: TestApiPg/TestFlowStatusUpdate (141.26s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@serprex serprex force-pushed the bring-back-null-avro branch from 282364d to 90b57fe Compare October 16, 2025 20:03
despite our efforts, nulled values still frequently leak through
@serprex serprex force-pushed the bring-back-null-avro branch from 90b57fe to 90b7203 Compare October 16, 2025 20:10
@serprex serprex requested a review from heavycrystal October 16, 2025 21:58
Copy link
Contributor

@ilidemi ilidemi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic LGTM. Please add test with and without the flag

ilidemi added a commit that referenced this pull request Nov 29, 2025
Current situation:
1. `failed to sync records: failed to write records to S3: failed to
write records to OCF writer: failed to write record to OCF:
some_column_100: avro: *avro.null is unsupported for Avro long` occurs
once a month for one customer
2. The column in question is a nullable integer, all values in it are
null, it is also inherited from a parent table
3. Parent table has same name but different schema
4. Parent and child attnums are different, so possibly column deletions
were involved, and also child seems to have become inherited later on.
5. I was unable to reproduce the error with that information. It is
possible the schema has changed since, so it would be nice to capture
the exact data when we have it.

This change is a spiritual successor of #3613, but defaults to strict
behavior (which has been working fine for everyone else) and puts the
lax one under a config. When lax is enabled, it collects all the inputs
that go into deciding whether a column would be nullable under strict
behavior, then some extra about table inheritance, and logs it later if
any mismatch with strict was detected. Tested that the new logic runs
and logs if the code is adjusted to under-do nullable, but as-is the
test doesn't do much as the issue is not cleanly reproducible yet.

Also adding a generic code notification metric that's easy to emit from
anywhere, will set up a non-paging alert on it once this goes in.

The plan is to enable the setting just for one service, leave it to bake
for another month or two, then check back when the notification fires.
After the issue is sorted out, all the null tracking can be removed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants