RFC: Reusable Source Executor by fuyufjh · Pull Request #72 · risingwavelabs/rfcs

fuyufjh · 2023-08-18T03:13:06Z

No description provided.

rfcs/0061-reusable-source.md

BugenZhao

Really love this idea! This makes "Source" in our system much more well-defined.

BugenZhao · 2023-08-18T07:33:59Z

rfcs/0061-reusable-source.md

+let source_stream = the stream from upstream SourceExecutor i.e. new events
+let backfill_stream = read from message brokers with user-defined scan_startup_mode) i.e. historical events
+
+// In fact, these should be per kafka partition


It might be worth noting that split changes should also be applied to Backfill executors, and...

we need to persist the backfill offsets in this case,

aligning the distribution/scheduling of Backfill and Source could make life much easier.

+1 for both 2 items

BugenZhao · 2023-08-18T07:38:04Z

rfcs/0061-reusable-source.md

+
+So how to combine the backfill data with incoming data without duplicating or missing events?
+
+Since everything happens inside the `BackfillExecutor`, the implementation is not difficult. Here is the pseudo-code.


So this seems to be another implementation besides MV-on-MV backfill and CDC backfill, since each of them seems to have a slightly different algorithm?

Hope that there's some abstraction for this new backfill executor, so that we don't have to introduce a physical executor every time we support a new source connector. 🥵

Yeah, there is already pretty lots of duplication between MV-on-MV backfill and CDC backfill. I am not sure whether it's possible to DRY it. Anyway, either DRY or not is acceptable to me.

BugenZhao · 2023-08-18T07:40:29Z

rfcs/0061-reusable-source.md

+
+Here,
+
+- `KafkaSourceExecutor` is created on executing `CREATE SOURCE` statement, although a trivial optimization is doing nothing when there is no downstream operator attached.


I guess this "optimization" is necessary, as there're kinds of sources that do not support rewinding the historical data like Nexmark or Datagen. So it can be confusing if the source executor runs itself and loses some records in the downstream materialized views created later.

st1page

Love this idea. Some supplements:

For the watermark on the source, We need to add the WatermarkFilter in the source fragment during the CREATE SOURCE statement. And I think we do not need to support do the watermark logic in the backfill part. because
- we have not supported and do not have a design about the MV on MV's watermark in the backfill phase. risingwavelabs/risingwave#8375 So we should assume the downstream can accept there could be no watermark for a long time.
- Our watermark filter algorithm is non-deterministic among different splits. So even when we filter the out-date data in backfill. Backfills in different streaming queries are still inconsistency.
Based on this design, we will have a global unique instance for the source. And we can easily insert some common logic on the source and show it to users naturally. E.g, an internal table includes the record filtered by watermark. An internal table that includes the when parsing from the connector.

BugenZhao · 2023-08-22T11:31:05Z

I personally prefer putting one WatermarkFilter after each Backfill, as different materialized views indeed filter out different records considering TTL on external sources. This also matches the current behavior that sources instantiated in different streaming jobs has their own watermarks.

rfcs/0072-reusable-source.md

liurenjie1024 · 2023-08-24T04:10:23Z

rfcs/0072-reusable-source.md

+let backfill_offset = 0;
+let backfill_completed = false;
+
+next_event = select! {


Does this require back fill executor has same parallelism as source executor?

I am concerning that backfill a source may delay the barrier event. If we bias for source_stream, the backfill progress may not catch up with upstream and hurt freshness, if we bias for backfill_stream, the barrier will be queued in the source. So may be we need a customized round-robin select! here?

Here the barrier should be passed directly (aka. "stealing")

This is same with our design of any other source - the barrier will always be injected whenever the barrier comes

kwannoel · 2024-03-21T04:11:26Z

rfcs/0072-reusable-source.md

+Conceptually, a source is very different from a table or a sink in terms of whether it is instantiated. A source defined by a `CREATE SOURCE` statement is not an instance, it's just a set of metadata stored in the catalog until it's referenced by a materialized view, at which point a `SourceExecutor` is instantiated.
+
+As a result, when creating multiple materialized views on top of one source, the `SourceExecutor` is not reused between them. It leads to multiple times of resource utilization and multiple consuming offsets, metrics, etc. Furthermore, it becomes a bug if the source definition uses some exclusive property, such as the consumer group of Kafka source.
+


We should include @xxchan's comment in this RFC: https://github.com/risingwavelabs/rfcs/pull/72/files#r1388947714

I think currently for N stream jobs on a source, we have N consumers. With this design, we have 1+N consumers during backfilling, and only 1 in steady-state after backfill is finished.

It makes it a lot clearer.

Suggested change

Consider that now, for N stream jobs on a source, we have N consumers. With this design, we have 1+N consumers during backfilling (1 source, N backfill), and only source executor in steady-state after backfill is finished.

I will update it to elaborate more the possible benefits.

fuyufjh added 2 commits August 17, 2023 23:16

first version

31071c6

refine

6ad291a

StrikeW reviewed Aug 18, 2023

View reviewed changes

rfcs/0061-reusable-source.md Show resolved Hide resolved

BugenZhao reviewed Aug 18, 2023

View reviewed changes

st1page reviewed Aug 21, 2023

View reviewed changes

st1page mentioned this pull request Aug 21, 2023

Tracking: Alter Source/table with connector risingwavelabs/risingwave#11800

Closed

4 tasks

fix RFC number

3a043a6

liurenjie1024 reviewed Aug 24, 2023

View reviewed changes

rfcs/0072-reusable-source.md Show resolved Hide resolved

rfcs/0072-reusable-source.md Show resolved Hide resolved

liurenjie1024 reviewed Aug 24, 2023

View reviewed changes

fuyufjh mentioned this pull request Sep 15, 2023

RFC: Refined S3 Source #76

Merged

StrikeW mentioned this pull request Oct 25, 2023

feat(cdc): share a changelog stream for multiple cdc tables risingwavelabs/risingwave#12535

Merged

8 tasks

fix image

459483e

tabVersion mentioned this pull request Jan 5, 2024

refactor: deprecate StreamChunkWithState for source state risingwavelabs/risingwave#14384

Closed

xxchan mentioned this pull request Feb 21, 2024

feat(stream): add kafka backfill executor risingwavelabs/risingwave#14172

Merged

9 tasks

xxchan mentioned this pull request Mar 13, 2024

feat: add kafka backfill frontend risingwavelabs/risingwave#15602

Merged

11 tasks

kwannoel reviewed Mar 21, 2024

View reviewed changes

xxchan mentioned this pull request Mar 31, 2024

Tracking issue for shared source risingwavelabs/risingwave#16003

Closed

21 tasks

xxchan mentioned this pull request Aug 30, 2024

Let source backfill finish when there's no data from upstream risingwavelabs/risingwave#18299

Closed


		So how to combine the backfill data with incoming data without duplicating or missing events?

		Since everything happens inside the `BackfillExecutor`, the implementation is not difficult. Here is the pseudo-code.


		Here,

		- `KafkaSourceExecutor` is created on executing `CREATE SOURCE` statement, although a trivial optimization is doing nothing when there is no downstream operator attached.

		Conceptually, a source is very different from a table or a sink in terms of whether it is instantiated. A source defined by a `CREATE SOURCE` statement is not an instance, it's just a set of metadata stored in the catalog until it's referenced by a materialized view, at which point a `SourceExecutor` is instantiated.

		As a result, when creating multiple materialized views on top of one source, the `SourceExecutor` is not reused between them. It leads to multiple times of resource utilization and multiple consuming offsets, metrics, etc. Furthermore, it becomes a bug if the source definition uses some exclusive property, such as the consumer group of Kafka source.


	Consider that now, for N stream jobs on a source, we have N consumers. With this design, we have 1+N consumers during backfilling (1 source, N backfill), and only source executor in steady-state after backfill is finished.

Conversation

fuyufjh commented Aug 18, 2023

Uh oh!

Uh oh!

BugenZhao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

st1page left a comment

Choose a reason for hiding this comment

Uh oh!

BugenZhao commented Aug 22, 2023

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants