RFC: Refined S3 Source by tabVersion · Pull Request #76 · risingwavelabs/rfcs

tabVersion · 2023-09-08T12:34:44Z

a combination of #22 and #74

hzxa21 · 2023-09-08T15:21:33Z

rfcs/0076-refined-s3-source.md

+* Task queue is used to store the filenames that are not fetched yet.
+  * Task queue is initialized by `select * from state_table where partition in local_vnode`. It just reads all things in the local vnode.
+  * When receiving new filenames from the message stream, it will push the filenames to the task queue.
+* State cache is used to store the filenames that are fetched but not written to hummock yet.


I have two questions here:

How many state tables are maintained in Fech executor? What are the schemas?

What does the state transition look like for each file?

Here are my thoughts:

Two state tables "Task Queue" and "File State" are maintained. The former is the partial index of the later one, meaning that files in "Task Queue" must be present in "File State". Schemas:

"Task Queue" | file_name | read_offset | enqueue_epoch? (can be used to sort tasks on recovery) | "File State" | file_name (bloom_filter_key) | metadata (size/enqueue_epoch/finish_epoch/...) |

State transitions for a file (these are logical stated and may not all exist in the implementation)

INIT: file name is received from list executor or is seen after reading from state store on recovery

INIT -> FINISH: file name is present in "File State" and absent in "Task Queue"

INIT -> IN_PRORESS: file name is absent in "File State". Insert into "File State" and "Task Queue".

IN_PROGRESS -> IN_PROGRESS: read_offset updated. Update read_offset "Task Queue".

IN_PROGRESS -> FINISH: finish reading the file. Delete from "Task Queue" and Update finish_epoch in "File State"

for fetch, we just need one state table. Here is the schema.

| filename | read offset | status |

In my design, the task queue is in memory.
Every time we insert a new file, it will be (filename, NULL, INIT) and for reading file, the row will be (filename, offset, GOING | DONE)

I think we don't need to maintain the order of each file in the task queue.

The main benefit of having a persistent state of received but unfinished files is that we don't need to do a full table scan on the state table on recovery. In other words, this extra state table is an index. More files will be inserted into the file state table and the states never be cleaned up. Therefore, I exepct that recovery will become slower over time without the extra index.

The order is just a side-benefit of persisting the task queue and I agree the order doesn't affect correctness.

Oh, I underestimated the overhead of doing a table scan. Then, your solution fits the scene best for you. We can keep the task queue at the thousand-row level with pagination mechanism.

Oh, I underestimated the overhead of doing a table scan. Then, your solution fits the scene best for you. We can keep the task queue at the thousand-row level with pagination mechanism.

SGTM.

hzxa21 · 2023-09-08T15:22:25Z

rfcs/0076-refined-s3-source.md

+  * When receiving a barrier, it will write the state cache to hummock and clear the state cache. (please note here we require
+    the hummock not to perform update on insert the same key)


please note here we require the hummock not to perform update on insert the same key

Why is this the case? Can you explain more?

list executor has no grantee on filenames never duplicate.
For example, file_1 has been done and it has been materialized to the state table. But somehow list passed file_1 again, and fetch exec will enqueue file_1 to the task queue and write it to the state table as a new file when a barrier comes. The default behavior of state table is to overwrite when meeting the same pk and it will reset file_1 to the init stage.
Or we may do a point get every time receive a new file to check whether it has been in the state table.

Or we may do a point get every time receive a new file to check whether it has been in the state table.

+1 for this approach since it is more straight-fordward. Also, I think fetch anyway need to do a point get when it receives a file from list to determine whether it should be put into the task queue or it should be ignored.

BugenZhao · 2023-09-15T06:59:55Z

rfcs/0076-refined-s3-source.md

+Let's say they are `list` and `fetch`, aligning the name in [RFC: S3 Source with SQS](https://github.com/risingwavelabs/rfcs/pull/22), 
+and they do the same thing.
+
+* `List` is unique in one source plan. It is responsible for keep listing all files and push the filenames to the downstream.


Which layer does List reside in?

A new kind of connector (split) in the source executor, which persists its state in the offset.

A new executor.

BugenZhao · 2023-09-15T07:03:32Z

rfcs/0076-refined-s3-source.md

+## List and Fetch
+
+So I am proposing a new solution, which is to use two separate executors to complete the job.
+Let's say they are `list` and `fetch`, aligning the name in [RFC: S3 Source with SQS](https://github.com/risingwavelabs/rfcs/pull/22), 


Will we expose this two-phase implementation to users? In other words, how will user create an S3 source?

By create source file_names (..) and then create materialized view s3_files as select fetch(file_name),

or create source s3_files (..) directly? And in this way, how many streaming jobs will we have? 1 or 2?

for create source, we will only init the list part and fetch is accompanied by the following mvs

BugenZhao · 2023-09-15T07:04:30Z

rfcs/0076-refined-s3-source.md

+The `list` executor will keep listing the files in S3 bucket and try its best to de-duplicate the files (but no guarantee).
+It will record the timestamp when doing the last listing and filter out the files that are created before the timestamp.


May I ask where the duplication comes from?

list is going to keep querying, so the same file will be listed repeatedly.

Yeah, but we'll filter the files with the created/modified date. Is it possible that the same file is included in multiple list responses?

BugenZhao · 2023-09-15T07:06:18Z

rfcs/0076-refined-s3-source.md

+And we can utilize the idea of [RFC: Refine S3 Source: List Objects in A Lazy Way](https://github.com/risingwavelabs/rfcs/pull/74),
+not to list all pages at once, but to list the pages one by one, preventing overload downstream.
+
+`list` executor will keep the max page `p` number as well as the last list timestamp `ts` in its state store.


May I ask how can we find the files being deleted?

No. list will only handle the new added ones. If fetch meets an invalid file path, it will be discarded.

fuyufjh · 2023-09-15T07:15:23Z

Some thoughts about reusing the source (#72)

Need to read a snapshot from the filename : read offset table, as the workload for FileSourceBackfillExecutor
Need to support hidden system column: file name and offset

tabVersion added 2 commits September 8, 2023 20:33

new rfc

c6def4b

rename

f7b85c5

hzxa21 reviewed Sep 8, 2023

View reviewed changes

tabVersion mentioned this pull request Sep 15, 2023

tracking: refined file source risingwavelabs/risingwave#12332

Closed

2 tasks

BugenZhao reviewed Sep 15, 2023

View reviewed changes

update

ae1b262

Rossil2012 mentioned this pull request Sep 26, 2023

feat(connector): implement new fs source risingwavelabs/risingwave#12547

Closed

8 tasks

Rossil2012 mentioned this pull request Oct 11, 2023

feat: new fs source (list and fetch) risingwavelabs/risingwave#12595

Merged

8 tasks

fuyufjh approved these changes Oct 18, 2023

View reviewed changes

tabVersion merged commit 8cd318c into main Oct 18, 2023

		* When receiving a barrier, it will write the state cache to hummock and clear the state cache. (please note here we require
		the hummock not to perform update on insert the same key)

		The `list` executor will keep listing the files in S3 bucket and try its best to de-duplicate the files (but no guarantee).
		It will record the timestamp when doing the last listing and filter out the files that are created before the timestamp.

Conversation

tabVersion commented Sep 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tabVersion Sep 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fuyufjh commented Sep 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tabVersion commented Sep 8, 2023 •

edited

Loading

tabVersion Sep 15, 2023 •

edited

Loading