Skip to content

Conversation

@tritone
Copy link
Contributor

@tritone tritone commented Dec 30, 2025

Complete rewrite of storage.MultiRangeDownloader. The new design should be more resilient to concurrency issues, deadlocks, retries, etc.

@tritone tritone requested review from a team as code owners December 30, 2025 06:45
@product-auto-label product-auto-label bot added the api: storage Issues related to the Cloud Storage API. label Dec 30, 2025
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @tritone, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a substantial architectural overhaul to the MultiRangeDownloader component within the storage client. The changes aim to create a more robust and maintainable system by moving from a distributed goroutine-based implementation to a centralized, event-loop-driven manager. This new design is intended to make the multi-range download functionality more resilient to transient network issues, concurrency challenges, and potential deadlocks, ultimately leading to a more reliable and performant client.

Highlights

  • Core Refactor of MultiRangeDownloader: The storage.MultiRangeDownloader has been completely rewritten to improve resilience against concurrency issues, deadlocks, and retries. The new design centralizes state management and uses an event-loop architecture.
  • New Manager-Based Architecture: A new multiRangeDownloaderManager struct and associated command pattern (mrdAddCmd, mrdCloseCmd, etc.) have been introduced to orchestrate the download process, handling requests, responses, and stream lifecycle events.
  • Enhanced Stream Session Management: Individual gRPC bidirectional streams are now managed by bidiReadStreamSession instances, which include robust mechanisms for initial session establishment, sending requests, receiving responses, and handling stream redirections and retries.
  • Improved Error Handling and Retries: The refactor includes more sophisticated error handling, particularly for gRPC status codes and redirect errors, with automatic stream re-establishment and re-sending of pending ranges. Test cases were updated to reflect these new error behaviors.
  • Dependency Updates: Several Go module dependencies, including various AWS SDK components, gRPC, Prometheus, and OpenTelemetry, have been updated in go.work.sum.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request presents a significant and well-structured refactoring of the MultiRangeDownloader. The new design, which utilizes an event loop (actor model), is an excellent approach for managing concurrency and state, resulting in more robust and comprehensible logic. The clear separation of concerns among the manager, sessions, and commands is commendable.

However, I have identified a critical issue where a blocking io.Writer could stall the entire downloader. Additionally, there are a couple of medium-severity issues concerning API design and code duplication that, if addressed, would further enhance the resilience and maintainability of this new implementation.


content := dataRange.GetChecksummedData().GetContent()
req.bytesWritten += int64(len(content))
_, err := req.output.Write(content)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The req.output.Write(content) call is executed within the manager's single-threaded event loop. If a user provides an io.Writer that blocks (e.g., writing to a slow network connection or a full pipe), it will stall the entire event loop. This prevents the processing of any other events, such as receiving more data from the gRPC stream, handling new Add requests, or closing the downloader. This can lead to performance degradation, deadlocks, and may cause the stream's receive buffer to fill up, creating backpressure on the server.

To resolve this, the Write operation should be performed asynchronously, outside of the event loop. A potential solution is to use a dedicated writer goroutine for each range request that consumes data from a channel. The event loop would then push data chunks to the appropriate channel without blocking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this offline, it is the caller's responsibility to ensure a non-blocking writer. We can consider this as an option later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even a non-blocking writer will take some time to write though - if this is blocking Add() calls in the meantime it may be noticeable performance degradation for certain workloads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely worth investigating but if we are just writing messages to in-memory buffers it will be trivial. For file io I agree it will be a bigger deal perhaps.

Comment on lines 672 to 707
func (m *multiRangeDownloaderManager) handleStreamEnd(result mrdSessionResult) {
m.currentSession = nil
err := result.err

if result.redirect != nil {
m.readSpec.RoutingToken = result.redirect.RoutingToken
m.readSpec.ReadHandle = result.redirect.ReadHandle
if ensureErr := m.ensureSession(m.ctx); ensureErr != nil {
if !m.isRetryable(ensureErr) {
m.permanentErr = ensureErr
m.attrsOnce.Do(func() { close(m.attrsReady) })
m.failAllPending(m.permanentErr)
}
}
} else if m.isRetryable(err) {
if len(m.pendingRanges) > 0 {
if ensureErr := m.ensureSession(m.ctx); ensureErr != nil {
if !m.isRetryable(ensureErr) {
m.permanentErr = ensureErr
m.attrsOnce.Do(func() { close(m.attrsReady) })
m.failAllPending(m.permanentErr)
}
}
}
} else {
if !errors.Is(err, context.Canceled) && !errors.Is(err, errClosed) {
if m.permanentErr == nil {
m.permanentErr = err
}
} else if m.permanentErr == nil {
m.permanentErr = errClosed
}
m.failAllPending(m.permanentErr)
m.attrsOnce.Do(func() { close(m.attrsReady) })
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for handling a non-retryable error from ensureSession is duplicated within both the if result.redirect != nil block and the else if m.isRetryable(err) block. This repetition makes the code more difficult to read and maintain.

This complex error-handling logic could be simplified by refactoring the duplicated code into a common path that is executed after the if/else if chain. This would improve code clarity and reduce the chance of introducing bugs in the future.

func (m *multiRangeDownloaderManager) handleStreamEnd(result mrdSessionResult) {
	m.currentSession = nil
	err := result.err

	var ensureErr error
	if result.redirect != nil {
		m.readSpec.RoutingToken = result.redirect.RoutingToken
		m.readSpec.ReadHandle = result.redirect.ReadHandle
		ensureErr = m.ensureSession(m.ctx)
	} else if m.isRetryable(err) {
		if len(m.pendingRanges) > 0 {
			ensureErr = m.ensureSession(m.ctx)
		}
	} else {
		// Non-retryable error from the stream itself.
		if !errors.Is(err, context.Canceled) && !errors.Is(err, errClosed) {
			if m.permanentErr == nil {
				m.permanentErr = err
			}
		} else if m.permanentErr == nil {
			m.permanentErr = errClosed
		}
		m.failAllPending(m.permanentErr)
		m.attrsOnce.Do(func() { close(m.attrsReady) })
		return
	}

	// Handle error from ensureSession.
	if ensureErr != nil && !m.isRetryable(ensureErr) {
		m.permanentErr = ensureErr
		m.attrsOnce.Do(func() { close(m.attrsReady) })
		m.failAllPending(m.permanentErr)
	}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I think there is a bit of code duplication here. Can be simplified a bit I think.

Complete rewrite of storage.MultiRangeDownloader. The new
design should be more resilient to concurrency issues,
deadlocks, retries, etc.
Copy link
Contributor

@BrennaEpp BrennaEpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments/questions

}
m.readIDCounter++

// Attributes should be ready if we are processing Add commands
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also check a case where if offset is greater than the object size, the range should be failed and not added to the stream. Otherwise, a permanent error will be set on the MRD when server gives a Out of Range error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit of a tricky case because it is possible if there is another concurrent writer to the object, the Size will be out-of-date and these calls will in fact succeed. I don't think we validate this in the existing MRD code and it's on the caller to decide how to handle this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but the MRD works with the single version of the object which is what we should stick to ? Without this validation we would set permanent error if there is any one invalid range provided. (not getting data for any valid range provided after this)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MRD is supposed to support an object that grows; see the tailing reads example: https://github.com/GoogleCloudPlatform/golang-samples/blob/main/storage/rapid/read_appendable_object_tail.go

Can you check with the GCSFuse team on the expected behavior here? I know they have logic in their code to recover from these types of permanent errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, will confirm with them, if required can be fixed in a subsequent PR

Comment on lines 672 to 707
func (m *multiRangeDownloaderManager) handleStreamEnd(result mrdSessionResult) {
m.currentSession = nil
err := result.err

if result.redirect != nil {
m.readSpec.RoutingToken = result.redirect.RoutingToken
m.readSpec.ReadHandle = result.redirect.ReadHandle
if ensureErr := m.ensureSession(m.ctx); ensureErr != nil {
if !m.isRetryable(ensureErr) {
m.permanentErr = ensureErr
m.attrsOnce.Do(func() { close(m.attrsReady) })
m.failAllPending(m.permanentErr)
}
}
} else if m.isRetryable(err) {
if len(m.pendingRanges) > 0 {
if ensureErr := m.ensureSession(m.ctx); ensureErr != nil {
if !m.isRetryable(ensureErr) {
m.permanentErr = ensureErr
m.attrsOnce.Do(func() { close(m.attrsReady) })
m.failAllPending(m.permanentErr)
}
}
}
} else {
if !errors.Is(err, context.Canceled) && !errors.Is(err, errClosed) {
if m.permanentErr == nil {
m.permanentErr = err
}
} else if m.permanentErr == nil {
m.permanentErr = errClosed
}
m.failAllPending(m.permanentErr)
m.attrsOnce.Do(func() { close(m.attrsReady) })
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I think there is a bit of code duplication here. Can be simplified a bit I think.

cpriti-os
cpriti-os previously approved these changes Jan 7, 2026
@tritone tritone merged commit 1cfd100 into googleapis:main Jan 7, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: storage Issues related to the Cloud Storage API.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants