Skip to content

Conversation

@heyihong
Copy link
Contributor

@heyihong heyihong commented Nov 14, 2025

What changes were proposed in this pull request?

This PR makes CollectMetricsExec.collectedMetrics thread-safe by introducing synchronization and caching at the accumulator level.

Key changes:

  1. Added a collectedMetricsRow field to cache the converted metrics row
  2. Modified collectedMetrics method to use accumulator.synchronized block with lazy initialization and caching
  3. Updated resetMetrics() to clear the cache and reset the accumulator within a synchronized block
  4. Removed the observedMetricsLock from QueryExecution.observedMetrics since thread-safety is now handled at the CollectMetricsExec level

The synchronization ensures that:

  • Multiple threads can safely call collectedMetrics concurrently
  • The row conversion (toRowConverter) is performed only once and cached
  • The cache is properly invalidated when metrics are reset

Why are the changes needed?

The CollectMetricsExec.collectedMetrics method can be called concurrently from multiple threads when accessing QueryExecution.observedMetrics in connect mode. Without proper synchronization, this can lead to:

  1. Race conditions during the row conversion process
  2. Potential inconsistent or corrupted metrics results
  3. Possible thread-safety violations in the underlying conversion logic

Does this PR introduce any user-facing change?

No. This is an internal bug fix that improves thread-safety. Users should not observe any behavioral changes except for the elimination of potential race conditions and incorrect results when accessing metrics concurrently.

How was this patch tested?

build/sbt "sql/testOnly *DatasetSuite -- -z SPARK-54353"

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 1.7.54

@github-actions github-actions bot added the SQL label Nov 14, 2025
@hvanhovell
Copy link
Contributor

@heyihong please fix the locking in CollectMetricsExec. Get the value out of the accumulator and move into a a synchronized field. This change as is will cause performance regression throughout the engine.

@heyihong heyihong changed the title [SPARK-54353][CORE][SQL] Make ArrayBasedMapBuilder thread-safe [SPARK-54353][SQL] Make CollectMetricsExec.collectedMetrics thread-safe Nov 14, 2025
@heyihong heyihong requested a review from hvanhovell November 14, 2025 15:47
@heyihong
Copy link
Contributor Author

heyihong commented Nov 17, 2025

@heyihong please fix the locking in CollectMetricsExec. Get the value out of the accumulator and move into a a synchronized field. This change as is will cause performance regression throughout the engine.

@hvanhovell accumulator.value is not thread-safe (e.g., when using ArrayBasedMapBuilder) and needs synchronization. Also, repeatedly computing the observed metrics is not very performant. One solution I came up with is to cache the observed metrics and invalidate the cache properly. Please take another look at this change when you have time.

cc: @dongjoon-hyun

@dongjoon-hyun
Copy link
Member

Thank you for pinging me, @heyihong . Given that this is not a new code part, I moved your JIRA issue from Spark 4.1 to 4.2 because it doesn't look like a regression. I hope we can get enough discussion first and backport to all applicable branches (if needed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants