-
Notifications
You must be signed in to change notification settings - Fork 0
add sparkapp collection support #224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughAdds SparkApplication and ScheduledSparkApplication collection support plus new exclusion types for Kubeflow Notebooks, Volcano Jobs, Spark Applications, and Scheduled Spark Applications; updates collectors, controller wiring, CRD, deepcopy, and RBAC to permit observing these resources. Changes
Sequence DiagramssequenceDiagram
participant Ctrl as Controller
participant Dyn as Dynamic Client
participant Factory as Informer Factory
participant Informer as Informer
participant Handler as Event Handler
participant Batcher as Batcher
Ctrl->>Dyn: NewSparkCollector(namespaces, exclusions)
Dyn->>Factory: Build informer for sparkoperator.k8s.io/v1beta2
Factory->>Informer: Create informer (sparkapplications / scheduledsparkapplications)
Ctrl->>Informer: Attach Add/Update/Delete handlers
Informer->>Informer: Start & sync cache
loop Resource Events
Informer->>Handler: Event (Add/Update/Delete, Unstructured)
Handler->>Handler: Validate & extract namespace/name
Handler->>Handler: Check exclusions map
alt Not excluded
Handler->>Batcher: Enqueue CollectedResource (type, key, timestamp, raw)
Batcher->>Ctrl: Flush batch to resource channel
else Excluded
Handler->>Handler: Skip
end
end
Ctrl->>Informer: Stop collector
Informer->>Factory: Stop informer
Batcher->>Batcher: Stop
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro ⛔ Files ignored due to path filters (4)
📒 Files selected for processing (3)
🧰 Additional context used🧬 Code graph analysis (1)api/v1/zz_generated.deepcopy.go (2)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
🔇 Additional comments (3)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
dfc484e to
99e476d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
internal/collector/scheduled_spark_application_collector.go (1)
258-259: Consider reducing log verbosity for collected resources.Logging at
Infolevel for every collected resource may generate excessive log output in clusters with many Spark applications. Consider usingV(1).Info()orDebuglevel for per-resource logging.🔎 Suggested change
// Send the processed resource to the batch channel - c.logger.Info("Collected Scheduled Spark Application resource", "key", key, "eventType", eventType, "resource", processedObj) + c.logger.V(1).Info("Collected Scheduled Spark Application resource", "key", key, "eventType", eventType)internal/collector/spark_application_collector.go (1)
258-259: Consider reducing log verbosity for collected resources.Same as for
ScheduledSparkApplicationCollector- logging atInfolevel for every collected resource may be excessive.🔎 Suggested change
// Send the processed resource to the batch channel - c.logger.Info("Collected Spark Application resource", "key", key, "eventType", eventType, "resource", processedObj) + c.logger.V(1).Info("Collected Spark Application resource", "key", key, "eventType", eventType)
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (8)
dist/backend-install.yamlis excluded by!**/dist/**dist/install.yamlis excluded by!**/dist/**dist/installer_updater.yamlis excluded by!**/dist/**dist/zxporter.yamlis excluded by!**/dist/**gen/api/v1/common.pb.gois excluded by!**/*.pb.go,!**/gen/**gen/api/v1/k8s.pb.gois excluded by!**/*.pb.go,!**/gen/**gen/api/v1/metrics_collector.pb.gois excluded by!**/*.pb.go,!**/gen/**proto/dakr_proto_descriptor.binis excluded by!**/*.bin
📒 Files selected for processing (8)
api/v1/collectionpolicy_types.goconfig/rbac/role.yamlhelm-chart/zxporter/templates/zxporter-rbac.yamlinternal/collector/interface.gointernal/collector/scheduled_spark_application_collector.gointernal/collector/spark_application_collector.gointernal/collector/types.gointernal/controller/collectionpolicy_controller.go
🧰 Additional context used
🧬 Code graph analysis (4)
internal/collector/interface.go (1)
gen/api/v1/metrics_collector.pb.go (5)
ResourceType(98-98)ResourceType(305-307)ResourceType(309-311)ResourceType(318-320)ResourceType_RESOURCE_TYPE_SPARK_APPLICATION(176-176)
internal/collector/types.go (1)
internal/collector/interface.go (5)
VolumeAttachment(140-140)KubeflowNotebook(141-141)VolcanoJob(142-142)SparkApplication(143-143)ScheduledSparkApplication(144-144)
api/v1/collectionpolicy_types.go (1)
internal/collector/types.go (2)
ExcludedSparkApplication(180-186)ExcludedScheduledSparkApplication(189-195)
internal/controller/collectionpolicy_controller.go (3)
internal/collector/types.go (2)
ExcludedSparkApplication(180-186)ExcludedScheduledSparkApplication(189-195)internal/collector/interface.go (3)
Namespace(95-95)SparkApplication(143-143)ScheduledSparkApplication(144-144)internal/collector/batcher.go (2)
DefaultMaxBatchSize(16-16)DefaultMaxBatchTime(19-19)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Test Metrics Server Lifecycle on K8s v1.32.3
- GitHub Check: Build Docker Image
- GitHub Check: Analyze (go)
🔇 Additional comments (25)
config/rbac/role.yaml (1)
308-316: LGTM!The RBAC rule correctly grants read-only access (
get,list,watch) to Spark Operator custom resources, which aligns with the monitoring requirements of the new collectors.helm-chart/zxporter/templates/zxporter-rbac.yaml (1)
353-361: LGTM!The Helm chart RBAC rule mirrors the config/rbac/role.yaml addition and is correctly placed alongside other optional third-party integrations.
internal/collector/interface.go (2)
143-144: LGTM!The new
SparkApplicationandScheduledSparkApplicationresource types are correctly added to the enum, maintaining proper ordering afterVolcanoJob.
199-200: LGTM!String representations follow the established snake_case convention used throughout the codebase.
internal/collector/types.go (2)
15-15: LGTM!The new resource types are correctly added to
AllResourceTypes(), enabling proper enumeration across the collection framework.
178-195: LGTM!The exclusion structs follow the established pattern with consistent field naming and JSON tags matching other exclusion types in this file.
api/v1/collectionpolicy_types.go (2)
96-105: LGTM!The new exclusion fields are properly added to the
Exclusionsstruct with consistent JSON tags andomitemptyoptions, following the established pattern for other exclusion categories.
389-423: LGTM!The new exclusion types (
ExcludedKubeflowNotebook,ExcludedVolcanoJob,ExcludedSparkApplication,ExcludedScheduledSparkApplication) follow the established pattern with proper documentation comments and JSON tags.internal/controller/collectionpolicy_controller.go (7)
124-127: LGTM!The
PolicyConfigstruct correctly includes the new exclusion fields for Spark-related resources, maintaining consistency with the existing exclusion pattern.
257-257: LGTM!The kubebuilder RBAC annotation correctly declares the required permissions for Spark Operator resources, which will be used for manifest generation.
595-625: LGTM!The exclusion conversion logic correctly transforms API spec exclusions to internal collector exclusion types for Kubeflow Notebooks, Volcano Jobs, and Spark Applications, following the established pattern used for other resource types.
822-828: LGTM!The
identifyAffectedCollectorsmethod correctly identifies when Spark-related exclusion changes require collector restarts.
1515-1534: LGTM!The selective restart logic correctly handles the new
spark_applicationandscheduled_spark_applicationcollector types with proper configuration passthrough.
2668-2691: LGTM!The collector registration correctly includes both
SparkApplicationCollectorandScheduledSparkApplicationCollectorwith appropriate configuration.
3329-3348: LGTM!The disabled collectors change handler correctly supports re-enabling Spark-related collectors with proper configuration.
internal/collector/scheduled_spark_application_collector.go (5)
22-36: LGTM!The collector struct is well-designed with proper synchronization (mutex for exclusion map access) and follows the established collector pattern.
38-81: LGTM!The constructor properly initializes all fields, creates the exclusion lookup map for O(1) access, and sets up the batching infrastructure.
83-242: LGTM!The
Startmethod correctly:
- Sets up the dynamic informer with proper namespace filtering
- Handles Add/Update/Delete events with tombstone support for deletions
- Waits for cache sync with a timeout
- Starts the batcher and context cancellation handler
309-346: LGTM!The
Stopmethod properly handles cleanup:
- Stops all informers by closing their stop channels
- Safely closes the main stop channel with a select guard to prevent double-close panics
- Closes the batch channel and stops the batcher
358-381: LGTM!The
IsAvailablemethod correctly checks cluster accessibility for the ScheduledSparkApplication CRD using a minimal list request.internal/collector/spark_application_collector.go (5)
22-36: LGTM!The collector struct follows the established pattern with proper field organization and synchronization.
38-81: LGTM!Constructor is correctly implemented with proper initialization of all fields and exclusion map setup.
83-242: LGTM!The
Startmethod correctly sets up the dynamic informer forsparkapplicationsresources with proper event handling, cache synchronization, and lifecycle management.
309-346: LGTM!The
Stopmethod correctly handles graceful shutdown with proper channel cleanup and panic prevention.
358-381: LGTM!The
IsAvailablemethod correctly validates CRD accessibility with minimal overhead.
| case SparkApplication: | ||
| return gen.ResourceType_RESOURCE_TYPE_SPARK_APPLICATION | ||
| // case ScheduledSparkApplication: | ||
| // return gen.ResourceType_RESOURCE_TYPE_SCHEDULED_SPARK_APPLICATION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Check if RESOURCE_TYPE_SCHEDULED_SPARK_APPLICATION exists in generated proto files
rg -n "RESOURCE_TYPE_SCHEDULED_SPARK_APPLICATION" --type goRepository: devzero-inc/zxporter
Length of output: 170
🏁 Script executed:
# Also check for any SCHEDULED_SPARK_APPLICATION references
rg -n "SCHEDULED_SPARK_APPLICATION" --type goRepository: devzero-inc/zxporter
Length of output: 170
🏁 Script executed:
# Look at the interface.go file around lines 310-313
sed -n '300,320p' internal/collector/interface.goRepository: devzero-inc/zxporter
Length of output: 818
🏁 Script executed:
# Check for ScheduledSparkApplication usage in the codebase
rg -n "ScheduledSparkApplication" --type go -B 2 -A 2Repository: devzero-inc/zxporter
Length of output: 19126
🏁 Script executed:
# Check where proto definitions are located and what's in them
find . -name "*.proto" -type f | head -20Repository: devzero-inc/zxporter
Length of output: 115
🏁 Script executed:
# Look for ResourceType enum in proto files
rg -n "enum ResourceType" --type proto -A 50Repository: devzero-inc/zxporter
Length of output: 93
🏁 Script executed:
# Search for ResourceType enum in proto files
rg -n "enum ResourceType" --type-add 'proto:*.proto' -t proto -A 50Repository: devzero-inc/zxporter
Length of output: 46
🏁 Script executed:
# Or just search in the proto files directly
cat ./internal/proto/mpa.proto | head -100Repository: devzero-inc/zxporter
Length of output: 1714
🏁 Script executed:
# Also check the generated proto file
cat ./proto/api/v1/mpa.proto | head -150Repository: devzero-inc/zxporter
Length of output: 1714
🏁 Script executed:
# Search for ResourceType in all files (not just proto)
rg -n "ResourceType_RESOURCE_TYPE_SPARK_APPLICATION" --type go -B 5 | head -50Repository: devzero-inc/zxporter
Length of output: 999
🏁 Script executed:
# Look for where gen.ResourceType is imported from
rg -n "package gen" --type go | head -20Repository: devzero-inc/zxporter
Length of output: 46
🏁 Script executed:
# Find all generated proto files
find . -name "*.pb.go" | head -10Repository: devzero-inc/zxporter
Length of output: 353
🏁 Script executed:
# Look at the full ResourceType enum definition in metrics_collector.pb.go
rg -n "type ResourceType" -A 60 gen/api/v1/metrics_collector.pb.goRepository: devzero-inc/zxporter
Length of output: 3464
🏁 Script executed:
# Also check the metrics_collector.proto file if it exists
find . -name "*metrics_collector.proto*" -type fRepository: devzero-inc/zxporter
Length of output: 46
🏁 Script executed:
# Search for "SCHEDULED_SPARK" in the generated pb.go file
rg -n "SCHEDULED_SPARK" gen/api/v1/metrics_collector.pb.goRepository: devzero-inc/zxporter
Length of output: 46
🏁 Script executed:
# See more of the ResourceType enum, specifically around line 176 and beyond
sed -n '98,200p' gen/api/v1/metrics_collector.pb.goRepository: devzero-inc/zxporter
Length of output: 4800
🏁 Script executed:
# Search for the proto source file
find . -name "*metrics_collector.proto*" -o -name "*metrics.proto*" | grep -v pb.goRepository: devzero-inc/zxporter
Length of output: 46
🏁 Script executed:
# Search for the source proto file for metrics_collector
find . -path "*/proto/*" -name "*metrics*" -type fRepository: devzero-inc/zxporter
Length of output: 46
🏁 Script executed:
# Also check if there's a proto source directory
ls -la proto/Repository: devzero-inc/zxporter
Length of output: 316
🏁 Script executed:
# Look at the proto/api directory structure
find proto/api -type fRepository: devzero-inc/zxporter
Length of output: 86
🏁 Script executed:
# Check if metrics_collector proto is in internal/proto
ls -la internal/proto/Repository: devzero-inc/zxporter
Length of output: 236
🏁 Script executed:
# Search for the metrics_collector proto definition in the entire repository
rg -n "SPARK_APPLICATION" . --type plaintext | head -20Repository: devzero-inc/zxporter
Length of output: 97
🏁 Script executed:
# Search for SPARK_APPLICATION in all files
rg -n "SPARK_APPLICATION" --max-count=50Repository: devzero-inc/zxporter
Length of output: 789
🏁 Script executed:
# Check if there's a dakr_proto_descriptor.bin that might contain the proto definitions
file proto/dakr_proto_descriptor.binRepository: devzero-inc/zxporter
Length of output: 108
🏁 Script executed:
# Look at git to see if there's proto source
git ls-files | grep -i "proto" | grep -v "\.pb\|\.bin"Repository: devzero-inc/zxporter
Length of output: 111
🏁 Script executed:
# Check if there's a metrics_collector.proto file anywhere
find . -name "*.proto" -type f | xargs grep -l "SPARK_APPLICATION" 2>/dev/nullRepository: devzero-inc/zxporter
Length of output: 48
🏁 Script executed:
# Look for any references to ScheduledSparkApplication in proto files
find . -name "*.proto" -type f -exec grep -l "ScheduledSparkApplication" {} \;Repository: devzero-inc/zxporter
Length of output: 46
🏁 Script executed:
# Check if the proto descriptor contains the definition
strings proto/dakr_proto_descriptor.bin 2>/dev/null | grep -i "scheduled_spark" | head -10Repository: devzero-inc/zxporter
Length of output: 46
🏁 Script executed:
# Let's look at the common.pb.go to see if there's a K8SObjectKind for ScheduledSparkApplication
rg -n "K8S_OBJECT_KIND.*SCHEDULED_SPARK" gen/api/v1/common.pb.goRepository: devzero-inc/zxporter
Length of output: 46
🏁 Script executed:
# Let's check what K8SObjectKind values exist for Spark-related types
rg -n "K8S_OBJECT_KIND.*SPARK" gen/api/v1/common.pb.goRepository: devzero-inc/zxporter
Length of output: 244
🏁 Script executed:
# Get the full K8SObjectKind enum
sed -n '50,120p' gen/api/v1/common.pb.goRepository: devzero-inc/zxporter
Length of output: 3389
🏁 Script executed:
# Let's check if there's a way to see what ResourceType value might be reserved or next available
sed -n '98,160p' gen/api/v1/metrics_collector.pb.go | grep "ResourceType_RESOURCE_TYPE" | tail -5Repository: devzero-inc/zxporter
Length of output: 339
🏁 Script executed:
# Check if there are any comments or TODOs about ScheduledSparkApplication in the Go code
rg -n "TODO|FIXME|TODO.*Scheduled" internal/collector/Repository: devzero-inc/zxporter
Length of output: 330
🏁 Script executed:
# Verify that ScheduledSparkApplication is indeed used and that collectors are instantiated
rg -n "NewScheduledSparkApplicationCollector\|ScheduledSparkApplication" internal/controller/collectionpolicy_controller.go | head -10Repository: devzero-inc/zxporter
Length of output: 46
🏁 Script executed:
# Check the test files to see if there are any tests for ScheduledSparkApplication
find . -name "*test*.go" -type f -exec grep -l "ScheduledSparkApplication" {} \;Repository: devzero-inc/zxporter
Length of output: 46
Add RESOURCE_TYPE_SCHEDULED_SPARK_APPLICATION to proto definitions and implement the mapping.
ScheduledSparkApplication is actively used throughout the codebase—there's a full collector implementation, controller integration, and exclusion types—but the corresponding proto constant doesn't exist. Currently, ScheduledSparkApplication resources fall through to the default case and are serialized as RESOURCE_TYPE_UNSPECIFIED, preventing proper resource type identification.
The fix requires two steps:
- Add
RESOURCE_TYPE_SCHEDULED_SPARK_APPLICATIONto theResourceTypeenum in the proto definitions (with an appropriate numeric value, e.g., 52) - Uncomment the case in
interface.golines 312-313 once the proto constant is available
🤖 Prompt for AI Agents
In internal/collector/interface.go around lines 310 to 313,
ScheduledSparkApplication currently falls through to the default and is
serialized as RESOURCE_TYPE_UNSPECIFIED because the proto enum lacks
RESOURCE_TYPE_SCHEDULED_SPARK_APPLICATION; add
RESOURCE_TYPE_SCHEDULED_SPARK_APPLICATION to the ResourceType enum in the proto
(pick the next appropriate numeric value, e.g., 52) and regenerate the Go proto
bindings, then uncomment and restore the case for ScheduledSparkApplication in
interface.go so it returns
gen.ResourceType_RESOURCE_TYPE_SCHEDULED_SPARK_APPLICATION.
Summary by CodeRabbit
New Features
Chores
✏️ Tip: You can customize this high-level summary in your review settings.