Skip to content

Conversation

@mkhelghati-db
Copy link

Major Transformations

  1. Serverless Compute Migration
    Old: Continuous streaming (trigger(processingTime='10 seconds'))
    New: Serverless batch processing (trigger(availableNow=True))
    Result: 60-80% cost reduction, pay only for processing time
  2. CDC Data Simulation to test streaming data pipelines
    Added: Background data generators creating CDC events every 60 seconds
    Operations: INSERT, UPDATE, DELETE with realistic patterns
    Coverage: Both single-table and multi-table scenarios
  3. CDF Efficiency Demonstrations
    Added: Explicit volume comparisons (CDF vs non-CDF processing)
    Metrics: Processing efficiency, cost reduction, speed improvements
    Impact: Shows 60-90% reduction in data processing volume
  4. Added: Real-time monitoring and progress tracking
  5. Performance Optimizations
    Delta Properties: Optimized file sizes and rewrite tuning
    Auto Loader: Incremental processing configuration
    Schema Evolution: Robust handling with mergeSchema=true

@QuentinAmbard
Copy link
Collaborator

hey, that's a great update, but you have a lot of extra file that shouldn't be there in the PR, could you clean it up ? We should only have the notebook files
Thanks!!

- Migrate to serverless compute with trigger(availableNow=True)
- Add continuous CDC data generators for realistic simulation
- Implement CDF vs non-CDF processing volume demonstrations
- Restructure demos with 8 numbered steps for clarity
- Add performance optimizations (delta properties, auto loader config)
- Fix schema evolution and column name issues
- Remove deprecated Spark configurations
- Add real-time monitoring and progress tracking
Resolved conflicts by keeping our CDC pipeline updates:
- product_demos/cdc-pipeline/01-CDC-CDF-simple-pipeline.py
- product_demos/cdc-pipeline/02-CDC-CDF-full-multi-tables.py

Our CDC updates include:
- Serverless compute with trigger(availableNow=True)
- Continuous CDC data generators
- CDF vs non-CDF processing demonstrations
- 8-step structured storyline
- Performance optimizations
- Real-time monitoring and progress tracking
@mkhelghati-db
Copy link
Author

@QuentinAmbard sorry it took too long. I had forgot about it. Please have a look and let me know if it is all ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants