CC-Store System Architecture

CC-Store is a high-performance storage system designed for efficiently managing and querying billions of Common Crawl web page records. This document outlines the overall architecture, data flow, storage structure, and key components of the system.

Overview

CC-Store is built around several key design principles:

Domain-centric storage organization for efficient querying
Multi-level metadata management
Support for incremental data updates
Apache Spark optimization

System Architecture

The system is organized into several logical layers:

+---------------------+
|     应用层          |
|   +-------------+   |
|   |  CCStore API|   |
|   +-------------+   |
+----------+----------+
           |
+----------v----------+
|     核心层          |
| +------------------+|
| |StorageBackend    ||
| +------------------+|
| |MetadataManager   ||
| +------------------+|
+----------+----------+
           |
+----------v----------+
|     存储层          |
| +------------------+|
| |ParquetStorage    ||
| +------------------+|
| |MetadataManager   ||
| +------------------+|
+----------+----------+
           |
+----------v----------+
|     物理存储        |
| +------------------+|
| |  S3/HDFS/本地存储||
| +------------------+|
+---------------------+

Key Components

CCStore API (Application Layer)
- Main entry point for users
- Handles high-level operations like reading and writing documents
- Manages coordination between components
Core Layer
- StorageBackend: Manages storage of document data in Parquet format
- MetadataManager: Tracks domain and file metadata
Storage Layer
- Implements specific storage strategies for different data types
- Manages partitioning and optimization for performance
Physical Storage
- Actual storage backend (S3, HDFS, local filesystem)
- Raw binary data storage

Data Flow

Read Flow

User requests data for a specific domain through CCStore API
System queries MetadataManager to locate the relevant files
StorageBackend reads appropriate Parquet files based on metadata information
Results are returned to the user as a Spark DataFrame

Write Flow

User submits data to be written through CCStore API
StorageBackend groups data by domain and date
Data is partitioned and written as Parquet files with optimized compression
MetadataManager updates domain and file metadata

Storage Structure

CC-Store organizes data in a hierarchical structure optimized for domain-based queries:

storage_path/
├── data/                                     # Main data storage
│   ├── domain_bucket=000/                    # Domain buckets (hash-based sharding)
│   │   ├── domain=example.com/
│   │   │   ├── date=20230101/
│   │   │   │   ├── part-0000.parquet         # Partition files with all content
│   │   │   │   ├── part-0001.parquet
│   │   │   │   └── part-0002.parquet
│   │   │   └── date=20230102/
│   │   │       ├── part-0000.parquet
│   │   │       └── part-0001.parquet
│   │   └── domain=another.com/
│   │       └── ...
│   └── domain_bucket=001/
│       └── ...
└── metadata/                                 # Metadata storage
    ├── domains/                              # Domain metadata
    │   └── ...
    └── files/                                # File metadata
        └── ...

Partitioning Strategy

The system automatically partitions data based on several factors:

Domain-based partitioning: Data is first organized by domain name
Date-based partitioning: Within each domain, data is organized by date
Size-based partitioning: For each domain and date, data is further partitioned into multiple part files

The size-based partitioning is determined by:

Target part file size (default: 128MB)
Minimum records per partition
Estimated DataFrame size

This approach allows for:

Efficient querying of specific domains and date ranges
Parallel processing of large datasets
Optimization of file sizes for storage and processing efficiency

Storage Optimization

CC-Store employs several techniques to optimize storage:

Advanced Parquet Format Configuration
- Column-level compression settings
- Optimized page and row group sizes
- Dictionary encoding for repeated values
- Bloom filters for key columns like URLs
Query Optimization
- Partition pruning to read only relevant files
- Column projection to read only required columns
- Predicate pushdown for filtering at storage level

Metadata Model

CC-Store maintains several types of metadata to track the state of the system:

Domain Metadata

Stores information about each domain in the system:

Domain ID
Bucket ID
Total files, records, and size
Date range (min/max dates)
Other statistics

File Metadata

Tracks information about each Parquet file:

Domain ID
Date
Part ID
File path and size
Record count
Timestamp range
Checksum and other file attributes

Performance Considerations

Reading Performance

Column-based storage allows selective reading of required fields
Predefined partition structure enables efficient date range and domain queries
Bloom filters accelerate point lookups

Writing Performance

Size-based partitioning automatically scales with data volume
Domain and date isolation prevents large batch writes from impacting unrelated domains
Advanced compression reduces storage costs with minimal CPU overhead
Multi-level architecture enables parallel writing for different domains

Conclusion

The CC-Store architecture is designed for high performance and scalability when working with large web crawl datasets. By organizing data around domains and dates, and employing advanced storage optimization techniques, it provides efficient storage and fast query capabilities for Common Crawl data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CC-Store System Architecture

Overview

System Architecture

Key Components

Data Flow

Read Flow

Write Flow

Storage Structure

Partitioning Strategy

Storage Optimization

Metadata Model

Domain Metadata

File Metadata

Performance Considerations

Reading Performance

Writing Performance

Conclusion

FilesExpand file tree

architecture.md

Latest commit

History

architecture.md

File metadata and controls

CC-Store System Architecture

Overview

System Architecture

Key Components

Data Flow

Read Flow

Write Flow

Storage Structure

Partitioning Strategy

Storage Optimization

Metadata Model

Domain Metadata

File Metadata

Performance Considerations

Reading Performance

Writing Performance

Conclusion