CC-Store is a high-performance storage system designed for efficiently managing and querying billions of Common Crawl web page records. This document outlines the overall architecture, data flow, storage structure, and key components of the system.
CC-Store is built around several key design principles:
- Domain-centric storage organization for efficient querying
- Multi-level metadata management
- Support for incremental data updates
- Apache Spark optimization
The system is organized into several logical layers:
+---------------------+
| 应用层 |
| +-------------+ |
| | CCStore API| |
| +-------------+ |
+----------+----------+
|
+----------v----------+
| 核心层 |
| +------------------+|
| |StorageBackend ||
| +------------------+|
| |MetadataManager ||
| +------------------+|
+----------+----------+
|
+----------v----------+
| 存储层 |
| +------------------+|
| |ParquetStorage ||
| +------------------+|
| |MetadataManager ||
| +------------------+|
+----------+----------+
|
+----------v----------+
| 物理存储 |
| +------------------+|
| | S3/HDFS/本地存储||
| +------------------+|
+---------------------+
-
CCStore API (Application Layer)
- Main entry point for users
- Handles high-level operations like reading and writing documents
- Manages coordination between components
-
Core Layer
- StorageBackend: Manages storage of document data in Parquet format
- MetadataManager: Tracks domain and file metadata
-
Storage Layer
- Implements specific storage strategies for different data types
- Manages partitioning and optimization for performance
-
Physical Storage
- Actual storage backend (S3, HDFS, local filesystem)
- Raw binary data storage
- User requests data for a specific domain through
CCStoreAPI - System queries
MetadataManagerto locate the relevant files StorageBackendreads appropriate Parquet files based on metadata information- Results are returned to the user as a Spark DataFrame
- User submits data to be written through
CCStoreAPI StorageBackendgroups data by domain and date- Data is partitioned and written as Parquet files with optimized compression
MetadataManagerupdates domain and file metadata
CC-Store organizes data in a hierarchical structure optimized for domain-based queries:
storage_path/
├── data/ # Main data storage
│ ├── domain_bucket=000/ # Domain buckets (hash-based sharding)
│ │ ├── domain=example.com/
│ │ │ ├── date=20230101/
│ │ │ │ ├── part-0000.parquet # Partition files with all content
│ │ │ │ ├── part-0001.parquet
│ │ │ │ └── part-0002.parquet
│ │ │ └── date=20230102/
│ │ │ ├── part-0000.parquet
│ │ │ └── part-0001.parquet
│ │ └── domain=another.com/
│ │ └── ...
│ └── domain_bucket=001/
│ └── ...
└── metadata/ # Metadata storage
├── domains/ # Domain metadata
│ └── ...
└── files/ # File metadata
└── ...
The system automatically partitions data based on several factors:
- Domain-based partitioning: Data is first organized by domain name
- Date-based partitioning: Within each domain, data is organized by date
- Size-based partitioning: For each domain and date, data is further partitioned into multiple part files
The size-based partitioning is determined by:
- Target part file size (default: 128MB)
- Minimum records per partition
- Estimated DataFrame size
This approach allows for:
- Efficient querying of specific domains and date ranges
- Parallel processing of large datasets
- Optimization of file sizes for storage and processing efficiency
CC-Store employs several techniques to optimize storage:
-
Advanced Parquet Format Configuration
- Column-level compression settings
- Optimized page and row group sizes
- Dictionary encoding for repeated values
- Bloom filters for key columns like URLs
-
Query Optimization
- Partition pruning to read only relevant files
- Column projection to read only required columns
- Predicate pushdown for filtering at storage level
CC-Store maintains several types of metadata to track the state of the system:
Stores information about each domain in the system:
- Domain ID
- Bucket ID
- Total files, records, and size
- Date range (min/max dates)
- Other statistics
Tracks information about each Parquet file:
- Domain ID
- Date
- Part ID
- File path and size
- Record count
- Timestamp range
- Checksum and other file attributes
- Column-based storage allows selective reading of required fields
- Predefined partition structure enables efficient date range and domain queries
- Bloom filters accelerate point lookups
- Size-based partitioning automatically scales with data volume
- Domain and date isolation prevents large batch writes from impacting unrelated domains
- Advanced compression reduces storage costs with minimal CPU overhead
- Multi-level architecture enables parallel writing for different domains
The CC-Store architecture is designed for high performance and scalability when working with large web crawl datasets. By organizing data around domains and dates, and employing advanced storage optimization techniques, it provides efficient storage and fast query capabilities for Common Crawl data.