From 75cce6706cc2fcfff65a4eb308d7bfc3b0ee39b3 Mon Sep 17 00:00:00 2001 From: "Sherman, Srikanth" Date: Mon, 16 Mar 2026 13:08:48 -0700 Subject: [PATCH 01/19] First Cassandra with QAT and zlib-accel --- software/cassandra/QAT/README.md | 599 +++++++++++++++++++++++++++++++ 1 file changed, 599 insertions(+) create mode 100644 software/cassandra/QAT/README.md diff --git a/software/cassandra/QAT/README.md b/software/cassandra/QAT/README.md new file mode 100644 index 0000000..2b08829 --- /dev/null +++ b/software/cassandra/QAT/README.md @@ -0,0 +1,599 @@ +This workload tuning guide describes the best known practices to optimize performance on Intel Xeon CPUs +when running Apache Cassandra. Default configurations may vary across hardware vendors, thus this guide +helps provide a set of recommended settings for getting the best performance throughput/latency. + +# This document is organized with the following topics: + +- [Document Nomenclature](#Document-Nomenclature) +- [Hardware Configuration Recommendations](#Hardware-Configuration-Recommendations)  +- [BIOS Setting Recommendations](#BIOS-Configuration-Recommendations) +- [DataStax’s Kernel/Storage/Network Settings](DataStax’s-Kernel/Storage/Network-Settings) +- [Cassandra Settings](#Cassandra-settings)  +- [Cassandra-Stress Testing](#Cassandra-Stress-Testing) +- [Cassandra-Stress Performance Comparisons](#Cassandra-Stress-Performance-Comparisons) +- [Example System Startup Script](#Example-System-Startup-Script)  +- [FAQ](#FAQ) + +# Document Nomenclature + +This document uses the following distinctions: + +- **Client**: Applies only to client systems running the load generator like `cassandra-stress`, NoSQLBench, or other benchmarks +- **Server**: Applies only to server systems running the Cassandra instances +- **Cloud**: Where applicable, notes differences between bare-metal and cloud instances +- **Cass3**: Applies only to Cassandra version 3.x +- **Cass4**: Applies only to Cassandra version 4.x +- **Cass5**: Applies only to Cassandra version 5.x +- **CassStress**: Applies only to the `cassandra-stress` benchmark (not required if using other benchmarks) +- **PerfTip X%**: Estimated performance throughput improvement expected with this change + +# Hardware Configuration Recommendations + +## Typical Cassandra Configuration: + +| Client system | < ---- | Network | ---- > | Server System |<---------->| Database | +|-----------------|--------|-----------------------|--------|----------------------|------------|------------------------| +|(Load generator) | | | | Cassandra Instance(s)| | Data Location (Storage)| + + + +The Apache Cassandra Database or Cassandra instance(s) run on the server system. +There are minimum hardware requirements for a Cassandra instance. +The details can be found [here](https://cassandra.apache.org/doc/4.0/cassandra/operating/hardware.html) + +In summary, the minimum CPU/DRAM hardware required for a small production Cassandra instance is 8 logical CPU and at least 32GB of DRAM. + +For a larger Cassandra instance, we have found there is an upper limit where Cassandra performance does not scale linearly any more by adding more CPU/Memory resources. +This limit is approximately 48 logical CPU cores per Cassandra instance. +To avoid this inefficiency, customers typically run multiple Cassandra instances on larger hardware systems to provide the best Cassandra performance on these larger systems. + +Below are some typical hardware sizing recommendations. +Note, your specific schema, request types, replication and other performance requirements may differ and may need to alter these recommendations. + +**Table 1: Summary of Hardware Resources for One Cassandra Instance** + +| System Size | Number Logical CPUs | Memory | Storage | Networking | +|-------------|----------------------|---------|------------------------------|------------| +| Small | 8 | 32 GB | 1 SATA/SAS SSD | 10 Gbit | +| Medium | 16 | 64 GB | 2 SATA/SAS SSD or 1 NVME | 10 Gbit | +| Large | 32 | 128 GB | 1 NVME | 10-25 Gbit | +| X-Large | 48 | 192 GB | 1-2 NVME | 25 Gbit | + + +# BIOS Configuration Recommendations + +Table 2 describes the BIOS options that impact Cassandra performance. + +**Table 2: Summary of BIOS Options for Optimizing Cassandra** + +| Parameter Name | Typical BIOS Default | BIOS Setting Recommended | Description | PerfTip | +|------------------------|----------------------|--------------------------------------------|----------------------------------------------------------------------------------------------------------------------------|--------------------------------| +| Hyperthreading/SMT | Enabled | Enabled | Enabling hyperthreading or simultaneous multithreading allows for two hardware threads per core on supported Xeon CPUs. Note, Xeons E core CPUs do not support this feature. | up to 22% | +| SNC | Disabled or SNC1 | Enabled, SNC 2-4 depending on number of Cassandra instances running | When running more than 1 Cassandra instance, enabling Sub-Numa-Clustering gives you a way to isolate compute/memory resources to improve Cassandra running efficiency. See "NUMA Performance Considerations" in this document for more details. | in combination with numactl changes, up to 15% | +| Latency Optimized Mode | Disabled | Enabled | Some Xeon system BIOS expose this parameter. This setting optimizes for latency vs. power of the memory subsystem, which helps latency-sensitive workloads, like Cassandra. | 2–4% | + + +# DataStax’s Kernel/Storage/Network Settings + +[DataStax Link](https://docs.datastax.com/en/dse/6.8/managing/configure/recommended-settings.html) + +- For the Linux OS kernel, we follow the kernel settings and disable kernel features settings that impact +performance specified in the DataStax Link. +- For storage, we follow the "Optimize disk settings" and "Optimize SSDs" recommendations. + If your schema and access requests are small (~1KB), random, and mostly read requests on flash media, changing the default `read_ahead_kb` from 128 to 8 can double throughput. + > !Performance tip: can give you up to 100% speedup + +- For networking, we follow the "Networking TCP settings" in the DataStax link. + +All these settings (kernel, storage, and networking) are captured in the **“Example System Startup Script”** towards the end of this document. + +## Adding Multiple Network IP Addresses to a Network Interface + +Each Cassandra instance requires a unique IP address. +If you plan to support multiple Cassandra instances on one system with one network interface, you can add static network IP addresses to an existing one: + +```bash +# Network Alias (one IP address per instance) +ifconfig eth0:0 up +ifconfig eth0:1 up +... +ifconfig eth0:n up +``` + +# Cassandra settings + +## Cassandra Directories and Files of Interest: + +Server side will have the entire Cassandra directory and files below. If multiple Cassandra instances are +running on the same server, there will be one CASSANDRA_SERVER_HOME# directory for each Cassandra +instance. Below are the files of interest that we either run or modify: +``` + +| +|__bin +| |__cassandra (Cassandra startup script) +| |__nodetool (run to get status and compaction info) +| +|__ conf +| |__cassandra.yaml # (tuning parameters for Cass3/4/5) +| |__cassandra_latest.yaml (new Cass5 features) +| |__cassandra-env.sh (JMX_PORT setting for multiple instance) +| |__jvm.options (Cass3 all java settings) +| |__jvm-server.options (Cass4/5 java general settings) +| |__jvm8-server.options (Cass4 java 8 specific settings) +| |__jvm11-server.options (Cass4/5 java 11 specific settings) +| |__jvm17-server.options (Cass5 java 17 specific settings) +| |__jvm21-server.options (Cass5 java 21 specific settings) +| +|__tools +| |__cqlstress-insanity-example.yaml (CassStress schema files) +| | +| |__bin +| |__cassandra-stress (CassStress load generator) +… +``` +## Modification to General settings on the Server Side: +- `JAVA_HOME` must be set before launching Cassandra. These 2 lines are required for the Java library you are using: + +```bash +export JAVA_HOME= + +# For example, with OpenJDK version 17, this would be: +export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64 +export PATH=$JAVA_HOME/bin:$PATH +``` +## Modification to the Cassandra YAML Configuration File on Server Side: +Most Cassandra parameters/features and settings are in this file: +/conf/cassandra.yaml. +All Cassandra versions require the following parameter modifications: +- seeds: +- listen_address: +- rpc_address: +- data_file_directories: (example /mnt/nvme1/cass_db) +- commitlog_directory: (example /mnt//nvme1/cass_db/commitlog) +- cdc_raw_directory: (example /mnt/nvme1/cass_db/cdc_raw) +- saved_caches_directory: (example /mnt/nvme1/cass_db/saved_caches) +- concurrent_reads: <#number of threads for reads>, optimal values of this is 3X the number +of logical CPUs allocated for this Cassandra instance. + > !Performance tip: can give you up to 15% speedup +- concurrent_writes: there are negative effects in increasing this value beyond the number of +CPU (virtual CPU number). The root cause comes from [a known contention software issue]( + https://issues.apache.org/jira/browse/CASSANDRA-13896), a workaround to this +problem, use the same value as logical CPU for this instance. + +**Cass5**, new to this Cassandra version, there are two cassandra*.yaml files included cassandra.yaml and +cassandra_latest.yaml. +- The cassandra.yaml file is backward compatible with **Cass4** data format and settings, thus if you +already have a database created with **Cass4** you can use this file. +- The cassandra_latest.yaml file has multiple new features added on **Cass5** like Unified Compaction, +Trie Memtable and BTI SStables. To use all these features, the datasets creation must be done +with these features enabled. To use this yaml file with **Cass5**, the file cassandra_latest.yaml must +be renamed to cassandra.yaml before initiating Cassandra. + +There is a performance benefit by using all the new features in Cass5, over not using these +features. +> !Performance tip: can give you up to 27% speedup + +## Modification of Java Files + +Apply the following to `/conf/jvm*` file changes according to the Java version supported. +We recommend using the latest JDK supported by the Cassandra version you are using, as this can improve throughput with JDK change alone. + +- **Cass3** only supports **JDK version 8** +- **Cass4** supports **JDK versions 8 through 11** +- **Cass5** supports **JDK versions 11 through 17** +- **Future Cassandra version 5.1** is planned to support **JDK version 21** + [Reference](https://issues.apache.org/jira/browse/CASSANDRA-18831) + +Add the following changes: + +**jvm-server.options** + > !Performance tip: can give you 3-5% of speedup +```bash +# Throughput advantage when using 2MB vs. the default 4KB memory pages (less compute overhead). +# We also use TransparentHugePage to not have to guess on memory allocation sizes upfront +-XX:+UseLargePages +-XX:+UseTransparentHugePages +``` + + +**jvm-server.options file** + > !Performance tip: can give you up to 10% speedup +```bash +# comment out NUMA +# -XX:+UseNUMA +# In Cass3 and Cass4 CMS was the default garbage collector, +# using G1GC gives up to throughput improvement and lower +# latency vs. CMS, below we are commenting out the CMS garbage + +# collector +### CMS Settings +#-XX:+UseParNewGC +#-XX:+UseConcMarkSweepGC +#-XX:+CMSParallelRemarkEnabled +#-XX:SurvivorRatio=8 +#-XX:MaxTenuringThreshold=1 +#-XX:CMSInitiatingOccupancyFraction=75 +#-XX:+UseCMSInitiatingOccupancyOnly +#-XX:CMSWaitDuration=10000 +#-XX:+CMSParallelInitialMarkEnabled +#-XX:+CMSEdenChunksRecordAlways +#-XX:+CMSClassUnloadingEnabled +# Cass3 and Cass4 requires to explicitly enable G1GC +# Cass5 G1GC is the default +-XX:+UseG1GC +# +# This is the Cassandra instance heap size +# NOTE: the total number of CassandraInstances*HeapSize +# should take between 25-50% of total system memory, we have +# not seen any performance benefits when setting this higher than +# 64GB heaps with G1GC + # +# PERFORMANCE NOTE: avoid using heaps between 32GB-38GB, details +# explained here: +# https://blog.codecentric.de/35gb-heap-less-32gb-java-jvm-memory-oddities +# +-Xms31G +-Xmx31G +``` +## Modification to the Cassandra-env file on the Server Side: +No modifications are needed to this file if running only one Cassandra instance on the server system or +running Cassandra with a hypervisor. Modifications are required when running more than one Cassandra +instance on bare metal, each Cassandra instance must have a unique JMX_PORT number, hence the +following file should be modified: `/conf/cassandra-env.sh`: +```bash +JMX_PORT=”7199” # 7199 for instance 1, +JMX_PORT=”7299” # 7299 for instance 2, etc. +… +``` +## Performance Option: +As previously stated in suggest hardware, one Cassandra instance scales well till around 48 logical CPUs, +hence for large system with >48 CPU and/or multiple CPU sockets, multiple Cassandra instances are +recommended to use compute and memory resources efficiently. Example below shows Configuration +Diagram 1 with a dual socket and 128 logical CPUs. In this case each Cassandra Instance will take 32 CPUs +and its own NVME device. + +![Configuration Diagram 1](images/cassandra-diagram1.jpg) + +## NUMA Performance Considerations: + +> !Performance tip: can give you up to 15% speedup + +Each Cassandra instance above can be pinned to distinct CPU and/or NUMA memory region +for best performance. For example, in Configuration Diagram 1, if the system supports Sub +NUMA clustering 2, or 4 NUMA nodes total for the system, each Cassandra instance can be +pinned to its own compute and memory with the numactl command. Below are the changes to +the cassandra startup script for each Cassandra instance, highlighted are the changes: +```bash +/bin/cassandra file compute/memory bind NUMA 0: +… +NUMACTL_ARGS=”” #${NUMACTL_ARGS:-"--interleave=all"} +if which numactl >/dev/null 2>/dev/null && numactl $NUMACTL_ARGS ls / >/dev/null +2>/dev/null +then + NUMACTL="numactl -m 0 -N 0 $NUMACTL_ARGS" +else + NUMACTL="" +fi + +/bin/cassandra file compute/memory bind NUMA 1: +… +NUMACTL_ARGS==”” #${NUMACTL_ARGS:-"--interleave=all"} +if which numactl >/dev/null 2>/dev/null && numactl $NUMACTL_ARGS ls / >/dev/null +2>/dev/null +then + NUMACTL="numactl -m 1 -N 1 $NUMACTL_ARGS" +else + NUMACTL="" +fi + +/bin/cassandra file compute/memory bind NUMA 2: +… +NUMACTL_ARGS==”” #${NUMACTL_ARGS:-"--interleave=all"} +if which numactl >/dev/null 2>/dev/null && numactl $NUMACTL_ARGS ls / >/dev/null +2>/dev/null +then + NUMACTL="numactl -m 2 -N 2 $NUMACTL_ARGS" +else + NUMACTL="" +fi + +/bin/cassandra file compute/memory bind NUMA 3: +… +NUMACTL_ARGS==”” #${NUMACTL_ARGS:-"--interleave=all"} +if which numactl >/dev/null 2>/dev/null && numactl $NUMACTL_ARGS ls / >/dev/null +2>/dev/null +then + NUMACTL="numactl -m 3 -N 3 $NUMACTL_ARGS" +else + NUMACTL="" +fi +``` +To find the specific NUMA details of your system run the following command, you may need to change the +number of nodes supported in the system BIOS settings. + ```bash + numactl -H +``` +**Cloud**: Cloud providers typically do not let you modify NUMA and may only have one node for the entire +system hence this NUMA section can be skipped for the cloud. + +**Starting a Cassandra Instance on the Server Side**: +```bash +/bin/cassandra -R +``` + +**Stopping the Cassandra Instance “gracefully” on the Server Side**: +```bash +/bin/nodetool flush +/bin/nodetool drain +/bin/nodetool stopdaemon +``` +**Forcefully Stopping the Cassandra Instance on the Server Side**: +```bash +killall –9 +``` + +**Forcefully stop all the Cassandra Instances**: +```bash +killall –9 java +``` +# Cassandra-Stress Testing + +## Modification to Cassandra Schema File on the Client Side (CassStress): +Cassandra comes with multiple example data layout format files, i.e. schema files. The file that best +represents our customer’s data is typically the insanity schema. This file can be found in +```/tools/cql-insanity-example.yaml```. The following modifications are applied to +this file for the best mixed workload performance on Cassandra-Stress. + +The compaction parameter is changed on the schema from default +‘LeveledCompactionStrategy’ to ‘SizeTieredCompactionStrategy’ for best overall performance +on a mix 80:20 Read:Write workload. Note, the line with compression definition +does not appear by default on the cql-insanity-example.yaml file, it is here to explicitly show the +settings for each Cassandra version. +**Cass3**, change the compaction strategy to SizeTieredCompactionStrategy +```bash +… +) WITH compaction = {'class':'SizeTieredCompactionStrategy'} +AND compression = {'class':'LZ4Compressor','chunk_length_in_kb':64} +# defaults on Cass3 +AND comment='A table of many types to test wide rows and collections' +``` +**Cass4**, change compaction strategy to SizeTieredCompactionStrategy, note that the compressor default is +different than **Cass3**, as the default changed to 16KB chunk size. +```bash +… +) WITH compaction = {'class':'SizeTieredCompactionStrategy'} +AND compression = {'class':'LZ4Compressor','chunk_length_in_kb':16} +# default on Cass4 +AND comment='A table of many types to test wide rows and collections' +``` +**Cass5**, change compaction strategy to the new unifiedCompactionStrategy which adjusts automatically +between Leveled and SizeTiered for best overall performance. +```bash +… +) WITH compaction = { 'class':'unifiedCompactionStrategy' } +AND compression = { 'class':'LZ4Compressor','chunk_length_in_kb':16} +# default on Cass5 +AND comment='A table of many types to test wide rows and collections' +``` + +## Cassandra-stress initial table creation (CassStress) +Cassandra requires a dataset on the server to read and write commands and test for performance, hence +building or copying a dataset is a required step before any performance testing: +To build your own dataset on one Cassandra instance: +```bash +/tools/bin/cassandra-stress \ + user profile=/tools/cql-insanity-example.yaml \ + ops\(insert=1\) no-warmup \ + cl=ONE \ + n= \ + -mode native cql3 \ + –pop seq=1.. \ + -node \ + -rate threads= +``` +**user profile** Designate the schema YAML file to use with cassandra-stress. +**no-warmup** Do not warmup the instance, do a cold start. +**cl=ONE – consistency level**. A write must be written to the commit log and MemTable of at least one replica +node. This is for one node cluster case. If multiple nodes are used in a cluster for example you can make +multiple copies of the data. +**n=** number of entries in database, typically 670 Bytes per entry +**pop seq=1..** – sequentially distributed entry +**node** - specify ip address for connecting Cassandra node +**rate threads** – # of outstanding write commands on the server, note having too many creates timeouts and +hence missing entries in database, use a value that is ¼ logical CPUs for instance or 32 whichever is less + +Additional tips to have a successful database created. To avoid missing key-value entries while creating +database, do the following: +- Temporarily set the write timeout to 8 seconds:```/bin/nodetool setwriterequesttimeout 8000 ``` +- Temporarily set concurrent compactors to 8 to accelerate the compaction activity: ```/bin/nodetool setconcurrentcompactors 8``` +- Temporarily increase the compaction throughput to an NVME devices to 128 MB/sec: ```/bin/nodetool setcompactionthroughput 128``` + +After the initial table has been created, flush and monitor until all compaction is done before stopping the +Cassandra instance: +- Flush Cassandra memory buffers to disk: ```/bin/nodetool flush``` +- Wait until all compaction jobs are done: ```watch /bin/nodetool compactionstats``` +- Once compaction is done (may take hours for large datasets), we want to safely drain and turn off +Cassandra: +```bash +/bin/nodetool drain +/bin/nodetool stopdaemon +``` +At this point you will want to keep a backup copy of the original database. See Cassandra Test Methodology +below for details. + +## Cassandra-Stress Test Methodology: +Given Cassandra is an immutable database, when overwriting entries on the database, this will add an entry +and not overwrite the original entry until compaction/merge tasks are performed at a later time. These writes +and compaction tasks change the performance characteristics of the database. Thus, to have consistent and +reproducible performance results, one must always start with a known state of a database (i.e. use the original +copy) and run the same test load and duration. This minimizes performance variability. Below are best +practices for performance testing Cassandra. +- Initialize system environment (see Sample Cassandra settings script below) +- Before each test run: + - Stop all running Cassandra instances + - Clear cache on the system + - Rebase database on all Cassandra instances (i.e. restart with original copy) + - Start the Cassandra instances on all nodes + - Run nodetool status to make sure all databases are up and running with the correct database +capacity and no background compaction occurring +- During each test run: + - Set the Cassandra load generator to a fixed load (number of clients and/or threads) + - Run tests for a fixed amount of time and/or cycles + - Reach steady state, typically 3 minutes, before collecting telemetry data + - For lower performance throughput variability run at least another 10 minutes in steady state +- After the test runs: + - Save the test results and telemetry data + +**Cassandra-Stress Benchmark parameters for mix 80/20 (80% read, 20% write) on one node (CassStress +Only):** +```bash +/tools/bin/cassandra-stress user +profile=/tools/cql-insanity-example.yaml +ops\(insert=20,simple1=80\) no-warmup cl=ONE duration=#s -mode native cql3 –pop dist=uniform\(1..\) -node -rate threads= number_of_client_threads +```` +To change the workload type, for example if only read requests are required, you can remove insert and set +simple1=1. If only write requests are required, you can remove simple1 and set insert=1. + +**Interpreting Cassandra-Stress Results(CassStress Only):** +A good explanation of the Cassandra-Stress output is [here](https://docs.datastax.com/en/dse/5.1/tooling/cassandra-stress-output.html) +If multiple instances are tested together, you want to add the throughputs and average the latencies. + +**Optional Tools for Debugging Performance issues:** +- [PerfSpect](https://github.com/intel/PerfSpect) captures configuration details from the system +- [PAT](https://github.com/intel-hadoop/PAT) captures CPU/Memory/Disk/Network performance details using Linux’s performance monitor +tools +- [Async-profiler](https://github.com/async-profiler/async-profiler) captures flame graphs of where the CPU cycles are being spent + +# Cassandra-Stress Performance Comparisons +**Sample Performance Comparisons on Xeons:** +![Performance Comparison](images/SamplePerformanceOnCassandra.jpg) + +## Details +Testing Date: Performance results are based on testing by Intel as of the date specified for each configuration below (between 2024-12-17 and 2025-01-22) and may not reflect all publicly available security updates. + +Cassandra on EMR 64c (Intel Xeon 8592+): 1-node, 2x INTEL(R) XEON(R) PLATINUM 8592+, 64 cores, 350W TDP, HT On, Turbo On, NUMA 4, Total Memory 1024GB (16x64GB DDR5 5600 MT/s [5600 MT/s]), BIOS 2.3, microcode 0x21000283, 2x Ethernet Controller X710 for 10GBASE-T, 1x 447.1G MTFDDAV480TDS, 4x 3.5T KIOXIA KCD8XPUG3T84, Ubuntu 24.04.1 LTS, 6.8.0-49-generic. Test by Intel as of Fri Jan 24 09:10:25 PM UTC 2025, Apache Cassandra 4.1.5, Cassandra-Stress 4.1.5, openjdk version ""11.0.24"" 2024-07-16, OpenJDK Runtime Environment (build 11.0.24+8-post-Ubuntu-1ubuntu324.04.1), OpenJDK 64-Bit Server VM (build 11.0.24+8-post-Ubuntu-1ubuntu324.04.1, mixed mode, sharin + +Cassandra on GNR 64c (Intel Xeon 6767P): 1-node, 2x Intel(R) Xeon(R) 6767P, 64 cores, 350W TDP, HT On, Turbo On, NUMA 4, Total Memory 1024GB (16x64GB DDR5 6400 MT/s [6400 MT/s]), BIOS BHSDCRB1.IPC.3544.P22.2411120403, microcode 0x1000341, 1x I210 Gigabit Network Connection, 2x Ethernet Controller X710 for 10GBASE-T, 8x 3.5T KIOXIA KCD8XPUG3T84, 1x 894.3G Micron_7450_MTFDKBG960TFR, Ubuntu 24.04.1 LTS, 6.8.0-49-generic. Test by Intel as of Wed Jan 22 09:47:19 PM UTC 2025, Apache Cassandra 4.1.5, Cassandra-Stress 4.1.5, openjdk version ""11.0.24"" 2024-07-16, OpenJDK Runtime Environment (build 11.0.24+8-post-Ubuntu-1ubuntu324.04.1), OpenJDK 64-Bit Server VM (build 11.0.24+8-post-Ubuntu-1ubuntu324.04.1, mixed mode, sharin + +Cassandra on GNR 96c (Intel Xeon 6952P): 1-node, 2x Intel(R) Xeon(R) 6952P, 96 cores, 400W TDP, HT On, Turbo On, NUMA 6, Total Memory 1536GB (24x64GB DDR5 6400 MT/s [6400 MT/s]), BIOS BHSDCRB1.IPC.3544.P15.2410232346, microcode 0x1000341, 1x I210 Gigabit Network Connection, 2x Ethernet Controller 10-Gigabit X540-AT2, 1x 894.3G SAMSUNG MZ1L2960HCJR-00A07, 8x 3.5T KIOXIA KCD8XPUG3T84, Ubuntu 24.04.1 LTS, 6.8.0-49-generic. Test by Intel as of 01/14/25, Apache Cassandra 4.1.5, Cassandra-Stress 4.1.5, openjdk version ""11.0.24"" 2024-07-16, OpenJDK Runtime Environment (build 11.0.24+8-post-Ubuntu-1ubuntu324.04.1), OpenJDK 64-Bit Server VM (build 11.0.24+8-post-Ubuntu-1ubuntu324.04.1, mixed mode, sharin + +Cassandra on GNR 128c (Intel Xeon 6980P): 1-node, 2x Intel(R) Xeon(R) 6980P, 128 cores, 500W TDP, HT On, Turbo On, NUMA 6, Total Memory 1536GB (24x64GB DDR5 6400 MT/s [6400 MT/s]), BIOS BHSDCRB1.IPC.3544.P15.2410232346, microcode 0x1000341, 2x Ethernet Controller 10-Gigabit X540-AT2, 1x I210 Gigabit Network Connection, 1x 894.3G SAMSUNG MZ1L2960HCJR-00A07, 8x 3.5T KIOXIA KCD8XPUG3T84, Ubuntu 24.04.1 LTS, 6.8.0-49-generic. Test by Intel as of 12/17/24, Apache Cassandra 4.1.5, Cassandra-Stress 4.1.5, openjdk version ""11.0.24"" 2024-07-16, OpenJDK Runtime Environment (build 11.0.24+8-post-Ubuntu-1ubuntu324.04.1), OpenJDK 64-Bit Server VM (build 11.0.24+8-post-Ubuntu-1ubuntu324.04.1, mixed mode, sharin + +Results may vary. + +# Example System Startup Script + +```bash +############################################ +#DataStax recommended kernel settings # +############################################ +ulimit -n 1048576 +ulimit -l unlimited +ulimit -u 32768 +# Disable reclaim mode, disable swap, disable defrag for Transparent +# Hugepages in accordance with DataStax +echo 0 > /proc/sys/vm/zone_reclaim_mode +swapoff –all +echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag +############################## +# Network production settings# +############################## +sysctl -w \ +net.ipv4.tcp_keepalive_time=60 \ +net.ipv4.tcp_keepalive_probes=3 \ +net.ipv4.tcp_keepalive_intvl=10 +sysctl -w \ +net.core.rmem_max=16777216 \ +net.core.wmem_max=16777216 \ +net.core.rmem_default=16777216 \ +net.core.wmem_default=16777216 \ +net.core.optmem_max=40960 \ +net.ipv4.tcp_rmem='4096 87380 16777216' \ +net.ipv4.tcp_wmem='4096 65536 16777216' +############################################################### +# Neworking adding 3 additional static IP address on the # +# same network interface for my Cassandra instances # +############################################################### +ifconfig eno1:1 134.134.101.218 up +ifconfig eno1:2 134.134.101.219 up +ifconfig eno1:3 134.134.101.220 up +################################################################ +#setting the system to performance mode for best possible perf # +################################################################ +for CPUFREQ in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +do + [ -f $CPUFREQ ] || continue + echo -n performance > $CPUFREQ +done +for CPUFREQ in /sys/devices/system/cpu/cpu*/power/energy_perf_bias +do + [ -f $CPUFREQ ] || continue + echo -n performance > $CPUFREQ +done +################################################################## +# Disk Optimizations for storage devices in Database # +# - changing scheduler to none # +# - rotational to zero # +# - changing the read ahead buffer # +################################################################## +touch /var/lock/subsys/local +echo none > /sys/block/nvme1n1/queue/scheduler +echo none > /sys/block/nvme2n1/queue/scheduler +echo none > /sys/block/nvme3n1/queue/scheduler +echo none > /sys/block/nvme4n1/queue/scheduler +echo 0 > /sys/class/block/nvme1n1/queue/rotational +echo 0 > /sys/class/block/nvme2n1/queue/rotational +echo 0 > /sys/class/block/nvme3n1/queue/rotational +echo 0 > /sys/class/block/nvme4n1/queue/rotational +###################################################################### +# Note this change alone will double Cassandra throughput # +# as the Linux default is 128 read_ahead_kb, this can bottleneck the # +# NVME device bandwidth when you have small random requests, like # +# those on cassandra-stress # +###################################################################### +echo 8 > /sys/class/block/nvme1n1/queue/read_ahead_kb +echo 8 > /sys/class/block/nvme2n1/queue/read_ahead_kb +echo 8 > /sys/class/block/nvme3n1/queue/read_ahead_kb +echo 8 > /sys/class/block/nvme4n1/queue/read_ahead_kb + +``` +# FAQ + +### Where can I download Cassandra? +As of mid-2025, the latest official release version is 5.0.4. Download binaries and source code from [here](http://archive.apache.org/dist/cassandra/ ) +[GitHub](https://github.com/apache/cassandra) is also a good resource, but you will need to build from source. + +### How often do I need to change/update Cassandra? +Stick with the latest that works for you and/or your customers. Recommend changing to a new major release when available for development as each newer version has better performance. +Most customers wait 1 year before using a new major version in production. + +### What should I use as heap size? +We are following DataStax (key contributor to Cassandra) recommendations. Best performance has been seen with total system heap of 25–50% DRAM. + +### What are storage requirements? +In general, the faster the storage, the better the throughput/latency. +The storage size needs to be appropriate for your datasets. +In our setup, we are using NVME devices for the dataset. +Cassandra is IO intensive and a Cassandra instance will be IO bound when multiple Cassandra instances are using the same NVME device. + +### Do I need to rebuild the dataset when I change configuration? +No. After creating the original working dataset, save a copy as reference. +Note: major Cassandra version datasets are not compatible with each other. +For example, a dataset created in Cass3 is not compatible with Cass4 or Cass5. +Cass5 dataset is backward compatible with Cass4 dataset if you use the default `cassandra.yaml` file. + +### Can I have clients and servers on the same machine? +Yes. Your data packages will not be transferred over the network—just across the memory bus between the CPUs. + +### What should my database size be? +Our customers typically run servers where the database storage capacity is at least 2 times larger than the server DRAM. +This can be difficult to simulate in efficient performance testing, as building/compacting/moving/copying large amounts of data takes lots of time and hardware resources. +As a rule of thumb, we typically have larger total database capacity for all instances in the system to be larger than the system DRAM. +This ensures we exercise both DRAM and storage. + +### How big will my database be for a given amount of entries? +Using CassStress and the recommended `cqlstress-insanity-example.yaml` file to create your dataset will result in files of approximately 670 bytes per entry on compressed partition on disk. +For example: 600 million partitions will create ~400GB compressed dataset on disk. From 9cc608a55f7a7278f6eff06797f1368d1bdb4c78 Mon Sep 17 00:00:00 2001 From: "Sherman, Srikanth" Date: Mon, 16 Mar 2026 13:19:39 -0700 Subject: [PATCH 02/19] Updated correct README --- software/cassandra/QAT/README.md | 633 ++++--------------------------- 1 file changed, 74 insertions(+), 559 deletions(-) diff --git a/software/cassandra/QAT/README.md b/software/cassandra/QAT/README.md index 2b08829..de51fae 100644 --- a/software/cassandra/QAT/README.md +++ b/software/cassandra/QAT/README.md @@ -1,599 +1,114 @@ -This workload tuning guide describes the best known practices to optimize performance on Intel Xeon CPUs -when running Apache Cassandra. Default configurations may vary across hardware vendors, thus this guide -helps provide a set of recommended settings for getting the best performance throughput/latency. +# Cassandra with Intel® QuickAssist Technology (Intel® QAT) Optimization Guide +## Table of Contents -# This document is organized with the following topics: +- [Overview](#overview) +- [QAT Hardware Requirement](#qat-hardware-requirement) +- [QAT Software Requirement](#qat-software-requirement) +- [Cassandra Configuration](#cassandra-configuration) +- [Building and configuring zlib-accel](#building-zlib-accel) +- [Using Cassandra with zlib-accel](#cassandra-with-zlib-accel) +- [Future Enhancements](#future-enhancements) +- [References](#references) -- [Document Nomenclature](#Document-Nomenclature) -- [Hardware Configuration Recommendations](#Hardware-Configuration-Recommendations)  -- [BIOS Setting Recommendations](#BIOS-Configuration-Recommendations) -- [DataStax’s Kernel/Storage/Network Settings](DataStax’s-Kernel/Storage/Network-Settings) -- [Cassandra Settings](#Cassandra-settings)  -- [Cassandra-Stress Testing](#Cassandra-Stress-Testing) -- [Cassandra-Stress Performance Comparisons](#Cassandra-Stress-Performance-Comparisons) -- [Example System Startup Script](#Example-System-Startup-Script)  -- [FAQ](#FAQ) +## Overview -# Document Nomenclature +Intel® QuickAssist Technology (Intel® QAT) zlib-accel library. -This document uses the following distinctions: +Without sacrificing compression ratios, zlib-accel with QAT offers higher throughput using a workload of NoSQLBench , 18% higher than +zstd, 98% higher than zlib, and 36% higher than zlib-ng. CPU cycles per Cassandra operation is also better; compared to zlib, using QAT with zlib-accel uses only 43% of the CPU cycles per Cassandra operation. -- **Client**: Applies only to client systems running the load generator like `cassandra-stress`, NoSQLBench, or other benchmarks -- **Server**: Applies only to server systems running the Cassandra instances -- **Cloud**: Where applicable, notes differences between bare-metal and cloud instances -- **Cass3**: Applies only to Cassandra version 3.x -- **Cass4**: Applies only to Cassandra version 4.x -- **Cass5**: Applies only to Cassandra version 5.x -- **CassStress**: Applies only to the `cassandra-stress` benchmark (not required if using other benchmarks) -- **PerfTip X%**: Estimated performance throughput improvement expected with this change -# Hardware Configuration Recommendations +## QAT Hardware Requirement -## Typical Cassandra Configuration: -| Client system | < ---- | Network | ---- > | Server System |<---------->| Database | -|-----------------|--------|-----------------------|--------|----------------------|------------|------------------------| -|(Load generator) | | | | Cassandra Instance(s)| | Data Location (Storage)| - - - -The Apache Cassandra Database or Cassandra instance(s) run on the server system. -There are minimum hardware requirements for a Cassandra instance. -The details can be found [here](https://cassandra.apache.org/doc/4.0/cassandra/operating/hardware.html) - -In summary, the minimum CPU/DRAM hardware required for a small production Cassandra instance is 8 logical CPU and at least 32GB of DRAM. - -For a larger Cassandra instance, we have found there is an upper limit where Cassandra performance does not scale linearly any more by adding more CPU/Memory resources. -This limit is approximately 48 logical CPU cores per Cassandra instance. -To avoid this inefficiency, customers typically run multiple Cassandra instances on larger hardware systems to provide the best Cassandra performance on these larger systems. - -Below are some typical hardware sizing recommendations. -Note, your specific schema, request types, replication and other performance requirements may differ and may need to alter these recommendations. - -**Table 1: Summary of Hardware Resources for One Cassandra Instance** - -| System Size | Number Logical CPUs | Memory | Storage | Networking | -|-------------|----------------------|---------|------------------------------|------------| -| Small | 8 | 32 GB | 1 SATA/SAS SSD | 10 Gbit | -| Medium | 16 | 64 GB | 2 SATA/SAS SSD or 1 NVME | 10 Gbit | -| Large | 32 | 128 GB | 1 NVME | 10-25 Gbit | -| X-Large | 48 | 192 GB | 1-2 NVME | 25 Gbit | +At least one Intel® QAT engine is required. This can be verified by running the following command: +``` +echo `(lspci -d 8086:4940 && lspci -d 8086:4941 && lspci -d 8086:4942 && lspci -d 8086:4943 && lspci -d 8086:4944 && lspci -d 8086:4945 && lspci -d 8086:4946 && lspci -d 8086:4947) | wc -l` supported devices found. +``` -# BIOS Configuration Recommendations +If a device is found, the output of the command with be: -Table 2 describes the BIOS options that impact Cassandra performance. +``` +8 supported devices found. +``` -**Table 2: Summary of BIOS Options for Optimizing Cassandra** +Verify that the QAT firmware is already loaded by using the following command: -| Parameter Name | Typical BIOS Default | BIOS Setting Recommended | Description | PerfTip | -|------------------------|----------------------|--------------------------------------------|----------------------------------------------------------------------------------------------------------------------------|--------------------------------| -| Hyperthreading/SMT | Enabled | Enabled | Enabling hyperthreading or simultaneous multithreading allows for two hardware threads per core on supported Xeon CPUs. Note, Xeons E core CPUs do not support this feature. | up to 22% | -| SNC | Disabled or SNC1 | Enabled, SNC 2-4 depending on number of Cassandra instances running | When running more than 1 Cassandra instance, enabling Sub-Numa-Clustering gives you a way to isolate compute/memory resources to improve Cassandra running efficiency. See "NUMA Performance Considerations" in this document for more details. | in combination with numactl changes, up to 15% | -| Latency Optimized Mode | Disabled | Enabled | Some Xeon system BIOS expose this parameter. This setting optimizes for latency vs. power of the memory subsystem, which helps latency-sensitive workloads, like Cassandra. | 2–4% | +``` +ls /lib/firmware/{qat_4xxx,qat_402xx,qat_420xx}.bin* 2>/dev/null +ls /lib/firmware/{qat_4xxx,qat_402xx,qat_420xx}_mmp.bin* 2>/dev/null +``` +The output of the above command should include 2 firmware files. Note that this can vary depending on the exact QAT device on your hardware. -# DataStax’s Kernel/Storage/Network Settings +``` + /lib/firmware/qat_402xx.bin + /lib/firmware/qat_402xx_mmp.bin +``` -[DataStax Link](https://docs.datastax.com/en/dse/6.8/managing/configure/recommended-settings.html) +If the firmware is not already available. It can be downloaded from the Linux kernel repository: +https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/intel/qat -- For the Linux OS kernel, we follow the kernel settings and disable kernel features settings that impact -performance specified in the DataStax Link. -- For storage, we follow the "Optimize disk settings" and "Optimize SSDs" recommendations. - If your schema and access requests are small (~1KB), random, and mostly read requests on flash media, changing the default `read_ahead_kb` from 128 to 8 can double throughput. - > !Performance tip: can give you up to 100% speedup +``` +cd ~ +wget https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/intel/qat/qat_4xxx.bin +wget https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/intel/qat/qat_4xxx_mmp.bin +wget https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/intel/qat/qat_402xx.bin +wget https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/intel/qat/qat_402xx_mmp.bin +sudo cp qat_4xxx*.bin qat_402xx*.bin /lib/firmware +rm qat_4xxx*.bin qat_402xx*.bin +``` -- For networking, we follow the "Networking TCP settings" in the DataStax link. +## QAT Software Requirement -All these settings (kernel, storage, and networking) are captured in the **“Example System Startup Script”** towards the end of this document. +QAT drivers, available in-tree in Linux kernel +QATlib library +QATzip library (v1.3.0 and above) -## Adding Multiple Network IP Addresses to a Network Interface +## Cassandra Configuration -Each Cassandra instance requires a unique IP address. -If you plan to support multiple Cassandra instances on one system with one network interface, you can add static network IP addresses to an existing one: +OpenJDK 17 +Cassandra 5.0.6 -```bash -# Network Alias (one IP address per instance) -ifconfig eth0:0 up -ifconfig eth0:1 up -... -ifconfig eth0:n up -``` +The Cassandra configuration mentioned in the base optimization-zone article. -# Cassandra settings +https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/intel/qat -## Cassandra Directories and Files of Interest: +## Building and configuring zlib-accel -Server side will have the entire Cassandra directory and files below. If multiple Cassandra instances are -running on the same server, there will be one CASSANDRA_SERVER_HOME# directory for each Cassandra -instance. Below are the files of interest that we either run or modify: ``` - -| -|__bin -| |__cassandra (Cassandra startup script) -| |__nodetool (run to get status and compaction info) -| -|__ conf -| |__cassandra.yaml # (tuning parameters for Cass3/4/5) -| |__cassandra_latest.yaml (new Cass5 features) -| |__cassandra-env.sh (JMX_PORT setting for multiple instance) -| |__jvm.options (Cass3 all java settings) -| |__jvm-server.options (Cass4/5 java general settings) -| |__jvm8-server.options (Cass4 java 8 specific settings) -| |__jvm11-server.options (Cass4/5 java 11 specific settings) -| |__jvm17-server.options (Cass5 java 17 specific settings) -| |__jvm21-server.options (Cass5 java 21 specific settings) -| -|__tools -| |__cqlstress-insanity-example.yaml (CassStress schema files) -| | -| |__bin -| |__cassandra-stress (CassStress load generator) -… +mkdir build +cd build +cmake -DDEBUG_LOG -DCOVERAGE=OFF -CMAKE_BUILD_TYPE=Release .. +make ``` -## Modification to General settings on the Server Side: -- `JAVA_HOME` must be set before launching Cassandra. These 2 lines are required for the Java library you are using: -```bash -export JAVA_HOME= +Edit /etc/zlib-accel.conf and add the following lines -# For example, with OpenJDK version 17, this would be: -export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64 -export PATH=$JAVA_HOME/bin:$PATH ``` -## Modification to the Cassandra YAML Configuration File on Server Side: -Most Cassandra parameters/features and settings are in this file: -/conf/cassandra.yaml. -All Cassandra versions require the following parameter modifications: -- seeds: -- listen_address: -- rpc_address: -- data_file_directories: (example /mnt/nvme1/cass_db) -- commitlog_directory: (example /mnt//nvme1/cass_db/commitlog) -- cdc_raw_directory: (example /mnt/nvme1/cass_db/cdc_raw) -- saved_caches_directory: (example /mnt/nvme1/cass_db/saved_caches) -- concurrent_reads: <#number of threads for reads>, optimal values of this is 3X the number -of logical CPUs allocated for this Cassandra instance. - > !Performance tip: can give you up to 15% speedup -- concurrent_writes: there are negative effects in increasing this value beyond the number of -CPU (virtual CPU number). The root cause comes from [a known contention software issue]( - https://issues.apache.org/jira/browse/CASSANDRA-13896), a workaround to this -problem, use the same value as logical CPU for this instance. - -**Cass5**, new to this Cassandra version, there are two cassandra*.yaml files included cassandra.yaml and -cassandra_latest.yaml. -- The cassandra.yaml file is backward compatible with **Cass4** data format and settings, thus if you -already have a database created with **Cass4** you can use this file. -- The cassandra_latest.yaml file has multiple new features added on **Cass5** like Unified Compaction, -Trie Memtable and BTI SStables. To use all these features, the datasets creation must be done -with these features enabled. To use this yaml file with **Cass5**, the file cassandra_latest.yaml must -be renamed to cassandra.yaml before initiating Cassandra. - -There is a performance benefit by using all the new features in Cass5, over not using these -features. -> !Performance tip: can give you up to 27% speedup - -## Modification of Java Files - -Apply the following to `/conf/jvm*` file changes according to the Java version supported. -We recommend using the latest JDK supported by the Cassandra version you are using, as this can improve throughput with JDK change alone. - -- **Cass3** only supports **JDK version 8** -- **Cass4** supports **JDK versions 8 through 11** -- **Cass5** supports **JDK versions 11 through 17** -- **Future Cassandra version 5.1** is planned to support **JDK version 21** - [Reference](https://issues.apache.org/jira/browse/CASSANDRA-18831) - -Add the following changes: - -**jvm-server.options** - > !Performance tip: can give you 3-5% of speedup -```bash -# Throughput advantage when using 2MB vs. the default 4KB memory pages (less compute overhead). -# We also use TransparentHugePage to not have to guess on memory allocation sizes upfront --XX:+UseLargePages --XX:+UseTransparentHugePages +use_qat_compress=1 +use_qat_uncompress=1 +use_iaa_compress=0 +use_iaa_uncompress=0 +use_zlib_compress=1 +use_zlib_uncompress=1 ``` +## Using Cassandra with zlib-accel + +Once the zlib-accel library has been built, It is simple to use Cassandra to build the -**jvm-server.options file** - > !Performance tip: can give you up to 10% speedup -```bash -# comment out NUMA -# -XX:+UseNUMA -# In Cass3 and Cass4 CMS was the default garbage collector, -# using G1GC gives up to throughput improvement and lower -# latency vs. CMS, below we are commenting out the CMS garbage - -# collector -### CMS Settings -#-XX:+UseParNewGC -#-XX:+UseConcMarkSweepGC -#-XX:+CMSParallelRemarkEnabled -#-XX:SurvivorRatio=8 -#-XX:MaxTenuringThreshold=1 -#-XX:CMSInitiatingOccupancyFraction=75 -#-XX:+UseCMSInitiatingOccupancyOnly -#-XX:CMSWaitDuration=10000 -#-XX:+CMSParallelInitialMarkEnabled -#-XX:+CMSEdenChunksRecordAlways -#-XX:+CMSClassUnloadingEnabled -# Cass3 and Cass4 requires to explicitly enable G1GC -# Cass5 G1GC is the default --XX:+UseG1GC -# -# This is the Cassandra instance heap size -# NOTE: the total number of CassandraInstances*HeapSize -# should take between 25-50% of total system memory, we have -# not seen any performance benefits when setting this higher than -# 64GB heaps with G1GC - # -# PERFORMANCE NOTE: avoid using heaps between 32GB-38GB, details -# explained here: -# https://blog.codecentric.de/35gb-heap-less-32gb-java-jvm-memory-oddities -# --Xms31G --Xmx31G -``` -## Modification to the Cassandra-env file on the Server Side: -No modifications are needed to this file if running only one Cassandra instance on the server system or -running Cassandra with a hypervisor. Modifications are required when running more than one Cassandra -instance on bare metal, each Cassandra instance must have a unique JMX_PORT number, hence the -following file should be modified: `/conf/cassandra-env.sh`: -```bash -JMX_PORT=”7199” # 7199 for instance 1, -JMX_PORT=”7299” # 7299 for instance 2, etc. -… -``` -## Performance Option: -As previously stated in suggest hardware, one Cassandra instance scales well till around 48 logical CPUs, -hence for large system with >48 CPU and/or multiple CPU sockets, multiple Cassandra instances are -recommended to use compute and memory resources efficiently. Example below shows Configuration -Diagram 1 with a dual socket and 128 logical CPUs. In this case each Cassandra Instance will take 32 CPUs -and its own NVME device. - -![Configuration Diagram 1](images/cassandra-diagram1.jpg) - -## NUMA Performance Considerations: - -> !Performance tip: can give you up to 15% speedup - -Each Cassandra instance above can be pinned to distinct CPU and/or NUMA memory region -for best performance. For example, in Configuration Diagram 1, if the system supports Sub -NUMA clustering 2, or 4 NUMA nodes total for the system, each Cassandra instance can be -pinned to its own compute and memory with the numactl command. Below are the changes to -the cassandra startup script for each Cassandra instance, highlighted are the changes: -```bash -/bin/cassandra file compute/memory bind NUMA 0: -… -NUMACTL_ARGS=”” #${NUMACTL_ARGS:-"--interleave=all"} -if which numactl >/dev/null 2>/dev/null && numactl $NUMACTL_ARGS ls / >/dev/null -2>/dev/null -then - NUMACTL="numactl -m 0 -N 0 $NUMACTL_ARGS" -else - NUMACTL="" -fi - -/bin/cassandra file compute/memory bind NUMA 1: -… -NUMACTL_ARGS==”” #${NUMACTL_ARGS:-"--interleave=all"} -if which numactl >/dev/null 2>/dev/null && numactl $NUMACTL_ARGS ls / >/dev/null -2>/dev/null -then - NUMACTL="numactl -m 1 -N 1 $NUMACTL_ARGS" -else - NUMACTL="" -fi - -/bin/cassandra file compute/memory bind NUMA 2: -… -NUMACTL_ARGS==”” #${NUMACTL_ARGS:-"--interleave=all"} -if which numactl >/dev/null 2>/dev/null && numactl $NUMACTL_ARGS ls / >/dev/null -2>/dev/null -then - NUMACTL="numactl -m 2 -N 2 $NUMACTL_ARGS" -else - NUMACTL="" -fi - -/bin/cassandra file compute/memory bind NUMA 3: -… -NUMACTL_ARGS==”” #${NUMACTL_ARGS:-"--interleave=all"} -if which numactl >/dev/null 2>/dev/null && numactl $NUMACTL_ARGS ls / >/dev/null -2>/dev/null -then - NUMACTL="numactl -m 3 -N 3 $NUMACTL_ARGS" -else - NUMACTL="" -fi ``` -To find the specific NUMA details of your system run the following command, you may need to change the -number of nodes supported in the system BIOS settings. - ```bash - numactl -H +LD_PRELOAD=/root/zlib-accel/build/libzlib-accel.so bin/cassandra -R ``` -**Cloud**: Cloud providers typically do not let you modify NUMA and may only have one node for the entire -system hence this NUMA section can be skipped for the cloud. -**Starting a Cassandra Instance on the Server Side**: -```bash -/bin/cassandra -R -``` +## Future Enhancements -**Stopping the Cassandra Instance “gracefully” on the Server Side**: -```bash -/bin/nodetool flush -/bin/nodetool drain -/bin/nodetool stopdaemon -``` -**Forcefully Stopping the Cassandra Instance on the Server Side**: -```bash -killall –9 -``` +Support for QAT plugin into Cassandra is in progress and waiting to be upstreamed. This includes support for ZSTD and Deflate. -**Forcefully stop all the Cassandra Instances**: -```bash -killall –9 java -``` -# Cassandra-Stress Testing - -## Modification to Cassandra Schema File on the Client Side (CassStress): -Cassandra comes with multiple example data layout format files, i.e. schema files. The file that best -represents our customer’s data is typically the insanity schema. This file can be found in -```/tools/cql-insanity-example.yaml```. The following modifications are applied to -this file for the best mixed workload performance on Cassandra-Stress. - -The compaction parameter is changed on the schema from default -‘LeveledCompactionStrategy’ to ‘SizeTieredCompactionStrategy’ for best overall performance -on a mix 80:20 Read:Write workload. Note, the line with compression definition -does not appear by default on the cql-insanity-example.yaml file, it is here to explicitly show the -settings for each Cassandra version. -**Cass3**, change the compaction strategy to SizeTieredCompactionStrategy -```bash -… -) WITH compaction = {'class':'SizeTieredCompactionStrategy'} -AND compression = {'class':'LZ4Compressor','chunk_length_in_kb':64} -# defaults on Cass3 -AND comment='A table of many types to test wide rows and collections' -``` -**Cass4**, change compaction strategy to SizeTieredCompactionStrategy, note that the compressor default is -different than **Cass3**, as the default changed to 16KB chunk size. -```bash -… -) WITH compaction = {'class':'SizeTieredCompactionStrategy'} -AND compression = {'class':'LZ4Compressor','chunk_length_in_kb':16} -# default on Cass4 -AND comment='A table of many types to test wide rows and collections' -``` -**Cass5**, change compaction strategy to the new unifiedCompactionStrategy which adjusts automatically -between Leveled and SizeTiered for best overall performance. -```bash -… -) WITH compaction = { 'class':'unifiedCompactionStrategy' } -AND compression = { 'class':'LZ4Compressor','chunk_length_in_kb':16} -# default on Cass5 -AND comment='A table of many types to test wide rows and collections' -``` +## References -## Cassandra-stress initial table creation (CassStress) -Cassandra requires a dataset on the server to read and write commands and test for performance, hence -building or copying a dataset is a required step before any performance testing: -To build your own dataset on one Cassandra instance: -```bash -/tools/bin/cassandra-stress \ - user profile=/tools/cql-insanity-example.yaml \ - ops\(insert=1\) no-warmup \ - cl=ONE \ - n= \ - -mode native cql3 \ - –pop seq=1.. \ - -node \ - -rate threads= -``` -**user profile** Designate the schema YAML file to use with cassandra-stress. -**no-warmup** Do not warmup the instance, do a cold start. -**cl=ONE – consistency level**. A write must be written to the commit log and MemTable of at least one replica -node. This is for one node cluster case. If multiple nodes are used in a cluster for example you can make -multiple copies of the data. -**n=** number of entries in database, typically 670 Bytes per entry -**pop seq=1..** – sequentially distributed entry -**node** - specify ip address for connecting Cassandra node -**rate threads** – # of outstanding write commands on the server, note having too many creates timeouts and -hence missing entries in database, use a value that is ¼ logical CPUs for instance or 32 whichever is less - -Additional tips to have a successful database created. To avoid missing key-value entries while creating -database, do the following: -- Temporarily set the write timeout to 8 seconds:```/bin/nodetool setwriterequesttimeout 8000 ``` -- Temporarily set concurrent compactors to 8 to accelerate the compaction activity: ```/bin/nodetool setconcurrentcompactors 8``` -- Temporarily increase the compaction throughput to an NVME devices to 128 MB/sec: ```/bin/nodetool setcompactionthroughput 128``` - -After the initial table has been created, flush and monitor until all compaction is done before stopping the -Cassandra instance: -- Flush Cassandra memory buffers to disk: ```/bin/nodetool flush``` -- Wait until all compaction jobs are done: ```watch /bin/nodetool compactionstats``` -- Once compaction is done (may take hours for large datasets), we want to safely drain and turn off -Cassandra: -```bash -/bin/nodetool drain -/bin/nodetool stopdaemon -``` -At this point you will want to keep a backup copy of the original database. See Cassandra Test Methodology -below for details. - -## Cassandra-Stress Test Methodology: -Given Cassandra is an immutable database, when overwriting entries on the database, this will add an entry -and not overwrite the original entry until compaction/merge tasks are performed at a later time. These writes -and compaction tasks change the performance characteristics of the database. Thus, to have consistent and -reproducible performance results, one must always start with a known state of a database (i.e. use the original -copy) and run the same test load and duration. This minimizes performance variability. Below are best -practices for performance testing Cassandra. -- Initialize system environment (see Sample Cassandra settings script below) -- Before each test run: - - Stop all running Cassandra instances - - Clear cache on the system - - Rebase database on all Cassandra instances (i.e. restart with original copy) - - Start the Cassandra instances on all nodes - - Run nodetool status to make sure all databases are up and running with the correct database -capacity and no background compaction occurring -- During each test run: - - Set the Cassandra load generator to a fixed load (number of clients and/or threads) - - Run tests for a fixed amount of time and/or cycles - - Reach steady state, typically 3 minutes, before collecting telemetry data - - For lower performance throughput variability run at least another 10 minutes in steady state -- After the test runs: - - Save the test results and telemetry data - -**Cassandra-Stress Benchmark parameters for mix 80/20 (80% read, 20% write) on one node (CassStress -Only):** -```bash -/tools/bin/cassandra-stress user -profile=/tools/cql-insanity-example.yaml -ops\(insert=20,simple1=80\) no-warmup cl=ONE duration=#s -mode native cql3 –pop dist=uniform\(1..\) -node -rate threads= number_of_client_threads -```` -To change the workload type, for example if only read requests are required, you can remove insert and set -simple1=1. If only write requests are required, you can remove simple1 and set insert=1. - -**Interpreting Cassandra-Stress Results(CassStress Only):** -A good explanation of the Cassandra-Stress output is [here](https://docs.datastax.com/en/dse/5.1/tooling/cassandra-stress-output.html) -If multiple instances are tested together, you want to add the throughputs and average the latencies. - -**Optional Tools for Debugging Performance issues:** -- [PerfSpect](https://github.com/intel/PerfSpect) captures configuration details from the system -- [PAT](https://github.com/intel-hadoop/PAT) captures CPU/Memory/Disk/Network performance details using Linux’s performance monitor -tools -- [Async-profiler](https://github.com/async-profiler/async-profiler) captures flame graphs of where the CPU cycles are being spent - -# Cassandra-Stress Performance Comparisons -**Sample Performance Comparisons on Xeons:** -![Performance Comparison](images/SamplePerformanceOnCassandra.jpg) - -## Details -Testing Date: Performance results are based on testing by Intel as of the date specified for each configuration below (between 2024-12-17 and 2025-01-22) and may not reflect all publicly available security updates. - -Cassandra on EMR 64c (Intel Xeon 8592+): 1-node, 2x INTEL(R) XEON(R) PLATINUM 8592+, 64 cores, 350W TDP, HT On, Turbo On, NUMA 4, Total Memory 1024GB (16x64GB DDR5 5600 MT/s [5600 MT/s]), BIOS 2.3, microcode 0x21000283, 2x Ethernet Controller X710 for 10GBASE-T, 1x 447.1G MTFDDAV480TDS, 4x 3.5T KIOXIA KCD8XPUG3T84, Ubuntu 24.04.1 LTS, 6.8.0-49-generic. Test by Intel as of Fri Jan 24 09:10:25 PM UTC 2025, Apache Cassandra 4.1.5, Cassandra-Stress 4.1.5, openjdk version ""11.0.24"" 2024-07-16, OpenJDK Runtime Environment (build 11.0.24+8-post-Ubuntu-1ubuntu324.04.1), OpenJDK 64-Bit Server VM (build 11.0.24+8-post-Ubuntu-1ubuntu324.04.1, mixed mode, sharin - -Cassandra on GNR 64c (Intel Xeon 6767P): 1-node, 2x Intel(R) Xeon(R) 6767P, 64 cores, 350W TDP, HT On, Turbo On, NUMA 4, Total Memory 1024GB (16x64GB DDR5 6400 MT/s [6400 MT/s]), BIOS BHSDCRB1.IPC.3544.P22.2411120403, microcode 0x1000341, 1x I210 Gigabit Network Connection, 2x Ethernet Controller X710 for 10GBASE-T, 8x 3.5T KIOXIA KCD8XPUG3T84, 1x 894.3G Micron_7450_MTFDKBG960TFR, Ubuntu 24.04.1 LTS, 6.8.0-49-generic. Test by Intel as of Wed Jan 22 09:47:19 PM UTC 2025, Apache Cassandra 4.1.5, Cassandra-Stress 4.1.5, openjdk version ""11.0.24"" 2024-07-16, OpenJDK Runtime Environment (build 11.0.24+8-post-Ubuntu-1ubuntu324.04.1), OpenJDK 64-Bit Server VM (build 11.0.24+8-post-Ubuntu-1ubuntu324.04.1, mixed mode, sharin - -Cassandra on GNR 96c (Intel Xeon 6952P): 1-node, 2x Intel(R) Xeon(R) 6952P, 96 cores, 400W TDP, HT On, Turbo On, NUMA 6, Total Memory 1536GB (24x64GB DDR5 6400 MT/s [6400 MT/s]), BIOS BHSDCRB1.IPC.3544.P15.2410232346, microcode 0x1000341, 1x I210 Gigabit Network Connection, 2x Ethernet Controller 10-Gigabit X540-AT2, 1x 894.3G SAMSUNG MZ1L2960HCJR-00A07, 8x 3.5T KIOXIA KCD8XPUG3T84, Ubuntu 24.04.1 LTS, 6.8.0-49-generic. Test by Intel as of 01/14/25, Apache Cassandra 4.1.5, Cassandra-Stress 4.1.5, openjdk version ""11.0.24"" 2024-07-16, OpenJDK Runtime Environment (build 11.0.24+8-post-Ubuntu-1ubuntu324.04.1), OpenJDK 64-Bit Server VM (build 11.0.24+8-post-Ubuntu-1ubuntu324.04.1, mixed mode, sharin - -Cassandra on GNR 128c (Intel Xeon 6980P): 1-node, 2x Intel(R) Xeon(R) 6980P, 128 cores, 500W TDP, HT On, Turbo On, NUMA 6, Total Memory 1536GB (24x64GB DDR5 6400 MT/s [6400 MT/s]), BIOS BHSDCRB1.IPC.3544.P15.2410232346, microcode 0x1000341, 2x Ethernet Controller 10-Gigabit X540-AT2, 1x I210 Gigabit Network Connection, 1x 894.3G SAMSUNG MZ1L2960HCJR-00A07, 8x 3.5T KIOXIA KCD8XPUG3T84, Ubuntu 24.04.1 LTS, 6.8.0-49-generic. Test by Intel as of 12/17/24, Apache Cassandra 4.1.5, Cassandra-Stress 4.1.5, openjdk version ""11.0.24"" 2024-07-16, OpenJDK Runtime Environment (build 11.0.24+8-post-Ubuntu-1ubuntu324.04.1), OpenJDK 64-Bit Server VM (build 11.0.24+8-post-Ubuntu-1ubuntu324.04.1, mixed mode, sharin - -Results may vary. - -# Example System Startup Script - -```bash -############################################ -#DataStax recommended kernel settings # -############################################ -ulimit -n 1048576 -ulimit -l unlimited -ulimit -u 32768 -# Disable reclaim mode, disable swap, disable defrag for Transparent -# Hugepages in accordance with DataStax -echo 0 > /proc/sys/vm/zone_reclaim_mode -swapoff –all -echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag -############################## -# Network production settings# -############################## -sysctl -w \ -net.ipv4.tcp_keepalive_time=60 \ -net.ipv4.tcp_keepalive_probes=3 \ -net.ipv4.tcp_keepalive_intvl=10 -sysctl -w \ -net.core.rmem_max=16777216 \ -net.core.wmem_max=16777216 \ -net.core.rmem_default=16777216 \ -net.core.wmem_default=16777216 \ -net.core.optmem_max=40960 \ -net.ipv4.tcp_rmem='4096 87380 16777216' \ -net.ipv4.tcp_wmem='4096 65536 16777216' -############################################################### -# Neworking adding 3 additional static IP address on the # -# same network interface for my Cassandra instances # -############################################################### -ifconfig eno1:1 134.134.101.218 up -ifconfig eno1:2 134.134.101.219 up -ifconfig eno1:3 134.134.101.220 up -################################################################ -#setting the system to performance mode for best possible perf # -################################################################ -for CPUFREQ in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor -do - [ -f $CPUFREQ ] || continue - echo -n performance > $CPUFREQ -done -for CPUFREQ in /sys/devices/system/cpu/cpu*/power/energy_perf_bias -do - [ -f $CPUFREQ ] || continue - echo -n performance > $CPUFREQ -done -################################################################## -# Disk Optimizations for storage devices in Database # -# - changing scheduler to none # -# - rotational to zero # -# - changing the read ahead buffer # -################################################################## -touch /var/lock/subsys/local -echo none > /sys/block/nvme1n1/queue/scheduler -echo none > /sys/block/nvme2n1/queue/scheduler -echo none > /sys/block/nvme3n1/queue/scheduler -echo none > /sys/block/nvme4n1/queue/scheduler -echo 0 > /sys/class/block/nvme1n1/queue/rotational -echo 0 > /sys/class/block/nvme2n1/queue/rotational -echo 0 > /sys/class/block/nvme3n1/queue/rotational -echo 0 > /sys/class/block/nvme4n1/queue/rotational -###################################################################### -# Note this change alone will double Cassandra throughput # -# as the Linux default is 128 read_ahead_kb, this can bottleneck the # -# NVME device bandwidth when you have small random requests, like # -# those on cassandra-stress # -###################################################################### -echo 8 > /sys/class/block/nvme1n1/queue/read_ahead_kb -echo 8 > /sys/class/block/nvme2n1/queue/read_ahead_kb -echo 8 > /sys/class/block/nvme3n1/queue/read_ahead_kb -echo 8 > /sys/class/block/nvme4n1/queue/read_ahead_kb -``` -# FAQ - -### Where can I download Cassandra? -As of mid-2025, the latest official release version is 5.0.4. Download binaries and source code from [here](http://archive.apache.org/dist/cassandra/ ) -[GitHub](https://github.com/apache/cassandra) is also a good resource, but you will need to build from source. - -### How often do I need to change/update Cassandra? -Stick with the latest that works for you and/or your customers. Recommend changing to a new major release when available for development as each newer version has better performance. -Most customers wait 1 year before using a new major version in production. - -### What should I use as heap size? -We are following DataStax (key contributor to Cassandra) recommendations. Best performance has been seen with total system heap of 25–50% DRAM. - -### What are storage requirements? -In general, the faster the storage, the better the throughput/latency. -The storage size needs to be appropriate for your datasets. -In our setup, we are using NVME devices for the dataset. -Cassandra is IO intensive and a Cassandra instance will be IO bound when multiple Cassandra instances are using the same NVME device. - -### Do I need to rebuild the dataset when I change configuration? -No. After creating the original working dataset, save a copy as reference. -Note: major Cassandra version datasets are not compatible with each other. -For example, a dataset created in Cass3 is not compatible with Cass4 or Cass5. -Cass5 dataset is backward compatible with Cass4 dataset if you use the default `cassandra.yaml` file. - -### Can I have clients and servers on the same machine? -Yes. Your data packages will not be transferred over the network—just across the memory bus between the CPUs. - -### What should my database size be? -Our customers typically run servers where the database storage capacity is at least 2 times larger than the server DRAM. -This can be difficult to simulate in efficient performance testing, as building/compacting/moving/copying large amounts of data takes lots of time and hardware resources. -As a rule of thumb, we typically have larger total database capacity for all instances in the system to be larger than the system DRAM. -This ensures we exercise both DRAM and storage. - -### How big will my database be for a given amount of entries? -Using CassStress and the recommended `cqlstress-insanity-example.yaml` file to create your dataset will result in files of approximately 670 bytes per entry on compressed partition on disk. -For example: 600 million partitions will create ~400GB compressed dataset on disk. +zib-accel: https://github.com/intel/zlib-accel +NoSQLBench: https://github.com/nosqlbench/nosqlbench From df5b4da252a4dd0e363bf094485136a85d22bed4 Mon Sep 17 00:00:00 2001 From: "Sherman, Srikanth" Date: Mon, 16 Mar 2026 13:50:14 -0700 Subject: [PATCH 03/19] Initial QAT checkin --- software/cassandra/QAT/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/software/cassandra/QAT/README.md b/software/cassandra/QAT/README.md index de51fae..0e2864e 100644 --- a/software/cassandra/QAT/README.md +++ b/software/cassandra/QAT/README.md @@ -100,7 +100,7 @@ use_zlib_uncompress=1 Once the zlib-accel library has been built, It is simple to use Cassandra to build the ``` -LD_PRELOAD=/root/zlib-accel/build/libzlib-accel.so bin/cassandra -R +LD_PRELOAD=/opt/zlib-accel/build/libzlib-accel.so bin/cassandra -R ``` ## Future Enhancements From 0663bfdddad31f8a806b8329edcc96f75ecb46a7 Mon Sep 17 00:00:00 2001 From: "Sherman, Srikanth" Date: Tue, 17 Mar 2026 08:22:35 -0700 Subject: [PATCH 04/19] More updates to QAT Cassandra --- software/cassandra/QAT/README.md | 49 ++++++++++++++++++++------------ 1 file changed, 31 insertions(+), 18 deletions(-) diff --git a/software/cassandra/QAT/README.md b/software/cassandra/QAT/README.md index 0e2864e..8287219 100644 --- a/software/cassandra/QAT/README.md +++ b/software/cassandra/QAT/README.md @@ -3,7 +3,7 @@ - [Overview](#overview) - [QAT Hardware Requirement](#qat-hardware-requirement) -- [QAT Software Requirement](#qat-software-requirement) +- [QAT Software Requirement and Prequisites](#qat-software-requirement-and-prerequisite) - [Cassandra Configuration](#cassandra-configuration) - [Building and configuring zlib-accel](#building-zlib-accel) - [Using Cassandra with zlib-accel](#cassandra-with-zlib-accel) @@ -12,16 +12,13 @@ ## Overview -Intel® QuickAssist Technology (Intel® QAT) zlib-accel library. - -Without sacrificing compression ratios, zlib-accel with QAT offers higher throughput using a workload of NoSQLBench , 18% higher than -zstd, 98% higher than zlib, and 36% higher than zlib-ng. CPU cycles per Cassandra operation is also better; compared to zlib, using QAT with zlib-accel uses only 43% of the CPU cycles per Cassandra operation. +Compression takes up a significant portion of resources in the data center. Hardware acceleration like Intel® QuickAssist Technology (Intel® QAT) can be used to offload the compression portion of a workload to provide higher throughput and lower latency than using the CPU alone. The zlib-accel library uses a shim approach to seamless integrate Intel® QAT for compression operations. Using zlib-accel allows the user to take advantage of hardware compression with QAT without having to make code changes to the underlying Cassandra codebase. +Without sacrificing compression ratios, zlib-accel with QAT offers higher throughput using a workload of NoSQLBench , 18% higher than zstd, 98% higher than zlib, and 36% higher than zlib-ng. CPU cycles per Cassandra operation is also better; compared to zlib, using QAT with zlib-accel uses only 43% of the CPU cycles per Cassandra operation. ## QAT Hardware Requirement - -At least one Intel® QAT engine is required. This can be verified by running the following command: +At least one Intel® QAT engine is required and the individual engine might need to be updated in the BIOS. This can be verified by running the following command: ``` echo `(lspci -d 8086:4940 && lspci -d 8086:4941 && lspci -d 8086:4942 && lspci -d 8086:4943 && lspci -d 8086:4944 && lspci -d 8086:4945 && lspci -d 8086:4946 && lspci -d 8086:4947) | wc -l` supported devices found. @@ -60,21 +57,33 @@ sudo cp qat_4xxx*.bin qat_402xx*.bin /lib/firmware rm qat_4xxx*.bin qat_402xx*.bin ``` -## QAT Software Requirement +After firmware is updated, the initramfs must be updated. This differs based on the Linux distribution. + +## QAT Software Requirements and Prerequisites -QAT drivers, available in-tree in Linux kernel -QATlib library -QATzip library (v1.3.0 and above) +The QAT driver is available either "in-tree" as part of a release kernel or can be built outside of the release. This document assumes the use of the in-tree driver that is already available with kernsl after version 5.19. + +QATLib provides user space libraries that allows QAT device access and expose APIs for use by higher level applications. The QATLib driver can be installed using your distributions package manager. For Ubuntu 24.04: + +``` +sudo -E apt install -y libqat4 libqat-dev qatlib-service qatlib-examples libusdm-dev +``` + +QATzip is a user-space library built on top of the Intel® QuickAssist Technology (QAT) user-space library. It provides extended compression and decompression capabilities by offloading these operations to Intel® QAT Accelerators. + +``` +sudo -E apt install -y qatzip libqatzip3 +``` + +Please note that "intel_iommu=on" will be required as a kernel parameter. ## Cassandra Configuration +The Cassandra configuration mentioned in the base optimization-zone repository can still be used with zlib-accel. zlib-accel requires the following software versions: + OpenJDK 17 Cassandra 5.0.6 -The Cassandra configuration mentioned in the base optimization-zone article. - -https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/intel/qat - ## Building and configuring zlib-accel ``` @@ -97,7 +106,7 @@ use_zlib_uncompress=1 ## Using Cassandra with zlib-accel -Once the zlib-accel library has been built, It is simple to use Cassandra to build the +Once the zlib-accel library has been built, It is simple to use Cassandra to enable hardware compression. ``` LD_PRELOAD=/opt/zlib-accel/build/libzlib-accel.so bin/cassandra -R @@ -105,10 +114,14 @@ LD_PRELOAD=/opt/zlib-accel/build/libzlib-accel.so bin/cassandra -R ## Future Enhancements -Support for QAT plugin into Cassandra is in progress and waiting to be upstreamed. This includes support for ZSTD and Deflate. +Support for QAT plugin into Cassandra is in progress and waiting to be upstreamed. This includes support for ZSTD. ## References - zib-accel: https://github.com/intel/zlib-accel + NoSQLBench: https://github.com/nosqlbench/nosqlbench + +QATLib: https://intel.github.io/quickassist/qatlib/index.html + + From 0a06b7d6e33244ec3e1a211dd07de23dfb5c31b0 Mon Sep 17 00:00:00 2001 From: "Sherman, Srikanth" Date: Tue, 17 Mar 2026 08:24:10 -0700 Subject: [PATCH 05/19] Update root markdown to have Cassandra QAT reference. --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index a409232..8f5e989 100644 --- a/README.md +++ b/README.md @@ -32,6 +32,7 @@ We aim to provide a dynamic resource where users can find the latest optimizatio - Software - [Cassandra](software/cassandra/README.md) + - [Cassandra QAT](software/cassandra/QAT/README.md) - [Gluten](software/gluten/README.md) - [Java](software/java/README.md) - [Similarity Search](software/similarity-search/README.md) From 9ac595af46f6f538a7239439c675ff88ec9a447e Mon Sep 17 00:00:00 2001 From: "Sherman, Srikanth" Date: Tue, 17 Mar 2026 15:51:27 -0700 Subject: [PATCH 06/19] Further updates after review from QAT team. --- software/cassandra/QAT/README.md | 27 +++++++++++++++++---------- 1 file changed, 17 insertions(+), 10 deletions(-) diff --git a/software/cassandra/QAT/README.md b/software/cassandra/QAT/README.md index 8287219..95afa30 100644 --- a/software/cassandra/QAT/README.md +++ b/software/cassandra/QAT/README.md @@ -7,6 +7,7 @@ - [Cassandra Configuration](#cassandra-configuration) - [Building and configuring zlib-accel](#building-zlib-accel) - [Using Cassandra with zlib-accel](#cassandra-with-zlib-accel) +- [Benchmarking Cassandra with QAT](#benchmark-cassandra-with-qat) - [Future Enhancements](#future-enhancements) - [References](#references) @@ -49,19 +50,21 @@ https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree ``` cd ~ -wget https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/intel/qat/qat_4xxx.bin -wget https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/intel/qat/qat_4xxx_mmp.bin -wget https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/intel/qat/qat_402xx.bin -wget https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/intel/qat/qat_402xx_mmp.bin -sudo cp qat_4xxx*.bin qat_402xx*.bin /lib/firmware -rm qat_4xxx*.bin qat_402xx*.bin +wget https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/plain/intel/qat/qat_4xxx.bin +wget https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/plain/intel/qat/qat_4xxx.bin +wget https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/plain/intel/qat/qat_402xx.bin +wget https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/plain/intel/qat/qat_402xx.bin +wget https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/plain/intel/qat/qat_420xx.bin +wget https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/plain/intel/qat/qat_420xx.bin +sudo cp qat_4xxx*.bin qat_402xx*.bin qat_420xx*.bin /lib/firmware +rm qat_4xxx*.bin qat_402xx*.bin qat_420xx*.bin ``` After firmware is updated, the initramfs must be updated. This differs based on the Linux distribution. ## QAT Software Requirements and Prerequisites -The QAT driver is available either "in-tree" as part of a release kernel or can be built outside of the release. This document assumes the use of the in-tree driver that is already available with kernsl after version 5.19. +The QAT driver is available either "in-tree" as part of a release kernel or can be built outside of the release. This document assumes the use of the in-tree driver that is already available with kernal after version 5.19. The distribution used for this benchmarking was Ubuntu 24.04 with the in-tree driver. QATLib provides user space libraries that allows QAT device access and expose APIs for use by higher level applications. The QATLib driver can be installed using your distributions package manager. For Ubuntu 24.04: @@ -75,6 +78,8 @@ QATzip is a user-space library built on top of the Intel® QuickAssist Technolog sudo -E apt install -y qatzip libqatzip3 ``` +Depending on the use case, the user can configure the number of QAT engines to use with the workload. In "Managed Mode", the QATLib can be used to restrict the workload to a specific number of engines. + Please note that "intel_iommu=on" will be required as a kernel parameter. ## Cassandra Configuration @@ -112,6 +117,10 @@ Once the zlib-accel library has been built, It is simple to use Cassandra to ena LD_PRELOAD=/opt/zlib-accel/build/libzlib-accel.so bin/cassandra -R ``` +## Benchmarking Cassandra with QAT + +NoSQLBench is used for benchmarking Cassandra. The results mentioned in the Overview section were generated by using 6 independent Cassandra servers and servers. The benchmark used a mix of 80% reads and 20% writes using the default CQL timeseries schema. + ## Future Enhancements Support for QAT plugin into Cassandra is in progress and waiting to be upstreamed. This includes support for ZSTD. @@ -122,6 +131,4 @@ zib-accel: https://github.com/intel/zlib-accel NoSQLBench: https://github.com/nosqlbench/nosqlbench -QATLib: https://intel.github.io/quickassist/qatlib/index.html - - +QATLib Users Guide: https://intel.github.io/quickassist/qatlib/index.html From 5aa09c03ab50323f6c80c89ab4db78ffe3dd7bd2 Mon Sep 17 00:00:00 2001 From: "Sherman, Srikanth" Date: Tue, 24 Mar 2026 08:10:51 -0700 Subject: [PATCH 07/19] More changes based on reviews by Java team and QAT team. Added test configuration info. --- software/cassandra/QAT/README.md | 46 +++++++++++++++++++++++++++----- 1 file changed, 39 insertions(+), 7 deletions(-) diff --git a/software/cassandra/QAT/README.md b/software/cassandra/QAT/README.md index 95afa30..814e5b2 100644 --- a/software/cassandra/QAT/README.md +++ b/software/cassandra/QAT/README.md @@ -13,13 +13,16 @@ ## Overview -Compression takes up a significant portion of resources in the data center. Hardware acceleration like Intel® QuickAssist Technology (Intel® QAT) can be used to offload the compression portion of a workload to provide higher throughput and lower latency than using the CPU alone. The zlib-accel library uses a shim approach to seamless integrate Intel® QAT for compression operations. Using zlib-accel allows the user to take advantage of hardware compression with QAT without having to make code changes to the underlying Cassandra codebase. +Compression takes up a significant portion of resources in the data center. Hardware acceleration like Intel® QuickAssist Technology (Intel® QAT) can be used to offload the compression portion of a workload. Offloading these operations will free up CPU cores to do other work and will improve compress/decompress performance. The zlib-accel library uses a shim approach to seamless integrate Intel® QAT for compression operations using the Deflate algorithm. Using zlib-accel allows the user to take advantage of hardware compression with QAT without having to make code changes to the underlying Cassandra codebase. + +Without sacrificing compression ratios, zlib-accel with QAT offers higher throughput using a workload of NoSQLBench. The compression throughput of zlib-accel with QAT is 18% higher than zstd, 98% higher than zlib, and 36% higher than zlib-ng. CPU cycles per Cassandra operation is also better; compared to zlib, using QAT with zlib-accel uses only 43% of the CPU cycles per Cassandra operation. -Without sacrificing compression ratios, zlib-accel with QAT offers higher throughput using a workload of NoSQLBench , 18% higher than zstd, 98% higher than zlib, and 36% higher than zlib-ng. CPU cycles per Cassandra operation is also better; compared to zlib, using QAT with zlib-accel uses only 43% of the CPU cycles per Cassandra operation. ## QAT Hardware Requirement -At least one Intel® QAT engine is required and the individual engine might need to be updated in the BIOS. This can be verified by running the following command: +At least one Intel® QAT engine is required and the individual engine might need to be updated in the BIOS. The following steps should be performed to be reading to use the QAT device(s). + +1. Check for QAT device availability. This can be verified by running the following command: ``` echo `(lspci -d 8086:4940 && lspci -d 8086:4941 && lspci -d 8086:4942 && lspci -d 8086:4943 && lspci -d 8086:4944 && lspci -d 8086:4945 && lspci -d 8086:4946 && lspci -d 8086:4947) | wc -l` supported devices found. @@ -31,7 +34,7 @@ If a device is found, the output of the command with be: 8 supported devices found. ``` -Verify that the QAT firmware is already loaded by using the following command: +2. Verify that the QAT firmware is already loaded by using the following command: ``` ls /lib/firmware/{qat_4xxx,qat_402xx,qat_420xx}.bin* 2>/dev/null @@ -62,9 +65,31 @@ rm qat_4xxx*.bin qat_402xx*.bin qat_420xx*.bin After firmware is updated, the initramfs must be updated. This differs based on the Linux distribution. +3. Verify that the kernel drivers are loaded using the following command. + +``` +lsmod | grep qat +``` + +The output should be similar to the following: + +``` +qat_4xxx 16384 0 +intel_qat 172032 1 qat_4xxx +``` + +If the kernel modules are not found, they can be installed using: + +``` +sudo modprobe intel_qat +sudo modprobe qat_4xxx +``` + +If the kernel modules could not be installed, it might be needed to either install them through a kernel configuration or to install that with the distribution's package manger. + ## QAT Software Requirements and Prerequisites -The QAT driver is available either "in-tree" as part of a release kernel or can be built outside of the release. This document assumes the use of the in-tree driver that is already available with kernal after version 5.19. The distribution used for this benchmarking was Ubuntu 24.04 with the in-tree driver. +The QAT driver is available either "in-tree" as part of a release kernel or can be built outside of the release. This document assumes the use of the in-tree driver that is already available with kernel after version 5.19. The distribution used for this benchmarking was Ubuntu 24.04 with the in-tree driver. QATLib provides user space libraries that allows QAT device access and expose APIs for use by higher level applications. The QATLib driver can be installed using your distributions package manager. For Ubuntu 24.04: @@ -84,7 +109,7 @@ Please note that "intel_iommu=on" will be required as a kernel parameter. ## Cassandra Configuration -The Cassandra configuration mentioned in the base optimization-zone repository can still be used with zlib-accel. zlib-accel requires the following software versions: +The Cassandra configuration mentioned in the base [optimization-zone] (https://github.com/intel/optimization-zone/tree/main/software/cassandra) repository can still be used with zlib-accel. This Cassandra with QAT/zlib-accel optimization was tested the following software versions: OpenJDK 17 Cassandra 5.0.6 @@ -123,7 +148,14 @@ NoSQLBench is used for benchmarking Cassandra. The results mentioned in the Ove ## Future Enhancements -Support for QAT plugin into Cassandra is in progress and waiting to be upstreamed. This includes support for ZSTD. +Support for QAT plugin into Cassandra is in progress and waiting to be upstreamed. This includes support for ZSTD. Please refer to the [enhancement proposal] (https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-49%3A+Hardware-accelerated+compression) for more info and the latest status and on the QAT plugin. + + +## Details + +Cassandra on GNR 128c (Intel Xeon 6980P): 1-node, 2x Intel(R) Xeon(R) 6980P, 128 cores, 500W TDP, HT On, Turbo On, NUMA 6, Total Memory 1536GB (24x64GB DDR5 6400 MT/s [6400 MT/s]), BIOS F23, microcode 0x10003f3, 2x 1350 Gigabit Network Connection, 1x14.3G SanDisk 3.2Gen1, 8x3.5T Samsung MZQL23T8HCL5-00A07, 1x7T Micron_7450_MTFDK8G1T9TFR, Ubuntu 24.04.3 LTS, 6.8.0-86-generic. Test by Intel as of Nov 18, 2025, Apache Cassandra 5.0.5, OpenJDK 64-Bit Server VM 17.0.16, NoSQLBench version 4.15.104 + +Results may vary. ## References From 4575b8b9ce6212782ffd670977be53dba1ef9964 Mon Sep 17 00:00:00 2001 From: "Sherman, Srikanth" Date: Wed, 25 Mar 2026 10:45:18 -0700 Subject: [PATCH 08/19] Updated some broken links --- software/cassandra/QAT/README.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/software/cassandra/QAT/README.md b/software/cassandra/QAT/README.md index 814e5b2..806e2fa 100644 --- a/software/cassandra/QAT/README.md +++ b/software/cassandra/QAT/README.md @@ -3,12 +3,13 @@ - [Overview](#overview) - [QAT Hardware Requirement](#qat-hardware-requirement) -- [QAT Software Requirement and Prequisites](#qat-software-requirement-and-prerequisite) +- [QAT Software Requirement and Prequisites](#qat-software-requirement-and-prerequisites) - [Cassandra Configuration](#cassandra-configuration) -- [Building and configuring zlib-accel](#building-zlib-accel) -- [Using Cassandra with zlib-accel](#cassandra-with-zlib-accel) -- [Benchmarking Cassandra with QAT](#benchmark-cassandra-with-qat) +- [Building and configuring zlib-accel](#building-and-configuring-zlib-accel) +- [Using Cassandra with zlib-accel](#using-cassandra-with-zlib-accel) +- [Benchmarking Cassandra with QAT](#benchmarking-cassandra-with-qat) - [Future Enhancements](#future-enhancements) +- [Details](#Details) - [References](#references) ## Overview From fd418fa064d6666c1bef57fd4df77ada417c32af Mon Sep 17 00:00:00 2001 From: ssherman8 <102256180+ssherman8@users.noreply.github.com> Date: Wed, 25 Mar 2026 10:50:08 -0700 Subject: [PATCH 09/19] Update README.md Fixed another link. --- software/cassandra/QAT/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/software/cassandra/QAT/README.md b/software/cassandra/QAT/README.md index 806e2fa..957a5a3 100644 --- a/software/cassandra/QAT/README.md +++ b/software/cassandra/QAT/README.md @@ -3,7 +3,7 @@ - [Overview](#overview) - [QAT Hardware Requirement](#qat-hardware-requirement) -- [QAT Software Requirement and Prequisites](#qat-software-requirement-and-prerequisites) +- [QAT Software Requirement and Prerequisites](#qat-software-requirement-and-prerequisites) - [Cassandra Configuration](#cassandra-configuration) - [Building and configuring zlib-accel](#building-and-configuring-zlib-accel) - [Using Cassandra with zlib-accel](#using-cassandra-with-zlib-accel) From e05c7b8f18f1de84237879d64a78fc06127b4bef Mon Sep 17 00:00:00 2001 From: "Sherman, Srikanth" Date: Wed, 25 Mar 2026 10:56:35 -0700 Subject: [PATCH 10/19] Updated one last broken link --- software/cassandra/QAT/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/software/cassandra/QAT/README.md b/software/cassandra/QAT/README.md index 806e2fa..0b3b1d6 100644 --- a/software/cassandra/QAT/README.md +++ b/software/cassandra/QAT/README.md @@ -3,7 +3,7 @@ - [Overview](#overview) - [QAT Hardware Requirement](#qat-hardware-requirement) -- [QAT Software Requirement and Prequisites](#qat-software-requirement-and-prerequisites) +- [QAT Software Requirement and Prerequisites](#qat-software-requirement-and-prerequisites) - [Cassandra Configuration](#cassandra-configuration) - [Building and configuring zlib-accel](#building-and-configuring-zlib-accel) - [Using Cassandra with zlib-accel](#using-cassandra-with-zlib-accel) @@ -88,7 +88,7 @@ sudo modprobe qat_4xxx If the kernel modules could not be installed, it might be needed to either install them through a kernel configuration or to install that with the distribution's package manger. -## QAT Software Requirements and Prerequisites +## QAT Software Requirement and Prerequisites The QAT driver is available either "in-tree" as part of a release kernel or can be built outside of the release. This document assumes the use of the in-tree driver that is already available with kernel after version 5.19. The distribution used for this benchmarking was Ubuntu 24.04 with the in-tree driver. From 7bca0a2299ae71964151b0d4c9dbe470ea68f1e1 Mon Sep 17 00:00:00 2001 From: ssherman8 <102256180+ssherman8@users.noreply.github.com> Date: Fri, 27 Mar 2026 18:19:49 -0700 Subject: [PATCH 11/19] Update software/cassandra/QAT/README.md Typo Co-authored-by: rsiyer-intel --- software/cassandra/QAT/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/software/cassandra/QAT/README.md b/software/cassandra/QAT/README.md index 0b3b1d6..ba9ec45 100644 --- a/software/cassandra/QAT/README.md +++ b/software/cassandra/QAT/README.md @@ -86,7 +86,7 @@ sudo modprobe intel_qat sudo modprobe qat_4xxx ``` -If the kernel modules could not be installed, it might be needed to either install them through a kernel configuration or to install that with the distribution's package manger. +If the kernel modules could not be installed, it might be needed to either install them through a kernel configuration or to install that with the distribution's package manager. ## QAT Software Requirement and Prerequisites From 97440f70a4481826fb75b71bf680b9589e59a4f4 Mon Sep 17 00:00:00 2001 From: ssherman8 <102256180+ssherman8@users.noreply.github.com> Date: Fri, 27 Mar 2026 18:20:36 -0700 Subject: [PATCH 12/19] Update software/cassandra/QAT/README.md Typo Co-authored-by: rsiyer-intel --- software/cassandra/QAT/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/software/cassandra/QAT/README.md b/software/cassandra/QAT/README.md index ba9ec45..5c83b9a 100644 --- a/software/cassandra/QAT/README.md +++ b/software/cassandra/QAT/README.md @@ -160,7 +160,7 @@ Results may vary. ## References -zib-accel: https://github.com/intel/zlib-accel +zlib-accel: https://github.com/intel/zlib-accel NoSQLBench: https://github.com/nosqlbench/nosqlbench From ca5ea61ce151a65c7bcc373e579d01e4e3e2cb96 Mon Sep 17 00:00:00 2001 From: ssherman8 <102256180+ssherman8@users.noreply.github.com> Date: Fri, 27 Mar 2026 18:21:29 -0700 Subject: [PATCH 13/19] Update software/cassandra/QAT/README.md Typo Co-authored-by: rsiyer-intel --- software/cassandra/QAT/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/software/cassandra/QAT/README.md b/software/cassandra/QAT/README.md index 5c83b9a..6a33640 100644 --- a/software/cassandra/QAT/README.md +++ b/software/cassandra/QAT/README.md @@ -21,7 +21,7 @@ Without sacrificing compression ratios, zlib-accel with QAT offers higher throug ## QAT Hardware Requirement -At least one Intel® QAT engine is required and the individual engine might need to be updated in the BIOS. The following steps should be performed to be reading to use the QAT device(s). +At least one Intel® QAT engine is required and the individual engine might need to be updated in the BIOS. The following steps should be performed to be ready to use the QAT device(s). 1. Check for QAT device availability. This can be verified by running the following command: From 7bef83d53a041f9fe101a4f6cef6892a2ffcba32 Mon Sep 17 00:00:00 2001 From: ssherman8 <102256180+ssherman8@users.noreply.github.com> Date: Fri, 27 Mar 2026 18:24:10 -0700 Subject: [PATCH 14/19] Update software/cassandra/QAT/README.md Co-authored-by: rsiyer-intel --- software/cassandra/QAT/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/software/cassandra/QAT/README.md b/software/cassandra/QAT/README.md index 6a33640..0167622 100644 --- a/software/cassandra/QAT/README.md +++ b/software/cassandra/QAT/README.md @@ -110,7 +110,7 @@ Please note that "intel_iommu=on" will be required as a kernel parameter. ## Cassandra Configuration -The Cassandra configuration mentioned in the base [optimization-zone] (https://github.com/intel/optimization-zone/tree/main/software/cassandra) repository can still be used with zlib-accel. This Cassandra with QAT/zlib-accel optimization was tested the following software versions: +The Cassandra configuration mentioned in the base [cassandra](https://github.com/intel/optimization-zone/blob/main/software/cassandra/README.md) readme can still be used with zlib-accel. This Cassandra with QAT/zlib-accel optimization was tested with the following software versions: OpenJDK 17 Cassandra 5.0.6 From f43a91059b8bcd1ece4ffd185ad5ab90a6a6c09b Mon Sep 17 00:00:00 2001 From: "Sherman, Srikanth" Date: Mon, 30 Mar 2026 09:35:52 -0700 Subject: [PATCH 15/19] PR suggested changes to text --- software/cassandra/QAT/README.md | 32 +++++++++++++++++++++----------- 1 file changed, 21 insertions(+), 11 deletions(-) diff --git a/software/cassandra/QAT/README.md b/software/cassandra/QAT/README.md index 0167622..17730a6 100644 --- a/software/cassandra/QAT/README.md +++ b/software/cassandra/QAT/README.md @@ -16,7 +16,7 @@ Compression takes up a significant portion of resources in the data center. Hardware acceleration like Intel® QuickAssist Technology (Intel® QAT) can be used to offload the compression portion of a workload. Offloading these operations will free up CPU cores to do other work and will improve compress/decompress performance. The zlib-accel library uses a shim approach to seamless integrate Intel® QAT for compression operations using the Deflate algorithm. Using zlib-accel allows the user to take advantage of hardware compression with QAT without having to make code changes to the underlying Cassandra codebase. -Without sacrificing compression ratios, zlib-accel with QAT offers higher throughput using a workload of NoSQLBench. The compression throughput of zlib-accel with QAT is 18% higher than zstd, 98% higher than zlib, and 36% higher than zlib-ng. CPU cycles per Cassandra operation is also better; compared to zlib, using QAT with zlib-accel uses only 43% of the CPU cycles per Cassandra operation. +Without sacrificing compression ratios, zlib-accel with QAT offers higher throughput using a workload of [NoSQLBench](https://github.com/nosqlbench/nosqlbench). The compression throughput of zlib-accel with QAT is 18% higher than zstd, 98% higher than zlib, and 36% higher than zlib-ng. CPU cycles per Cassandra operation is also better; compared to zlib, using QAT with zlib-accel uses only 43% of the CPU cycles per Cassandra operation. ## QAT Hardware Requirement @@ -29,7 +29,7 @@ At least one Intel® QAT engine is required and the individual engine might need echo `(lspci -d 8086:4940 && lspci -d 8086:4941 && lspci -d 8086:4942 && lspci -d 8086:4943 && lspci -d 8086:4944 && lspci -d 8086:4945 && lspci -d 8086:4946 && lspci -d 8086:4947) | wc -l` supported devices found. ``` -If a device is found, the output of the command with be: +If at least one device is found, the output of the command will be: ``` 8 supported devices found. @@ -55,11 +55,11 @@ https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree ``` cd ~ wget https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/plain/intel/qat/qat_4xxx.bin -wget https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/plain/intel/qat/qat_4xxx.bin -wget https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/plain/intel/qat/qat_402xx.bin +wget https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/plain/intel/qat/qat_4xxx_mmp.bin wget https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/plain/intel/qat/qat_402xx.bin +wget https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/plain/intel/qat/qat_402xx_mmp.bin wget https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/plain/intel/qat/qat_420xx.bin -wget https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/plain/intel/qat/qat_420xx.bin +wget https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/plain/intel/qat/qat_420xx_mmp.bin sudo cp qat_4xxx*.bin qat_402xx*.bin qat_420xx*.bin /lib/firmware rm qat_4xxx*.bin qat_402xx*.bin qat_420xx*.bin ``` @@ -86,7 +86,7 @@ sudo modprobe intel_qat sudo modprobe qat_4xxx ``` -If the kernel modules could not be installed, it might be needed to either install them through a kernel configuration or to install that with the distribution's package manager. +If the kernel modules could not be installed, it might be needed to either install them through a kernel configuration or to install them with the distribution's package manager. ## QAT Software Requirement and Prerequisites @@ -104,7 +104,7 @@ QATzip is a user-space library built on top of the Intel® QuickAssist Technolog sudo -E apt install -y qatzip libqatzip3 ``` -Depending on the use case, the user can configure the number of QAT engines to use with the workload. In "Managed Mode", the QATLib can be used to restrict the workload to a specific number of engines. +Depending on the use case, the user can configure the number of QAT engines to use with the workload. In "Managed Mode", the [QATLib](https://intel.github.io/quickassist/qatlib/index.html) library can be used to restrict the workload to a specific number of engines. Please note that "intel_iommu=on" will be required as a kernel parameter. @@ -114,6 +114,7 @@ The Cassandra configuration mentioned in the base [cassandra](https://github.com OpenJDK 17 Cassandra 5.0.6 +zlib-accel 1.0.0 ## Building and configuring zlib-accel @@ -137,7 +138,16 @@ use_zlib_uncompress=1 ## Using Cassandra with zlib-accel -Once the zlib-accel library has been built, It is simple to use Cassandra to enable hardware compression. +[zlib-accel] (https://github.com/intel/zlib-accel) can be built with: + +``` +mkdir build +cd build +cmake -DUSE_QAT=ON -DUSE_IAA=OFF -DDEBUG_LOG=OFF -DCOVERAGE=OFF -DCMAKE_BUILD_TYPE=Release +make +``` + +Once the zlib-accel library has been built, It is simple to use Cassandra to enable hardware compression. zlib-accel is usually installed in the /opt/zlib-accel ``` LD_PRELOAD=/opt/zlib-accel/build/libzlib-accel.so bin/cassandra -R @@ -145,16 +155,16 @@ LD_PRELOAD=/opt/zlib-accel/build/libzlib-accel.so bin/cassandra -R ## Benchmarking Cassandra with QAT -NoSQLBench is used for benchmarking Cassandra. The results mentioned in the Overview section were generated by using 6 independent Cassandra servers and servers. The benchmark used a mix of 80% reads and 20% writes using the default CQL timeseries schema. +NoSQLBench is used for benchmarking Cassandra. The results mentioned in the Overview section were generated by using 6 independent Cassandra clients and servers. The benchmark used a mix of 80% reads and 20% writes using the default CQL timeseries schema. ## Future Enhancements -Support for QAT plugin into Cassandra is in progress and waiting to be upstreamed. This includes support for ZSTD. Please refer to the [enhancement proposal] (https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-49%3A+Hardware-accelerated+compression) for more info and the latest status and on the QAT plugin. +Support for QAT plugin into Cassandra is in progress and waiting to be upstreamed. This includes support for ZSTD. Please refer to the [enhancement proposal](https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-49%3A+Hardware-accelerated+compression) for more info and the latest status and on the QAT plugin. ## Details -Cassandra on GNR 128c (Intel Xeon 6980P): 1-node, 2x Intel(R) Xeon(R) 6980P, 128 cores, 500W TDP, HT On, Turbo On, NUMA 6, Total Memory 1536GB (24x64GB DDR5 6400 MT/s [6400 MT/s]), BIOS F23, microcode 0x10003f3, 2x 1350 Gigabit Network Connection, 1x14.3G SanDisk 3.2Gen1, 8x3.5T Samsung MZQL23T8HCL5-00A07, 1x7T Micron_7450_MTFDK8G1T9TFR, Ubuntu 24.04.3 LTS, 6.8.0-86-generic. Test by Intel as of Nov 18, 2025, Apache Cassandra 5.0.5, OpenJDK 64-Bit Server VM 17.0.16, NoSQLBench version 4.15.104 +Cassandra on GNR 128c (Intel Xeon 6980P): 1-node, 2x Intel(R) Xeon(R) 6980P, 128 cores, 500W TDP, HT On, Turbo On, NUMA 6, Total Memory 1536GB (24x64GB DDR5 6400 MT/s [6400 MT/s]), BIOS F23, microcode 0x10003f3, 2x 1350 Gigabit Network Connection, 4 QAT engines, 1x14.3G SanDisk 3.2Gen1, 8x3.5T Samsung MZQL23T8HCL5-00A07, 1x7T Micron_7450_MTFDK8G1T9TFR, Ubuntu 24.04.3 LTS, 6.8.0-86-generic. Test by Intel as of Nov 18, 2025, Apache Cassandra 5.0.5, OpenJDK 64-Bit Server VM 17.0.16, NoSQLBench version 4.15.104, zlib-accel version 1.0.0 Results may vary. From f29cb8647c830180576b45d3a99b58be53d025e7 Mon Sep 17 00:00:00 2001 From: "Sherman, Srikanth" Date: Wed, 1 Apr 2026 16:31:30 -0700 Subject: [PATCH 16/19] More edits based on PR feedback. Added NoSQLBench commands to reproduce dataset. --- software/cassandra/QAT/README.md | 44 ++++++++++++++++++++++++-------- 1 file changed, 34 insertions(+), 10 deletions(-) diff --git a/software/cassandra/QAT/README.md b/software/cassandra/QAT/README.md index 17730a6..dee89c0 100644 --- a/software/cassandra/QAT/README.md +++ b/software/cassandra/QAT/README.md @@ -118,10 +118,12 @@ zlib-accel 1.0.0 ## Building and configuring zlib-accel +[zlib-accel](https://github.com/intel/zlib-accel) can be built with: + ``` mkdir build cd build -cmake -DDEBUG_LOG -DCOVERAGE=OFF -CMAKE_BUILD_TYPE=Release .. +cmake -DUSE_QAT=ON -DUSE_IAA=OFF -DDEBUG_LOG=OFF -DCOVERAGE=OFF -DCMAKE_BUILD_TYPE=Release make ``` @@ -138,24 +140,46 @@ use_zlib_uncompress=1 ## Using Cassandra with zlib-accel -[zlib-accel] (https://github.com/intel/zlib-accel) can be built with: +Once the zlib-accel library has been built, It is simple to use Cassandra to enable hardware compression. zlib-accel is usually installed in the /opt/zlib-accel. Please the LD_PRELOAD below to point to the shared object if it was not installed in the default directory. ``` -mkdir build -cd build -cmake -DUSE_QAT=ON -DUSE_IAA=OFF -DDEBUG_LOG=OFF -DCOVERAGE=OFF -DCMAKE_BUILD_TYPE=Release -make +LD_PRELOAD=/opt/zlib-accel/build/libzlib-accel.so bin/cassandra -R ``` -Once the zlib-accel library has been built, It is simple to use Cassandra to enable hardware compression. zlib-accel is usually installed in the /opt/zlib-accel +## Benchmarking Cassandra with QAT + +NoSQLBench is used for benchmarking Cassandra. The results mentioned in the Overview section were generated by using 6 independent Cassandra servers. The benchmark used a mix of 80% reads and 20% writes using the default CQL timeseries schema. + +1. Download the CQL timeseries schema ``` -LD_PRELOAD=/opt/zlib-accel/build/libzlib-accel.so bin/cassandra -R +java -jar nb.jar --copy cql-timeseries2 ``` -## Benchmarking Cassandra with QAT +2. Change the compressor to use "Deflate". -NoSQLBench is used for benchmarking Cassandra. The results mentioned in the Overview section were generated by using 6 independent Cassandra clients and servers. The benchmark used a mix of 80% reads and 20% writes using the default CQL timeseries schema. +``` +< AND compression = { 'sstable_compression' : '<>' } +--- +> AND compression = { 'class' : 'DeflateCompressor' } +``` + +3. Create keyspace & table + +``` +java -jar nb.jar run driver=cql yaml=cql-timeseries2.yaml tags=phase:schema host= +``` + +4. Pre-populate dataset with progress reported every 4s + +``` +java -Xmx31G -Xms31G -XX:+UseG1GC -jar nb.jar run driver=cql yaml=cql-timeseries2.yaml tags=phase:rampup host= cycles=<# of rows to enter> threads= rampup-cycles=1000000000 main-cycles=1000000000 --progress console:4s +``` +4. Run the workload (mixed 80R/20W) + +``` +java -Xmx31G -Xms31G -XX:+UseG1GC -jar nb.jar run driver=cql yaml=cql-timeseries2.yaml tags=phase:main read_ratio=8 write_ratio=2 host= threads= pooling=8:8:2048 cycles=<# of iterations to run the workload> limit=1 rampup-cycles=1000000000 main-cycles=1000000000 --progress console:3s --report-csv-to +``` ## Future Enhancements From 1f6f3c6774f397af3d0a52d23830a5f569ee427c Mon Sep 17 00:00:00 2001 From: "Sherman, Srikanth" Date: Wed, 1 Apr 2026 17:18:50 -0700 Subject: [PATCH 17/19] Clarified NoSQLBench commands to reproduce results. --- software/cassandra/QAT/README.md | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/software/cassandra/QAT/README.md b/software/cassandra/QAT/README.md index dee89c0..1a550db 100644 --- a/software/cassandra/QAT/README.md +++ b/software/cassandra/QAT/README.md @@ -150,13 +150,13 @@ LD_PRELOAD=/opt/zlib-accel/build/libzlib-accel.so bin/cassandra -R NoSQLBench is used for benchmarking Cassandra. The results mentioned in the Overview section were generated by using 6 independent Cassandra servers. The benchmark used a mix of 80% reads and 20% writes using the default CQL timeseries schema. -1. Download the CQL timeseries schema +1. Download the CQL timeseries schema. This will generate a cql-timeseries2.yaml file. ``` java -jar nb.jar --copy cql-timeseries2 ``` -2. Change the compressor to use "Deflate". +2. Change the compressor to use "Deflate" in the "create-table" blocks statemement (approximately line 46). ``` < AND compression = { 'sstable_compression' : '<>' } @@ -164,18 +164,22 @@ java -jar nb.jar --copy cql-timeseries2 > AND compression = { 'class' : 'DeflateCompressor' } ``` -3. Create keyspace & table +3. Create keyspace & table by running nb.jar with the cql driver. Host IP of the Cassandra server has to be specified in this statement (if running on the same system, "127.0.0.1"). ``` java -jar nb.jar run driver=cql yaml=cql-timeseries2.yaml tags=phase:schema host= ``` -4. Pre-populate dataset with progress reported every 4s +4. Pre-populate dataset with progress reported every 4s. Along with the Host IP of the Cassandra server (same as previous step), the number of rows to enter and the number of client threads has to be specified. The results mentioned in the Overview section used "100M" for the number of rows and "400" client threads: + +host=127.0.0.1 +cycles=100M +threads=400 ``` java -Xmx31G -Xms31G -XX:+UseG1GC -jar nb.jar run driver=cql yaml=cql-timeseries2.yaml tags=phase:rampup host= cycles=<# of rows to enter> threads= rampup-cycles=1000000000 main-cycles=1000000000 --progress console:4s ``` -4. Run the workload (mixed 80R/20W) +4. Run the workload (mixed 80R/20W). In addition to the values of mentioned in the previous steps, the directory name where the CSV results are stored should be specified. ``` java -Xmx31G -Xms31G -XX:+UseG1GC -jar nb.jar run driver=cql yaml=cql-timeseries2.yaml tags=phase:main read_ratio=8 write_ratio=2 host= threads= pooling=8:8:2048 cycles=<# of iterations to run the workload> limit=1 rampup-cycles=1000000000 main-cycles=1000000000 --progress console:3s --report-csv-to From b9d471fda7375a1fd9c319c4c48945e45e130f78 Mon Sep 17 00:00:00 2001 From: ssherman8 <102256180+ssherman8@users.noreply.github.com> Date: Wed, 1 Apr 2026 17:21:58 -0700 Subject: [PATCH 18/19] Update README.md Formatting change and minor typo. --- software/cassandra/QAT/README.md | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/software/cassandra/QAT/README.md b/software/cassandra/QAT/README.md index 1a550db..f3fb8f0 100644 --- a/software/cassandra/QAT/README.md +++ b/software/cassandra/QAT/README.md @@ -156,7 +156,7 @@ NoSQLBench is used for benchmarking Cassandra. The results mentioned in the Ove java -jar nb.jar --copy cql-timeseries2 ``` -2. Change the compressor to use "Deflate" in the "create-table" blocks statemement (approximately line 46). +2. Change the compression to use "DeflateCompressor" in the "create-table" blocks statemement (approximately line 46). ``` < AND compression = { 'sstable_compression' : '<>' } @@ -170,11 +170,7 @@ java -jar nb.jar --copy cql-timeseries2 java -jar nb.jar run driver=cql yaml=cql-timeseries2.yaml tags=phase:schema host= ``` -4. Pre-populate dataset with progress reported every 4s. Along with the Host IP of the Cassandra server (same as previous step), the number of rows to enter and the number of client threads has to be specified. The results mentioned in the Overview section used "100M" for the number of rows and "400" client threads: - -host=127.0.0.1 -cycles=100M -threads=400 +4. Pre-populate dataset with progress reported every 4s. Along with the Host IP of the Cassandra server (same as previous step), the number of rows to enter and the number of client threads has to be specified. The results mentioned in the Overview section used "100M" for the number of rows and "400" client threads (host=127.0.0.1 cycles=100M threads=400) ``` java -Xmx31G -Xms31G -XX:+UseG1GC -jar nb.jar run driver=cql yaml=cql-timeseries2.yaml tags=phase:rampup host= cycles=<# of rows to enter> threads= rampup-cycles=1000000000 main-cycles=1000000000 --progress console:4s From 76d6e7a227bf7d48da19114838610e2620f7535e Mon Sep 17 00:00:00 2001 From: ssherman8 <102256180+ssherman8@users.noreply.github.com> Date: Wed, 1 Apr 2026 17:25:08 -0700 Subject: [PATCH 19/19] Update README.md with formatting --- software/cassandra/QAT/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/software/cassandra/QAT/README.md b/software/cassandra/QAT/README.md index f3fb8f0..27fcf39 100644 --- a/software/cassandra/QAT/README.md +++ b/software/cassandra/QAT/README.md @@ -140,7 +140,7 @@ use_zlib_uncompress=1 ## Using Cassandra with zlib-accel -Once the zlib-accel library has been built, It is simple to use Cassandra to enable hardware compression. zlib-accel is usually installed in the /opt/zlib-accel. Please the LD_PRELOAD below to point to the shared object if it was not installed in the default directory. +Once the [zlib-accel](https://github.com/intel/zlib-accel) library has been built, It is simple to use Cassandra to enable hardware compression. zlib-accel is usually installed in the /opt/zlib-accel. Please update the LD_PRELOAD below to point to the shared object if it was not installed in the default directory. ``` LD_PRELOAD=/opt/zlib-accel/build/libzlib-accel.so bin/cassandra -R