BAND (Bandwidth Assessment for Native DDR) is a Python-based memory bandwidth measurement tool designed to provide results comparable to the industry-standard STREAM benchmark. It offers optimized implementations for various memory operations with a focus on performance, allowing Python developers to evaluate memory bandwidth in their environments.
Traditional memory bandwidth benchmarks like STREAM are written in C and require compilation for each platform. BAND was created to provide an easy-to-use Python alternative that:
- Requires no compilation step (just Python + NumPy)
- Produces results comparable to the C-based STREAM benchmark
- Provides multiple optimized implementations to explore memory bandwidth characteristics
- Offers a simple, cross-platform way to estimate DDR memory bandwidth
- Helps Python developers understand memory performance constraints in data-intensive applications
BAND is particularly useful for:
- Data scientists working with large NumPy arrays
- Python developers optimizing memory-bound applications
- Performance engineers comparing memory subsystems across platforms
- System administrators evaluating Python performance on different hardware
- Python 3.6 or higher
- Dependencies listed in requirements.txt:
- NumPy: For efficient array operations
- psutil: For system information collection
# Clone the repository
git clone https://github.com/kylefoxaustin/band.git
cd band
# Install required packages using requirements.txt
pip install -r requirements.txt
# Make the script executable
chmod +x band.pyBasic usage:
./band.pyThis will run all tests with default settings (4GB total memory, using up to 4 threads).
./band.py --size 2.0 --threads 8 --iterations 5You can customize the execution with the following options:
--size FLOAT Size in GB for each test (default: 4 GB)
--iterations INT Number of iterations per test (default: 3)
--threads INT Number of threads (default: min(CPU count, 4))
--chunk-size INT Chunk size in KB for operations (default: varies by test)
--triad-only Run only the triad tests for optimization experiments
--best Run only the best implementation for each operation
--compare Compare to C STREAM benchmark results
--c-stream-triad FLOAT C STREAM Triad result for comparison (default: 19.98 GB/s)
--stream-file PATH Path to a file containing the output from a STREAM-C benchmark run
--enable-chunking Enable chunked implementations that may benefit from CPU cache
BAND allocates memory based on the --size parameter, which defaults to 4GB. This can cause out-of-memory errors on systems with limited RAM. Below are recommended settings for different system configurations:
| System Memory | Recommended Size | Command Example |
|---|---|---|
| 8GB or less | 0.5GB or less | ./band.py --size 0.5 --stream-file stream_results.txt |
| 16GB | 1-2GB | ./band.py --size 1.5 --stream-file stream_results.txt |
| 32GB | 2-4GB | ./band.py --size 3.0 --stream-file stream_results.txt |
| 64GB+ | 4-8GB | ./band.py --size 6.0 --stream-file stream_results.txt |
- As a general rule, set the
--sizeparameter to no more than 25-30% of your total available system memory - For embedded systems or SBCs (like Raspberry Pi, Jetson boards), use smaller values (0.1-0.5GB)
- Larger test sizes can provide more accurate results for high-end systems, but aren't necessary for basic comparisons
- Monitor memory usage with tools like
htoporfree -hwhile running tests - If you experience out-of-memory errors, reduce the size parameter
Bandwidth results should be consistent regardless of test size, as long as the arrays are significantly larger than your CPU cache size. For most modern systems, even a 0.5GB test size is more than sufficient to exceed cache limits and measure true memory bandwidth.
-
--size: Total memory size to use for testing. Larger values provide more accurate results but require more RAM. Recommended to use at least 2-4 GB for meaningful results.
-
--iterations: Number of times each test is run. The first iteration is considered a warm-up and excluded from final results.
-
--threads: Number of threads to use. Defaults to the minimum of available CPU cores or 4. More threads can help utilize multi-channel memory systems.
-
--chunk-size: Size of data chunks processed in each iteration, measured in KB. Different chunk sizes can significantly impact performance due to cache effects.
-
--triad-only: Run only the Triad tests, which combine read, write and arithmetic operations. Useful for focusing on the most comprehensive memory test.
-
--best: Run only the best implementation for each operation type, which is useful for maximum performance measurement with minimal testing time.
-
--compare: Enable comparison with C STREAM benchmark results using default reference values.
-
--c-stream-triad: Specify your C STREAM Triad result in GB/s for direct comparison.
-
--stream-file: Path to a file containing the output from a STREAM-C benchmark run. BAND will parse this file to extract all benchmark values for comprehensive comparison.
-
--enable-chunking: Enable cache-optimized implementations that use smaller chunk sizes for better cache utilization. By default, BAND uses standard implementations that match STREAM's approach of measuring pure memory bandwidth.
BAND offers two approaches to measuring memory bandwidth:
By default, BAND uses implementations that closely follow the STREAM benchmark philosophy:
- Focus on measuring sustained memory bandwidth
- Use large array sizes that exceed cache capacity
- Minimize cache effects to get a true measure of memory subsystem performance
This mode is most useful for:
- Comparing Python performance to STREAM.C benchmark results
- Evaluating true memory bandwidth limitations
- Hardware performance comparisons
When the --enable-chunking flag is used, BAND includes additional implementations that are optimized for cache utilization:
- Py-Chunked Triad: Uses smaller chunk sizes (512KB by default) with reused temporary arrays
- Py-Combined Triad: Uses NumPy's expression optimization with moderate chunk sizes
This mode is useful for:
- Understanding potential performance with optimized code
- Exploring cache effects on performance
- Developing cache-friendly NumPy code
The cache-optimized implementations often outperform the standard STREAM implementations by significant margins (typically 20-50%), showing the importance of cache-friendly coding in Python.
BAND offers an optional cache optimization mode via the --enable-chunking flag. Understanding how this works can help you decide when to use it and interpret the results.
The ChunkedTriad implementation (enabled with --enable-chunking) processes data in small chunks (512KB by default) using a reusable temporary array. Here's how it works:
- Data Segmentation: Instead of processing the entire large arrays at once, the algorithm divides them into manageable chunks
- Temporary Array Reuse: It creates a single small temporary array that gets reused for each chunk
- Sequential Processing: Each chunk is processed completely before moving to the next
Standard STREAM implementations intentionally use very large arrays to measure sustained memory bandwidth without cache benefits. In contrast, chunking takes advantage of cache behavior to optimize performance.
When using 512KB chunks on a system with, for example, a 1MB L2 cache:
- The working set (portions of arrays a, b, c, and the temporary array) is approximately 2MB total
- This exceeds the L2 cache size, so not all data remains cache-resident
- However, the temporary array can remain entirely in cache
- Between 25-50% of the data from the main arrays might remain in cache between operations
- The rest will be evicted and reloaded as needed
This improved cache utilization explains why the chunked implementation typically outperforms the standard implementation by 30-50% or more.
Use standard mode (default):
- When you want to measure true memory subsystem bandwidth
- For direct comparison with STREAM benchmark results
- To evaluate hardware memory performance
- For comparing different systems' memory bandwidth
Use chunking mode (--enable-chunking):
- When you want to see the potential performance of cache-optimized code
- To understand how much performance is left "on the table" with naive implementations
- For developing algorithms that will work with large datasets
- To experiment with different chunk sizes for your specific CPU architecture
You can customize the chunk size with the --chunk-size parameter (in KB):
./band.py --enable-chunking --chunk-size 256Different CPUs have different cache sizes and behaviors. Experimenting with chunk sizes can reveal the optimal size for your specific hardware:
- Values smaller than your L1 cache (typically 32-64KB per core) might show best performance
- Values that fit in L2 cache (typically 256KB-1MB per core) often perform well
- Values that exceed L3 cache (typically 2-32MB shared) will approach standard STREAM performance
Observing how performance changes with different chunk sizes can provide insights into your CPU's cache hierarchy and help you optimize real-world NumPy code.
BAND offers built-in functionality to compare its results with the industry-standard STREAM benchmark written in C. This allows you to evaluate how the Python implementation performs relative to native code.
You can use the included setup_stream.sh script to download, compile, and run the STREAM benchmark:
# Download, compile and run STREAM-C
./setup_stream.shThe script will:
- Download the STREAM benchmark source code
- Compile it with appropriate optimizations
- Run the benchmark
- Save the results to
stream_results.txt
Alternatively, you can manually compile and run STREAM:
# Download and compile STREAM
git clone https://github.com/jeffhammond/STREAM.git
cd STREAM
gcc -O3 -fopenmp -DSTREAM_ARRAY_SIZE=100000000 -DNTIMES=10 stream.c -o stream_omp
# Run STREAM-C with multiple threads
export OMP_NUM_THREADS=4 # Adjust based on your system
./stream_omp > stream_results.txtFor more accurate and convenient comparison, BAND can automatically read STREAM-C results from an output file:
-
Save STREAM-C results to a file:
# Run STREAM-C and save output to a file export OMP_NUM_THREADS=4 && ./stream_omp > stream_results.txt
-
Run BAND with the results file:
./band.py --stream-file stream_results.txt
This approach provides:
- Comprehensive comparison using all measured operations (Copy, Scale, Add, Triad)
- Automatic unit conversion (BAND reports in GB/s, STREAM-C in MB/s)
- Fair comparison using standard STREAM mode by default
To include cache-optimized implementations in the comparison:
./band.py --stream-file stream_results.txt --enable-chunkingBAND provides composite bandwidth metrics that estimate real-world performance for different application types by weighting individual test results:
This score estimates memory bandwidth for typical applications based on analysis of instruction mixes across various workloads:
General Score = (0.40 × Copy) + (0.25 × Scale) + (0.15 × Add) + (0.20 × Triad)
The weightings are derived from research on instruction frequencies in common applications, where:
- Copy operations represent about 40% of memory accesses
- Scale operations (multiply by scalar) represent about 25%
- Add operations represent about 15%
- Complex operations like Triad represent about 20%
This specialized metric estimates memory bandwidth for Large Language Model inference workloads, which have a distinct access pattern dominated by reads:
LLM Score = (0.90 × Copy) + (0.05 × Scale) + (0.025 × Add) + (0.025 × Triad)
LLMs are extremely read-heavy during inference, as they must retrieve vast numbers of parameters from memory while performing relatively fewer complex operations.
BAND also provides "adjusted" versions of both metrics where the Triad value is doubled to approximate the performance gap between Python and C implementations. This adjustment accounts for the observation that Python Triad implementations typically achieve around 50% of the performance of equivalent C implementations due to interpreter overhead.
BAND Effective Bandwidth Metrics:
-----------------------------------
Py-STREAM results:
- General application bandwidth score: 19.91 GB/s
- LLM bandwidth score: 24.31 GB/s
Py-STREAM with doubled Triad (to match STREAM.C):
- General application bandwidth score: 21.95 GB/s
- LLM bandwidth score: 24.56 GB/s
Calculation Explanation:
General score = (0.40 × Copy) + (0.25 × Scale) + (0.15 × Add) + (0.20 × Triad)
LLM score = (0.90 × Copy) + (0.05 × Scale) + (0.025 × Add) + (0.025 × Triad)
* Adjusted scores use doubled Triad values to approximate STREAM.C performance
* Weightings based on instruction mix analysis of typical applications
These metrics provide more meaningful estimations of how memory bandwidth will affect real application performance compared to looking at individual test results in isolation.
Below is an example of running BAND with STREAM-C comparison:
Reading STREAM benchmark results from stream_results.txt
STREAM-C results detected:
Copy: 26224.70 MB/s
Scale: 18053.20 MB/s
Add: 20061.90 MB/s
Triad: 20014.90 MB/s
BAND: Bandwidth Assessment for Native DDR
----------------------------------------
System: Linux x86_64
Processor: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
CPU Cores: 16
Memory: 62.7 GB
Test Size: 4.0 GB per test
Threads: 4
Iterations: 3
Running Py-STREAM Copy test with 4 threads, 4.0 GB total memory...
Iteration 1/3... 26.45 GB/s
Iteration 2/3... 26.63 GB/s
Iteration 3/3... 26.56 GB/s
Py-STREAM Copy result: 26.60 GB/s (min: 26.56, max: 26.63)
[Additional test results omitted for brevity]
Results Summary
--------------
Py-STREAM Copy: 26.60 GB/s
Py-STREAM Scale: 17.42 GB/s
Py-STREAM Add: 19.14 GB/s
Py-STREAM Triad: 10.22 GB/s
Py-MEMCPY: 26.46 GB/s
Python vs C Comparison:
Py-STREAM Copy: 27238.40 MB/s (103.9% of STREAM.C Copy @ 26224.70 MB/s)
Py-STREAM Scale: 17837.08 MB/s (98.8% of STREAM.C Scale @ 18053.20 MB/s)
Py-STREAM Add: 19599.36 MB/s (97.7% of STREAM.C Add @ 20061.90 MB/s)
Py-STREAM Triad: 10465.28 MB/s (52.3% of STREAM.C Triad @ 20014.90 MB/s)
Py-MEMCPY: 27093.74 MB/s (103.3% of STREAM.C Copy @ 26224.70 MB/s)
When running with --enable-chunking, you'll see additional results:
Results Summary
--------------
Py-STREAM Copy: 26.60 GB/s
Py-STREAM Scale: 17.42 GB/s
Py-STREAM Add: 19.14 GB/s
Py-STREAM Triad: 10.22 GB/s
Py-Chunked Triad: 14.64 GB/s
Py-Combined Triad: 10.78 GB/s
Py-MEMCPY: 26.46 GB/s
Triad Implementation Comparison (vs Py-STREAM Triad):
Py-Chunked Triad: 14.64 GB/s (+43.3%)
Py-Combined Triad: 10.78 GB/s (+5.5%)
Python vs C Comparison:
[Same output as above plus the chunked implementations]
Py-Chunked Triad: 14991.36 MB/s (74.9% of STREAM.C Triad @ 20014.90 MB/s)
Py-Combined Triad: 11034.72 MB/s (55.1% of STREAM.C Triad @ 20014.90 MB/s)
Users may notice that the standard Py-STREAM Triad implementation typically achieves only 50-60% of the performance measured by STREAM.C, while simpler operations like Copy and Scale achieve 90-105% of STREAM.C performance. This significant difference in the Triad operation is worth explaining:
-
Operation Complexity: The Triad operation (
a = b + scalar × c) is more complex than Copy or Scale, involving multiple arrays, a multiplication, and an addition. -
NumPy Overhead: For complex operations, Python's NumPy introduces several layers of overhead:
- Type checking and array shape validation
- Temporary array creation for intermediate results
- Python interpreter overhead for managing multiple operations
- Memory allocation and garbage collection costs
-
Memory Access Patterns: The Triad operation requires reading from two arrays (
bandc) and writing to a third (a), creating more complex memory access patterns that expose inefficiencies in Python's memory management. -
Optimization Limitations: C compilers can perform low-level optimizations that aren't available to NumPy, such as:
- Instruction-level parallelism
- Register allocation optimization
- Loop unrolling and vectorization tuned to the specific CPU architecture
This performance gap is an intrinsic characteristic when comparing Python to compiled C code for memory-intensive operations. It's not a flaw in the benchmark but rather an accurate reflection of the trade-offs between Python's ease of use and C's performance.
The Chunked Triad implementation (available with --enable-chunking) partially mitigates these issues through better cache utilization, achieving 70-85% of STREAM.C performance, demonstrating that algorithmic improvements can significantly narrow the gap.
This behavior is precisely why BAND is valuable: it helps Python developers understand the memory bandwidth limitations they might encounter in real-world applications and the potential gains from cache-friendly programming techniques.
To achieve the best memory bandwidth results, consider trying:
-
Experiment with thread count
- Match the number of threads to your CPU's memory channels for optimal results
- Try powers of 2:
--threads 1,--threads 2,--threads 4,--threads 8
-
Try different chunk sizes
- Smaller chunks (16-128KB) may work better on systems with small caches (with
--enable-chunking) - Larger chunks (1-8MB) often work better on server-class hardware
- Example:
--chunk-size 512or--chunk-size 4096
- Smaller chunks (16-128KB) may work better on systems with small caches (with
-
Optimize for your workload
- Use
--triad-onlyto focus on the most comprehensive test - Compare standard vs cache-optimized implementations with
--enable-chunking
- Use
-
System-level optimizations
- Run with elevated process priority
- Disable CPU frequency scaling
- Close other memory-intensive applications
- Try setting process affinity to specific NUMA nodes if applicable
-
Memory configurations
- Test with various memory configurations (dual vs. single channel)
- Compare DIMM speeds and configurations if possible
For convenience, BAND includes a shell script to download, compile, and run the STREAM benchmark:
./setup_stream.shThe script will:
- Check for required dependencies (gcc, git)
- Download the STREAM source code
- Compile it with appropriate optimizations
- Run the benchmark with your system's CPU core count
- Save the results to
stream_results.txt
After running this script, you can use the STREAM results with BAND:
./band.py --stream-file stream_results.txtMaintained by Kyle Fox (@kylefoxaustin).
This project is intended for educational and performance measurement purposes. Contributions, bug reports, and feature requests are welcome.
This project is licensed under the MIT License - see the LICENSE file for details.