Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 69 additions & 0 deletions sql-cli/analyze_memory.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Memory Usage Analysis for 100k Row CSV

## Current Data Duplication Issue

When loading a 100k row CSV file, the data is stored multiple times:

### 1. CsvDataSource (src/data/csv_datasource.rs)
- Stores data as `Vec<serde_json::Value>`
- Each row is a JSON object with field names duplicated

### 2. QueryResponse (src/api_client.rs)
- Contains `data: Vec<Value>` - another copy of the JSON data
- Stored in Buffer.results

### 3. Buffer.filtered_data (optional)
- When filtering: `Vec<Vec<String>>` - string representation of filtered rows

### 4. Buffer.cached_data (optional)
- Another `Vec<serde_json::Value>` for caching

## Memory Overhead Calculation

For a typical trade record with 7 fields:
```
{
"id": 12345,
"symbol": "AAPL",
"price": 150.25,
"quantity": 100,
"timestamp": "2024-01-15T10:30:00Z",
"side": "BUY",
"exchange": "NASDAQ"
}
```

### JSON Object Overhead:
- Field names: ~50 bytes × 100k rows = 5MB
- serde_json::Value enum tags: 8 bytes × 7 fields × 100k = 5.6MB
- HashMap overhead: ~40 bytes × 100k = 4MB
- String allocations: Each string value has its own allocation

### Total Memory Usage:
- Raw data: ~100 bytes × 100k = 10MB
- JSON representation: ~300 bytes × 100k = 30MB
- Multiple copies: 30MB × 2-3 = 60-90MB minimum
- Plus heap fragmentation and allocator overhead

**Result: 10MB of actual data becomes 100MB+ in memory**

## Solution Options

### Short-term Fix (V46)
1. Remove duplicate storage of cached_data when not needed
2. Use indices instead of copying filtered data
3. Clear unused data after loading

### Long-term Fix (V50+)
1. Migrate to DataTable with columnar storage
2. Store data only once in efficient format
3. Use views/indices for filtering and sorting
4. Lazy loading for large datasets

## Immediate Recommendation

For V46, we should:
1. Avoid storing `cached_data` unless actually caching
2. Use filter indices instead of `filtered_data` copies
3. Implement streaming for large CSV files
4. Consider compression for string columns
43 changes: 37 additions & 6 deletions sql-cli/integration_tests/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,18 @@

This directory contains all integration and test files for the SQL CLI project.

## Directory Structure

```
integration_tests/
├── test_scripts/ # Shell scripts for testing features
├── test_data/ # CSV and other data files for tests
└── *.rs # Rust integration test files
```

## Organization

### Shell Scripts (*.sh)
### Shell Scripts (`test_scripts/`)
- `test_all_fixes.sh` - Comprehensive test suite for all fixes
- `test_buffer_switch.sh` - Tests buffer switching functionality
- `test_column_search.sh` - Tests column search feature
Expand Down Expand Up @@ -61,20 +70,42 @@ These are standalone test programs that can be compiled and run individually:
- `test_history_debug.rs`, `test_history_unit.rs` - History system tests
- `test_state_init.rs` - State initialization

### Test Data (`test_data/`)
- Sample CSV files with various data types and structures
- Query result exports for regression testing
- Test fixtures for specific scenarios

## Running Tests

### Shell Scripts
```bash
cd integration_tests
./test_history_search.sh # or any other .sh file
# From project root
./integration_tests/test_scripts/test_history_search.sh

# Or for version-specific tests
./integration_tests/test_scripts/test_v46_datatable.sh
```

### Rust Test Files
```bash
cd integration_tests
rustc test_csv.rs && ./test_csv # Compile and run individual test
# Run all integration tests
cargo test --test '*'

# Run specific test
cargo test --test test_csv

# With debug output
RUST_LOG=debug cargo test --test test_name -- --nocapture
```

## Version Tests

Tests are versioned to match our DataTable migration strategy:
- **V40-V45**: Trait-based migration (✅ complete)
- **V46-V50**: DataTable introduction (🚧 in progress)
- **V51-V60**: DataView implementation (📋 planned)
- **V61-V70**: Full migration completion (📋 planned)

## Note
These tests were moved from the main project directory to keep it clean and organized.
All paths in the test files assume they're run from the integration_tests directory.
Test scripts may need path adjustments if test data locations have changed.
File renamed without changes.
File renamed without changes.
File renamed without changes.
6 changes: 6 additions & 0 deletions sql-cli/integration_tests/test_data/test_datatable.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
id,name,age,salary,active,joined_date
1,Alice,30,75000.50,true,2020-01-15
2,Bob,25,60000.00,false,2021-03-20
3,Charlie,35,85000.75,true,2019-06-01
4,Diana,28,70000.25,true,2022-02-10
5,Eve,32,,false,2020-11-30
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
36 changes: 36 additions & 0 deletions sql-cli/integration_tests/test_memory.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
use serde_json::json;

fn main() {
// Test memory usage of serde_json::Value
let json_val = json!({
"id": 12345,
"symbol": "AAPL",
"price": 150.25,
"quantity": 100,
"timestamp": "2024-01-15T10:30:00Z",
"side": "BUY",
"exchange": "NASDAQ"
});

println!("Size of serde_json::Value: {} bytes", std::mem::size_of_val(&json_val));

// String version
let str_vec = vec![
"12345".to_string(),
"AAPL".to_string(),
"150.25".to_string(),
"100".to_string(),
"2024-01-15T10:30:00Z".to_string(),
"BUY".to_string(),
"NASDAQ".to_string()
];

println!("Size of Vec<String>: {} bytes", std::mem::size_of_val(&str_vec));

// Actual string content size
let json_str = serde_json::to_string(&json_val).unwrap();
println!("JSON string length: {} bytes", json_str.len());

let total_str_len: usize = str_vec.iter().map(|s| s.len()).sum();
println!("Total string content: {} bytes", total_str_len);
}
File renamed without changes.
File renamed without changes.
File renamed without changes.
56 changes: 56 additions & 0 deletions sql-cli/integration_tests/test_scripts/test_v46_datatable.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
#!/bin/bash

# Test script for V46: DataTable Introduction

echo "Testing V46: DataTable Introduction"
echo "===================================="

# Build the project
echo "Building project..."
cargo build --release 2>&1 | grep -E "error|warning|Finished"

if [ $? -ne 0 ]; then
echo "Build failed!"
exit 1
fi

echo ""
echo "Test: DataTable Conversion Demo"
echo "--------------------------------"
echo "1. Load a CSV file"
echo "2. Press F6 to convert current results to DataTable"
echo "3. Check status message for memory comparison"
echo "4. Check debug logs for column type information"
echo ""

# Create test data
cat > test_datatable.csv << EOF
id,name,age,salary,active,joined_date
1,Alice,30,75000.50,true,2020-01-15
2,Bob,25,60000.00,false,2021-03-20
3,Charlie,35,85000.75,true,2019-06-01
4,Diana,28,70000.25,true,2022-02-10
5,Eve,32,,false,2020-11-30
EOF

echo "Test data created: test_datatable.csv"
echo ""
echo "Running tests..."

# Test DataTable conversion
cargo test --lib data::datatable::tests::test_from_query_response --nocapture 2>&1 | grep -E "test result|V46"

echo ""
echo "Instructions for manual testing:"
echo "1. Run: RUST_LOG=debug ./target/release/sql-cli test_datatable.csv"
echo "2. After data loads, press F6"
echo "3. Look for 'V46: DataTable created!' in status bar"
echo "4. Check debug logs (F5) for detailed column information"
echo ""
echo "Expected behavior:"
echo "- Status shows memory comparison (JSON vs DataTable)"
echo "- Debug logs show column types (Integer, String, Float, Boolean, DateTime)"
echo "- Memory usage should be lower for DataTable"
echo ""
echo "===================================="
echo "V46 DataTable Introduction Test Complete!"
Loading
Loading