From 8d59f13723bc8c67319b9a4bd789a509cf5bcd51 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 31 Oct 2025 17:04:26 +0000 Subject: [PATCH] Add comprehensive integration guide for LLM applications Create INTEGRATION_GUIDE.md to help developers integrate SSR methodology into applications that generate LLM responses. The guide covers: - Core SSR concepts and mathematical foundations - Implementation components (compute functions and ResponseRater class) - Multiple integration patterns (minimal math-only vs full text-to-PMF) - Example application with survey processing - Performance optimization and parameter tuning - Common pitfalls and quick reference card This guide is designed for LLMs or developers who want to borrow the underlying logic for converting text responses into Likert scale probability distributions without needing to deeply explore the codebase. --- INTEGRATION_GUIDE.md | 527 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 527 insertions(+) create mode 100644 INTEGRATION_GUIDE.md diff --git a/INTEGRATION_GUIDE.md b/INTEGRATION_GUIDE.md new file mode 100644 index 0000000..93df8f2 --- /dev/null +++ b/INTEGRATION_GUIDE.md @@ -0,0 +1,527 @@ +# Semantic Similarity Rating (SSR) - Integration Guide for LLM Applications + +## Overview + +This guide explains how to integrate Semantic Similarity Rating (SSR) logic into applications that generate LLM responses and need to convert them into Likert scale probability distributions. + +**What SSR Does**: Converts free-text LLM responses (e.g., "I somewhat agree") into probability distributions across Likert scale points (e.g., `[0.05, 0.15, 0.35, 0.35, 0.10]` for a 5-point scale from "Strongly Disagree" to "Strongly Agree"). + +**Why This Matters**: Instead of forcing LLMs to output single numeric ratings (1-5), SSR preserves uncertainty and nuance by quantifying the semantic similarity between the response and each scale point. + +--- + +## Core Concepts + +### 1. The SSR Equation + +SSR computes a probability mass function (PMF) using cosine similarity between embeddings: + +``` +p[r] = (similarity[r] - min_similarity + ε) / + (sum_all_similarities - n_points * min_similarity + ε) +``` + +Where: +- `similarity[r]`: Cosine similarity between response embedding and reference statement `r` +- `min_similarity`: Minimum similarity across all scale points (baseline subtraction) +- `n_points`: Number of Likert scale points (typically 5) +- `ε` (epsilon): Optional regularization parameter (default: 0.0) + +**Key Insight**: Subtracting the minimum similarity creates a "relative similarity" that emphasizes distinctions between scale points. + +### 2. Temperature Scaling (Optional) + +After computing the base PMF, you can apply temperature scaling to control distribution sharpness: + +``` +p_scaled[i] = (p[i]^(1/T)) / sum(p[j]^(1/T) for all j) +``` + +- **T = 0**: One-hot encoding (argmax of probabilities) +- **T = 1**: No scaling (identity) +- **T > 1**: Softer distribution (more uniform) +- **T < 1**: Sharper distribution (more peaked) + +--- + +## Implementation Components to Borrow + +### 1. **Core Math Functions** (`compute.py`) + +#### `response_embeddings_to_pmf(matrix_responses, matrix_likert_sentences, epsilon=0.0)` + +**Purpose**: Converts embeddings to probability distributions using SSR equation. + +**Input**: +- `matrix_responses`: numpy array of shape `(n_responses, embedding_dim)` - LLM response embeddings +- `matrix_likert_sentences`: numpy array of shape `(embedding_dim, n_scale_points)` - Reference embeddings (transposed) +- `epsilon`: Regularization parameter (default: 0.0) + +**Output**: +- numpy array of shape `(n_responses, n_scale_points)` - Probability distributions + +**Algorithm**: +```python +# 1. Normalize embeddings (L2 norm) +M_left = matrix_responses / ||matrix_responses|| +M_right = matrix_likert_sentences / ||matrix_likert_sentences|| + +# 2. Compute cosine similarities (scaled to [0, 1]) +cos = (1 + M_left @ M_right) / 2 + +# 3. Subtract minimum similarity per response +cos_min = min(cos, axis=1) +numerator = cos - cos_min + epsilon * kronecker_delta + +# 4. Normalize to sum to 1 +denominator = sum(cos, axis=1) - n_points * cos_min + epsilon +pmf = numerator / denominator +``` + +#### `scale_pmf(pmf, temperature, max_temp=inf)` + +**Purpose**: Apply temperature scaling to control distribution sharpness. + +**Input**: +- `pmf`: 1D array of probabilities +- `temperature`: Scaling parameter (0 to inf) +- `max_temp`: Optional ceiling on temperature + +**Output**: Scaled PMF (still sums to 1) + +--- + +### 2. **Orchestration Class** (`response_rater.py`) + +The `ResponseRater` class provides a higher-level interface that manages: +- Multiple reference sets (different phrasings of Likert scales) +- Automatic embedding computation +- Reference set selection and averaging + +**Key Features for Integration**: + +#### Dual Operating Modes + +**Text Mode** (recommended for most applications): +```python +# Automatically computes embeddings using sentence-transformers +rater = ResponseRater(df_references) # No embedding column +pmfs = rater.get_response_pmfs('set1', ["I agree", "Not sure"]) +``` + +**Embedding Mode** (for custom embedding pipelines): +```python +# Uses pre-computed embeddings +rater = ResponseRater(df_references_with_embeddings) +pmfs = rater.get_response_pmfs('set1', embedding_matrix) +``` + +#### Reference Set Management + +```python +# Use specific reference set +pmfs = rater.get_response_pmfs('set1', responses) + +# Average across all reference sets (more robust) +pmfs = rater.get_response_pmfs('mean', responses) + +# Get survey-level aggregate (average PMFs) +survey_pmf = rater.get_survey_response_pmf(pmfs) +``` + +--- + +## Data Structures + +### Reference Sentences DataFrame + +**Required Structure** (Polars DataFrame or convert from pandas): + +```python +import polars as po + +df_references = po.DataFrame({ + 'id': ['set1', 'set1', 'set1', 'set1', 'set1', # Reference set ID + 'set2', 'set2', 'set2', 'set2', 'set2'], + 'int_response': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5], # Must be 1-5 for each set + 'sentence': [ + # Set 1 (formal phrasing) + 'Strongly disagree', 'Disagree', 'Neutral', + 'Agree', 'Strongly agree', + # Set 2 (informal phrasing) + 'Disagree a lot', 'Kinda disagree', 'Don\'t know', + 'Kinda agree', 'Agree a lot' + ] +}) +``` + +**Validation Requirements**: +- Each reference set (`id`) must have exactly 5 sentences +- `int_response` must be [1, 2, 3, 4, 5] for each set +- Reserved ID: `'mean'` cannot be used (reserved for averaging) +- Optional `embedding` column for pre-computed embeddings + +--- + +## Integration Workflow + +### Minimal Integration (Just the Math) + +If your app already has embeddings: + +```python +import numpy as np +from semantic_similarity_rating.compute import response_embeddings_to_pmf, scale_pmf + +# Your app generates these +llm_response_embeddings = np.array([[...], [...]]) # Shape: (n_responses, 384) +reference_embeddings = np.array([[...], [...], ...]).T # Shape: (384, 5) - TRANSPOSED! + +# Convert to PMFs +pmfs = response_embeddings_to_pmf(llm_response_embeddings, reference_embeddings) + +# Optional: Apply temperature scaling +pmfs_scaled = np.array([scale_pmf(pmf, temperature=0.8) for pmf in pmfs]) + +# Get survey aggregate +survey_pmf = pmfs_scaled.mean(axis=0) +``` + +**Key Detail**: Reference embeddings must be **transposed** (shape: `embedding_dim x n_scale_points`). + +--- + +### Full Integration (Text to PMF) + +If your app generates text responses: + +```python +import polars as po +from semantic_similarity_rating import ResponseRater + +# 1. Set up reference sentences (one-time setup) +df_references = po.DataFrame({ + 'id': ['likert_v1'] * 5, + 'int_response': [1, 2, 3, 4, 5], + 'sentence': [ + 'Strongly disagree', + 'Disagree', + 'Neutral', + 'Agree', + 'Strongly agree' + ] +}) + +# 2. Initialize rater (loads sentence-transformer model) +rater = ResponseRater(df_references, model_name='all-MiniLM-L6-v2') + +# 3. Your app generates LLM responses +llm_responses = [ + "I completely agree with this", + "I'm not really sure about this", + "I strongly disagree" +] + +# 4. Convert to PMFs +pmfs = rater.get_response_pmfs( + reference_set_id='likert_v1', + llm_responses=llm_responses, + temperature=1.0, + epsilon=0.0 +) + +# 5. Get aggregate survey response +survey_pmf = rater.get_survey_response_pmf(pmfs) + +print("Individual PMFs:") +print(pmfs) # Shape: (3, 5) +print("\nSurvey-level PMF:") +print(survey_pmf) # Shape: (5,) +``` + +--- + +## Advanced: Multiple Reference Sets + +Using multiple phrasings improves robustness: + +```python +df_references = po.DataFrame({ + 'id': ['formal'] * 5 + ['casual'] * 5 + ['academic'] * 5, + 'int_response': [1, 2, 3, 4, 5] * 3, + 'sentence': [ + # Formal + 'Strongly disagree', 'Disagree', 'Neutral', 'Agree', 'Strongly agree', + # Casual + 'Totally disagree', 'Kinda disagree', 'Meh', 'Kinda agree', 'Totally agree', + # Academic + 'Reject entirely', 'Reject partially', 'Withhold judgment', + 'Accept partially', 'Accept entirely' + ] +}) + +rater = ResponseRater(df_references) + +# Use specific set +pmfs_formal = rater.get_response_pmfs('formal', responses) + +# Average across all sets (recommended for robustness) +pmfs_averaged = rater.get_response_pmfs('mean', responses) +``` + +--- + +## Technical Considerations + +### 1. **Embedding Model Selection** + +**Default**: `all-MiniLM-L6-v2` (sentence-transformers) +- **Pros**: Fast, lightweight (80MB), good general performance +- **Embedding dim**: 384 +- **Max tokens**: 256 + +**Alternatives**: +- `all-mpnet-base-v2`: Better quality, slower (420MB, 768-dim) +- `paraphrase-multilingual-mpnet-base-v2`: Multilingual support +- Custom models via sentence-transformers or OpenAI/Cohere APIs + +**Integration Tip**: If using custom embeddings (OpenAI, Cohere, etc.), use **embedding mode**: + +```python +# Pre-compute embeddings with your provider +reference_embeddings = openai.embed([ref1, ref2, ref3, ref4, ref5]) + +df_with_embeddings = po.DataFrame({ + 'id': ['set1'] * 5, + 'int_response': [1, 2, 3, 4, 5], + 'sentence': [...], + 'embedding': reference_embeddings # List of lists or arrays +}) + +rater = ResponseRater(df_with_embeddings) # Auto-detects embedding mode +``` + +### 2. **Performance Optimization** + +- **Batch Processing**: `model.encode()` is vectorized - pass lists, not loops +- **GPU Acceleration**: Set `device='cuda'` in ResponseRater constructor +- **Caching**: Pre-compute reference embeddings once, reuse across requests +- **Embedding Size**: Smaller models (384-dim) are 2-3x faster than large (768-dim) + +### 3. **Parameter Tuning** + +**Epsilon (ε)**: +- **Default**: 0.0 (no regularization) +- **Use case**: Prevent numerical instability if all similarities are equal +- **Typical range**: 0.0 to 0.01 +- **Effect**: Adds small uniform probability to all scale points + +**Temperature (T)**: +- **Default**: 1.0 (no scaling) +- **T < 1**: Sharper distributions (more confident) +- **T > 1**: Softer distributions (less confident) +- **T = 0**: One-hot encoding (forces single choice) +- **Typical range**: 0.5 to 2.0 + +### 4. **Edge Cases** + +**Empty responses**: +```python +# Returns empty array +pmfs = response_embeddings_to_pmf(np.empty((0, 384)), reference_matrix) +# Shape: (0, 5) +``` + +**Identical embeddings**: +- All similarities equal → PMF becomes uniform distribution +- Epsilon helps distinguish slightly: `[0.2, 0.2, 0.2, 0.2, 0.2]` + +**Temperature = 0 with ties**: +- If multiple scale points have max probability → returns original PMF +- Otherwise → one-hot at argmax + +--- + +## Example Application Integration + +### Survey Application Flow + +```python +class SurveyProcessor: + def __init__(self, reference_sentences_df): + self.rater = ResponseRater( + reference_sentences_df, + model_name='all-MiniLM-L6-v2', + device='cpu' # or 'cuda' + ) + + def process_survey_question( + self, + question: str, + llm_responses: list[str], + temperature: float = 1.0 + ) -> dict: + """ + Process LLM responses for a single survey question. + + Returns: + { + 'individual_pmfs': array of shape (n_responses, 5), + 'survey_pmf': array of shape (5,), + 'expected_value': float (1-5 scale), + 'distribution_entropy': float + } + """ + # Get PMFs (averaged across reference sets for robustness) + pmfs = self.rater.get_response_pmfs( + reference_set_id='mean', + llm_responses=llm_responses, + temperature=temperature + ) + + # Aggregate to survey level + survey_pmf = self.rater.get_survey_response_pmf(pmfs) + + # Compute summary statistics + scale_points = np.array([1, 2, 3, 4, 5]) + expected_value = np.dot(survey_pmf, scale_points) + entropy = -np.sum(survey_pmf * np.log(survey_pmf + 1e-10)) + + return { + 'individual_pmfs': pmfs, + 'survey_pmf': survey_pmf, + 'expected_value': expected_value, + 'distribution_entropy': entropy + } + + def process_full_survey( + self, + questions: list[str], + responses_per_question: list[list[str]] + ) -> list[dict]: + """Process all questions in a survey.""" + return [ + self.process_survey_question(q, responses) + for q, responses in zip(questions, responses_per_question) + ] +``` + +**Usage**: +```python +# Setup +processor = SurveyProcessor(df_references) + +# Your app generates these +llm_responses = [ + "I think this is pretty good", + "Not convinced about this", + "Absolutely love it!" +] + +# Process +results = processor.process_survey_question( + question="How satisfied are you with the product?", + llm_responses=llm_responses, + temperature=1.0 +) + +print(f"Expected rating: {results['expected_value']:.2f}/5.0") +print(f"Distribution: {results['survey_pmf']}") +print(f"Uncertainty (entropy): {results['distribution_entropy']:.3f}") +``` + +--- + +## Dependencies + +**Minimal** (just the math): +``` +numpy>=1.24.0 +scipy>=1.10.0 +``` + +**Full** (with text embedding): +``` +numpy>=1.24.0 +scipy>=1.10.0 +polars>=0.20.0 +sentence-transformers>=2.2.0 +beartype>=0.15.0 +``` + +**Installation**: +```bash +pip install numpy scipy polars sentence-transformers beartype +``` + +Or install from this repository: +```bash +pip install git+https://github.com/pymc-labs/semantic-similarity-rating.git +``` + +--- + +## Key Files to Reference + +- **`semantic_similarity_rating/compute.py`** (123 lines) + Core mathematical functions - can be extracted as standalone module + +- **`semantic_similarity_rating/response_rater.py`** (368 lines) + Orchestration layer - adapt for your application's needs + +- **`tests/test_compute.py`** (8KB) + Test cases showing expected behavior and edge cases + +- **`tests/test_response_rater.py`** (9KB) + Integration test examples + +--- + +## Citation + +If you use this methodology, please cite: + +``` +Maier, B. F., Aslak, U., Fiaschi, L., Pappas, K., Wiecki, T. (2025). +Measuring Synthetic Consumer Purchase Intent Using Semantic-Similarity Ratings. +``` + +--- + +## License + +MIT License - Free to use and modify for commercial and non-commercial applications. + +--- + +## Quick Reference Card + +| **Task** | **Code** | +|----------|----------| +| Convert text responses to PMFs | `rater.get_response_pmfs('set1', ["response1", "response2"])` | +| Use averaged reference sets | `rater.get_response_pmfs('mean', responses)` | +| Apply temperature scaling | `rater.get_response_pmfs('set1', responses, temperature=0.8)` | +| Get survey aggregate | `rater.get_survey_response_pmf(pmfs)` | +| Just the math (embeddings → PMF) | `compute.response_embeddings_to_pmf(resp_emb, ref_emb)` | +| Scale existing PMF | `compute.scale_pmf(pmf, temperature=0.5)` | +| Use custom embedding model | Include `'embedding'` column in DataFrame | +| Check available reference sets | `rater.available_reference_sets` | +| Get model info | `rater.model_info` | + +--- + +## Common Pitfalls + +1. **Reference embedding shape**: Must be `(embedding_dim, 5)` not `(5, embedding_dim)` - transpose if needed! +2. **Reference dataframe validation**: Each set needs exactly 5 sentences with `int_response` = [1, 2, 3, 4, 5] +3. **Mode confusion**: Text mode expects `list[str]`, embedding mode expects `np.ndarray` +4. **Temperature = 0**: Only use if you want hard classification (one-hot output) +5. **Polars vs Pandas**: Use `polars.DataFrame` or convert: `po.from_pandas(df)` + +--- + +## Support + +- **Original Repository**: https://github.com/pymc-labs/semantic-similarity-rating +- **Paper**: Maier et al. (2025) +- **License**: MIT