feat(contrib): Add SemanticCacheProcessor for semantic similarity caching#46
feat(contrib): Add SemanticCacheProcessor for semantic similarity caching#46Karanjot786 wants to merge 6 commits intogoogle-gemini:mainfrom
Conversation
Summary of ChangesHello @Karanjot786, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a powerful new "SemanticCacheProcessor" to the "genai-processors" library, fundamentally changing how LLM responses can be cached. Instead of relying on brittle exact-match caching, this new processor leverages vector embeddings and cosine similarity to identify and serve responses for semantically similar queries. This intelligent caching mechanism is designed to drastically cut down on redundant API calls, reduce operational costs, and improve response times for users interacting with LLM-powered applications, making the system more efficient and cost-effective. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a SemanticCacheProcessor to cache LLM responses based on semantic similarity, which is a great feature for reducing API calls and latency. The implementation is well-structured with a clear separation of concerns between the embedding client, cache storage, and the processor logic. The addition of comprehensive documentation and a thorough test suite is commendable.
My review includes a few suggestions for minor improvements, such as removing unused code, simplifying part creation for consistency, and correcting a potential typo in the documentation.
… SemanticCacheProcessor
|
Hi @kibergus, when you have a moment, could you please review this PR? Thanks! |
| hit_count: int = 0 | ||
| metadata: dict[str, Any] = dataclasses.field(default_factory=dict) | ||
|
|
||
| def get_response_parts(self) -> list[content_api.ProcessorPart]: |
There was a problem hiding this comment.
You have now the to_dict and from_dict method in ProcessorPart, you could use them directly. Not sure this is in the latest packaged version but we plan to release the new one very soon. So best to use those.
There was a problem hiding this comment.
Thanks for pointing this out. I'll switch to ProcessorPart.to_dict() and ProcessorPart.from_dict() directly. Will update once the new release lands.
| a = np.array(vec1, dtype=np.float32) | ||
| b = np.array(vec2, dtype=np.float32) | ||
|
|
||
| dot_product = np.dot(a, b) |
There was a problem hiding this comment.
nit: use it in return statement directly, no need to compute it if norm is zero.
There was a problem hiding this comment.
Good catch. I'll move the computation into the return statement and skip it entirely when norm is zero.
| return float(dot_product / (norm_a * norm_b)) | ||
|
|
||
|
|
||
| def _serialize_part(part: content_api.ProcessorPart) -> dict[str, Any]: |
There was a problem hiding this comment.
no need for it: ProcessorPart.to_dict should work here (works for multi-model data btw).
There was a problem hiding this comment.
I'll remove _serialize_part and use ProcessorPart.to_dict() instead. Cleaner and handles multi-modal data too.
| """ | ||
| # Extract text from all parts | ||
| text_parts = [] | ||
| for part in content.all_parts: |
There was a problem hiding this comment.
raise an exception if you have non text content?
We could generate a cache miss whenever the prompt contains non-textual data.
There was a problem hiding this comment.
Makes sense. I'll add a check: if the prompt contains non-textual parts, we generate a cache miss. No silent failures.
| embedding: list[float], | ||
| threshold: float, | ||
| limit: int = 1, | ||
| ) -> SimilaritySearchResult | None: |
There was a problem hiding this comment.
needs to be a list if limit > 1
There was a problem hiding this comment.
I'll change the return type to list[SimilaritySearchResult] so it properly supports limit > 1.
| streams.stream_content(input_parts) | ||
| ): | ||
| yield part | ||
| return |
There was a problem hiding this comment.
not aure you'd need this - you could handle empty input_content inside the embed method - raising an exception, and the following block would apply.
There was a problem hiding this comment.
I'll move the empty input handling into the embed method itself and raise an exception there. Removes the need for the early return block.
|
|
||
| current_time = time.time() | ||
|
|
||
| for entry in self._entries.values(): |
There was a problem hiding this comment.
you could put this into a separate thread asyncio.to_thread() as it might take a while, then the asyncio loop does not block on it.
There was a problem hiding this comment.
I'll wrap the cleanup loop in asyncio.to_thread() so the event loop stays unblocked during eviction.
| If a match is found above the similarity threshold, returns the cached | ||
| response instead of calling the wrapped processor. | ||
|
|
||
| Reduces API costs and latency when similar queries are frequently repeated. |
There was a problem hiding this comment.
note: only works for turn based processor, not for realtime ones.
There was a problem hiding this comment.
Will add a clear note in the docstring: this processor works for turn-based use cases only, not realtime ones.
aelissee
left a comment
There was a problem hiding this comment.
Hi, thanks for the proposal: I'd let Kibergus check it as well but first quick pass at it.
|
@aelissee @kibergus Applied all the review feedback. Here is what changed: Serialization:
Error handling:
API changes:
Performance:
Documentation:
Tests:
|
|
Karanjot, thank you for the contribution, but I think it would be better, more maintainable if it lives in a repository that you own. We can add a link to that repository to a list in contrib/README.md https://github.com/google-gemini/genai-processors/tree/main/genai_processors/contrib#readme similar to how there are links to mbeacom's contrinutions. If it lives in our repository, we ourselves won't be able to provide much maintenance/support to it. And you will need to go through code review from us to make any change. This won't be conveninet for either of us. SemanticCacheProcessor explores an interesting idea, but is application-specific. It needs careful tuning to avoid quality degradation due to false positives. E.g. "Tell me the 1682 France's capital city" has a different answer from "Tell me France's capital city". It is hard to find a universal threshold. Another problem is that computing embedding is not free. If users of the system don't ask similar questions really often, it can easily increase the cost of running. The cache we currently have targets a somewhat different problem: allow resuming a previoiusly interrupted operation. It is handy when developing a multi-stage pipeline or if the pipeline takes a long time to complete. And in this case exact match can easily be achieved. |
Summary
Adds
SemanticCacheProcessor, a new contrib processor that caches LLM responses based on semantic similarity using vector embeddings. Unlike exact-match caching, this approach matches queries like "What is the capital of France?" and "Tell me France's capital city" to the same cached response.Motivation
Current caching in genai-processors uses exact hash matching, which misses cache hits for semantically equivalent queries. This causes:
Changes
New Files
genai_processors/contrib/semantic_cache.py- Main implementationgenai_processors/contrib/semantic_cache.md- Documentationgenai_processors/contrib/tests/semantic_cache_test.py- Test suite (38 tests)Modified Files
genai_processors/contrib/README.md- Added to processor listFeatures
VectorCacheBaseABC for custom implementationsUsage