Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .jules/bolt.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,3 +57,7 @@
## 2025-02-13 - [Substring pre-filtering for regex optimization]
**Learning:** In hot paths (like `PriorityEngine._calculate_urgency`), executing pre-compiled regular expressions (`re.search`) for simple keyword extraction or grouping (e.g., `\b(word1|word2)\b`) is significantly slower than simple Python substring checks (`in text`). The regex engine execution overhead in Python adds up in high-iteration loops like priority scoring.
**Action:** Always consider pre-extracting literal keywords from simple regex patterns and executing a quick `any(k in text for k in keywords)` pre-filter. Only invoke `regex.search` if the pre-filter passes, avoiding the expensive regex operation on texts that obviously do not match.

## 2026-02-14 - Stable Cryptographic Cache Keys
**Learning:** Python's built-in `hash()` is salted and non-deterministic across process restarts or different worker processes. Using `hash(image_bytes)` as a cache key in a multi-worker production environment (like Gunicorn/Uvicorn) results in a 0% hit rate across workers and process restarts.
**Action:** Always use stable cryptographic hashes like `hashlib.md5(data).hexdigest()` for cache keys involving binary data to ensure consistency across the entire application cluster.
Comment on lines +61 to +63
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This note describes MD5 as a “stable cryptographic” hash and claims hash() causes a 0% hit rate “across workers”. In this codebase the detection cache is an in-memory, per-process ThreadSafeCache, so cache hits are not shared across workers regardless of key stability, and MD5 should not be described as cryptographically secure. Please reword this learning/action to focus on determinism/stability (not cryptographic strength) and avoid implying cross-worker cache sharing unless the cache is actually shared (e.g., Redis).

Suggested change
## 2026-02-14 - Stable Cryptographic Cache Keys
**Learning:** Python's built-in `hash()` is salted and non-deterministic across process restarts or different worker processes. Using `hash(image_bytes)` as a cache key in a multi-worker production environment (like Gunicorn/Uvicorn) results in a 0% hit rate across workers and process restarts.
**Action:** Always use stable cryptographic hashes like `hashlib.md5(data).hexdigest()` for cache keys involving binary data to ensure consistency across the entire application cluster.
## 2026-02-14 - Deterministic Cache Keys
**Learning:** Python's built-in `hash()` is salted and non-deterministic across process restarts and worker processes. Using `hash(image_bytes)` as a cache key means the same logical key can map to different values between processes or deployments, preventing effective reuse of cached results beyond a single process lifetime.
**Action:** Use a stable, deterministic hash function from `hashlib` (for example, `hashlib.md5(data).hexdigest()` or a stronger variant) when you need cache keys that remain consistent across restarts or processes. This is for key stability only and should not be relied on for cryptographic security.

Copilot uses AI. Check for mistakes.
26 changes: 18 additions & 8 deletions backend/routers/detection.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
from PIL import Image
import logging
import time
import hashlib

from backend.utils import process_and_detect, validate_uploaded_file, process_uploaded_image
from backend.schemas import DetectionResponse, UrgencyAnalysisRequest, UrgencyAnalysisResponse
Expand Down Expand Up @@ -68,35 +69,44 @@ async def _get_cached_result(key: str, func, *args, **kwargs):
return result

async def _cached_detect_severity(image_bytes: bytes):
key = f"severity_{hash(image_bytes)}"
# Stable cache key using MD5 (hash() is unstable across processes)
image_hash = hashlib.md5(image_bytes).hexdigest()
key = f"severity_{image_hash}"
return await _get_cached_result(key, detect_severity_clip, image_bytes)

async def _cached_detect_smart_scan(image_bytes: bytes):
key = f"smart_scan_{hash(image_bytes)}"
image_hash = hashlib.md5(image_bytes).hexdigest()
key = f"smart_scan_{image_hash}"
return await _get_cached_result(key, detect_smart_scan_clip, image_bytes)
Comment on lines 71 to 80
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MD5-based cache-key generation is duplicated across each _cached_* helper. This repetition makes it easy for prefixes/formatting to drift and harder to update hashing strategy later. Consider extracting a small helper (e.g., _image_cache_key(prefix, image_bytes)) and using it for all these functions.

Copilot uses AI. Check for mistakes.

async def _cached_generate_caption(image_bytes: bytes):
key = f"caption_{hash(image_bytes)}"
image_hash = hashlib.md5(image_bytes).hexdigest()
key = f"caption_{image_hash}"
return await _get_cached_result(key, generate_image_caption, image_bytes)

async def _cached_detect_waste(image_bytes: bytes):
key = f"waste_{hash(image_bytes)}"
image_hash = hashlib.md5(image_bytes).hexdigest()
key = f"waste_{image_hash}"
return await _get_cached_result(key, detect_waste_clip, image_bytes)

async def _cached_detect_civic_eye(image_bytes: bytes):
key = f"civic_eye_{hash(image_bytes)}"
image_hash = hashlib.md5(image_bytes).hexdigest()
key = f"civic_eye_{image_hash}"
return await _get_cached_result(key, detect_civic_eye_clip, image_bytes)

async def _cached_detect_graffiti(image_bytes: bytes):
key = f"graffiti_{hash(image_bytes)}"
image_hash = hashlib.md5(image_bytes).hexdigest()
key = f"graffiti_{image_hash}"
return await _get_cached_result(key, detect_graffiti_art_clip, image_bytes)

async def _cached_detect_traffic_sign(image_bytes: bytes):
key = f"traffic_sign_{hash(image_bytes)}"
image_hash = hashlib.md5(image_bytes).hexdigest()
key = f"traffic_sign_{image_hash}"
return await _get_cached_result(key, detect_traffic_sign_clip, image_bytes)

async def _cached_detect_abandoned_vehicle(image_bytes: bytes):
key = f"abandoned_vehicle_{hash(image_bytes)}"
image_hash = hashlib.md5(image_bytes).hexdigest()
key = f"abandoned_vehicle_{image_hash}"
return await _get_cached_result(key, detect_abandoned_vehicle_clip, image_bytes)
Comment on lines 71 to 110
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for hash() used in cache key generation patterns
rg -n --type=py 'cache.*hash\(' -A2 -B2
rg -n --type=py 'key.*=.*hash\(' -A2 -B2

Repository: RohanExploit/VishwaGuru

Length of output: 897


🏁 Script executed:

cat -n backend/ai_service.py | sed -n '180,200p'

Repository: RohanExploit/VishwaGuru

Length of output: 797


🏁 Script executed:

# Search more comprehensively for hash() used in cache contexts
rg -n --type=py 'hash\(' --context=5 | rg -A5 -B5 'cache|key'

Repository: RohanExploit/VishwaGuru

Length of output: 1976


🏁 Script executed:

# Check for any other cache key generation patterns
rg -n --type=py '(cache_key|cache.*key)\s*=' -A1 -B1

Repository: RohanExploit/VishwaGuru

Length of output: 2928


Inconsistent cache key generation: backend/ai_service.py:190 still uses hash() instead of a deterministic hash.

The line cache_key = f"chat_{hash(query)}" uses Python's built-in hash(), which is non-deterministic across processes. The PR fixes this in backend/routers/detection.py by switching to hashlib.md5(). Apply the same fix to ai_service.py for consistency and to ensure stable cache behavior in multi-worker deployments.

🧰 Tools
🪛 Ruff (0.15.6)

[error] 73-73: Probable use of insecure hash functions in hashlib: md5

(S324)


[error] 78-78: Probable use of insecure hash functions in hashlib: md5

(S324)


[error] 83-83: Probable use of insecure hash functions in hashlib: md5

(S324)


[error] 88-88: Probable use of insecure hash functions in hashlib: md5

(S324)


[error] 93-93: Probable use of insecure hash functions in hashlib: md5

(S324)


[error] 98-98: Probable use of insecure hash functions in hashlib: md5

(S324)


[error] 103-103: Probable use of insecure hash functions in hashlib: md5

(S324)


[error] 108-108: Probable use of insecure hash functions in hashlib: md5

(S324)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/routers/detection.py` around lines 71 - 110, The cache key in
ai_service.py uses Python's non-deterministic hash() (cache_key =
f"chat_{hash(query)}"), causing unstable keys across processes; replace this
with a deterministic hash using hashlib.md5 (compute md5 over the query bytes or
encoded string and use hexdigest()) when building cache_key in the function that
forms chat cache keys (look for the variable cache_key and the code path that
creates "chat_{...}" keys), ensuring the new key is like
f"chat_{hashlib.md5(query.encode('utf-8')).hexdigest()}" so it matches the
deterministic approach used in _cached_* functions in detection.py.


# Endpoints
Expand Down
36 changes: 19 additions & 17 deletions backend/routers/issues.py
Original file line number Diff line number Diff line change
Expand Up @@ -236,8 +236,7 @@ async def create_issue(
# Invalidate cache so new issue appears
try:
recent_issues_cache.clear()
recent_issues_cache.clear()
user_issues_cache.clear()
user_issues_cache.clear()
except Exception as e:
logger.error(f"Error clearing cache: {e}")

Expand Down Expand Up @@ -347,24 +346,27 @@ def get_nearby_issues(
)

# Convert to response format and limit results
nearby_responses = [
NearbyIssueResponse(
id=issue.id,
description=issue.description[:100] + "..." if len(issue.description) > 100 else issue.description,
category=issue.category,
latitude=issue.latitude,
longitude=issue.longitude,
distance_meters=distance,
upvotes=issue.upvotes or 0,
created_at=issue.created_at,
status=issue.status
)
for issue, distance in nearby_issues_with_distance[:limit]
]
# Performance Boost: Map directly to dictionaries to avoid Pydantic overhead
nearby_data = []
for issue, distance in nearby_issues_with_distance[:limit]:
desc = issue.description or ""
short_desc = desc[:100] + "..." if len(desc) > 100 else desc

nearby_data.append({
"id": issue.id,
"description": short_desc,
"category": issue.category,
"latitude": issue.latitude,
"longitude": issue.longitude,
"distance_meters": distance,
"upvotes": issue.upvotes or 0,
"created_at": issue.created_at.isoformat() if issue.created_at else None,
"status": issue.status
})

# Performance Boost: Cache serialized JSON to bypass redundant Pydantic validation
# and serialization on cache hits.
json_data = json.dumps([r.model_dump(mode='json') for r in nearby_responses])
json_data = json.dumps(nearby_data)
nearby_issues_cache.set(json_data, cache_key)

return Response(content=json_data, media_type="application/json")
Comment on lines +369 to 372
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_nearby_issues is still declared with response_model=List[NearbyIssueResponse], but the implementation now returns a pre-serialized Response. Returning a Response bypasses FastAPI response-model validation/serialization, so the API contract (e.g., field types/required fields) is no longer enforced and can silently drift from the OpenAPI schema. Consider either returning nearby_data as a Python list with a fast response_class (e.g., ORJSONResponse) or removing/adjusting response_model to reflect that this endpoint returns raw JSON without validation.

Copilot uses AI. Check for mistakes.
Expand Down
Loading