Build advanced movie recommendation system#3
Conversation
…oach Co-authored-by: deepshekhardas1234 <deepshekhardas1234@gmail.com>
|
Cursor Agent can help with this pull request. Just |
There was a problem hiding this comment.
5 issues found across 25 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="recommender/hybrid.py">
<violation number="1" location="recommender/hybrid.py:75">
P1: The `by_likes` branch does not filter out liked/disliked movies from candidates, so the user's own selections are likely to appear (often at the top) of the recommendations. The `by_user` branch correctly excludes seen movies — the same pattern should be applied here.</violation>
</file>
<file name="recommender/cf.py">
<violation number="1" location="recommender/cf.py:101">
P1: The sklearn fallback computes RMSE on training data (reconstruction error), while the Surprise path computes it on a held-out test set. These are not comparable: training RMSE is always lower, so the sklearn backend will look artificially better. Split the data the same way (e.g., hold out 20% of observed entries) before computing RMSE, so the metric is meaningful regardless of backend.</violation>
</file>
<file name="recommender/posters.py">
<violation number="1" location="recommender/posters.py:10">
P2: `lru_cache` permanently caches transient HTTP failures. If a request times out or the server returns a 5xx, `None` is cached and the poster will never be fetched for that movie until the process restarts. Consider caching only successful results (e.g., use a manual dict cache that stores only non-`None` values).</violation>
</file>
<file name="recommender/data.py">
<violation number="1" location="recommender/data.py:26">
P1: Calling `zf.extractall(base)` without validating member paths is vulnerable to zip-slip attacks — a crafted archive (from `MOVIELENS_LOCAL_ZIP` or a compromised download) could write files outside `base`. Validate that each member's resolved path starts with the target directory, or on Python 3.12+ use the `filter='data'` parameter.</violation>
<violation number="2" location="recommender/data.py:47">
P1: Falling back to `verify=False` on `SSLError` silently disables TLS certificate verification, making the download vulnerable to man-in-the-middle attacks. A network attacker could serve a malicious zip that then gets extracted to disk. Remove this fallback — if TLS fails, the error should propagate so users can fix their environment (e.g., install `certifi` or update CA certs) rather than silently downgrade security.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| cf_scores = _minmax(cf_scores) | ||
| content_scores = _minmax(content_scores) | ||
| hybrid = weight_cf * cf_scores + (1.0 - weight_cf) * content_scores | ||
| candidates = index_to_movie |
There was a problem hiding this comment.
P1: The by_likes branch does not filter out liked/disliked movies from candidates, so the user's own selections are likely to appear (often at the top) of the recommendations. The by_user branch correctly excludes seen movies — the same pattern should be applied here.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At recommender/hybrid.py, line 75:
<comment>The `by_likes` branch does not filter out liked/disliked movies from candidates, so the user's own selections are likely to appear (often at the top) of the recommendations. The `by_user` branch correctly excludes seen movies — the same pattern should be applied here.</comment>
<file context>
@@ -0,0 +1,84 @@
+ cf_scores = _minmax(cf_scores)
+ content_scores = _minmax(content_scores)
+ hybrid = weight_cf * cf_scores + (1.0 - weight_cf) * content_scores
+ candidates = index_to_movie
+
+ # Prepare result DataFrame
</file context>
| "item_means": item_means.astype(np.float32), | ||
| } | ||
|
|
||
| recon = (user_factors @ item_factors.T) + item_means |
There was a problem hiding this comment.
P1: The sklearn fallback computes RMSE on training data (reconstruction error), while the Surprise path computes it on a held-out test set. These are not comparable: training RMSE is always lower, so the sklearn backend will look artificially better. Split the data the same way (e.g., hold out 20% of observed entries) before computing RMSE, so the metric is meaningful regardless of backend.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At recommender/cf.py, line 101:
<comment>The sklearn fallback computes RMSE on training data (reconstruction error), while the Surprise path computes it on a held-out test set. These are not comparable: training RMSE is always lower, so the sklearn backend will look artificially better. Split the data the same way (e.g., hold out 20% of observed entries) before computing RMSE, so the metric is meaningful regardless of backend.</comment>
<file context>
@@ -0,0 +1,152 @@
+ "item_means": item_means.astype(np.float32),
+ }
+
+ recon = (user_factors @ item_factors.T) + item_means
+ obs_pred = recon[nonzero]
+ obs_true = mat[nonzero]
</file context>
|
|
||
| def _extract(zip_file: Path): | ||
| with zipfile.ZipFile(zip_file, "r") as zf: | ||
| zf.extractall(base) |
There was a problem hiding this comment.
P1: Calling zf.extractall(base) without validating member paths is vulnerable to zip-slip attacks — a crafted archive (from MOVIELENS_LOCAL_ZIP or a compromised download) could write files outside base. Validate that each member's resolved path starts with the target directory, or on Python 3.12+ use the filter='data' parameter.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At recommender/data.py, line 26:
<comment>Calling `zf.extractall(base)` without validating member paths is vulnerable to zip-slip attacks — a crafted archive (from `MOVIELENS_LOCAL_ZIP` or a compromised download) could write files outside `base`. Validate that each member's resolved path starts with the target directory, or on Python 3.12+ use the `filter='data'` parameter.</comment>
<file context>
@@ -0,0 +1,97 @@
+
+ def _extract(zip_file: Path):
+ with zipfile.ZipFile(zip_file, "r") as zf:
+ zf.extractall(base)
+
+ if local_zip and Path(local_zip).exists():
</file context>
| urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning) | ||
| except Exception: | ||
| pass | ||
| with requests.get(MOVIELENS_URL, stream=True, timeout=60, verify=False) as r: # type: ignore[arg-type] |
There was a problem hiding this comment.
P1: Falling back to verify=False on SSLError silently disables TLS certificate verification, making the download vulnerable to man-in-the-middle attacks. A network attacker could serve a malicious zip that then gets extracted to disk. Remove this fallback — if TLS fails, the error should propagate so users can fix their environment (e.g., install certifi or update CA certs) rather than silently downgrade security.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At recommender/data.py, line 47:
<comment>Falling back to `verify=False` on `SSLError` silently disables TLS certificate verification, making the download vulnerable to man-in-the-middle attacks. A network attacker could serve a malicious zip that then gets extracted to disk. Remove this fallback — if TLS fails, the error should propagate so users can fix their environment (e.g., install `certifi` or update CA certs) rather than silently downgrade security.</comment>
<file context>
@@ -0,0 +1,97 @@
+ urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
+ except Exception:
+ pass
+ with requests.get(MOVIELENS_URL, stream=True, timeout=60, verify=False) as r: # type: ignore[arg-type]
+ r.raise_for_status()
+ with open(zip_path, "wb") as f:
</file context>
| PLACEHOLDER = "https://via.placeholder.com/500x750?text=No+Image" | ||
|
|
||
|
|
||
| @lru_cache(maxsize=4096) |
There was a problem hiding this comment.
P2: lru_cache permanently caches transient HTTP failures. If a request times out or the server returns a 5xx, None is cached and the poster will never be fetched for that movie until the process restarts. Consider caching only successful results (e.g., use a manual dict cache that stores only non-None values).
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At recommender/posters.py, line 10:
<comment>`lru_cache` permanently caches transient HTTP failures. If a request times out or the server returns a 5xx, `None` is cached and the poster will never be fetched for that movie until the process restarts. Consider caching only successful results (e.g., use a manual dict cache that stores only non-`None` values).</comment>
<file context>
@@ -0,0 +1,61 @@
+PLACEHOLDER = "https://via.placeholder.com/500x750?text=No+Image"
+
+
+@lru_cache(maxsize=4096)
+def _omdb_poster(title: str, year: Optional[int]) -> Optional[str]:
+ api_key = os.getenv("OMDB_API_KEY")
</file context>
This pull request contains changes generated by Cursor background composer.