Skip to content

Build advanced movie recommendation system#3

Draft
deepshekhardas wants to merge 1 commit intomainfrom
cursor/build-advanced-movie-recommendation-system-5b5d
Draft

Build advanced movie recommendation system#3
deepshekhardas wants to merge 1 commit intomainfrom
cursor/build-advanced-movie-recommendation-system-5b5d

Conversation

@deepshekhardas
Copy link
Copy Markdown
Owner

This pull request contains changes generated by Cursor background composer.

…oach

Co-authored-by: deepshekhardas1234 <deepshekhardas1234@gmail.com>
@cursor
Copy link
Copy Markdown

cursor bot commented Aug 27, 2025

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
Learn more about Cursor Agents

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 issues found across 25 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="recommender/hybrid.py">

<violation number="1" location="recommender/hybrid.py:75">
P1: The `by_likes` branch does not filter out liked/disliked movies from candidates, so the user's own selections are likely to appear (often at the top) of the recommendations. The `by_user` branch correctly excludes seen movies — the same pattern should be applied here.</violation>
</file>

<file name="recommender/cf.py">

<violation number="1" location="recommender/cf.py:101">
P1: The sklearn fallback computes RMSE on training data (reconstruction error), while the Surprise path computes it on a held-out test set. These are not comparable: training RMSE is always lower, so the sklearn backend will look artificially better. Split the data the same way (e.g., hold out 20% of observed entries) before computing RMSE, so the metric is meaningful regardless of backend.</violation>
</file>

<file name="recommender/posters.py">

<violation number="1" location="recommender/posters.py:10">
P2: `lru_cache` permanently caches transient HTTP failures. If a request times out or the server returns a 5xx, `None` is cached and the poster will never be fetched for that movie until the process restarts. Consider caching only successful results (e.g., use a manual dict cache that stores only non-`None` values).</violation>
</file>

<file name="recommender/data.py">

<violation number="1" location="recommender/data.py:26">
P1: Calling `zf.extractall(base)` without validating member paths is vulnerable to zip-slip attacks — a crafted archive (from `MOVIELENS_LOCAL_ZIP` or a compromised download) could write files outside `base`. Validate that each member's resolved path starts with the target directory, or on Python 3.12+ use the `filter='data'` parameter.</violation>

<violation number="2" location="recommender/data.py:47">
P1: Falling back to `verify=False` on `SSLError` silently disables TLS certificate verification, making the download vulnerable to man-in-the-middle attacks. A network attacker could serve a malicious zip that then gets extracted to disk. Remove this fallback — if TLS fails, the error should propagate so users can fix their environment (e.g., install `certifi` or update CA certs) rather than silently downgrade security.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread recommender/hybrid.py
cf_scores = _minmax(cf_scores)
content_scores = _minmax(content_scores)
hybrid = weight_cf * cf_scores + (1.0 - weight_cf) * content_scores
candidates = index_to_movie
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai bot Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: The by_likes branch does not filter out liked/disliked movies from candidates, so the user's own selections are likely to appear (often at the top) of the recommendations. The by_user branch correctly excludes seen movies — the same pattern should be applied here.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At recommender/hybrid.py, line 75:

<comment>The `by_likes` branch does not filter out liked/disliked movies from candidates, so the user's own selections are likely to appear (often at the top) of the recommendations. The `by_user` branch correctly excludes seen movies — the same pattern should be applied here.</comment>

<file context>
@@ -0,0 +1,84 @@
+        cf_scores = _minmax(cf_scores)
+        content_scores = _minmax(content_scores)
+        hybrid = weight_cf * cf_scores + (1.0 - weight_cf) * content_scores
+        candidates = index_to_movie
+
+    # Prepare result DataFrame
</file context>
Fix with Cubic

Comment thread recommender/cf.py
"item_means": item_means.astype(np.float32),
}

recon = (user_factors @ item_factors.T) + item_means
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai bot Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: The sklearn fallback computes RMSE on training data (reconstruction error), while the Surprise path computes it on a held-out test set. These are not comparable: training RMSE is always lower, so the sklearn backend will look artificially better. Split the data the same way (e.g., hold out 20% of observed entries) before computing RMSE, so the metric is meaningful regardless of backend.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At recommender/cf.py, line 101:

<comment>The sklearn fallback computes RMSE on training data (reconstruction error), while the Surprise path computes it on a held-out test set. These are not comparable: training RMSE is always lower, so the sklearn backend will look artificially better. Split the data the same way (e.g., hold out 20% of observed entries) before computing RMSE, so the metric is meaningful regardless of backend.</comment>

<file context>
@@ -0,0 +1,152 @@
+        "item_means": item_means.astype(np.float32),
+    }
+
+    recon = (user_factors @ item_factors.T) + item_means
+    obs_pred = recon[nonzero]
+    obs_true = mat[nonzero]
</file context>
Fix with Cubic

Comment thread recommender/data.py

def _extract(zip_file: Path):
with zipfile.ZipFile(zip_file, "r") as zf:
zf.extractall(base)
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai bot Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Calling zf.extractall(base) without validating member paths is vulnerable to zip-slip attacks — a crafted archive (from MOVIELENS_LOCAL_ZIP or a compromised download) could write files outside base. Validate that each member's resolved path starts with the target directory, or on Python 3.12+ use the filter='data' parameter.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At recommender/data.py, line 26:

<comment>Calling `zf.extractall(base)` without validating member paths is vulnerable to zip-slip attacks — a crafted archive (from `MOVIELENS_LOCAL_ZIP` or a compromised download) could write files outside `base`. Validate that each member's resolved path starts with the target directory, or on Python 3.12+ use the `filter='data'` parameter.</comment>

<file context>
@@ -0,0 +1,97 @@
+
+    def _extract(zip_file: Path):
+        with zipfile.ZipFile(zip_file, "r") as zf:
+            zf.extractall(base)
+
+    if local_zip and Path(local_zip).exists():
</file context>
Fix with Cubic

Comment thread recommender/data.py
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
except Exception:
pass
with requests.get(MOVIELENS_URL, stream=True, timeout=60, verify=False) as r: # type: ignore[arg-type]
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai bot Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Falling back to verify=False on SSLError silently disables TLS certificate verification, making the download vulnerable to man-in-the-middle attacks. A network attacker could serve a malicious zip that then gets extracted to disk. Remove this fallback — if TLS fails, the error should propagate so users can fix their environment (e.g., install certifi or update CA certs) rather than silently downgrade security.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At recommender/data.py, line 47:

<comment>Falling back to `verify=False` on `SSLError` silently disables TLS certificate verification, making the download vulnerable to man-in-the-middle attacks. A network attacker could serve a malicious zip that then gets extracted to disk. Remove this fallback — if TLS fails, the error should propagate so users can fix their environment (e.g., install `certifi` or update CA certs) rather than silently downgrade security.</comment>

<file context>
@@ -0,0 +1,97 @@
+            urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
+        except Exception:
+            pass
+        with requests.get(MOVIELENS_URL, stream=True, timeout=60, verify=False) as r:  # type: ignore[arg-type]
+            r.raise_for_status()
+            with open(zip_path, "wb") as f:
</file context>
Fix with Cubic

Comment thread recommender/posters.py
PLACEHOLDER = "https://via.placeholder.com/500x750?text=No+Image"


@lru_cache(maxsize=4096)
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai bot Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: lru_cache permanently caches transient HTTP failures. If a request times out or the server returns a 5xx, None is cached and the poster will never be fetched for that movie until the process restarts. Consider caching only successful results (e.g., use a manual dict cache that stores only non-None values).

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At recommender/posters.py, line 10:

<comment>`lru_cache` permanently caches transient HTTP failures. If a request times out or the server returns a 5xx, `None` is cached and the poster will never be fetched for that movie until the process restarts. Consider caching only successful results (e.g., use a manual dict cache that stores only non-`None` values).</comment>

<file context>
@@ -0,0 +1,61 @@
+PLACEHOLDER = "https://via.placeholder.com/500x750?text=No+Image"
+
+
+@lru_cache(maxsize=4096)
+def _omdb_poster(title: str, year: Optional[int]) -> Optional[str]:
+    api_key = os.getenv("OMDB_API_KEY")
</file context>
Fix with Cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants