-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Bug Description
When running kurt content map url with a domain that doesn't include the protocol (e.g., juhache.substack.com instead of https://juhache.substack.com), the command incorrectly treats it as a single page instead of discovering all subpages via sitemap or crawling.
Steps to Reproduce
- Run:
uv run kurt content map url "juhache.substack.com" - Observe: Only 1 page discovered with method
single_page - Expected: Should discover 90+ pages from the sitemap
Root Cause
The is_single_page_url() function in src/kurt/utils/url_utils.py:19 uses urlparse() which treats URLs without protocols as paths, not domains.
Example:
from urllib.parse import urlparse
# Without protocol - WRONG
urlparse("juhache.substack.com").path # Returns 'juhache.substack.com'
# With protocol - CORRECT
urlparse("https://juhache.substack.com").path # Returns ''This causes the function to incorrectly identify juhache.substack.com as a single page instead of a multi-page site.
Workaround
Always include the protocol:
uv run kurt content map url "https://juhache.substack.com"Proposed Fix
Add URL normalization to ensure protocol is present. This should be done in one of these locations:
-
Option 1: CLI command level (
src/kurt/commands/content/map.py:99)- Normalize URL before passing to
map_url_content()
- Normalize URL before passing to
-
Option 2: Utility function (
src/kurt/utils/url_utils.py)- Add
ensure_protocol(url: str) -> strfunction - Call it in
is_single_page_url()before parsing
- Add
-
Option 3: Content mapping layer (
src/kurt/content/map/__init__.py:41)- Normalize in
map_url_content()function
- Normalize in
Suggested implementation:
def ensure_protocol(url: str) -> str:
"""Add https:// protocol if missing."""
if not url.startswith(('http://', 'https://')):
return f'https://{url}'
return urlImpact
- Users must remember to always include
https://when usingmap urlcommand - Inconsistent with user expectations (most tools auto-add protocol)
- Affects all URL-based commands that use
is_single_page_url()
Related Files
src/kurt/utils/url_utils.py:19-is_single_page_url()functionsrc/kurt/commands/content/map.py:99- CLI command entry pointsrc/kurt/content/map/__init__.py:41-map_url_content()function