Skip to content

Bug: map url command treats URLs without protocol as single pages #35

@boringdata

Description

@boringdata

Bug Description

When running kurt content map url with a domain that doesn't include the protocol (e.g., juhache.substack.com instead of https://juhache.substack.com), the command incorrectly treats it as a single page instead of discovering all subpages via sitemap or crawling.

Steps to Reproduce

  1. Run: uv run kurt content map url "juhache.substack.com"
  2. Observe: Only 1 page discovered with method single_page
  3. Expected: Should discover 90+ pages from the sitemap

Root Cause

The is_single_page_url() function in src/kurt/utils/url_utils.py:19 uses urlparse() which treats URLs without protocols as paths, not domains.

Example:

from urllib.parse import urlparse

# Without protocol - WRONG
urlparse("juhache.substack.com").path  # Returns 'juhache.substack.com'

# With protocol - CORRECT  
urlparse("https://juhache.substack.com").path  # Returns ''

This causes the function to incorrectly identify juhache.substack.com as a single page instead of a multi-page site.

Workaround

Always include the protocol:

uv run kurt content map url "https://juhache.substack.com"

Proposed Fix

Add URL normalization to ensure protocol is present. This should be done in one of these locations:

  1. Option 1: CLI command level (src/kurt/commands/content/map.py:99)

    • Normalize URL before passing to map_url_content()
  2. Option 2: Utility function (src/kurt/utils/url_utils.py)

    • Add ensure_protocol(url: str) -> str function
    • Call it in is_single_page_url() before parsing
  3. Option 3: Content mapping layer (src/kurt/content/map/__init__.py:41)

    • Normalize in map_url_content() function

Suggested implementation:

def ensure_protocol(url: str) -> str:
    """Add https:// protocol if missing."""
    if not url.startswith(('http://', 'https://')):
        return f'https://{url}'
    return url

Impact

  • Users must remember to always include https:// when using map url command
  • Inconsistent with user expectations (most tools auto-add protocol)
  • Affects all URL-based commands that use is_single_page_url()

Related Files

  • src/kurt/utils/url_utils.py:19 - is_single_page_url() function
  • src/kurt/commands/content/map.py:99 - CLI command entry point
  • src/kurt/content/map/__init__.py:41 - map_url_content() function

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions