Skip to content

Releases: phase3dev/advanced-sitemap-parser

v1.0.2

08 Apr 15:48
9bfb5fa

Choose a tag to compare

What's Changed

Improvements

  • Improved per-sitemap output filename readability so exported .txt files are easier to identify against their source sitemap URL or local sitemap file
  • Remote sitemap outputs now include cleaner host/path/query hints in the readable filename prefix
  • Local sitemap outputs now include parent-directory and file-stem hints in the readable filename prefix
  • Kept the trailing short hash derived from the full original source string, so collision safety is preserved

Docs

  • Updated README output filename examples and descriptions to reflect the more identifiable naming format

Notes

  • This release changes the human-readable prefix of per-sitemap output filenames
  • The hash basis is unchanged, but any downstream scripts that match the old filename prefix format may need to be updated

Full Changelog: v1.0.1...v1.0.2

v1.0.1

08 Apr 14:19
cfffaad

Choose a tag to compare

What's Changed

Bug Fixes

  • Fixed per-sitemap filename collisions for child sitemap URLs sharing the same host/path but differing by query string; output filenames now use a sanitized readable base plus a short hash of the full source URL
  • Ensured each per-sitemap file contains only the URLs extracted from that specific sitemap source
  • Corrected the README clone URL to phase3dev/sitemap-extract
  • Fixed local sitemap handling so directory-based inputs and nested local sitemap references resolve correctly
  • Replaced the bare except: in get_current_ip() with except Exception: so KeyboardInterrupt and SystemExit are not swallowed

Improvements

  • Added all_extracted_urls.txt, always written at the end of a run from the deduplicated union of all extracted page URLs
  • Deduplicated per-sitemap URL files before writing
  • Tightened directory scanning so --directory matches only .xml and .xml.gz, not suffix variants like .xml.bak
  • Normalized save_dir once before processor construction so the processor always receives a canonical path
  • Made --stealth behavior consistent across runtime, CLI help, and README by forcing max_workers=1 instead of warning only
  • Replaced non-interruptible retry sleeps with a shared interruptible sleep helper so Ctrl+C responds promptly during retry backoff and stagger delays
  • Preserved global pacing semantics under multithreading by locking the shared request clock around delay calculation and sleep
  • Removed request-scoped proxy and user-agent state from shared instance fields, keeping them local per request to eliminate cross-thread races

Tests

  • Added minimal stdlib regression tests covering interruptible sleep, proxy/IP formatting, interrupt propagation, concurrent stats/failure tracking, and a threaded local sitemap run

Docs & Maintenance

  • Documented supported Python as 3.9+ in the README
  • Added a minimal SECURITY.md
  • Removed unused lxml from requirements.txt

Full Changelog: v1.0.0...v1.0.1

v1.0.0

12 Mar 00:38
9be2d88

Choose a tag to compare

Initial Release