Releases: phase3dev/advanced-sitemap-parser
Releases · phase3dev/advanced-sitemap-parser
v1.0.2
What's Changed
Improvements
- Improved per-sitemap output filename readability so exported
.txtfiles are easier to identify against their source sitemap URL or local sitemap file - Remote sitemap outputs now include cleaner host/path/query hints in the readable filename prefix
- Local sitemap outputs now include parent-directory and file-stem hints in the readable filename prefix
- Kept the trailing short hash derived from the full original source string, so collision safety is preserved
Docs
- Updated README output filename examples and descriptions to reflect the more identifiable naming format
Notes
- This release changes the human-readable prefix of per-sitemap output filenames
- The hash basis is unchanged, but any downstream scripts that match the old filename prefix format may need to be updated
Full Changelog: v1.0.1...v1.0.2
v1.0.1
What's Changed
Bug Fixes
- Fixed per-sitemap filename collisions for child sitemap URLs sharing the same host/path but differing by query string; output filenames now use a sanitized readable base plus a short hash of the full source URL
- Ensured each per-sitemap file contains only the URLs extracted from that specific sitemap source
- Corrected the README clone URL to
phase3dev/sitemap-extract - Fixed local sitemap handling so directory-based inputs and nested local sitemap references resolve correctly
- Replaced the bare
except:inget_current_ip()withexcept Exception:soKeyboardInterruptandSystemExitare not swallowed
Improvements
- Added
all_extracted_urls.txt, always written at the end of a run from the deduplicated union of all extracted page URLs - Deduplicated per-sitemap URL files before writing
- Tightened directory scanning so
--directorymatches only.xmland.xml.gz, not suffix variants like.xml.bak - Normalized
save_dironce before processor construction so the processor always receives a canonical path - Made
--stealthbehavior consistent across runtime, CLI help, and README by forcingmax_workers=1instead of warning only - Replaced non-interruptible retry sleeps with a shared interruptible sleep helper so Ctrl+C responds promptly during retry backoff and stagger delays
- Preserved global pacing semantics under multithreading by locking the shared request clock around delay calculation and sleep
- Removed request-scoped proxy and user-agent state from shared instance fields, keeping them local per request to eliminate cross-thread races
Tests
- Added minimal stdlib regression tests covering interruptible sleep, proxy/IP formatting, interrupt propagation, concurrent stats/failure tracking, and a threaded local sitemap run
Docs & Maintenance
- Documented supported Python as 3.9+ in the README
- Added a minimal
SECURITY.md - Removed unused
lxmlfromrequirements.txt
Full Changelog: v1.0.0...v1.0.1