Skip to content

Scraper runner: fail-fast on source load failure#105

Merged
AB-Law merged 1 commit intodevfrom
bugfix/101-source-load-failure-visibility
Mar 17, 2026
Merged

Scraper runner: fail-fast on source load failure#105
AB-Law merged 1 commit intodevfrom
bugfix/101-source-load-failure-visibility

Conversation

@AB-Law
Copy link
Copy Markdown
Owner

@AB-Law AB-Law commented Mar 17, 2026

Summary

  • Re-raise exceptions in when loading active global sources fails instead of swallowing and returning, so scraper jobs fail visibly and are eligible for retry/alerting.
  • Added regression unit tests in covering both successful source iteration and failure propagation.
  • Added a scraper function wiki note describing the new fail-fast behavior for visibility and operational response.

Issue

Test plan

  • ============================= test session starts ==============================
    platform darwin -- Python 3.11.10, pytest-9.0.2, pluggy-1.6.0
    rootdir: /Users/akshayb/projects/Pluck-It-101-source-load-failure-visibility
    configfile: pytest.ini
    plugins: mock-3.15.1, anyio-4.11.0, asyncio-1.3.0, cov-7.0.0, langsmith-0.4.33
    asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
    collected 0 items

============================ no tests ran in 0.00s =============================

Propagate load errors from run_global_scrapers so scraper jobs fail visible and can retry.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 17, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e87b273a-35f1-4596-8328-12a1cc9eb77e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch bugfix/101-source-load-failure-visibility
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

@AB-Law AB-Law merged commit 17975df into dev Mar 17, 2026
8 checks passed
AB-Law added a commit that referenced this pull request Mar 17, 2026
* docs: add local dev setup for full emulation-based development

- Rewrite QUICKSTART.md with Azurite + Cosmos emulator setup, container
  creation steps, port reference, and 5-tab run guide
- Add CONTRIBUTING.md: branch strategy, tech stack, per-layer conventions,
  PR checklist, and secrets policy
- Add local.settings.json.example for both PluckIt.Functions and
  PluckIt.Processor pre-filled with emulator connection strings and
  placeholder values for Azure OpenAI credentials

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* ci: auto-sync dev from main after every merge

* fix: use http for vnext-preview Cosmos emulator (no TLS on ARM Mac)

* feat: add setup-local-cosmos.py script to bootstrap emulator containers

* fix: use UseDevelopmentStorage=true for Azurite blob container creation

* feat: add use-env.sh script to switch between local and prod settings

* Potential fix for code scanning alert no. 10: Workflow does not contain permissions

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

* fix: use UseDevelopmentStorage=true for Azurite storage connection strings

* Refactor/angular lint (#69)

* feat: integrate ESLint for improved code quality

- Added ESLint configuration to the project for TypeScript and Angular files.
- Updated `angular.json` to include linting options.
- Installed necessary ESLint packages for Angular and TypeScript support.
- Refactored code to replace `any` with `unknown` in several service files for better type safety.
- Enhanced unit tests to align with the new type definitions and ensure consistency.

* refactor: remove unused page

* fix: correct Cosmos emulator key (invalid base64 Tg== -> g==)

* fix: update Cosmos DB keys and enhance Azurite compatibility for local development

* fix: enhance UUID generation and improve error handling in BlobSasService and setup-local-cosmos.py

* chore: update CI/CD workflows to include 'dev' branch for push and pull request triggers

* fix: improve UUID generation logic in DashboardComponent and update CI to disable watch mode for tests

* feat: enhance wardrobe loading logic (#71)

* feat: enhance wardrobe loading logic
- Updated the `_load_user_wardrobe` function to prefetch item IDs and limit the number of items returned based on wear count.
- Enhanced unit tests for `_load_user_wardrobe` to verify correct behavior and query execution.

* test: enhance user wardrobe loading tests to verify wearCount conditions
- Added assertions to check for the presence of "IS_DEFINED(c.wearCount)" and "c.wearCount > 0" in the SQL query for loading user wardrobe items.

* Feature/52 resolve phash linear scan (#72)

* feat: Enhance deduplication logic in RunDeduplicator by implementing prefix-based pHash storage. Introduced methods for registering pHashes by prefix and identifying candidate buckets for Hamming distance checks. Updated tests to validate detection of duplicates with nearby prefix variations.

* refactor: Replace candidate bucket method with a cached version for improved performance in deduplication logic. The new method optimizes pHash prefix bucket retrieval by caching results, enhancing efficiency during duplicate checks.

* refactor: Simplify candidate bucket retrieval by removing max value check in deduplicator. Added new tests to validate deduplication behavior at threshold boundaries and just under threshold conditions.

* Refactor wardrobe item loading and enhance query limits (#76)

* Refactor wardrobe item loading and enhance query limits

- Updated `_load_user_wardrobe` to first fetch item IDs and then retrieve top scored items with a limit of 50.
- Introduced a constant `_WARDROBE_SCAN_LIMIT` to cap the number of items fetched in wardrobe queries to 500.
- Enhanced `_load_wardrobe_items` in both `gaps.py` and `wear_patterns.py` to include the new limit and ensure recent wear events are capped.
- Added unit tests to verify the new loading logic and query limits for wardrobe items.

* refactor:Refactor item loading in `_load_user_wardrobe` for improved query

* fix: remove duplicate sorting in `get_wear_patterns` function

* test: Update assertions in `test_load_user_wardrobe_fetches_ids_then_top_scored_items` to validate query parameters and limits

- Refactored test to extract query parameters from the second call of `sync_wardrobe.query_items`.
- Updated assertions to check for the correct limit parameter in the SQL query.

* feat: add Cosmos image cleanup index functionality (#78)

- Introduced a new Cosmos DB container for image cleanup indexing, enhancing the wardrobe management system.
- Updated the `WardrobeRepository` to support syncing and deleting entries in the image cleanup index.
- Modified `CleanupFunctions` to utilize the new index for identifying known item IDs.
- Enhanced local settings and infrastructure scripts to accommodate the new container.
- Added unit tests to ensure correct behavior of the image cleanup index operations.

* refactor: query normalization and caching in wardrobe tool (#77)

- Removed the `_normalise_query_text` and `_get_cache_scope` functions to simplify the codebase.
- Updated `_expand_query_cached` to use a shared cache key based on normalized terms, improving cache efficiency across users and sessions.
- Adjusted unit tests to reflect changes in caching behavior and ensure correct functionality with normalized queries.

* feat: Implement caching for SAS URL generation in BlobSasService

- Added in-memory caching to the GenerateSasUrl method to improve performance by returning cached SAS URLs for repeated requests within a validity window.
- Introduced MemoryCache for managing cached SAS URLs and defined cache skew and minimum cache duration constants.
- Updated unit tests to verify caching behavior for allowed containers and validity windows, ensuring correct functionality under various scenarios.
- Included Microsoft.Extensions.Caching.Memory package for caching support.

* feat: Add Redis caching support for SAS URL generation (#81)

- Implemented Redis caching for SAS URL generation in BlobSasService to enhance performance and scalability.
- Updated local.settings.json.example and QUICKSTART.md to include configuration for Redis cache.
- Modified Program.cs to conditionally use Redis or in-memory caching based on configuration settings.
- Enhanced unit tests to validate caching behavior with Redis and ensure correct functionality.
- Updated Terraform scripts to support new SAS cache settings.

* bugfix: Enable cross-partition queries in scraper source retrieval (#82)

- Added `enable_cross_partition_query=True` to various query items in `function_app.py` and `scraper_runner.py` to enhance data retrieval across partitions.
- Updated relevant functions to ensure efficient querying of active and global scraper sources, improving overall performance and reliability.

* fix: codeRabbit comments

* refactor: Update MakeItem method in WardrobeRepositoryTests for improved clarity

* chore: Increase maxPollingInterval in host.json from 2 seconds to 15 seconds for improved queue processing efficiency

* feat: add robots.txt and update staticwebapp.config.json for sitemap inclusion

- Added a new robots.txt file to allow all user agents and specify the sitemap location.
- Updated staticwebapp.config.json to exclude robots.txt and well-known paths from navigation fallback.

* fix: test case

* fix: cap vault insights scans and extend cache TTL (#88)

Limit wardrobe and wear-event scans for vault insights and extend cache TTL to reduce RU usage while preserving bounded analytics freshness.

* Fix issue #59: validate LLM category output (#89)

* Fix: paginate digest profile reads (#90)

Prevent unbounded profile reads during weekly digest startup by switching
from list(read_all_items()) to a paginated query on user profile id only.

* Optimize mood canonicalization embeddings via batching. (#91)

Canonicalized mood names are now re-embedded in one Azure OpenAI batch call
instead of one request per item, with a safe fallback to existing embeddings
when batch re-embedding fails.

* fix: clamp wear history maxResults (#93)

* Migrate mood processor to async Cosmos container calls (#92)

* fix: Enhance mood processor to filter out unknown primary moods and improve error logging in digest agent. Added partition key to user data fetch in vault insights. Updated tests for mood processor and vault insights to reflect changes.

* feat: Enhance traceability and metadata handling in function_app and digest_agent (#103)

- Added support for extracting trace ID from headers in `_metadata_request_id_from_headers`.
- Updated `_set_trace_identifier_kwargs` to accept additional metadata and conditionally include it in the kwargs.
- Modified `_build_langfuse_callbacks` to accept metadata for improved context in callback generation.
- Enhanced `run_digest_now` to forward trace ID from the request headers.
- Introduced `_build_digest_langfuse_callbacks` in `digest_agent` for consistent metadata handling.
- Updated tests to verify trace ID propagation and metadata inclusion in LLM calls.

* Fix silent scraper source-load failures (#105)

Propagate load errors from run_global_scrapers so scraper jobs fail visible and can retry.

* bugfix: reuse digest LLM instance for batch digest (#104)

* Fix: reuse digest LLM instance across batch runs

Digest generation now keeps a single Azure OpenAI client in-process to avoid
creating a fresh client per user while preserving prompt and save behavior.

* Fix test mock for digest llm invoke config path

* Refactor auth token resolution into shared service (#106)

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
AB-Law added a commit that referenced this pull request Apr 18, 2026
* docs: add local dev setup for full emulation-based development

- Rewrite QUICKSTART.md with Azurite + Cosmos emulator setup, container
  creation steps, port reference, and 5-tab run guide
- Add CONTRIBUTING.md: branch strategy, tech stack, per-layer conventions,
  PR checklist, and secrets policy
- Add local.settings.json.example for both PluckIt.Functions and
  PluckIt.Processor pre-filled with emulator connection strings and
  placeholder values for Azure OpenAI credentials

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* ci: auto-sync dev from main after every merge

* fix: use http for vnext-preview Cosmos emulator (no TLS on ARM Mac)

* feat: add setup-local-cosmos.py script to bootstrap emulator containers

* fix: use UseDevelopmentStorage=true for Azurite blob container creation

* feat: add use-env.sh script to switch between local and prod settings

* Potential fix for code scanning alert no. 10: Workflow does not contain permissions

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

* fix: use UseDevelopmentStorage=true for Azurite storage connection strings

* Refactor/angular lint (#69)

* feat: integrate ESLint for improved code quality

- Added ESLint configuration to the project for TypeScript and Angular files.
- Updated `angular.json` to include linting options.
- Installed necessary ESLint packages for Angular and TypeScript support.
- Refactored code to replace `any` with `unknown` in several service files for better type safety.
- Enhanced unit tests to align with the new type definitions and ensure consistency.

* refactor: remove unused page

* fix: correct Cosmos emulator key (invalid base64 Tg== -> g==)

* fix: update Cosmos DB keys and enhance Azurite compatibility for local development

* fix: enhance UUID generation and improve error handling in BlobSasService and setup-local-cosmos.py

* chore: update CI/CD workflows to include 'dev' branch for push and pull request triggers

* fix: improve UUID generation logic in DashboardComponent and update CI to disable watch mode for tests

* feat: enhance wardrobe loading logic (#71)

* feat: enhance wardrobe loading logic
- Updated the `_load_user_wardrobe` function to prefetch item IDs and limit the number of items returned based on wear count.
- Enhanced unit tests for `_load_user_wardrobe` to verify correct behavior and query execution.

* test: enhance user wardrobe loading tests to verify wearCount conditions
- Added assertions to check for the presence of "IS_DEFINED(c.wearCount)" and "c.wearCount > 0" in the SQL query for loading user wardrobe items.

* Feature/52 resolve phash linear scan (#72)

* feat: Enhance deduplication logic in RunDeduplicator by implementing prefix-based pHash storage. Introduced methods for registering pHashes by prefix and identifying candidate buckets for Hamming distance checks. Updated tests to validate detection of duplicates with nearby prefix variations.

* refactor: Replace candidate bucket method with a cached version for improved performance in deduplication logic. The new method optimizes pHash prefix bucket retrieval by caching results, enhancing efficiency during duplicate checks.

* refactor: Simplify candidate bucket retrieval by removing max value check in deduplicator. Added new tests to validate deduplication behavior at threshold boundaries and just under threshold conditions.

* Refactor wardrobe item loading and enhance query limits (#76)

* Refactor wardrobe item loading and enhance query limits

- Updated `_load_user_wardrobe` to first fetch item IDs and then retrieve top scored items with a limit of 50.
- Introduced a constant `_WARDROBE_SCAN_LIMIT` to cap the number of items fetched in wardrobe queries to 500.
- Enhanced `_load_wardrobe_items` in both `gaps.py` and `wear_patterns.py` to include the new limit and ensure recent wear events are capped.
- Added unit tests to verify the new loading logic and query limits for wardrobe items.

* refactor:Refactor item loading in `_load_user_wardrobe` for improved query

* fix: remove duplicate sorting in `get_wear_patterns` function

* test: Update assertions in `test_load_user_wardrobe_fetches_ids_then_top_scored_items` to validate query parameters and limits

- Refactored test to extract query parameters from the second call of `sync_wardrobe.query_items`.
- Updated assertions to check for the correct limit parameter in the SQL query.

* feat: add Cosmos image cleanup index functionality (#78)

- Introduced a new Cosmos DB container for image cleanup indexing, enhancing the wardrobe management system.
- Updated the `WardrobeRepository` to support syncing and deleting entries in the image cleanup index.
- Modified `CleanupFunctions` to utilize the new index for identifying known item IDs.
- Enhanced local settings and infrastructure scripts to accommodate the new container.
- Added unit tests to ensure correct behavior of the image cleanup index operations.

* refactor: query normalization and caching in wardrobe tool (#77)

- Removed the `_normalise_query_text` and `_get_cache_scope` functions to simplify the codebase.
- Updated `_expand_query_cached` to use a shared cache key based on normalized terms, improving cache efficiency across users and sessions.
- Adjusted unit tests to reflect changes in caching behavior and ensure correct functionality with normalized queries.

* feat: Implement caching for SAS URL generation in BlobSasService

- Added in-memory caching to the GenerateSasUrl method to improve performance by returning cached SAS URLs for repeated requests within a validity window.
- Introduced MemoryCache for managing cached SAS URLs and defined cache skew and minimum cache duration constants.
- Updated unit tests to verify caching behavior for allowed containers and validity windows, ensuring correct functionality under various scenarios.
- Included Microsoft.Extensions.Caching.Memory package for caching support.

* feat: Add Redis caching support for SAS URL generation (#81)

- Implemented Redis caching for SAS URL generation in BlobSasService to enhance performance and scalability.
- Updated local.settings.json.example and QUICKSTART.md to include configuration for Redis cache.
- Modified Program.cs to conditionally use Redis or in-memory caching based on configuration settings.
- Enhanced unit tests to validate caching behavior with Redis and ensure correct functionality.
- Updated Terraform scripts to support new SAS cache settings.

* bugfix: Enable cross-partition queries in scraper source retrieval (#82)

- Added `enable_cross_partition_query=True` to various query items in `function_app.py` and `scraper_runner.py` to enhance data retrieval across partitions.
- Updated relevant functions to ensure efficient querying of active and global scraper sources, improving overall performance and reliability.

* fix: codeRabbit comments

* refactor: Update MakeItem method in WardrobeRepositoryTests for improved clarity

* chore: Increase maxPollingInterval in host.json from 2 seconds to 15 seconds for improved queue processing efficiency

* feat: add robots.txt and update staticwebapp.config.json for sitemap inclusion

- Added a new robots.txt file to allow all user agents and specify the sitemap location.
- Updated staticwebapp.config.json to exclude robots.txt and well-known paths from navigation fallback.

* fix: test case

* fix: cap vault insights scans and extend cache TTL (#88)

Limit wardrobe and wear-event scans for vault insights and extend cache TTL to reduce RU usage while preserving bounded analytics freshness.

* Fix issue #59: validate LLM category output (#89)

* Fix: paginate digest profile reads (#90)

Prevent unbounded profile reads during weekly digest startup by switching
from list(read_all_items()) to a paginated query on user profile id only.

* Optimize mood canonicalization embeddings via batching. (#91)

Canonicalized mood names are now re-embedded in one Azure OpenAI batch call
instead of one request per item, with a safe fallback to existing embeddings
when batch re-embedding fails.

* fix: clamp wear history maxResults (#93)

* Migrate mood processor to async Cosmos container calls (#92)

* fix: Enhance mood processor to filter out unknown primary moods and improve error logging in digest agent. Added partition key to user data fetch in vault insights. Updated tests for mood processor and vault insights to reflect changes.

* feat: Enhance traceability and metadata handling in function_app and digest_agent (#103)

- Added support for extracting trace ID from headers in `_metadata_request_id_from_headers`.
- Updated `_set_trace_identifier_kwargs` to accept additional metadata and conditionally include it in the kwargs.
- Modified `_build_langfuse_callbacks` to accept metadata for improved context in callback generation.
- Enhanced `run_digest_now` to forward trace ID from the request headers.
- Introduced `_build_digest_langfuse_callbacks` in `digest_agent` for consistent metadata handling.
- Updated tests to verify trace ID propagation and metadata inclusion in LLM calls.

* Fix silent scraper source-load failures (#105)

Propagate load errors from run_global_scrapers so scraper jobs fail visible and can retry.

* bugfix: reuse digest LLM instance for batch digest (#104)

* Fix: reuse digest LLM instance across batch runs

Digest generation now keeps a single Azure OpenAI client in-process to avoid
creating a fresh client per user while preserving prompt and save behavior.

* Fix test mock for digest llm invoke config path

* Refactor auth token resolution into shared service (#106)

* refactor: Remove enable_cross_partition_query from queries in function_app.py and update ScraperSources partition key in setup-local-cosmos.py

- Removed `enable_cross_partition_query=True` from several query items in `function_app.py` to streamline query execution.
- Updated the partition key for the `ScraperSources` container in `setup-local-cosmos.py` from `/id` to `/sourceType` for improved data organization.

* test: Enhance refresh session tests with additional time mocking

- Updated `test_refresh_session_rotates_tokens_and_replaces_previous` and `test_refresh_session_rejects_expired_refresh_token` to include a mock for the current UTC time, improving the accuracy of token expiration handling in tests.
- Refactored the context manager for patching to streamline the mocking process.

* fix: Re-enable cross-partition queries in function_app.py

- Added `enable_cross_partition_query=True` back to several query items in `function_app.py` to improve data retrieval across partitions.
- Updated the partition key for user subscriptions to enhance query efficiency and ensure accurate data fetching.

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant