Scraper runner: fail-fast on source load failure#105
Merged
Conversation
Propagate load errors from run_global_scrapers so scraper jobs fail visible and can retry.
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
AB-Law
added a commit
that referenced
this pull request
Mar 17, 2026
* docs: add local dev setup for full emulation-based development - Rewrite QUICKSTART.md with Azurite + Cosmos emulator setup, container creation steps, port reference, and 5-tab run guide - Add CONTRIBUTING.md: branch strategy, tech stack, per-layer conventions, PR checklist, and secrets policy - Add local.settings.json.example for both PluckIt.Functions and PluckIt.Processor pre-filled with emulator connection strings and placeholder values for Azure OpenAI credentials Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ci: auto-sync dev from main after every merge * fix: use http for vnext-preview Cosmos emulator (no TLS on ARM Mac) * feat: add setup-local-cosmos.py script to bootstrap emulator containers * fix: use UseDevelopmentStorage=true for Azurite blob container creation * feat: add use-env.sh script to switch between local and prod settings * Potential fix for code scanning alert no. 10: Workflow does not contain permissions Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * fix: use UseDevelopmentStorage=true for Azurite storage connection strings * Refactor/angular lint (#69) * feat: integrate ESLint for improved code quality - Added ESLint configuration to the project for TypeScript and Angular files. - Updated `angular.json` to include linting options. - Installed necessary ESLint packages for Angular and TypeScript support. - Refactored code to replace `any` with `unknown` in several service files for better type safety. - Enhanced unit tests to align with the new type definitions and ensure consistency. * refactor: remove unused page * fix: correct Cosmos emulator key (invalid base64 Tg== -> g==) * fix: update Cosmos DB keys and enhance Azurite compatibility for local development * fix: enhance UUID generation and improve error handling in BlobSasService and setup-local-cosmos.py * chore: update CI/CD workflows to include 'dev' branch for push and pull request triggers * fix: improve UUID generation logic in DashboardComponent and update CI to disable watch mode for tests * feat: enhance wardrobe loading logic (#71) * feat: enhance wardrobe loading logic - Updated the `_load_user_wardrobe` function to prefetch item IDs and limit the number of items returned based on wear count. - Enhanced unit tests for `_load_user_wardrobe` to verify correct behavior and query execution. * test: enhance user wardrobe loading tests to verify wearCount conditions - Added assertions to check for the presence of "IS_DEFINED(c.wearCount)" and "c.wearCount > 0" in the SQL query for loading user wardrobe items. * Feature/52 resolve phash linear scan (#72) * feat: Enhance deduplication logic in RunDeduplicator by implementing prefix-based pHash storage. Introduced methods for registering pHashes by prefix and identifying candidate buckets for Hamming distance checks. Updated tests to validate detection of duplicates with nearby prefix variations. * refactor: Replace candidate bucket method with a cached version for improved performance in deduplication logic. The new method optimizes pHash prefix bucket retrieval by caching results, enhancing efficiency during duplicate checks. * refactor: Simplify candidate bucket retrieval by removing max value check in deduplicator. Added new tests to validate deduplication behavior at threshold boundaries and just under threshold conditions. * Refactor wardrobe item loading and enhance query limits (#76) * Refactor wardrobe item loading and enhance query limits - Updated `_load_user_wardrobe` to first fetch item IDs and then retrieve top scored items with a limit of 50. - Introduced a constant `_WARDROBE_SCAN_LIMIT` to cap the number of items fetched in wardrobe queries to 500. - Enhanced `_load_wardrobe_items` in both `gaps.py` and `wear_patterns.py` to include the new limit and ensure recent wear events are capped. - Added unit tests to verify the new loading logic and query limits for wardrobe items. * refactor:Refactor item loading in `_load_user_wardrobe` for improved query * fix: remove duplicate sorting in `get_wear_patterns` function * test: Update assertions in `test_load_user_wardrobe_fetches_ids_then_top_scored_items` to validate query parameters and limits - Refactored test to extract query parameters from the second call of `sync_wardrobe.query_items`. - Updated assertions to check for the correct limit parameter in the SQL query. * feat: add Cosmos image cleanup index functionality (#78) - Introduced a new Cosmos DB container for image cleanup indexing, enhancing the wardrobe management system. - Updated the `WardrobeRepository` to support syncing and deleting entries in the image cleanup index. - Modified `CleanupFunctions` to utilize the new index for identifying known item IDs. - Enhanced local settings and infrastructure scripts to accommodate the new container. - Added unit tests to ensure correct behavior of the image cleanup index operations. * refactor: query normalization and caching in wardrobe tool (#77) - Removed the `_normalise_query_text` and `_get_cache_scope` functions to simplify the codebase. - Updated `_expand_query_cached` to use a shared cache key based on normalized terms, improving cache efficiency across users and sessions. - Adjusted unit tests to reflect changes in caching behavior and ensure correct functionality with normalized queries. * feat: Implement caching for SAS URL generation in BlobSasService - Added in-memory caching to the GenerateSasUrl method to improve performance by returning cached SAS URLs for repeated requests within a validity window. - Introduced MemoryCache for managing cached SAS URLs and defined cache skew and minimum cache duration constants. - Updated unit tests to verify caching behavior for allowed containers and validity windows, ensuring correct functionality under various scenarios. - Included Microsoft.Extensions.Caching.Memory package for caching support. * feat: Add Redis caching support for SAS URL generation (#81) - Implemented Redis caching for SAS URL generation in BlobSasService to enhance performance and scalability. - Updated local.settings.json.example and QUICKSTART.md to include configuration for Redis cache. - Modified Program.cs to conditionally use Redis or in-memory caching based on configuration settings. - Enhanced unit tests to validate caching behavior with Redis and ensure correct functionality. - Updated Terraform scripts to support new SAS cache settings. * bugfix: Enable cross-partition queries in scraper source retrieval (#82) - Added `enable_cross_partition_query=True` to various query items in `function_app.py` and `scraper_runner.py` to enhance data retrieval across partitions. - Updated relevant functions to ensure efficient querying of active and global scraper sources, improving overall performance and reliability. * fix: codeRabbit comments * refactor: Update MakeItem method in WardrobeRepositoryTests for improved clarity * chore: Increase maxPollingInterval in host.json from 2 seconds to 15 seconds for improved queue processing efficiency * feat: add robots.txt and update staticwebapp.config.json for sitemap inclusion - Added a new robots.txt file to allow all user agents and specify the sitemap location. - Updated staticwebapp.config.json to exclude robots.txt and well-known paths from navigation fallback. * fix: test case * fix: cap vault insights scans and extend cache TTL (#88) Limit wardrobe and wear-event scans for vault insights and extend cache TTL to reduce RU usage while preserving bounded analytics freshness. * Fix issue #59: validate LLM category output (#89) * Fix: paginate digest profile reads (#90) Prevent unbounded profile reads during weekly digest startup by switching from list(read_all_items()) to a paginated query on user profile id only. * Optimize mood canonicalization embeddings via batching. (#91) Canonicalized mood names are now re-embedded in one Azure OpenAI batch call instead of one request per item, with a safe fallback to existing embeddings when batch re-embedding fails. * fix: clamp wear history maxResults (#93) * Migrate mood processor to async Cosmos container calls (#92) * fix: Enhance mood processor to filter out unknown primary moods and improve error logging in digest agent. Added partition key to user data fetch in vault insights. Updated tests for mood processor and vault insights to reflect changes. * feat: Enhance traceability and metadata handling in function_app and digest_agent (#103) - Added support for extracting trace ID from headers in `_metadata_request_id_from_headers`. - Updated `_set_trace_identifier_kwargs` to accept additional metadata and conditionally include it in the kwargs. - Modified `_build_langfuse_callbacks` to accept metadata for improved context in callback generation. - Enhanced `run_digest_now` to forward trace ID from the request headers. - Introduced `_build_digest_langfuse_callbacks` in `digest_agent` for consistent metadata handling. - Updated tests to verify trace ID propagation and metadata inclusion in LLM calls. * Fix silent scraper source-load failures (#105) Propagate load errors from run_global_scrapers so scraper jobs fail visible and can retry. * bugfix: reuse digest LLM instance for batch digest (#104) * Fix: reuse digest LLM instance across batch runs Digest generation now keeps a single Azure OpenAI client in-process to avoid creating a fresh client per user while preserving prompt and save behavior. * Fix test mock for digest llm invoke config path * Refactor auth token resolution into shared service (#106) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
AB-Law
added a commit
that referenced
this pull request
Apr 18, 2026
* docs: add local dev setup for full emulation-based development - Rewrite QUICKSTART.md with Azurite + Cosmos emulator setup, container creation steps, port reference, and 5-tab run guide - Add CONTRIBUTING.md: branch strategy, tech stack, per-layer conventions, PR checklist, and secrets policy - Add local.settings.json.example for both PluckIt.Functions and PluckIt.Processor pre-filled with emulator connection strings and placeholder values for Azure OpenAI credentials Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ci: auto-sync dev from main after every merge * fix: use http for vnext-preview Cosmos emulator (no TLS on ARM Mac) * feat: add setup-local-cosmos.py script to bootstrap emulator containers * fix: use UseDevelopmentStorage=true for Azurite blob container creation * feat: add use-env.sh script to switch between local and prod settings * Potential fix for code scanning alert no. 10: Workflow does not contain permissions Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * fix: use UseDevelopmentStorage=true for Azurite storage connection strings * Refactor/angular lint (#69) * feat: integrate ESLint for improved code quality - Added ESLint configuration to the project for TypeScript and Angular files. - Updated `angular.json` to include linting options. - Installed necessary ESLint packages for Angular and TypeScript support. - Refactored code to replace `any` with `unknown` in several service files for better type safety. - Enhanced unit tests to align with the new type definitions and ensure consistency. * refactor: remove unused page * fix: correct Cosmos emulator key (invalid base64 Tg== -> g==) * fix: update Cosmos DB keys and enhance Azurite compatibility for local development * fix: enhance UUID generation and improve error handling in BlobSasService and setup-local-cosmos.py * chore: update CI/CD workflows to include 'dev' branch for push and pull request triggers * fix: improve UUID generation logic in DashboardComponent and update CI to disable watch mode for tests * feat: enhance wardrobe loading logic (#71) * feat: enhance wardrobe loading logic - Updated the `_load_user_wardrobe` function to prefetch item IDs and limit the number of items returned based on wear count. - Enhanced unit tests for `_load_user_wardrobe` to verify correct behavior and query execution. * test: enhance user wardrobe loading tests to verify wearCount conditions - Added assertions to check for the presence of "IS_DEFINED(c.wearCount)" and "c.wearCount > 0" in the SQL query for loading user wardrobe items. * Feature/52 resolve phash linear scan (#72) * feat: Enhance deduplication logic in RunDeduplicator by implementing prefix-based pHash storage. Introduced methods for registering pHashes by prefix and identifying candidate buckets for Hamming distance checks. Updated tests to validate detection of duplicates with nearby prefix variations. * refactor: Replace candidate bucket method with a cached version for improved performance in deduplication logic. The new method optimizes pHash prefix bucket retrieval by caching results, enhancing efficiency during duplicate checks. * refactor: Simplify candidate bucket retrieval by removing max value check in deduplicator. Added new tests to validate deduplication behavior at threshold boundaries and just under threshold conditions. * Refactor wardrobe item loading and enhance query limits (#76) * Refactor wardrobe item loading and enhance query limits - Updated `_load_user_wardrobe` to first fetch item IDs and then retrieve top scored items with a limit of 50. - Introduced a constant `_WARDROBE_SCAN_LIMIT` to cap the number of items fetched in wardrobe queries to 500. - Enhanced `_load_wardrobe_items` in both `gaps.py` and `wear_patterns.py` to include the new limit and ensure recent wear events are capped. - Added unit tests to verify the new loading logic and query limits for wardrobe items. * refactor:Refactor item loading in `_load_user_wardrobe` for improved query * fix: remove duplicate sorting in `get_wear_patterns` function * test: Update assertions in `test_load_user_wardrobe_fetches_ids_then_top_scored_items` to validate query parameters and limits - Refactored test to extract query parameters from the second call of `sync_wardrobe.query_items`. - Updated assertions to check for the correct limit parameter in the SQL query. * feat: add Cosmos image cleanup index functionality (#78) - Introduced a new Cosmos DB container for image cleanup indexing, enhancing the wardrobe management system. - Updated the `WardrobeRepository` to support syncing and deleting entries in the image cleanup index. - Modified `CleanupFunctions` to utilize the new index for identifying known item IDs. - Enhanced local settings and infrastructure scripts to accommodate the new container. - Added unit tests to ensure correct behavior of the image cleanup index operations. * refactor: query normalization and caching in wardrobe tool (#77) - Removed the `_normalise_query_text` and `_get_cache_scope` functions to simplify the codebase. - Updated `_expand_query_cached` to use a shared cache key based on normalized terms, improving cache efficiency across users and sessions. - Adjusted unit tests to reflect changes in caching behavior and ensure correct functionality with normalized queries. * feat: Implement caching for SAS URL generation in BlobSasService - Added in-memory caching to the GenerateSasUrl method to improve performance by returning cached SAS URLs for repeated requests within a validity window. - Introduced MemoryCache for managing cached SAS URLs and defined cache skew and minimum cache duration constants. - Updated unit tests to verify caching behavior for allowed containers and validity windows, ensuring correct functionality under various scenarios. - Included Microsoft.Extensions.Caching.Memory package for caching support. * feat: Add Redis caching support for SAS URL generation (#81) - Implemented Redis caching for SAS URL generation in BlobSasService to enhance performance and scalability. - Updated local.settings.json.example and QUICKSTART.md to include configuration for Redis cache. - Modified Program.cs to conditionally use Redis or in-memory caching based on configuration settings. - Enhanced unit tests to validate caching behavior with Redis and ensure correct functionality. - Updated Terraform scripts to support new SAS cache settings. * bugfix: Enable cross-partition queries in scraper source retrieval (#82) - Added `enable_cross_partition_query=True` to various query items in `function_app.py` and `scraper_runner.py` to enhance data retrieval across partitions. - Updated relevant functions to ensure efficient querying of active and global scraper sources, improving overall performance and reliability. * fix: codeRabbit comments * refactor: Update MakeItem method in WardrobeRepositoryTests for improved clarity * chore: Increase maxPollingInterval in host.json from 2 seconds to 15 seconds for improved queue processing efficiency * feat: add robots.txt and update staticwebapp.config.json for sitemap inclusion - Added a new robots.txt file to allow all user agents and specify the sitemap location. - Updated staticwebapp.config.json to exclude robots.txt and well-known paths from navigation fallback. * fix: test case * fix: cap vault insights scans and extend cache TTL (#88) Limit wardrobe and wear-event scans for vault insights and extend cache TTL to reduce RU usage while preserving bounded analytics freshness. * Fix issue #59: validate LLM category output (#89) * Fix: paginate digest profile reads (#90) Prevent unbounded profile reads during weekly digest startup by switching from list(read_all_items()) to a paginated query on user profile id only. * Optimize mood canonicalization embeddings via batching. (#91) Canonicalized mood names are now re-embedded in one Azure OpenAI batch call instead of one request per item, with a safe fallback to existing embeddings when batch re-embedding fails. * fix: clamp wear history maxResults (#93) * Migrate mood processor to async Cosmos container calls (#92) * fix: Enhance mood processor to filter out unknown primary moods and improve error logging in digest agent. Added partition key to user data fetch in vault insights. Updated tests for mood processor and vault insights to reflect changes. * feat: Enhance traceability and metadata handling in function_app and digest_agent (#103) - Added support for extracting trace ID from headers in `_metadata_request_id_from_headers`. - Updated `_set_trace_identifier_kwargs` to accept additional metadata and conditionally include it in the kwargs. - Modified `_build_langfuse_callbacks` to accept metadata for improved context in callback generation. - Enhanced `run_digest_now` to forward trace ID from the request headers. - Introduced `_build_digest_langfuse_callbacks` in `digest_agent` for consistent metadata handling. - Updated tests to verify trace ID propagation and metadata inclusion in LLM calls. * Fix silent scraper source-load failures (#105) Propagate load errors from run_global_scrapers so scraper jobs fail visible and can retry. * bugfix: reuse digest LLM instance for batch digest (#104) * Fix: reuse digest LLM instance across batch runs Digest generation now keeps a single Azure OpenAI client in-process to avoid creating a fresh client per user while preserving prompt and save behavior. * Fix test mock for digest llm invoke config path * Refactor auth token resolution into shared service (#106) * refactor: Remove enable_cross_partition_query from queries in function_app.py and update ScraperSources partition key in setup-local-cosmos.py - Removed `enable_cross_partition_query=True` from several query items in `function_app.py` to streamline query execution. - Updated the partition key for the `ScraperSources` container in `setup-local-cosmos.py` from `/id` to `/sourceType` for improved data organization. * test: Enhance refresh session tests with additional time mocking - Updated `test_refresh_session_rotates_tokens_and_replaces_previous` and `test_refresh_session_rejects_expired_refresh_token` to include a mock for the current UTC time, improving the accuracy of token expiration handling in tests. - Refactored the context manager for patching to streamline the mocking process. * fix: Re-enable cross-partition queries in function_app.py - Added `enable_cross_partition_query=True` back to several query items in `function_app.py` to improve data retrieval across partitions. - Updated the partition key for user subscriptions to enhance query efficiency and ensure accurate data fetching. --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Issue
Test plan
platform darwin -- Python 3.11.10, pytest-9.0.2, pluggy-1.6.0
rootdir: /Users/akshayb/projects/Pluck-It-101-source-load-failure-visibility
configfile: pytest.ini
plugins: mock-3.15.1, anyio-4.11.0, asyncio-1.3.0, cov-7.0.0, langsmith-0.4.33
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 0 items
============================ no tests ran in 0.00s =============================