[Entity Analytics] EUID-based risk scoring and entity resolution via Entity Store maintainer#259732
Conversation
- Renamed `entityAnalytics94ModeEnabled` to `entityAnalyticsEntityStoreV2Enabled` for clarity. - Updated conditional checks in the plugin to use the new feature flag. - Enhanced error handling in routes to return appropriate responses when entity analytics V2 mode is enabled. - Added tests to validate behavior when entity analytics 9.4 mode is enabled, ensuring proper error responses. - Introduced new translations for error messages related to the V2 mode API restrictions.
…sform initialization - Added `initLegacyTransforms` method to `RiskScoreDataClient` for initializing legacy risk engine transforms. - Updated `RiskEngineDataClient` to call `initLegacyTransforms` during initialization. - Enhanced mock implementation in `risk_score_data_client.mock.ts` to include `initLegacyTransforms`. - Added tests for `initLegacyTransforms` to ensure proper functionality across namespaces. - Improved error logging for legacy transform initialization failures.
Replaces references to "entity analytics 9.4 mode" with "Entity Store V2" in error messages and test descriptions to align with the updated feature flag terminology. Made-with: Cursor
Introduce a shared StepResult shape for scoring steps so pipeline orchestration and metrics consume a consistent summary contract across base, resolution, and reset phases. Made-with: Cursor
Route reset-to-zero writes through shared persistence helpers and pass the orchestrator writer dependency so reset behavior matches base and resolution best-effort bulk handling. Made-with: Cursor
The uninstall call during test cleanup returns 403 when the store was never installed, which is normal between tests. Logging at warn creates noise that obscures real failures. Made-with: Cursor
…ForMaintainerRun The modifier-shape test intermittently timed out because stopMaintainer only prevents future scheduling — it doesn't kill in-flight executions. After stop+start the task often entered an indeterminate "currently running" state for >60s. Remove the stop/start dance entirely: set up both modifiers while the maintainer runs freely, confirm they're in the entity store, then trigger a fresh run that will see both. Also improves the shared waitForMaintainerRun helper to retry the manual trigger inside the poll loop (instead of fire-once) and fail fast if the maintainer errors during the wait. Made-with: Cursor
Same approach as the modifier-shape test: set up the watchlist while the maintainer runs freely, confirm it in the entity store, then trigger a fresh run. Safe because the retry loop already filters by run_id and checks for the watchlist modifier. Made-with: Cursor
Remove the strict lastErrorTimestamp fail-fast in waitForMaintainerRun so custom-namespace setup can recover from transient early maintainer errors while still relying on the successful new-run count condition. Made-with: Cursor
…p-dev/kibana into risk-score-maintainer-phase-1
|
@tiansivive I have refactored the risk score maintainer pipeline to make the flow easier to read and reason about by breaking it into some more helpers. I've added a few comments too to hopefully make things clearer. The code is much better now, I have made each stage much more aligned with the others and they use some shared helpers. |
| const baseScores = applyScoreModifiersFromEntities({ | ||
| now, | ||
| identifierType: entityType, | ||
| scoreType: 'base', | ||
| calculationRunId, | ||
| page: { | ||
| scores: buildZeroScores(baseEntityIds), | ||
| identifierField, | ||
| }, | ||
| entities, | ||
| watchlistConfigs, | ||
| }); | ||
|
|
||
| const resolutionScores = applyScoreModifiersFromEntities({ | ||
| now, | ||
| identifierType: entityType, | ||
| scoreType: 'resolution', | ||
| calculationRunId, | ||
| page: { | ||
| scores: buildZeroScores(resolutionEntityIds), | ||
| identifierField, | ||
| }, | ||
| entities, | ||
| watchlistConfigs, | ||
| }); |
There was a problem hiding this comment.
maybe something for later, but this could be concurrent?
| if (isMaintainerAlreadyRunningError(error)) { | ||
| if (!alreadyRunningHandled) { |
There was a problem hiding this comment.
nit: we can flatten this
| let requiredNewRuns = minRuns; | ||
| let manualRunTriggered = false; | ||
| let alreadyRunningHandled = false; |
There was a problem hiding this comment.
I'm really struggling with this file/fn, i just dont really have the overall view of why this is needed
There was a problem hiding this comment.
this is just the test helper, it has so many resilience fixes in it now from 20 rounds of ddebugging but I think it gets a pass because its test infra
|
/sync-ci |
…ntity maintainer tests Embeds settle logic directly into `waitForMaintainerRun` to ensure the task is idle before returning. This prevents version_conflict_engine_exceptions when a caller immediately stops the maintainer while a follow-up run is still saving its state. Made-with: Cursor
- Removes redundant `scale` test which duplicated coverage of existing modifier tests. - Simplifies `reset-to-zero` test by removing unnecessary dual-host setup. - Merges redundant dual-write resolution test into the main resolution scoring test to save a slow maintainer setup/teardown cycle. - Removes unused `waitForMaintainerToSettle` import and calls after embedding settle logic directly into `waitForMaintainerRun`. Made-with: Cursor
…p-dev/kibana into risk-score-maintainer-phase-1
| const scores = normalizeScores(await readRiskScores(es)); | ||
| const hostScores = scores.filter((s) => s.id_value === host.expectedEuid); | ||
| expect(hostScores.every((s) => (s.calculated_score_norm ?? 0) > 0)).to.be(true); | ||
| }); |
There was a problem hiding this comment.
🟡 Medium trial_license_complete_tier/risk_score_calculation.ts:668
The assertion hostScores.every((s) => (s.calculated_score_norm ?? 0) > 0) at line 670 silently passes when hostScores is empty because [].every(...) returns true in JavaScript. If the host entity has no scores, the test incorrectly passes without verifying that reset-to-zero was actually disabled. Consider asserting hostScores.length > 0 before the every() check.
+ // The entity should NOT have been reset to zero — only positive scores should exist
+ const scores = normalizeScores(await readRiskScores(es));
+ const hostScores = scores.filter((s) => s.id_value === host.expectedEuid);
+ expect(hostScores.length).to.be.greaterThan(0);
expect(hostScores.every((s) => (s.calculated_score_norm ?? 0) > 0)).to.be(true);🤖 Copy this AI Prompt to have your agent fix this:
In file x-pack/solutions/security/test/security_solution_api_integration/test_suites/entity_analytics/risk_score_maintainer/trial_license_complete_tier/risk_score_calculation.ts around lines 668-671:
The assertion `hostScores.every((s) => (s.calculated_score_norm ?? 0) > 0)` at line 670 silently passes when `hostScores` is empty because `[].every(...)` returns `true` in JavaScript. If the host entity has no scores, the test incorrectly passes without verifying that reset-to-zero was actually disabled. Consider asserting `hostScores.length > 0` before the `every()` check.
Evidence trail:
File: x-pack/solutions/security/test/security_solution_api_integration/test_suites/entity_analytics/risk_score_maintainer/trial_license_complete_tier/risk_score_calculation.ts, lines 669-670 (REVIEWED_COMMIT). The code shows `const hostScores = scores.filter((s) => s.id_value === host.expectedEuid);` followed by `expect(hostScores.every((s) => (s.calculated_score_norm ?? 0) > 0)).to.be(true);` with no length check. JavaScript's `Array.prototype.every()` returns `true` for empty arrays per ECMAScript spec (https://tc39.es/ecma262/#sec-array.prototype.every).
💛 Build succeeded, but was flaky
Failed CI StepsTest Failures
Metrics [docs]Public APIs missing comments
Async chunks
History
cc @hop-dev |
jaredburgettelastic
left a comment
There was a problem hiding this comment.
Desk test only, thank you for the incredible work 🎉
Summary
Implements the v2 risk scoring pipeline — a new ES|QL-based maintainer that scores entities by EUID, reads modifiers directly from the Entity Store, uses run-aware reset logic, and resolves entity groups via a lookup index. All new behaviour is feature-flagged and runs alongside the existing risk engine without affecting it, including the preview risk scores API.
Foundation work was merged separately in #258197 (maintainer registration, legacy API guards, saved object setup).
What changed from v1
user.name/host.namehost:<id>,user:<email@namespace>) viagetEuidEsqlEvaluation()entity.identity.asset.criticality,entity.attributes.watchlists)is_privileged_userbooleanriskModifiervalues, applied multiplicativelyLOOKUP JOIN(see below)entity.risk.*for base,entity.relationships.resolution.risk.*for resolution)write_now,defer_to_phase_2, ornot_in_storeto route between phasesexcludedEntities)calculation_run_id— dual pass forbaseandresolutionscore typesrisk_score_service→ transformsrisk_score_maintainerregistered with the Entity Store maintainer frameworkentity.id-based identifiers; resolution scoring and reset-to-zero are excluded as preview is statelessHow reset-to-zero changed
v1 accumulates a global
scoredEntityIdsarray across all pages, then resets any entity with a positive score whose ID is not in that array. This requires unbounded in-memory state and breaks down in a multi-phase pipeline where later phases write additional score documents after base pagination completes.v2 replaces this with run-aware stale detection:
STATS ... LAST(score, @timestamp) BY id_value)WHERE score_type IS NULL OR score_type == "<type>") — runs independently forbaseandresolutioncalculation_run_id IS NULL OR calculation_run_id != currentRunId)LIMIT 10000per run — remaining stale entities are handled by subsequent scheduled runsThis removes the need for per-page exclusion state and composes cleanly with multiple score types.
Reset ES|QL query
Schema updates
Two new fields are added to the risk score index mapping. Both are optional and backward-compatible — existing v1 documents without these fields continue to work.
New risk score document fields
score_typekeywordbase,propagated, orresolution. Absent on v1 documents.calculation_run_idkeywordrelated_entities{ entity_id, relationship_type }Modifier type changes
The
Modifier<'watchlist'>type was generalized:Multiple watchlist modifiers per entity are supported and combine multiplicatively. The
riskScoreDocFactory()third parameter changed fromModifier<'watchlist'> | undefinedtoArray<Modifier<'watchlist'>>. Backward compatibility with the existing privmon modifier is preserved.Pipeline walkthrough
The maintainer
run()executes per entity type (host, user, etc.):Phase 1 — Base scoring
WatchlistConfigClientMV_PSERIES_WEIGHTED_SUM(TOP(risk_score, sampleSize, "desc"), 1.5)crudClient.listEntities()for the scored pageapplyScoreModifiersFromEntities()reads criticality + watchlists from entity documentswrite_now/defer_to_phase_2/not_in_store; dual-write to risk index + Entity Store.entity_analytics.risk_score.lookup-{namespace}(indexed withindex.mode: lookup)Phase 2 — Resolution scoring
resolution_target_idin the lookup indexLOOKUP JOINjoins the lookup index to combine alerts from all aliases in the grouprelated_entitiesfrom the join resultscore_type: "resolution") and Entity Store (entity.relationships.resolution.risk.*)Cleanup
baseandresolution)Preview API (synchronous)
The existing
/internal/risk_score/previewendpoint is extended with a V2 code path, gated by theentityAnalyticsEntityStoreV2experimental feature flag. When V2 is active:id_field: "entity.id"and raw EUIDs asid_value(e.g.host:<id>,user:<email>) — no translation to legacy identifier fieldsTelemetry
New event-based telemetry for observability:
risk_score_maintainer_run_summary— one per{namespace, entityType, calculationRunId}with outcome, counters, and durationrisk_score_maintainer_stage_summary— one per stage (phase1_base_scoring,phase1_lookup_sync,phase2_resolution_scoring,reset_to_zero) with per-stage timing and error detailsscoresWrittenBase,scoresWrittenResolution,lookupDocsUpserted,lookupDocsDeleted,lookupPrunedDocssuccess | error | skipped | aborted), skip reasons, and error kindsTesting
Feature flags
Add to
kibana.yml(orkibana.dev.yml):Then enable Entity Store V2 and the ID-based risk scoring advanced setting (
securitySolution:entityStoreEnableV2) via Stack Management → Advanced Settings.Manual testing with the document generator
The security-documents-generator has a companion PR (#342) with a
risk-score-v2command that seeds entities, ingests alerts, triggers maintainer runs, and prints a scorecard:# In the security-documents-generator repo, on the risk-scoring-v2 branch: yarn start risk-score-v2The command supports interactive follow-on actions (expand entities, tweak scores, view risk docs, run comparisons) and includes graph workflows for resolution + propagation.
What to verify
.ds-risk-score.risk-score-*) and on Entity Store documents (entity.risk.*)score_type: "base"andcalculation_run_idare present on new risk score documentsscore_type: "resolution"andrelated_entitiesentity.relationships.resolution.risk.*entity.id-based identifiers when V2 is enabled, and falls back to V1 scoring when disabledFTR test coverage
setup_and_status.tstask_execution.tstask_execution_nondefault_spaces.tsrisk_score_calculation.tsresolution_scoring.tspreview_api.tsChecklist
release_note:*label is applied per the guidelinesRisks
idBasedRiskScoringEnabledand errors are non-fatal (logged as warnings)index.mode: lookupfor efficient joins