Skip to content

[Entity Analytics] EUID-based risk scoring and entity resolution via Entity Store maintainer#259732

Merged
jaredburgettelastic merged 236 commits intoelastic:mainfrom
hop-dev:risk-score-maintainer-phase-1
Apr 3, 2026
Merged

[Entity Analytics] EUID-based risk scoring and entity resolution via Entity Store maintainer#259732
jaredburgettelastic merged 236 commits intoelastic:mainfrom
hop-dev:risk-score-maintainer-phase-1

Conversation

@hop-dev
Copy link
Copy Markdown
Contributor

@hop-dev hop-dev commented Mar 26, 2026

Summary

Implements the v2 risk scoring pipeline — a new ES|QL-based maintainer that scores entities by EUID, reads modifiers directly from the Entity Store, uses run-aware reset logic, and resolves entity groups via a lookup index. All new behaviour is feature-flagged and runs alongside the existing risk engine without affecting it, including the preview risk scores API.

Foundation work was merged separately in #258197 (maintainer registration, legacy API guards, saved object setup).


What changed from v1

Concern v1 (legacy risk engine) v2 (risk score maintainer)
Entity identity user.name / host.name EUID (host:<id>, user:<email@namespace>) via getEuidEsqlEvaluation()
Scoring engine Composite agg → ES|QL per page Same two-step pattern, but the ES|QL query now computes EUIDs inline and scores by entity.id
Modifier source Separate queries to asset criticality index + privmon index Read directly from pre-fetched Entity Store documents (entity.asset.criticality, entity.attributes.watchlists)
Watchlist model Hardcoded is_privileged_user boolean Generic watchlist system — any number of named watchlists with configurable riskModifier values, applied multiplicatively
Resolution scoring Not supported Phase 2 — entities sharing a resolution target are grouped and scored as one via LOOKUP JOIN (see below)
Persistence Risk index only Dual-write: risk index + Entity Store (entity.risk.* for base, entity.relationships.resolution.risk.* for resolution)
Shadow entities All scored entities are written Entities not in the Entity Store are intentionally dropped from persistence
Entity categorization None Each scored entity is classified as write_now, defer_to_phase_2, or not_in_store to route between phases
Reset mechanism Per-page exclusion list (excludedEntities) Run-aware stale detection via calculation_run_id — dual pass for base and resolution score types
Orchestration risk_score_service → transforms risk_score_maintainer registered with the Entity Store maintainer framework
Preview API Full v1 scoring pipeline (stateful) V2 preview returns base scores only with entity.id-based identifiers; resolution scoring and reset-to-zero are excluded as preview is stateless

How reset-to-zero changed

v1 accumulates a global scoredEntityIds array across all pages, then resets any entity with a positive score whose ID is not in that array. This requires unbounded in-memory state and breaks down in a multi-phase pipeline where later phases write additional score documents after base pagination completes.

v2 replaces this with run-aware stale detection:

  1. Query the risk index for each entity's latest score doc (STATS ... LAST(score, @timestamp) BY id_value)
  2. Scope to the target score type (WHERE score_type IS NULL OR score_type == "<type>") — runs independently for base and resolution
  3. Reset only if the latest positive score was not produced by the current run (calculation_run_id IS NULL OR calculation_run_id != currentRunId)
  4. Bounded to LIMIT 10000 per run — remaining stale entities are handled by subsequent scheduled runs

This removes the need for per-page exclusion state and composes cleanly with multiple score types.

Reset ES|QL query
FROM {risk_score_alias}
  | EVAL id_value = TO_STRING({entityField})
  | EVAL score = TO_DOUBLE({scoreField})
  | EVAL score_type = TO_STRING({scoreTypeField})
  | EVAL calculation_run_id = TO_STRING({runIdField})
  | WHERE id_value IS NOT NULL AND id_value != ""
  | WHERE score_type IS NULL OR score_type == "{targetScoreType}"
  | STATS
      score = LAST(score, @timestamp),
      calculation_run_id = LAST(calculation_run_id, @timestamp)
    BY id_value
  | WHERE score > 0
  | WHERE calculation_run_id IS NULL OR calculation_run_id != "{currentRunId}"
  | KEEP id_value
  | LIMIT 10000

Schema updates

Two new fields are added to the risk score index mapping. Both are optional and backward-compatible — existing v1 documents without these fields continue to work.

New risk score document fields
Field Type Purpose
score_type keyword Classifies the score document: base, propagated, or resolution. Absent on v1 documents.
calculation_run_id keyword UUID of the maintainer run that produced the document. Used for run-aware stale detection in reset.
related_entities array of { entity_id, relationship_type } Present on resolution score documents — lists the contributing aliases in the resolution group.
Modifier type changes

The Modifier<'watchlist'> type was generalized:

// Before (v1)
{ type: 'watchlist', subtype: 'privmon', metadata: { is_privileged_user: boolean } }

// After (v2)
{ type: 'watchlist', subtype: string, metadata: { watchlist_id: string } }

Multiple watchlist modifiers per entity are supported and combine multiplicatively. The riskScoreDocFactory() third parameter changed from Modifier<'watchlist'> | undefined to Array<Modifier<'watchlist'>>. Backward compatibility with the existing privmon modifier is preserved.

Pipeline walkthrough

The maintainer run() executes per entity type (host, user, etc.):

Phase 1 — Base scoring

  1. Fetch watchlist configs once per run via WatchlistConfigClient
  2. Paginate entity IDs — composite aggregation with a Painless runtime mapping that computes EUIDs
  3. Score the page — ES|QL query bounded to the page's EUID range, using MV_PSERIES_WEIGHTED_SUM(TOP(risk_score, sampleSize, "desc"), 1.5)
  4. Fetch entities from Entity Store via crudClient.listEntities() for the scored page
  5. Apply modifiersapplyScoreModifiersFromEntities() reads criticality + watchlists from entity documents
  6. Categorize & persist — entities classified into write_now / defer_to_phase_2 / not_in_store; dual-write to risk index + Entity Store
  7. Sync lookup index — upsert alias→target relationship docs for entities with resolution relationships into .entity_analytics.risk_score.lookup-{namespace} (indexed with index.mode: lookup)

Phase 2 — Resolution scoring

  1. Paginate resolution groups — composite aggregation over resolution_target_id in the lookup index
  2. Score each group — ES|QL query with LOOKUP JOIN joins the lookup index to combine alerts from all aliases in the group
  3. Parse related entities — contributing aliases extracted as related_entities from the join result
  4. Apply group-level modifiers — highest asset criticality across the group, union of all watchlists
  5. Persist — resolution scores written to risk index (score_type: "resolution") and Entity Store (entity.relationships.resolution.risk.*)

Cleanup

  1. Reset to zero — run-aware stale detection clears decayed scores (dual pass for base and resolution)
  2. Prune lookup index — removes stale relationship docs for entities no longer in the Entity Store

Preview API (synchronous)

The existing /internal/risk_score/preview endpoint is extended with a V2 code path, gated by the entityAnalyticsEntityStoreV2 experimental feature flag. When V2 is active:

  • Runs the same ES|QL base scoring query as Phase 1, but synchronously within the HTTP request
  • Returns id_field: "entity.id" and raw EUIDs as id_value (e.g. host:<id>, user:<email>) — no translation to legacy identifier fields
  • Applies Entity Store modifiers (asset criticality + watchlists) when entities are enrolled
  • Omits resolution scoring and reset-to-zero, as preview is stateless and read-only
  • Falls back transparently to the V1 scoring path when the flag is disabled

Telemetry

New event-based telemetry for observability:

  • risk_score_maintainer_run_summary — one per {namespace, entityType, calculationRunId} with outcome, counters, and duration
  • risk_score_maintainer_stage_summary — one per stage (phase1_base_scoring, phase1_lookup_sync, phase2_resolution_scoring, reset_to_zero) with per-stage timing and error details
  • Run counters include: scoresWrittenBase, scoresWrittenResolution, lookupDocsUpserted, lookupDocsDeleted, lookupPrunedDocs
  • Bounded taxonomies for status (success | error | skipped | aborted), skip reasons, and error kinds

Testing

Feature flags

Add to kibana.yml (or kibana.dev.yml):

xpack.securitySolution.enableExperimental:
  - entityAnalyticsEntityStoreV2
  - entityAnalyticsWatchlistEnabled

Then enable Entity Store V2 and the ID-based risk scoring advanced setting (securitySolution:entityStoreEnableV2) via Stack Management → Advanced Settings.

Manual testing with the document generator

The security-documents-generator has a companion PR (#342) with a risk-score-v2 command that seeds entities, ingests alerts, triggers maintainer runs, and prints a scorecard:

# In the security-documents-generator repo, on the risk-scoring-v2 branch:
yarn start risk-score-v2

The command supports interactive follow-on actions (expand entities, tweak scores, view risk docs, run comparisons) and includes graph workflows for resolution + propagation.

What to verify

  • Entities enrolled in the Entity Store receive risk scores; entities not in the store do not
  • Scores appear in both the risk index (.ds-risk-score.risk-score-*) and on Entity Store documents (entity.risk.*)
  • score_type: "base" and calculation_run_id are present on new risk score documents
  • Asset criticality modifiers are applied correctly (sourced from entity documents, not the criticality index)
  • Watchlist modifiers apply multiplicatively when an entity is on multiple watchlists
  • Entities sharing a resolution target produce a single resolution score with score_type: "resolution" and related_entities
  • Resolution scores carry the highest criticality and union of watchlists across the group
  • Aliases in a resolution group do not receive their own resolution score
  • Resolution scores are dual-written to Entity Store under entity.relationships.resolution.risk.*
  • When alerts age out of the risk window, scores decay to zero on the next maintainer run
  • Legacy risk engine routes return 400 when V2 is enabled
  • Disabling the feature flags restores normal v1 behavior with no side effects
  • Preview API returns V2 base scores with entity.id-based identifiers when V2 is enabled, and falls back to V1 scoring when disabled

FTR test coverage

node scripts/jest_integration \
  --config x-pack/solutions/security/test/security_solution_api_integration/test_suites/entity_analytics/risk_score_maintainer/trial_license_complete_tier/configs/ess.config.ts
Suite Covers
setup_and_status.ts Maintainer registration, index template creation
task_execution.ts Scoring, restart, manual run, asset criticality integration
task_execution_nondefault_spaces.ts Namespace-scoped scoring isolation
risk_score_calculation.ts Single/multi-entity scoring, criticality modifiers, watchlist modifiers
resolution_scoring.ts Resolution group aggregation, multi-alias groups, group-level modifiers, Entity Store dual-write
preview_api.ts V2 base-only preview scores, entity.id-based identifiers, Entity Store modifier application, V1 fallback

Checklist

  • Unit or functional tests were updated or added to match the most common scenarios
  • Flaky Test Runner was used on any tests changed
  • The PR description includes the appropriate Release Notes section, and the correct release_note:* label is applied per the guidelines

Risks

Risk Severity Mitigation
New maintainer scoring diverges from v1 in edge cases Medium FTR tests assert parity; v2 is fully feature-flagged and does not affect v1 code paths
Dual-write to Entity Store adds write load Low Writes are conditional on idBasedRiskScoringEnabled and errors are non-fatal (logged as warnings)
Reset-to-zero batch limit (10k) may leave stale scores temporarily Low By design — remaining stale entities are cleared in subsequent scheduled runs
Lookup index grows with entity relationships Low Pruned each maintainer run; index uses index.mode: lookup for efficient joins
Resolution scoring adds a second pagination pass Low Only runs when resolution relationships exist; bounded by the same composite pagination limits

hop-dev and others added 30 commits March 11, 2026 12:13
- Renamed `entityAnalytics94ModeEnabled` to `entityAnalyticsEntityStoreV2Enabled` for clarity.
- Updated conditional checks in the plugin to use the new feature flag.
- Enhanced error handling in routes to return appropriate responses when entity analytics V2 mode is enabled.
- Added tests to validate behavior when entity analytics 9.4 mode is enabled, ensuring proper error responses.
- Introduced new translations for error messages related to the V2 mode API restrictions.
…sform initialization

- Added `initLegacyTransforms` method to `RiskScoreDataClient` for initializing legacy risk engine transforms.
- Updated `RiskEngineDataClient` to call `initLegacyTransforms` during initialization.
- Enhanced mock implementation in `risk_score_data_client.mock.ts` to include `initLegacyTransforms`.
- Added tests for `initLegacyTransforms` to ensure proper functionality across namespaces.
- Improved error logging for legacy transform initialization failures.
Replaces references to "entity analytics 9.4 mode" with "Entity Store V2"
in error messages and test descriptions to align with the updated
feature flag terminology.

Made-with: Cursor
hop-dev added 5 commits April 2, 2026 13:54
Introduce a shared StepResult shape for scoring steps so pipeline orchestration and metrics consume a consistent summary contract across base, resolution, and reset phases.

Made-with: Cursor
Route reset-to-zero writes through shared persistence helpers and pass the orchestrator writer dependency so reset behavior matches base and resolution best-effort bulk handling.

Made-with: Cursor
The uninstall call during test cleanup returns 403 when the store was
never installed, which is normal between tests. Logging at warn creates
noise that obscures real failures.

Made-with: Cursor
…ForMaintainerRun

The modifier-shape test intermittently timed out because stopMaintainer
only prevents future scheduling — it doesn't kill in-flight executions.
After stop+start the task often entered an indeterminate "currently
running" state for >60s. Remove the stop/start dance entirely: set up
both modifiers while the maintainer runs freely, confirm they're in the
entity store, then trigger a fresh run that will see both.

Also improves the shared waitForMaintainerRun helper to retry the manual
trigger inside the poll loop (instead of fire-once) and fail fast if the
maintainer errors during the wait.

Made-with: Cursor
Same approach as the modifier-shape test: set up the watchlist while the
maintainer runs freely, confirm it in the entity store, then trigger a
fresh run. Safe because the retry loop already filters by run_id and
checks for the watchlist modifier.

Made-with: Cursor
hop-dev added 3 commits April 2, 2026 15:11
Remove the strict lastErrorTimestamp fail-fast in waitForMaintainerRun so
custom-namespace setup can recover from transient early maintainer errors
while still relying on the successful new-run count condition.

Made-with: Cursor
@hop-dev
Copy link
Copy Markdown
Contributor Author

hop-dev commented Apr 2, 2026

@tiansivive I have refactored the risk score maintainer pipeline to make the flow easier to read and reason about by breaking it into some more helpers. I've added a few comments too to hopefully make things clearer.

The code is much better now, I have made each stage much more aligned with the others and they use some shared helpers.

@hop-dev hop-dev requested a review from tiansivive April 2, 2026 14:34
Comment on lines +150 to +174
const baseScores = applyScoreModifiersFromEntities({
now,
identifierType: entityType,
scoreType: 'base',
calculationRunId,
page: {
scores: buildZeroScores(baseEntityIds),
identifierField,
},
entities,
watchlistConfigs,
});

const resolutionScores = applyScoreModifiersFromEntities({
now,
identifierType: entityType,
scoreType: 'resolution',
calculationRunId,
page: {
scores: buildZeroScores(resolutionEntityIds),
identifierField,
},
entities,
watchlistConfigs,
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe something for later, but this could be concurrent?

Comment on lines +165 to +166
if (isMaintainerAlreadyRunningError(error)) {
if (!alreadyRunningHandled) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can flatten this

Comment on lines +147 to +149
let requiredNewRuns = minRuns;
let manualRunTriggered = false;
let alreadyRunningHandled = false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really struggling with this file/fn, i just dont really have the overall view of why this is needed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just the test helper, it has so many resilience fixes in it now from 20 rounds of ddebugging but I think it gets a pass because its test infra

@hop-dev
Copy link
Copy Markdown
Contributor Author

hop-dev commented Apr 2, 2026

/sync-ci

hop-dev and others added 5 commits April 2, 2026 20:15
…ntity maintainer tests

Embeds settle logic directly into `waitForMaintainerRun` to ensure the task is idle before returning. This prevents version_conflict_engine_exceptions when a caller immediately stops the maintainer while a follow-up run is still saving its state.

Made-with: Cursor
- Removes redundant `scale` test which duplicated coverage of existing modifier tests.
- Simplifies `reset-to-zero` test by removing unnecessary dual-host setup.
- Merges redundant dual-write resolution test into the main resolution scoring test to save a slow maintainer setup/teardown cycle.
- Removes unused `waitForMaintainerToSettle` import and calls after embedding settle logic directly into `waitForMaintainerRun`.

Made-with: Cursor
Comment on lines +668 to +671
const scores = normalizeScores(await readRiskScores(es));
const hostScores = scores.filter((s) => s.id_value === host.expectedEuid);
expect(hostScores.every((s) => (s.calculated_score_norm ?? 0) > 0)).to.be(true);
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium trial_license_complete_tier/risk_score_calculation.ts:668

The assertion hostScores.every((s) => (s.calculated_score_norm ?? 0) > 0) at line 670 silently passes when hostScores is empty because [].every(...) returns true in JavaScript. If the host entity has no scores, the test incorrectly passes without verifying that reset-to-zero was actually disabled. Consider asserting hostScores.length > 0 before the every() check.

+      // The entity should NOT have been reset to zero — only positive scores should exist
+      const scores = normalizeScores(await readRiskScores(es));
+      const hostScores = scores.filter((s) => s.id_value === host.expectedEuid);
+      expect(hostScores.length).to.be.greaterThan(0);
       expect(hostScores.every((s) => (s.calculated_score_norm ?? 0) > 0)).to.be(true);
🤖 Copy this AI Prompt to have your agent fix this:
In file x-pack/solutions/security/test/security_solution_api_integration/test_suites/entity_analytics/risk_score_maintainer/trial_license_complete_tier/risk_score_calculation.ts around lines 668-671:

The assertion `hostScores.every((s) => (s.calculated_score_norm ?? 0) > 0)` at line 670 silently passes when `hostScores` is empty because `[].every(...)` returns `true` in JavaScript. If the host entity has no scores, the test incorrectly passes without verifying that reset-to-zero was actually disabled. Consider asserting `hostScores.length > 0` before the `every()` check.

Evidence trail:
File: x-pack/solutions/security/test/security_solution_api_integration/test_suites/entity_analytics/risk_score_maintainer/trial_license_complete_tier/risk_score_calculation.ts, lines 669-670 (REVIEWED_COMMIT). The code shows `const hostScores = scores.filter((s) => s.id_value === host.expectedEuid);` followed by `expect(hostScores.every((s) => (s.calculated_score_norm ?? 0) > 0)).to.be(true);` with no length check. JavaScript's `Array.prototype.every()` returns `true` for empty arrays per ECMAScript spec (https://tc39.es/ecma262/#sec-array.prototype.every).

@elasticmachine
Copy link
Copy Markdown
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] Scout: [ platform / navigation ] plugin / local-serverless-security_complete - navigation - has security serverless side nav
  • [job] [logs] Rule Management - Security Solution Cypress Tests #4 / Rules table - privileges securitySolutionRulesV1.all should be able to adjust snooze settings should be able to adjust snooze settings
  • [job] [logs] Rule Management - Security Solution Cypress Tests #4 / Rules table - privileges securitySolutionRulesV1.read should not be able to adjust snooze settings should not be able to adjust snooze settings

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
securitySolution 138 139 +1

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
securitySolution 11.6MB 11.6MB +578.0B
Unknown metric groups

API count

id before after diff
securitySolution 207 208 +1

History

cc @hop-dev

Copy link
Copy Markdown
Contributor

@jaredburgettelastic jaredburgettelastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Desk test only, thank you for the incredible work 🎉

@jaredburgettelastic jaredburgettelastic merged commit 043dca3 into elastic:main Apr 3, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting release_note:skip Skip the PR/issue when compiling release notes Team:Entity Analytics Security Entity Analytics Team v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants