diff --git a/_aggregations/bucket/significant-terms.md b/_aggregations/bucket/significant-terms.md index c255379dd8b..91cc0de9612 100644 --- a/_aggregations/bucket/significant-terms.md +++ b/_aggregations/bucket/significant-terms.md @@ -9,28 +9,89 @@ redirect_from: # Significant terms aggregations -The `significant_terms` aggregation lets you spot unusual or interesting term occurrences in a filtered subset relative to the rest of the data in an index. +`significant_terms` helps you surface terms that are unusually frequent in a subset of documents (foreground set) compared to a broader reference set (background set). It’s the right choice when a plain `terms` aggregation shows you the *most common* values, but you want the *most over‑represented* values. -A foreground set is the set of documents that you filter. A background set is a set of all documents in an index. -The `significant_terms` aggregation examines all documents in the foreground set and finds a score for significant occurrences in contrast to the documents in the background set. +- Foreground set: the documents matched by your query. +- Background set: by default, all documents in the target indexes. You can narrow it with `background_filter`. -In the sample web log data, each document has a field containing the `user-agent` of the visitor. This example searches for all requests from an iOS operating system. A regular `terms` aggregation on this foreground set returns Firefox because it has the most number of documents within this bucket. On the other hand, a `significant_terms` aggregation returns Internet Explorer (IE) because IE has a significantly higher appearance in the foreground set as compared to the background set. +Each result bucket includes: + +- `key`: the term value. +- `doc_count`: the number of foreground docs containing the term. +- `bg_count`: the number of background docs containing the term. +- `score`: how strongly the term stands out in the foreground relative to the background, see [Heuristics and scoring](#heuristics-and-scoring) for further details. + +If the aggregation returns no buckets, you likely didn’t filter the foreground, for example using `match_all`, or the foreground has the same distribution of terms as the background. +{: .note} + +## Basic example: Find out what's distinctive about high‑value returns + +You can use the following query to ask, among orders that were *returned and cost over 500*, which `payment_method` values are unusually common compared to the whole index? ```json -GET opensearch_dashboards_sample_data_logs/_search +GET retail_orders/_search { "size": 0, "query": { - "terms": { - "machine.os.keyword": [ - "ios" + "bool": { + "filter": [ + { "term": { "status.keyword": "RETURNED" } }, + { "range": { "order_total": { "gte": 500 } } } ] } }, "aggs": { - "significant_response_codes": { + "payment_signals": { "significant_terms": { - "field": "agent.keyword" + "field": "payment_method.keyword" + } + } + } +} +``` +{% include copy-curl.html %} + +## Multi‑set analysis + +You can compute "what’s unusual" per category by first splitting documents into buckets, then running `significant_terms` inside each bucket. + +### Example: Unusual `cancel_reason` per region + +```json +GET rides/_search +{ + "size": 0, + "aggs": { + "by_region": { + "terms": { "field": "region.keyword", "size": 5 }, + "aggs": { + "odd_cancellations": { + "significant_terms": { "field": "cancel_reason.keyword" } + } + } + } + } +} +``` +{% include copy-curl.html %} + +### Example: Hotspots on a map + +You have a dataset with field incidents for sites across a country. Each document has a point location `site.location` with type `geo_point` and a categorical field `issue.keyword`, for example `POWER_OUTAGE`, `FIBER_CUT`, `VANDALISM`. And you want to spot which issue types are over‑represented inside particular map tiles compared to a broader reference set. You can use `geotile_grid` that divides the map into zoom‑level tiles, higher `precision` means smaller tiles, such as street or city blocks, lower `precision` means larger tiles, for example city or region. Then run `significant_terms` inside each tile to find the local outliers. + +The following request segments by tiles and queries which `issue.keyword` is unusually frequent in those tiles: + +```json +GET field_ops/_search +{ + "size": 0, + "aggs": { + "tiles": { + "geotile_grid": { "field": "site.location", "precision": 6 }, + "aggs": { + "odd_issues": { + "significant_terms": { "field": "issue.keyword" } + } } } } @@ -38,34 +99,143 @@ GET opensearch_dashboards_sample_data_logs/_search ``` {% include copy-curl.html %} -#### Example response +## Focus the background with `background_filter` + +By default, the background is the entire index. Sometimes you want a **narrower reference set**. + +### Example: Toronto verses the rest of Canada ```json -... -"aggregations" : { - "significant_response_codes" : { - "doc_count" : 2737, - "bg_count" : 14074, - "buckets" : [ - { - "key" : "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)", - "doc_count" : 818, - "score" : 0.01462731514608217, - "bg_count" : 4010 - }, - { - "key" : "Mozilla/5.0 (X11; Linux x86_64; rv:6.0a1) Gecko/20110421 Firefox/6.0a1", - "doc_count" : 1067, - "score" : 0.009062566630410223, - "bg_count" : 5362 +GET news/_search +{ + "size": 0, + "query": { "term": { "city.keyword": "Toronto" } }, + "aggs": { + "unusual_topics": { + "significant_terms": { + "field": "topic.keyword", + "background_filter": { + "term": { "country.keyword": "Canada" } + } } - ] + } } - } } ``` +{% include copy-curl.html %} + +A custom background requires extra work, each candidate term’s background frequency must be computed by filtering, which can be slower than using the index‑wide counts. +{: .warning} + +## Free‑text fields and keyword fields + +`significant_terms` works best on exact value fields, for example, `keyword`, `numeric`. Running it on heavily tokenized text can be memory‑intensive. For analyzed text, consider [`significant_text`]({{site.url}}{{site.baseurl}}/aggregations/bucket/significant-text/) instead, which is designed for free‑text and supports the same significance heuristics. + +## Heuristics and scoring + +The `score` ranks terms by how much their foreground frequency departs from the background frequency. It has no units and is meaningful only for comparison within the same request and heuristic. + +You can choose one heuristic per request by adding its object under `significant_terms`. + +### JLH (balanced absolute × relative lift) + +Good general‑purpose choice. Favors terms that increase both *absolutely* and *relatively*. -If the `significant_terms` aggregation doesn't return any result, you might have not filtered the results with a query. Alternatively, the distribution of terms in the foreground set might be the same as the background set, implying that there isn't anything unusual in the foreground set. +```json +"significant_terms": { + "field": "payment_method.keyword", + "jlh": {} +} +``` + +#### JLH scoring + +JLH score is calculated as follows: + +`fg_pct = doc_count / foreground_total` and `bg_pct = bg_count / background_total`. JLH ≈ `(fg_pct − bg_pct) * (fg_pct / bg_pct)`. A term whose share rises a little but from a much bigger baseline will rank higher than one with the same absolute rise from a tiny background share. + +#### Score example calculation using JLH + +If your foreground (high‑value returns) has `2,000` orders and the background (all orders) has `120,000`. One bucket reports: + +- `doc_count = 160` +- `bg_count = 3,200` + +Percentages: + +- `fg_pct = 160 / 2000 = 0.08` +- `bg_pct = 3200 / 120000 ≈ 0.026666…` + +JLH ≈ `(0.08 − 0.026666…) * (0.08 / 0.026666…) ≈ 0.053333… * 3 ≈ 0.16` + +This positive score means searched term is notably more prevalent in high‑value returns than overall. Scores are relative, therefore use them to rank terms, not as absolute probabilities. + +### Mutual information + +Mutual information (MI) prefers frequent terms, it can pick up popular but still distinctive terms. Set `include_negatives: false` to ignore terms that are less common in the foreground than the background. If your background is not a superset of the foreground, set `background_is_superset: false`. See following example: + +```json +"significant_terms": { + "field": "product.keyword", + "mutual_information": { + "include_negatives": false, + "background_is_superset": true + } +} +``` + +### Chi‑square + +Similar to [MI](#mutual-information-mi). Also supports `include_negatives` and `background_is_superset`. + +```json +"significant_terms": { + "field": "error.keyword", + "chi_square": { "include_negatives": false } +} +``` + +### Google Normalized Distance + +Google Normalized Distance (GND) favors strong co‑occurrence. Useful for synonym discovery or items that tend to appear together. + +```json +"significant_terms": { + "field": "tag.keyword", + "gnd": {} +} +``` + +### Percentage + +Sorts terms by the ratio `doc_count`/`bg_count`, displays how many foreground hits a term has relative to its background hits, but it doesn’t account for the overall sizes of the two sets, so very rare terms can dominate. + +```json +"significant_terms": { + "field": "sku.keyword", + "percentage": {} +} +``` + +### Scripted heuristic + +Supply a custom formula using the following variables: + +- `_subset_freq`: term docs in the foreground +- `_superset_freq`: term docs in the background +- `_subset_size`: foreground size +- `_superset_size`: background size + +```json +"significant_terms": { + "field": "field.keyword", + "script_heuristic": { + "script": { + "lang": "painless", + "source": "params._subset_freq / (params._superset_freq - params._subset_freq + 1)" + } + } +} +``` -The default source of statistical information for background term frequencies is the entire index. You can narrow this scope with a background filter for more focus