Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
234 changes: 202 additions & 32 deletions _aggregations/bucket/significant-terms.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,63 +9,233 @@

# Significant terms aggregations

The `significant_terms` aggregation lets you spot unusual or interesting term occurrences in a filtered subset relative to the rest of the data in an index.
`significant_terms` helps you surface terms that are unusually frequent in a subset of documents (foreground set) compared to a broader reference set (background set). It’s the right choice when a plain `terms` aggregation shows you the *most common* values, but you want the *most over‑represented* values.

A foreground set is the set of documents that you filter. A background set is a set of all documents in an index.
The `significant_terms` aggregation examines all documents in the foreground set and finds a score for significant occurrences in contrast to the documents in the background set.
- Foreground set: the documents matched by your query.
- Background set: by default, all documents in the target indexes. You can narrow it with `background_filter`.

In the sample web log data, each document has a field containing the `user-agent` of the visitor. This example searches for all requests from an iOS operating system. A regular `terms` aggregation on this foreground set returns Firefox because it has the most number of documents within this bucket. On the other hand, a `significant_terms` aggregation returns Internet Explorer (IE) because IE has a significantly higher appearance in the foreground set as compared to the background set.
Each result bucket includes:

- `key`: the term value.
- `doc_count`: the number of foreground docs containing the term.
- `bg_count`: the number of background docs containing the term.
- `score`: how strongly the term stands out in the foreground relative to the background, see [Heuristics and scoring](#heuristics-and-scoring) for further details.

If the aggregation returns no buckets, you likely didn’t filter the foreground, for example using `match_all`, or the foreground has the same distribution of terms as the background.
{: .note}

## Basic example: Find out what's distinctive about high‑value returns

You can use the following query to ask, among orders that were *returned and cost over 500*, which `payment_method` values are unusually common compared to the whole index?

```json
GET opensearch_dashboards_sample_data_logs/_search
GET retail_orders/_search
{
"size": 0,
"query": {
"terms": {
"machine.os.keyword": [
"ios"
"bool": {
"filter": [
{ "term": { "status.keyword": "RETURNED" } },
{ "range": { "order_total": { "gte": 500 } } }
]
}
},
"aggs": {
"significant_response_codes": {
"payment_signals": {
"significant_terms": {
"field": "agent.keyword"
"field": "payment_method.keyword"
}
}
}
}
```
{% include copy-curl.html %}

## Multi‑set analysis

You can compute "what’s unusual" per category by first splitting documents into buckets, then running `significant_terms` inside each bucket.

### Example: Unusual `cancel_reason` per region

```json
GET rides/_search
{
"size": 0,
"aggs": {
"by_region": {
"terms": { "field": "region.keyword", "size": 5 },
"aggs": {
"odd_cancellations": {
"significant_terms": { "field": "cancel_reason.keyword" }
}
}
}
}
}
```
{% include copy-curl.html %}

### Example: Hotspots on a map

Check failure on line 78 in _aggregations/bucket/significant-terms.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: Hotspots. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: Hotspots. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_aggregations/bucket/significant-terms.md", "range": {"start": {"line": 78, "column": 14}}}, "severity": "ERROR"}

You have a dataset with field incidents for sites across a country. Each document has a point location `site.location` with type `geo_point` and a categorical field `issue.keyword`, for example `POWER_OUTAGE`, `FIBER_CUT`, `VANDALISM`. And you want to spot which issue types are over‑represented inside particular map tiles compared to a broader reference set. You can use `geotile_grid` that divides the map into zoom‑level tiles, higher `precision` means smaller tiles, such as street or city blocks, lower `precision` means larger tiles, for example city or region. Then run `significant_terms` inside each tile to find the local outliers.

The following request segments by tiles and queries which `issue.keyword` is unusually frequent in those tiles:

```json
GET field_ops/_search
{
"size": 0,
"aggs": {
"tiles": {
"geotile_grid": { "field": "site.location", "precision": 6 },
"aggs": {
"odd_issues": {
"significant_terms": { "field": "issue.keyword" }
}
}
}
}
}
```
{% include copy-curl.html %}

#### Example response
## Focus the background with `background_filter`

By default, the background is the entire index. Sometimes you want a **narrower reference set**.

### Example: Toronto verses the rest of Canada

```json
...
"aggregations" : {
"significant_response_codes" : {
"doc_count" : 2737,
"bg_count" : 14074,
"buckets" : [
{
"key" : "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)",
"doc_count" : 818,
"score" : 0.01462731514608217,
"bg_count" : 4010
},
{
"key" : "Mozilla/5.0 (X11; Linux x86_64; rv:6.0a1) Gecko/20110421 Firefox/6.0a1",
"doc_count" : 1067,
"score" : 0.009062566630410223,
"bg_count" : 5362
GET news/_search
{
"size": 0,
"query": { "term": { "city.keyword": "Toronto" } },
"aggs": {
"unusual_topics": {
"significant_terms": {
"field": "topic.keyword",
"background_filter": {
"term": { "country.keyword": "Canada" }
}
}
]
}
}
}
}
```
{% include copy-curl.html %}

A custom background requires extra work, each candidate term’s background frequency must be computed by filtering, which can be slower than using the index‑wide counts.
{: .warning}

## Free‑text fields and keyword fields

`significant_terms` works best on exact value fields, for example, `keyword`, `numeric`. Running it on heavily tokenized text can be memory‑intensive. For analyzed text, consider [`significant_text`]({{site.url}}{{site.baseurl}}/aggregations/bucket/significant-text/) instead, which is designed for free‑text and supports the same significance heuristics.

## Heuristics and scoring

The `score` ranks terms by how much their foreground frequency departs from the background frequency. It has no units and is meaningful only for comparison within the same request and heuristic.

You can choose one heuristic per request by adding its object under `significant_terms`.

### JLH (balanced absolute × relative lift)

Good general‑purpose choice. Favors terms that increase both *absolutely* and *relatively*.

If the `significant_terms` aggregation doesn't return any result, you might have not filtered the results with a query. Alternatively, the distribution of terms in the foreground set might be the same as the background set, implying that there isn't anything unusual in the foreground set.
```json
"significant_terms": {
"field": "payment_method.keyword",
"jlh": {}
}
```

#### JLH scoring

JLH score is calculated as follows:

`fg_pct = doc_count / foreground_total` and `bg_pct = bg_count / background_total`. JLH ≈ `(fg_pct − bg_pct) * (fg_pct / bg_pct)`. A term whose share rises a little but from a much bigger baseline will rank higher than one with the same absolute rise from a tiny background share.

#### Score example calculation using JLH

If your foreground (high‑value returns) has `2,000` orders and the background (all orders) has `120,000`. One bucket reports:

- `doc_count = 160`
- `bg_count = 3,200`

Percentages:

- `fg_pct = 160 / 2000 = 0.08`
- `bg_pct = 3200 / 120000 ≈ 0.026666…`

JLH ≈ `(0.08 − 0.026666…) * (0.08 / 0.026666…) ≈ 0.053333… * 3 ≈ 0.16`

This positive score means searched term is notably more prevalent in high‑value returns than overall. Scores are relative, therefore use them to rank terms, not as absolute probabilities.

### Mutual information

Mutual information (MI) prefers frequent terms, it can pick up popular but still distinctive terms. Set `include_negatives: false` to ignore terms that are less common in the foreground than the background. If your background is not a superset of the foreground, set `background_is_superset: false`. See following example:

```json
"significant_terms": {
"field": "product.keyword",
"mutual_information": {
"include_negatives": false,
"background_is_superset": true
}
}
```

### Chi‑square

Similar to [MI](#mutual-information-mi). Also supports `include_negatives` and `background_is_superset`.

```json
"significant_terms": {
"field": "error.keyword",
"chi_square": { "include_negatives": false }
}
```

### Google Normalized Distance

Check failure on line 198 in _aggregations/bucket/significant-terms.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.HeadingCapitalization] 'Google Normalized Distance' is a heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.HeadingCapitalization] 'Google Normalized Distance' is a heading and should be in sentence case.", "location": {"path": "_aggregations/bucket/significant-terms.md", "range": {"start": {"line": 198, "column": 5}}}, "severity": "ERROR"}

Google Normalized Distance (GND) favors strong co‑occurrence. Useful for synonym discovery or items that tend to appear together.

```json
"significant_terms": {
"field": "tag.keyword",
"gnd": {}
}
```

### Percentage

Sorts terms by the ratio `doc_count`/`bg_count`, displays how many foreground hits a term has relative to its background hits, but it doesn’t account for the overall sizes of the two sets, so very rare terms can dominate.

```json
"significant_terms": {
"field": "sku.keyword",
"percentage": {}
}
```

### Scripted heuristic

Supply a custom formula using the following variables:

- `_subset_freq`: term docs in the foreground
- `_superset_freq`: term docs in the background
- `_subset_size`: foreground size
- `_superset_size`: background size

```json
"significant_terms": {
"field": "field.keyword",
"script_heuristic": {
"script": {
"lang": "painless",
"source": "params._subset_freq / (params._superset_freq - params._subset_freq + 1)"
}
}
}
```

The default source of statistical information for background term frequencies is the entire index. You can narrow this scope with a background filter for more focus

Loading