Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/api/datahub-apis.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@ We recommend using the GraphQL API if you're getting started with DataHub since

- Search for datasets with conditions
- Update a certain field of a dataset
- Iterate over large datasets using `scrollAcrossEntities`
- Determine the total number of records using `aggregateAcrossEntities`

Learn more about the GraphQL API:

Expand Down
76 changes: 75 additions & 1 deletion docs/api/graphql/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,9 +67,83 @@ The search term can be a simple string, or it can be a more complex query using

:::note
Note that by default Elasticsearch only allows pagination through 10,000 entities via the search API.
If you need to paginate through more, you can change the default value for the `index.max_result_window` setting in Elasticsearch, or using the scroll API to read from the index directly.
If you need to paginate through more, you can change the default value for the `index.max_result_window` setting in Elasticsearch, or use the `scrollAcrossEntities` GraphQL API to iterate over large datasets.
:::

### Iterating Over Large Datasets

To handle large datasets exceeding 10,000 records, use the `scrollAcrossEntities` GraphQL API. This allows you to scroll through results in batches.

**Initial Query**:

```graphql
{
scrollAcrossEntities(input: { types: [DATASET], query: "*", count: 10 }) {
nextScrollId
count
searchResults {
entity {
type
... on Dataset {
urn
type
platform {
name
}
name
}
}
}
}
}
```

**Subsequent Queries**:
Use the `nextScrollId` from the response to fetch the next batch of results. Continue until `nextScrollId` is null or undefined.

```graphql
{
scrollAcrossEntities(input: { types: [DATASET], query: "*", count: 10, scrollId: "your_nextScrollId_here" }) {
nextScrollId
count
searchResults {
entity {
type
... on Dataset {
urn
type
platform {
name
}
name
}
}
}
}
}
```

### Determining the Total Number of Records

To determine the total number of records, use the `aggregateAcrossEntities` GraphQL query.

```graphql
query aggregateAcrossEntities {
aggregateAcrossEntities(input: { types: ["DATASET"], facets: ["_entityType"] }) {
facets {
field
displayName
aggregations {
value
count
}
}
}
}
```

This query returns the total count of datasets by aggregating across the `DATASET` entity type. The `count` field in the response provides the total number of datasets.

## Modifying an Entity: Mutations

:::note
Expand Down
43 changes: 31 additions & 12 deletions docs/how/search.md
Original file line number Diff line number Diff line change
Expand Up @@ -205,14 +205,16 @@ query searchEntities {
}
```

### Searching at Scale
### Iterating Over Large Datasets

For queries that return more than 10k entities we recommend using the [scrollAcrossEntities](https://datahubproject.io/docs/graphql/queries/#scrollacrossentities) GraphQL API:
To handle large datasets, use the [scrollAcrossEntities](https://datahubproject.io/docs/graphql/queries/#scrollacrossentities) GraphQL API. This API allows you to scroll through results in batches, bypassing the 10,000 record limit imposed by Elasticsearch.

```
# Example query
#### Initial Query
Start by making an initial query to `scrollAcrossEntities` to begin scrolling through the datasets. This will return a `nextScrollId` which you will use in subsequent queries.

```graphql
{
scrollAcrossEntities(input: { types: [DATASET], query: "*", count: 10}) {
scrollAcrossEntities(input: { types: [DATASET], query: "*", count: 10 }) {
nextScrollId
count
searchResults {
Expand All @@ -232,14 +234,12 @@ For queries that return more than 10k entities we recommend using the [scrollAcr
}
```

This will return a response containing a `nextScrollId` value which must be used in subsequent queries to retrieve more data, i.e:
#### Subsequent Queries
Use the `nextScrollId` from the response of the initial query to fetch the next batch of results. Continue this process until the `nextScrollId` returned is null or undefined, indicating that there are no more results to fetch.

```
```graphql
{
scrollAcrossEntities(input:
{ types: [DATASET], query: "*", count: 10,
scrollId: "eyJzb3J0IjpbMy4wLCJ1cm46bGk6ZGF0YXNldDoodXJuOmxpOmRhdGFQbGF0Zm9ybTpiaWdxdWVyeSxiaWdxdWVyeS1wdWJsaWMtZGF0YS5jb3ZpZDE5X2dlb3RhYl9tb2JpbGl0eV9pbXBhY3QucG9ydF90cmFmZmljLFBST0QpIl0sInBpdElkIjpudWxsLCJleHBpcmF0aW9uVGltZSI6MH0="}
) {
scrollAcrossEntities(input: { types: [DATASET], query: "*", count: 10, scrollId: "your_nextScrollId_here" }) {
nextScrollId
count
searchResults {
Expand All @@ -259,7 +259,26 @@ This will return a response containing a `nextScrollId` value which must be used
}
```

In order to complete scrolling through all of the results, continue to request data in batches until the `nextScrollId` returned is null or undefined.
### Determining the Total Number of Records

To determine the total number of records (datasets) in DataHub, use the `aggregateAcrossEntities` GraphQL query. This query provides aggregated counts of entities, which can help you understand the total number of datasets available.

```graphql
query aggregateAcrossEntities {
aggregateAcrossEntities(input: { types: ["DATASET"], facets: ["_entityType"] }) {
facets {
field
displayName
aggregations {
value
count
}
}
}
}
```

This query will return the total count of datasets by aggregating across the `DATASET` entity type. The `count` field in the response will give you the total number of datasets.


### DataHub Blog
Expand Down