diff --git a/docs/api/datahub-apis.md b/docs/api/datahub-apis.md index c46aacde3a0cb..1ec37c3de4227 100644 --- a/docs/api/datahub-apis.md +++ b/docs/api/datahub-apis.md @@ -33,6 +33,8 @@ We recommend using the GraphQL API if you're getting started with DataHub since - Search for datasets with conditions - Update a certain field of a dataset +- Iterate over large datasets using `scrollAcrossEntities` +- Determine the total number of records using `aggregateAcrossEntities` Learn more about the GraphQL API: diff --git a/docs/api/graphql/getting-started.md b/docs/api/graphql/getting-started.md index dfa556051bd4d..6bd280326589f 100644 --- a/docs/api/graphql/getting-started.md +++ b/docs/api/graphql/getting-started.md @@ -67,9 +67,83 @@ The search term can be a simple string, or it can be a more complex query using :::note Note that by default Elasticsearch only allows pagination through 10,000 entities via the search API. -If you need to paginate through more, you can change the default value for the `index.max_result_window` setting in Elasticsearch, or using the scroll API to read from the index directly. +If you need to paginate through more, you can change the default value for the `index.max_result_window` setting in Elasticsearch, or use the `scrollAcrossEntities` GraphQL API to iterate over large datasets. ::: +### Iterating Over Large Datasets + +To handle large datasets exceeding 10,000 records, use the `scrollAcrossEntities` GraphQL API. This allows you to scroll through results in batches. + +**Initial Query**: + +```graphql +{ + scrollAcrossEntities(input: { types: [DATASET], query: "*", count: 10 }) { + nextScrollId + count + searchResults { + entity { + type + ... on Dataset { + urn + type + platform { + name + } + name + } + } + } + } +} +``` + +**Subsequent Queries**: +Use the `nextScrollId` from the response to fetch the next batch of results. Continue until `nextScrollId` is null or undefined. + +```graphql +{ + scrollAcrossEntities(input: { types: [DATASET], query: "*", count: 10, scrollId: "your_nextScrollId_here" }) { + nextScrollId + count + searchResults { + entity { + type + ... on Dataset { + urn + type + platform { + name + } + name + } + } + } + } +} +``` + +### Determining the Total Number of Records + +To determine the total number of records, use the `aggregateAcrossEntities` GraphQL query. + +```graphql +query aggregateAcrossEntities { + aggregateAcrossEntities(input: { types: ["DATASET"], facets: ["_entityType"] }) { + facets { + field + displayName + aggregations { + value + count + } + } + } +} +``` + +This query returns the total count of datasets by aggregating across the `DATASET` entity type. The `count` field in the response provides the total number of datasets. + ## Modifying an Entity: Mutations :::note diff --git a/docs/how/search.md b/docs/how/search.md index 2274fe7c09240..10b084d0ee219 100644 --- a/docs/how/search.md +++ b/docs/how/search.md @@ -205,14 +205,16 @@ query searchEntities { } ``` -### Searching at Scale +### Iterating Over Large Datasets -For queries that return more than 10k entities we recommend using the [scrollAcrossEntities](https://datahubproject.io/docs/graphql/queries/#scrollacrossentities) GraphQL API: +To handle large datasets, use the [scrollAcrossEntities](https://datahubproject.io/docs/graphql/queries/#scrollacrossentities) GraphQL API. This API allows you to scroll through results in batches, bypassing the 10,000 record limit imposed by Elasticsearch. -``` -# Example query +#### Initial Query +Start by making an initial query to `scrollAcrossEntities` to begin scrolling through the datasets. This will return a `nextScrollId` which you will use in subsequent queries. + +```graphql { - scrollAcrossEntities(input: { types: [DATASET], query: "*", count: 10}) { + scrollAcrossEntities(input: { types: [DATASET], query: "*", count: 10 }) { nextScrollId count searchResults { @@ -232,14 +234,12 @@ For queries that return more than 10k entities we recommend using the [scrollAcr } ``` -This will return a response containing a `nextScrollId` value which must be used in subsequent queries to retrieve more data, i.e: +#### Subsequent Queries +Use the `nextScrollId` from the response of the initial query to fetch the next batch of results. Continue this process until the `nextScrollId` returned is null or undefined, indicating that there are no more results to fetch. -``` +```graphql { - scrollAcrossEntities(input: - { types: [DATASET], query: "*", count: 10, - scrollId: "eyJzb3J0IjpbMy4wLCJ1cm46bGk6ZGF0YXNldDoodXJuOmxpOmRhdGFQbGF0Zm9ybTpiaWdxdWVyeSxiaWdxdWVyeS1wdWJsaWMtZGF0YS5jb3ZpZDE5X2dlb3RhYl9tb2JpbGl0eV9pbXBhY3QucG9ydF90cmFmZmljLFBST0QpIl0sInBpdElkIjpudWxsLCJleHBpcmF0aW9uVGltZSI6MH0="} - ) { + scrollAcrossEntities(input: { types: [DATASET], query: "*", count: 10, scrollId: "your_nextScrollId_here" }) { nextScrollId count searchResults { @@ -259,7 +259,26 @@ This will return a response containing a `nextScrollId` value which must be used } ``` -In order to complete scrolling through all of the results, continue to request data in batches until the `nextScrollId` returned is null or undefined. +### Determining the Total Number of Records + +To determine the total number of records (datasets) in DataHub, use the `aggregateAcrossEntities` GraphQL query. This query provides aggregated counts of entities, which can help you understand the total number of datasets available. + +```graphql +query aggregateAcrossEntities { + aggregateAcrossEntities(input: { types: ["DATASET"], facets: ["_entityType"] }) { + facets { + field + displayName + aggregations { + value + count + } + } + } +} +``` + +This query will return the total count of datasets by aggregating across the `DATASET` entity type. The `count` field in the response will give you the total number of datasets. ### DataHub Blog