RunLLM · runllm-pr-agent · Apr 18, 2025 · Apr 18, 2025 · Apr 18, 2025
diff --git a/docs/api/datahub-apis.md b/docs/api/datahub-apis.md
@@ -33,6 +33,8 @@ We recommend using the GraphQL API if you're getting started with DataHub since
 
 - Search for datasets with conditions
 - Update a certain field of a dataset
+- Iterate over large datasets using `scrollAcrossEntities`
+- Determine the total number of records using `aggregateAcrossEntities`
 
 Learn more about the GraphQL API:
 

diff --git a/docs/api/graphql/getting-started.md b/docs/api/graphql/getting-started.md
@@ -67,9 +67,83 @@ The search term can be a simple string, or it can be a more complex query using
 
 :::note
 Note that by default Elasticsearch only allows pagination through 10,000 entities via the search API.
-If you need to paginate through more, you can change the default value for the `index.max_result_window` setting in Elasticsearch, or using the scroll API to read from the index directly.
+If you need to paginate through more, you can change the default value for the `index.max_result_window` setting in Elasticsearch, or use the `scrollAcrossEntities` GraphQL API to iterate over large datasets.
 :::
 
+### Iterating Over Large Datasets
+
+To handle large datasets exceeding 10,000 records, use the `scrollAcrossEntities` GraphQL API. This allows you to scroll through results in batches.
+
+**Initial Query**:
+
+```graphql
+{
+  scrollAcrossEntities(input: { types: [DATASET], query: "*", count: 10 }) {
+    nextScrollId
+    count
+    searchResults {
+      entity {
+        type
+        ... on Dataset {
+          urn
+          type
+          platform {
+            name
+          }
+          name
+        }
+      }
+    }
+  }
+}
+```
+
+**Subsequent Queries**:
+Use the `nextScrollId` from the response to fetch the next batch of results. Continue until `nextScrollId` is null or undefined.
+
+```graphql
+{
+  scrollAcrossEntities(input: { types: [DATASET], query: "*", count: 10, scrollId: "your_nextScrollId_here" }) {
+    nextScrollId
+    count
+    searchResults {
+      entity {
+        type
+        ... on Dataset {
+          urn
+          type
+          platform {
+            name
+          }
+          name
+        }
+      }
+    }
+  }
+}
+```
+
+### Determining the Total Number of Records
+
+To determine the total number of records, use the `aggregateAcrossEntities` GraphQL query.
+
+```graphql
+query aggregateAcrossEntities {
+  aggregateAcrossEntities(input: { types: ["DATASET"], facets: ["_entityType"] }) {
+    facets {
+      field
+      displayName
+      aggregations {
+        value
+        count
+      }
+    }
+  }
+}
+```
+
+This query returns the total count of datasets by aggregating across the `DATASET` entity type. The `count` field in the response provides the total number of datasets.
+
 ## Modifying an Entity: Mutations
 
 :::note

diff --git a/docs/how/search.md b/docs/how/search.md
@@ -205,14 +205,16 @@ query searchEntities {
 }
 ```
 
-### Searching at Scale
+### Iterating Over Large Datasets
 
-For queries that return more than 10k entities we recommend using the [scrollAcrossEntities](https://datahubproject.io/docs/graphql/queries/#scrollacrossentities) GraphQL API: 
+To handle large datasets, use the [scrollAcrossEntities](https://datahubproject.io/docs/graphql/queries/#scrollacrossentities) GraphQL API. This API allows you to scroll through results in batches, bypassing the 10,000 record limit imposed by Elasticsearch.
 
-```
-# Example query
+#### Initial Query
+Start by making an initial query to `scrollAcrossEntities` to begin scrolling through the datasets. This will return a `nextScrollId` which you will use in subsequent queries.
+
+```graphql
 {
-  scrollAcrossEntities(input: { types: [DATASET], query: "*", count: 10}) {
+  scrollAcrossEntities(input: { types: [DATASET], query: "*", count: 10 }) {
     nextScrollId
     count
     searchResults {
@@ -232,14 +234,12 @@ For queries that return more than 10k entities we recommend using the [scrollAcr
 }
 ```
 
-This will return a response containing a `nextScrollId` value which must be used in subsequent queries to retrieve more data, i.e:
+#### Subsequent Queries
+Use the `nextScrollId` from the response of the initial query to fetch the next batch of results. Continue this process until the `nextScrollId` returned is null or undefined, indicating that there are no more results to fetch.
 
-```
+```graphql
 {
-  scrollAcrossEntities(input: 
-    { types: [DATASET], query: "*", count: 10,
-    scrollId: "eyJzb3J0IjpbMy4wLCJ1cm46bGk6ZGF0YXNldDoodXJuOmxpOmRhdGFQbGF0Zm9ybTpiaWdxdWVyeSxiaWdxdWVyeS1wdWJsaWMtZGF0YS5jb3ZpZDE5X2dlb3RhYl9tb2JpbGl0eV9pbXBhY3QucG9ydF90cmFmZmljLFBST0QpIl0sInBpdElkIjpudWxsLCJleHBpcmF0aW9uVGltZSI6MH0="}
-  ) {
+  scrollAcrossEntities(input: { types: [DATASET], query: "*", count: 10, scrollId: "your_nextScrollId_here" }) {
     nextScrollId
     count
     searchResults {
@@ -259,7 +259,26 @@ This will return a response containing a `nextScrollId` value which must be used
 }
 ```
 
-In order to complete scrolling through all of the results, continue to request data in batches until the `nextScrollId` returned is null or undefined.
+### Determining the Total Number of Records
+
+To determine the total number of records (datasets) in DataHub, use the `aggregateAcrossEntities` GraphQL query. This query provides aggregated counts of entities, which can help you understand the total number of datasets available.
+
+```graphql
+query aggregateAcrossEntities {
+  aggregateAcrossEntities(input: { types: ["DATASET"], facets: ["_entityType"] }) {
+    facets {
+      field
+      displayName
+      aggregations {
+        value
+        count
+      }
+    }
+  }
+}
+```
+
+This query will return the total count of datasets by aggregating across the `DATASET` entity type. The `count` field in the response will give you the total number of datasets.
 
 
 ### DataHub Blog