-
Notifications
You must be signed in to change notification settings - Fork 162
Description
Problem Description
There is currently no efficient method to list all document names in a very large collection when it's required to also include "phantom" documents (i.e., documents that do not actually exist but do have subcollections).
The goal is to perform a full scan of a collection's document IDs, similar to a keys-only query, but with the ability to find phantom documents, and to do so in parallel to make it practical for large collections (e.g., 500k to 1M+ documents).
Currently, there are two primary APIs for this, but they are mutually exclusive in their capabilities:
v1.listDocumentswithshowMissing: true:- What it does well: It correctly returns all document names, including phantoms.
- The limitation: It is an inherently sequential, paginated API. It does not support range queries (e.g.,
__name__ > 'a' AND __name__ < 'b') or partition cursors. This forces a single, long-running sequential scan, which is unacceptably slow for large collections. A scan of ~420,000 documents can take over two hours.
v1.partitionQuery + v1.runQuery:- What it does well: This is the recommended and highly effective method for parallelizing reads across a large collection. It is very fast.
- The limitation: The
runQueryRPC does not support ashowMissingflag. As a result, it only returns existing documents and silently skips over phantom documents, making it unsuitable for use cases that require a complete list of all document paths.
This leaves users with a difficult choice: either a correct but impractically slow scan, or a fast but incomplete one.
Use Case & Business Impact
This limitation is a significant bottleneck for critical back-office operations, data migrations, and large-scale data integrity checks. For example, when migrating data, it's essential to have a complete map of the existing structure, including documents that serve as parents for subcollections.
The inability to parallelize this task means that scripts which should take minutes can take many hours, making them difficult to run, monitor, and recover from in a production environment. This increases operational risk and engineering cost.
Proposed Solutions
To address this, we propose adding capabilities to the Firestore API that would bridge this gap. Any of the following solutions would be a massive improvement:
- Add
showMissing: truesupport to therunQueryRPC.- This is perhaps the most direct solution. If
runQuerysupportedshowMissing, we could continue using the existingpartitionQueryAPI to generate cursors and then execute parallel, keys-only queries that also find phantom documents. This seems like a natural extension of the existing parallel-scan pattern.
- This is perhaps the most direct solution. If
- Enhance listDocuments to support partitioning.
- Add support for
startAtandendAtcursors (like those frompartitionQuery) to theIListDocumentsRequest. This would allow us to partition the keyspace and then make parallellistDocumentscalls, each withshowMissing: true. - Alternative: Allow
wherefilters on the__name__property within alistDocumentsrequest. This would enable manual partitioning of the keyspace (e.g., by character ranges:a*,b*, etc.) and allow for parallel execution. The client library documentation forlistDocumentsstates thatshowMissingmay not be used withorderBy, so this restriction would need to be lifted to allow for manual range scans. (From docs: "Requests withshow_missingmay not specifywhereororder_by".)
- Add support for
- Introduce a new, dedicated API for parallel keyspace enumeration.
- Create a new RPC specifically designed for this task. It could be a "partitionable list" or a "keyspace scan" API that takes a parent path and returns a stream of all document names (existing and missing) within that path, with built-in support for parallel execution. This would provide a purpose-built tool for a common and important administrative task.
- Enhance
IListenRequestto notify on phantom document changes.- Extend the
listenRPC to send notifications when phantom documents are
implicitly created (because a subcollection document is added) or removed
(because their last subcollection document is removed). This would allow
for real-time tracking of the complete keyspace, which is crucial for
maintaining live caches or indexes of a collection's structure.
- Extend the
- Introduce a new API for traversing and watching parameterized paths.
- Provide a new API that accepts a fully parameterized path pattern (e.g.,
users/{userId}/posts/{postId}/comments/{commentId}). This API would
stream all matching documents, including phantom documents along the path,
and return the extracted, typed parameters for each document (e.g.,
{ userId: 'user-123', postId: 'post-abc', ... }). It should also
support watching for new documents that match the pattern. This is
conceptually similar to how Firebase Functions v2 triggers can be defined
with wildcards (e.g.,onDocumentWritten("users/{userId}/posts/{postId}")),
which has proven to be a very powerful and intuitive pattern. This would
massively simplify complex data traversal and synchronization logic.
- Provide a new API that accepts a fully parameterized path pattern (e.g.,
For us, enabling an efficient, parallel, and complete scan of a collection's keyspace is a very useful tool to have for managing Firestore data at scale. We understand this requires changes to the backend and would be grateful if you could consider this proposal and forward it to the appropriate team.