Fix HTTP 500 in filestore getMetadata racing with concurrent delete#4367
Fix HTTP 500 in filestore getMetadata racing with concurrent delete#4367epugh merged 4 commits intoapache:mainfrom
Conversation
ClusterFileStore.getMetadata could return a 500 when a file was deleted between fileStore.list(...) returning the entry and convertToResponse calling FileDetails.size()/getTimeStamp() on it. The underlying Files.size()/Files.getLastModifiedTime() would throw NoSuchFileException, which DistribFileStore wrapped as SolrException(SERVER_ERROR). Reproducible via beasting TestDistribFileStore.testFileStoreManagement (seen ~1/20 with seed 30453DE51CF44AB7, US-ASCII, security manager). Fix: - DistribFileStore.FileInfo: catch NoSuchFileException in size() (returns -1) and getTimeStamp() (returns null) so callers can distinguish a concurrent-delete race from a real I/O failure. - ClusterFileStore.convertToResponse: returns null when size() < 0, signalling the entry vanished mid-call. - ClusterFileStore.getMetadata: filters null entries from the DIRECTORY listing; the FILE/METADATA branch already maps an empty result to null, which now also covers the race case so the response matches NOFILE semantics (singletonMap(path, null)). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Addresses a filestore concurrency race where GET /cluster/filestore/metadata... could intermittently return HTTP 500 if a file is deleted between directory listing and attribute reads, aiming to make the response match “not found” semantics instead of erroring.
Changes:
- Catch
NoSuchFileExceptioninDistribFileStore.FileInfoattribute accessors (size(),getTimeStamp()), returning sentinel values for concurrent-delete races. - Update
ClusterFileStore.getMetadata/convertToResponseto treat vanished entries as absent (including filtering nulls from directory listings). - Add an unreleased changelog entry describing the behavioral fix.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| solr/core/src/java/org/apache/solr/filestore/DistribFileStore.java | Converts concurrent-delete NoSuchFileException into sentinel return values for size/timestamp reads. |
| solr/core/src/java/org/apache/solr/filestore/ClusterFileStore.java | Treats sentinel size as “vanished” and filters null entries from directory listings. |
| changelog/unreleased/PR#4367-filestore-getmetadata-delete-race.yml | Documents the fix for release notes. |
Comments suppressed due to low confidence (2)
solr/core/src/java/org/apache/solr/filestore/ClusterFileStore.java:268
convertToResponsecallsdetails.getMetaData()twice. Besides doing duplicate I/O, this can still surface the same concurrent-delete race as an unhandledRuntimeException(e.g.,DistribFileStorewraps metadata read failures inRuntimeException), causing the HTTP 500 this PR is trying to eliminate. Fetch the metadata once into a local variable and handle the case where metadata can’t be read because the entry disappeared (treat it asnull/vanished).
entryMetadata.timestamp = details.getTimeStamp();
if (details.getMetaData() != null) {
details.getMetaData().toMap(entryMetadata.unknownProperties());
}
solr/core/src/java/org/apache/solr/filestore/DistribFileStore.java:316
FileDetails.size()andgetTimeStamp()now use sentinel values (-1/null) to indicate a concurrent-delete race. SinceFileStore.FileDetailsdoesn’t document these return values, it’s easy for future callers to misinterpret them as valid data. Consider documenting this contract (ideally onFileStore.FileDetails) or at least adding a clarifying comment here indicating the sentinel values are part of an intentional API convention.
@Override
public Date getTimeStamp() {
try {
return new Date(Files.getLastModifiedTime(realPath()).toMillis());
} catch (NoSuchFileException e) {
// File was deleted concurrently between listing and reading its attributes.
return null;
} catch (IOException e) {
throw new SolrException(
SERVER_ERROR, "Failed to retrieve the last modified time for: " + realPath(), e);
}
}
@Override
public boolean isDir() {
return type == FileType.DIRECTORY;
}
@Override
public long size() {
try {
return Files.size(realPath());
} catch (NoSuchFileException e) {
// File was deleted concurrently between listing and reading its attributes.
return -1;
} catch (IOException e) {
throw new SolrException(
SERVER_ERROR, "Failed to retrieve the file size for: " + realPath(), e);
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| type: fixed # added, changed, fixed, deprecated, removed, dependency_update, security, other | ||
| authors: | ||
| - name: Eric Pugh | ||
| links: [] |
There was a problem hiding this comment.
Changelog entries in this directory typically include a link (PR and/or JIRA) under links: rather than an empty list. Please add an appropriate link entry so the change is traceable from the release notes.
| links: [] | |
| links: | |
| - https://github.com/apache/solr/pull/4367 |
…java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
I disocvered this issue while working on #4249 (SOLR-16341: fix blank file zip handling).
ClusterFileStore.getMetadata could return a 500 when a file was deleted between fileStore.list(...) returning the entry and convertToResponse calling FileDetails.size()/getTimeStamp() on it. The underlying Files.size()/Files.getLastModifiedTime() would throw NoSuchFileException, which DistribFileStore wrapped as SolrException(SERVER_ERROR).
Reproducible via beasting TestDistribFileStore.testFileStoreManagement (seen ~1/20 with seed 30453DE51CF44AB7, US-ASCII, security manager).
Fix: