File indexing overwrite prompts and fix index key handling error#242
Conversation
There was a problem hiding this comment.
Pull request overview
This PR refines file-index overwrite behavior in MPI contexts and hardens index merging against non-entry metadata keys (e.g., statistics_start_time).
Changes:
- Introduces
check_overwrite()to prompt only on MPI rank 0 and broadcast the decision to all ranks. - Adds an
overwriteoptional argument tomerge_index_files()and updates overwrite/skip logic to require explicit confirmation. - Prevents type errors in
merge_index_files()by skipping non-dict top-level index keys during consolidation.
Comments suppressed due to low confidence (4)
pysemtools/postprocessing/file_indexing.py:39
check_overwritetreats the response as case/whitespace-sensitive (ans in ['y','yes']). Inputs likeYes,Y, oryes\nwill evaluate to False and skip overwriting unexpectedly. Consider normalizing withans.strip().lower()(and optionally accepting common variants) before deciding.
if comm.Get_rank() == 0:
logger.write("warning", f"File {fname} exists. Overwrite?")
ans = input("input: [yes/no] ")
overwrite = ans in ['y', 'yes']
pysemtools/postprocessing/file_indexing.py:355
- In
index_files_from_folder, whenoverwriteis None and multiple existing index files are detected, the code will prompt once per file type because the return value ofcheck_overwrite(...)isn’t stored back intooverwrite. If the intended behavior is a single prompt whose answer applies to all collisions (as before), assign the result tooverwritethe first time and reuse it.
if file_exists:
if overwrite is False or \
(overwrite is None and not check_overwrite(comm, index_fname)):
remove.append(ftype)
pysemtools/postprocessing/file_indexing.py:526
merge_index_filesonly carries forwardsimulation_start_time; index files produced for statistics usestatistics_start_time(a float) and will now be silently dropped due to theisinstance(..., dict)guard. This can leaveconsolidated_index['simulation_start_time']at the 1e12 sentinel and lose the start-time metadata. Consider explicitly handlingstatistics_start_time(and/or setting the consolidated start time key based on which one is present).
consolidated_index = {}
consolidated_index["simulation_start_time"] = 1e12
consolidated_key = 0
for index_file in index_list:
try:
with open(index_file, "r") as infile:
index = json.load(infile)
except FileNotFoundError:
logger.write(
"warning",
f"Expected file {index_file} but it does not exist. skipping it",
)
continue
logger.write("info", f"Reading index file: {index_file}")
for key in index.keys():
if key == "simulation_start_time":
if index[key] < consolidated_index["simulation_start_time"]:
consolidated_index["simulation_start_time"] = index[key]
continue
elif isinstance(index[key], dict):
if index[key]["path"] != "file_not_in_folder":
consolidated_index[consolidated_key] = index[key]
consolidated_key += 1
pysemtools/postprocessing/file_indexing.py:573
merge_index_fileswritesoutput_fnamefrom every MPI rank (with open(output_fname, 'w') ...is not rank-guarded). On multi-rank runs this can corrupt the JSON output or fail intermittently. Only rank 0 should write the file, thencomm.Barrier()to synchronize.
logger.write("info", f"Writing consolidated index file: {output_fname}")
write_index = True
file_exists = os.path.exists(output_fname)
if file_exists:
if overwrite is False or \
(overwrite is None and not check_overwrite(comm, output_fname)):
write_index = False
logger.write("warning", f"Skipping writing index {output_fname}")
if write_index:
with open(output_fname, "w") as outfile:
outfile.write(json.dumps(consolidated_index, indent=4))
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
One question is whether the user should select overwrite for each file or only once for all subsequent files |
I guess it is a bit annoying to need to say yes individually. I think at least I have been annoyed before haha. But I have no strong opinions. Mark it as ready when you want and we merge it. |
I guess it's best to check for each file because the filename is presented to the user. If one wants to avoid the prompts then they can supply One thing I don't know is whether the |
Cleanup of the handling of overwrite prompts and fix
statistics_start_timetype error in index key handling:merge_index_filesFalse