Skip to content

File indexing overwrite prompts and fix index key handling error#242

Merged
adperezm merged 2 commits intoExtremeFLOW:mainfrom
nick-morse:enhancement/file_indexing_handling
Feb 26, 2026
Merged

File indexing overwrite prompts and fix index key handling error#242
adperezm merged 2 commits intoExtremeFLOW:mainfrom
nick-morse:enhancement/file_indexing_handling

Conversation

@nick-morse
Copy link
Contributor

Cleanup of the handling of overwrite prompts and fix statistics_start_time type error in index key handling:

  • Adding overwrite optional variable to merge_index_files
  • Overwrite optional variable has proper behavior if set to False
  • Overwrite user prompt now appears only on rank 0 and response is broadcast to all ranks
  • Change to only overwrite if provided 'y' or 'yes'

Copilot AI review requested due to automatic review settings February 26, 2026 12:52
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refines file-index overwrite behavior in MPI contexts and hardens index merging against non-entry metadata keys (e.g., statistics_start_time).

Changes:

  • Introduces check_overwrite() to prompt only on MPI rank 0 and broadcast the decision to all ranks.
  • Adds an overwrite optional argument to merge_index_files() and updates overwrite/skip logic to require explicit confirmation.
  • Prevents type errors in merge_index_files() by skipping non-dict top-level index keys during consolidation.
Comments suppressed due to low confidence (4)

pysemtools/postprocessing/file_indexing.py:39

  • check_overwrite treats the response as case/whitespace-sensitive (ans in ['y','yes']). Inputs like Yes, Y, or yes\n will evaluate to False and skip overwriting unexpectedly. Consider normalizing with ans.strip().lower() (and optionally accepting common variants) before deciding.
    if comm.Get_rank() == 0:
        logger.write("warning", f"File {fname} exists. Overwrite?")
        ans = input("input: [yes/no] ")
        overwrite = ans in ['y', 'yes']

pysemtools/postprocessing/file_indexing.py:355

  • In index_files_from_folder, when overwrite is None and multiple existing index files are detected, the code will prompt once per file type because the return value of check_overwrite(...) isn’t stored back into overwrite. If the intended behavior is a single prompt whose answer applies to all collisions (as before), assign the result to overwrite the first time and reuse it.
        if file_exists:

            if overwrite is False or \
               (overwrite is None and not check_overwrite(comm, index_fname)): 
                remove.append(ftype)

pysemtools/postprocessing/file_indexing.py:526

  • merge_index_files only carries forward simulation_start_time; index files produced for statistics use statistics_start_time (a float) and will now be silently dropped due to the isinstance(..., dict) guard. This can leave consolidated_index['simulation_start_time'] at the 1e12 sentinel and lose the start-time metadata. Consider explicitly handling statistics_start_time (and/or setting the consolidated start time key based on which one is present).
    consolidated_index = {}
    consolidated_index["simulation_start_time"] = 1e12
    consolidated_key = 0

    for index_file in index_list:

        try:
            with open(index_file, "r") as infile:
                index = json.load(infile)
        except FileNotFoundError:
            logger.write(
                "warning",
                f"Expected file {index_file} but it does not exist. skipping it",
            )
            continue

        logger.write("info", f"Reading index file: {index_file}")

        for key in index.keys():

            if key == "simulation_start_time":
                if index[key] < consolidated_index["simulation_start_time"]:
                    consolidated_index["simulation_start_time"] = index[key]
                continue

            elif isinstance(index[key], dict):
                if index[key]["path"] != "file_not_in_folder":
                    consolidated_index[consolidated_key] = index[key]
                    consolidated_key += 1

pysemtools/postprocessing/file_indexing.py:573

  • merge_index_files writes output_fname from every MPI rank (with open(output_fname, 'w') ... is not rank-guarded). On multi-rank runs this can corrupt the JSON output or fail intermittently. Only rank 0 should write the file, then comm.Barrier() to synchronize.
    logger.write("info", f"Writing consolidated index file: {output_fname}")

    write_index = True
    file_exists = os.path.exists(output_fname)
    if file_exists:

        if overwrite is False or \
           (overwrite is None and not check_overwrite(comm, output_fname)): 
            write_index = False
            logger.write("warning", f"Skipping writing index {output_fname}")

    if write_index:
        with open(output_fname, "w") as outfile:
            outfile.write(json.dumps(consolidated_index, indent=4))


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@nick-morse
Copy link
Contributor Author

One question is whether the user should select overwrite for each file or only once for all subsequent files

@nick-morse nick-morse marked this pull request as draft February 26, 2026 14:35
@adperezm
Copy link
Collaborator

One question is whether the user should select overwrite for each file or only once for all subsequent files

I guess it is a bit annoying to need to say yes individually. I think at least I have been annoyed before haha. But I have no strong opinions. Mark it as ready when you want and we merge it.

@nick-morse nick-morse marked this pull request as ready for review February 26, 2026 15:28
@nick-morse
Copy link
Contributor Author

I guess it is a bit annoying to need to say yes individually. I think at least I have been annoyed before haha. But I have no strong opinions. Mark it as ready when you want and we merge it.

I guess it's best to check for each file because the filename is presented to the user. If one wants to avoid the prompts then they can supply overwrite in the function calls.

One thing I don't know is whether the statistics_start_time should be recorded in the consolidated_index, it looks like simulation_start_time is recorded but doesn't exist for stats files and vice versa for statistics_start_time, which spurred the error fixed here

@adperezm adperezm merged commit 77c8837 into ExtremeFLOW:main Feb 26, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants