Skip to content

fixed AsynchronousXopt.to_json() and dump#343

Open
ndwang wants to merge 2 commits intoxopt-org:mainfrom
ndwang:main
Open

fixed AsynchronousXopt.to_json() and dump#343
ndwang wants to merge 2 commits intoxopt-org:mainfrom
ndwang:main

Conversation

@ndwang
Copy link

@ndwang ndwang commented Jul 23, 2025

I put in a fix for #234 and #235

The error is caused by duplicate indices in AsynchronousXopt.data.

When adding new data, Xopt.add_data() checks if self.data exists. If this is the first batch of data, keep the indices. If there are already data, reindex the new data and concatenate.

Xopt/xopt/base.py

Lines 379 to 388 in 7dc1b5c

# Set internal dataframe.
if self.data is not None:
new_data = pd.DataFrame(new_data, copy=True) # copy for reindexing
new_data.index = np.arange(len(self.data), len(self.data) + len(new_data))
self.data = pd.concat([self.data, new_data], axis=0)
else:
if new_data.index.dtype != np.int64:
new_data.index = new_data.index.astype(np.int64)
self.data = new_data
self.generator.add_data(new_data)

new_data.index starts from 0 in Xopt, however, in AsynchronousXopt it start from 1! This causes AsynchronousXopt.data to have two entries with index 1.

This index shift traces down to AsynchronousXopt.prepare_input_data()

# Reindex input dataframe
input_data.index = np.arange(
self._ix_last + 1, self._ix_last + 1 + len(input_data)
)

It's not clear to me if this shift is required to manage futures. So my proposal is simply forcing Xopt.add_data() to always reindex the first batch to start from 0.

@nikitakuklev
Copy link
Collaborator

As far as I can recall, the ability to not have the indices reset is on purpose so as to support step removal.

If I make steps [1,2,3], remove [1], and then try to add dataframe with [2,3] to a new generator, the expectation would be for that generator to keep [2,3] indexing.

Is there a reason to not change AsynchronousXopt instead? Setting _ix_last: int = -1 would start counting at 0.

@ndwang
Copy link
Author

ndwang commented Jul 29, 2025

Thanks for taking a look at this.

Is there a reason to not change AsynchronousXopt instead? Setting _ix_last: int = -1 would start counting at 0.

This also works. The reason I changed Xopt.add_data() is it enforces unique index internally and make sure that's always the case. But if the rest of the codebase expects it to keep the index of the first batch of data, I can look into changing only AsynchronousXopt.

If I make steps [1,2,3], remove [1], and then try to add dataframe with [2,3] to a new generator, the expectation would be for that generator to keep [2,3] indexing.

The current code does keep [2,3] indexing, but if you try to add more data, say steps [1,2,3], Xopt.data becomes [2,3,2,3,4].
Generator.add_data() already doesn't reindex, so Generator.data is [2,3,1,2,3]. But generator also never serializes its internal dataframe so duplicate index is not a problem.

Xopt/xopt/generator.py

Lines 114 to 124 in 7dc1b5c

def add_data(self, new_data: pd.DataFrame):
"""
update dataframe with results from new evaluations.
This is intended for generators that maintain their own data.
"""
if self.data is not None:
self.data = pd.concat([self.data, new_data], axis=0)
else:
self.data = new_data

@roussel-ryan
Copy link
Collaborator

@nikitakuklev what do you think about this answer?

@nikitakuklev
Copy link
Collaborator

My concern with reindexing is external tools/scripts getting confused when hand-manipulating data if indices of specific points change. X.data[10] should give same answer for whole optimization run. Moreover, X.generator.data[10] should always agree with X.data[10] - partial reindexing is even worse.

@ndwang makes a good point that there is no enforcement of this - indexing can get duplicated and messed up. This doesn't break BO, but might have issues for MOGA - seems like a bug.

I'd propose to:

  1. change async xopt to _ix_last: int = -1 for logical consistency
  2. change add_data to use last index of existing data for indexing new points: last_idx+1, last_idx+1+len(new_data) instead of len(self.data), len(self.data) + len(new_data)
  3. raise Exception if there are any duplicate indices in data

@nikitakuklev
Copy link
Collaborator

@ndwang any update on your end/thoughts about proposed fixes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants