I followed the commands in the README to conduct training on the AlfWorld dataset and obtained poor test results.
This is the training accuracy and training logs:

alfworld_train_true.txt
These are the insights extracted from the training logs:
alfworld_insight.txt
And here are the test accuracy and test logs:

alfworld_eval_true.txt
All the results are far from what was reported in the paper. I don't know where the problem lies. Maybe it's because I replaced the model with GPT-4o?
I followed the commands in the README to conduct training on the AlfWorld dataset and obtained poor test results.
This is the training accuracy and training logs:
alfworld_train_true.txt
These are the insights extracted from the training logs:
alfworld_insight.txt
And here are the test accuracy and test logs:
alfworld_eval_true.txt
All the results are far from what was reported in the paper. I don't know where the problem lies. Maybe it's because I replaced the model with GPT-4o?