I cannot reproduce the results in the paper.

I followed the commands in the README to conduct training on the AlfWorld dataset and obtained poor test results.

This is the training accuracy and training logs：

![Image](https://github.com/user-attachments/assets/2efc5ece-cf1c-4da6-8ba1-ecff303a424b)
[alfworld_train_true.txt](https://github.com/user-attachments/files/19797427/alfworld_train_true.txt)

These are the insights extracted from the training logs：

[alfworld_insight.txt](https://github.com/user-attachments/files/19797435/alfworld_insight.txt)

And here are the test accuracy and test logs：

![Image](https://github.com/user-attachments/assets/7a348818-009d-4fad-b238-6a6bd571f7f1)
[alfworld_eval_true.txt](https://github.com/user-attachments/files/19797451/alfworld_eval_true.txt)

All the results are far from what was reported in the paper. I don't know where the problem lies. Maybe it's because I replaced the model with GPT-4o?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I cannot reproduce the results in the paper. #8

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

I cannot reproduce the results in the paper. #8

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions