Skip to content

Confused about evaluation code #12

@alterdim

Description

@alterdim

Hello,
the functions in the "data extraction" folder seem to be dedicated to evaluation, but they produce very strange results (like a "Yes" generated answer getting full points where the gold label was "No"). Am I missing something ? Google got around 70 on a 4b model in their paper and I assume they didn't do simple "==" matching.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions