Skip to content

Documentation, WebArena 2.0, Evaluation Cache#62

Open
shuyanzhou wants to merge 8 commits intoweb-arena-x:mainfrom
shuyanzhou:main
Open

Documentation, WebArena 2.0, Evaluation Cache#62
shuyanzhou wants to merge 8 commits intoweb-arena-x:mainfrom
shuyanzhou:main

Conversation

@shuyanzhou
Copy link
Contributor

@shuyanzhou shuyanzhou commented Sep 8, 2024

Documentation

Update README files for compatibility with both WebArena (WA) and VisualWebArena (VWA).

WebArena 2.0

WebArena 2.0 addresses annotation issues reported by various users. Specifically:

  • WebArena 2.0 minimizes the use of exact_match and must_include for information-seeking tasks with StringEvaluator. The migration from old evaluators to new ones generally follows these rules:
- exact_match -> fuzzy_exact_match
- must_include, fuzzy_match
    - If the list contains 1 item -> fuzzy_exact_match
    - If the list contains > 1 item
       - For elements on the same level, same topic -> fuzzy_must_include
       - For elements on different aspects -> context_qa
- na -> fuzzy_na_match, which explicitly evaluates the reasoning behind unachievable outcomes.
- Reddit post-related -> qa. 
    - `context_qa` evaluates content based on both intent and answer.
    - `qa` evaluates based only on the answer, as the intent is not relevant.

The prompts are tested in evaluation_harness/eval_evaluators.

  • Other fixes
**Fix from github issues***
https://github.com/web-arena-x/webarena/issues/100
2: product type is very vague. Removed
3: update the intent to indicate tied rank
4: update the intent to indicate tied rank
5: type is too vague, add the scope

https://github.com/web-arena-x/webarena/issues/135
45: update the intent to be more accurate

https://github.com/web-arena-x/webarena/issues/137
425: update the intent to be more accurate

**Individual fix**
Template 324, remove ranking requirement. 
Template 204: Use a combination of context_qa and must_include. 
792, 793 were deleted because the reason is not very sound
Fix errors found by THU group [THU-Webarena-lite Bug Fixing](https://docs.google.com/spreadsheets/d/13BRuRlU_Z_UBcucjQ5myvrRdB0P0ID3Nj-dWlzawuYo/edit#gid=1021875443) 

**Typo, grammar**
by far -> so far
https://github.com/web-arena-x/webarena/issues/133
correpong -> corresponding 
telll -> tell
canlled -> cancelled
what could -> how could
competative -> competitive

Evaluator

Support result cache so that evaluation can be run offline. This is helpful if we accept submissions in the future. The participants only needs to upload their cached files and we can perform evaluation quickly without reruning their models

@shuyanzhou shuyanzhou changed the title Document update and evaluation update [WIP] Document update and evaluation update Sep 23, 2024
@shuyanzhou
Copy link
Contributor Author

@kohjingyu @ljang0 can you check a few examples on vwa to make sure it is not broken?

@shuyanzhou shuyanzhou changed the title Document update and evaluation update Documentation, WebArena 2.0, Evaluation Cache Sep 23, 2024
@kohjingyu kohjingyu mentioned this pull request Nov 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant