Ask some questions about the paper

1. What's the base model in Table 1 and Table 4, I see the gap of GPG results between the two tables.
2. Do we filter the M right/wrong answer in B smaples? Or just leave it be?