1. What's the base model in Table 1 and Table 4, I see the gap of GPG results between the two tables. 2. Do we filter the M right/wrong answer in B smaples? Or just leave it be?