Thank you for integrating and opensource the Benckmark dataset.
I noticed that there are some inconsistencies between statistics in the paper and the released data in benchmarks/CodonBERT/data. Here are the confusing parts:
- For the MLOS flu vaccine data, you show 543 mRNA samples in Table 1 in the paper, but I only found 167 samples in the released data.
- For SARS-Cov-2 vaccine degradation data, you show 2400 mRNA samples in Table 1 in the paper, but I only found 233 samples in the released data.
Could you kindly clarify them?
BTW, I noticed that some of the datasets are very small. When using a 0.7/0.15/0.15 split on such a small dataset and computing metrics like correlation, the results are not reliable. It would be better that you use k-fold cross validation.
Thank you for integrating and opensource the Benckmark dataset.
I noticed that there are some inconsistencies between statistics in the paper and the released data in
benchmarks/CodonBERT/data. Here are the confusing parts:Could you kindly clarify them?
BTW, I noticed that some of the datasets are very small. When using a 0.7/0.15/0.15 split on such a small dataset and computing metrics like correlation, the results are not reliable. It would be better that you use k-fold cross validation.