As the paper Protein−Ligand Scoring with Convolutional Neural Networks says: The performance of trained CNN models were evaluated by 3-fold cross-validation for both the pose prediction and virtual screening tasks. To avoid evaluating models on targets similar to those in the training set, training and test folds were constructed by clustering data based on target families rather than individual targets.
But I couldn't find the dataset here, and I didn't know how you construct test folds by target families...(also couldn't find the test fold for pose predictions here)