What script did you use to generate the files matching `data/*.clean`? In particular, did you do any preprocessing to the data?