-
Notifications
You must be signed in to change notification settings - Fork 12
Description
This bug has been driving me mental for a month, and caused me to tear the last of my hair out, but now I've localized it. Posting a report here for posterity. The bug comes from the CASA synthesis libs. I don't know if it's been fixed in CASA (and thus in the new https://github.com/radio-astro/casasynthesis package), we need to follow this up. (If it's not fixed in CASA then CHILES is fucked basically, since this affects the major loop of CLEAN).
The bug manifested itself as follows. When using lwimager to predict visibilities into MODEL_DATA from a large image cube, for example:
lwimager fillmodel=1 operation=csclean wprojplanes=0 field=0 spwid=0 stokes=I padding=1 cachesize=4096 niter=0 nchan=116 chanstart=16 chanstep=1 npix=6144 cellsize=0.06arcsec img_chanstart=16 mode=channel ms=CYG-C-1scan-testpaste.MS model=model3 -0-14.img fixed=1
I was getting different results on different machines (MODEL_DATA visibilities with a 10-20% difference!) Funnily enough, visibilities in the higher channels were always identical, it was only the lower-numbered channels that diverged.
I haven't pinpointed the bug but through experiments I've localized its behaviour. Somewhere in https://github.com/casacore/casarest/blob/master/synthesis/MeasurementEquations/CubeSkyEquation.cc is some funky code for deciding if an image cube is "too big" w.r.t. available RAM. If it is "too big", the predict() loop iterates over chunks of channels, picking a chunk size based on RAM size. If it's not too big, predict() processes the whole cube at once.
It looks like only the last chunk is predicted properly. Preceding chunks manifest the abovementioned error. The magnitude of the error depends on chunk size. So on a big-RAM machine where the whole cube is processed in one go, the visibilities are correct. On smaller-RAM machines, as soon as chunking kicks in, the lower channels get predicted incorrectly. Exact numbers depend on RAM size :))) so I was getting crazy inconsistencies between different machines.
Since it doesn't actually report on chunking explicitly, you need to look at the following console output:
2015-10-02 17:40:28 INFO CSCleanImageSkyModel::solve niter=0, assuming predict-only mode, converting model image to MODEL_DATA visibilities
0%....10....20....30....40....50....60....70....80....90....100%
It prints a 0 to 100% meter for each chunk. If you only see one 0...100% meter you're good, it did the whole cube in one go. If you see two or more meters, the bug has kicked in.
Obvious workaround for this use case is to split the cube in frequency to ensure single-chunk processing, and invoke lwimager multiple times. Will post more on this elsewhere.