I have some files with a bad encoding. I think they came from a FAT32 pendrive using the latin-1 encoding, and are now in a EXT4 filesystem. Once I try to see the directory in rover, those files appear in a empty line; not even the size is shown.
I can replicate the behaviour by creating a bogus file (I'm using spanish locale, but any UTF-8 should work):
$ touch $(echo "bad\0355char")
$ ls
'bad'$'\355''char'
$ locale
LANG=es_ES.UTF-8
LC_CTYPE="es_ES.UTF-8"
LC_NUMERIC="es_ES.UTF-8"
LC_TIME="es_ES.UTF-8"
LC_COLLATE="es_ES.UTF-8"
LC_MONETARY="es_ES.UTF-8"
LC_MESSAGES="es_ES.UTF-8"
LC_PAPER="es_ES.UTF-8"
LC_NAME="es_ES.UTF-8"
LC_ADDRESS="es_ES.UTF-8"
LC_TELEPHONE="es_ES.UTF-8"
LC_MEASUREMENT="es_ES.UTF-8"
LC_IDENTIFICATION="es_ES.UTF-8"
LC_ALL=
$ rover
The problem is that the \355 character (í in latin encoding) is 0xED and hence a 2 multibyte starting byte in UTF-8. As the next character does't continue the 2 multibyte encoding, is an incorrect UTF-8 string.
The functions mbstowcs() and swprintf() are failling silently, returning -1, as they cannot deal with the string. So nothing gets copied to the WBUF buffer, and the row remains empty.
If you create a bogus directory, the behavior is even more interesting. The WBUF gets reused from the last usage, and the filename seems to be named as the CWD or the previous directory.
$ mkdir bad-$'\355'
$ rover
I was thinking in how to solve the issue, perhaps some workaround like the ls(1) program does, replacing the spurious character with an ? symbol or similar. Deletion and other operations work fine.
I have some files with a bad encoding. I think they came from a FAT32 pendrive using the latin-1 encoding, and are now in a EXT4 filesystem. Once I try to see the directory in rover, those files appear in a empty line; not even the size is shown.
I can replicate the behaviour by creating a bogus file (I'm using spanish locale, but any UTF-8 should work):
The problem is that the \355 character (í in latin encoding) is 0xED and hence a 2 multibyte starting byte in UTF-8. As the next character does't continue the 2 multibyte encoding, is an incorrect UTF-8 string.
The functions mbstowcs() and swprintf() are failling silently, returning -1, as they cannot deal with the string. So nothing gets copied to the WBUF buffer, and the row remains empty.
If you create a bogus directory, the behavior is even more interesting. The WBUF gets reused from the last usage, and the filename seems to be named as the CWD or the previous directory.
I was thinking in how to solve the issue, perhaps some workaround like the ls(1) program does, replacing the spurious character with an ? symbol or similar. Deletion and other operations work fine.