Wednesday, December 27, 2006

Command of the day: convmv

If you are moving around files from one system (ubuntu: default encoding is utf8) to another (debian: historical default encoding for catalonians who want to use the euro sign is latin1) or getting files from CD's built in windows (spanish windows encoding cp850), you surelly will get on the filenames strange signs instead accents, tilded n's and c's with cedillas, opening question marks... not talking about the weird (for us) nordic and eastern signs. You may say that using such names for files is getting in trouble, but i like them for mp3 in order to convert back and forward from id3 tags to filenames.

Now that i have to move to another office and another computer i had such filename encoding problems. I knew how to convert the encoding of the content of a file with iconv how to solve that problem on the file names. 'apt-cache search' gave me the solution: convmv.

You can use it recursively with an '-r' option, it detects which files are already on utf8 and you have to explicitly indicate the 'from' and 'to' encodings. It will be nice it to detect the original encoding when there is no conflict but, at least it lets you do a temptative conversion and see how it would look after conversion. So the way of working is:

Do a temptative conversion for all the files you need to convert:

convmv -f latin1 -t utf8 -r . | less

Check the output and see whether the non utf8 files have been properly converted. If some files were not latin1 neither utf8, the name will be not properly converted, just prove a different origin encoding for those concrete files. When you know which is the encoding to translate just execute:

convmv -f latin1 -t utf8 -r . --notest

If the applied conversion was not the proper one, just apply the conversion backwards and you'll have
the original one again.

1 comment:

Anonymous said...

convmv -f ISO-8859-1 -t utf8 -r .
for files imported from JFS that makes python crash, fixed thanks !