Archives Containing Non-ASCII Filenames

Language:

Archiving files with non-ASCII characters in filenames may cause issues, because support of non-ASCII filenames in the numerous implementations of the particular archive formats differs significantly, although the situation is improving.

Recent tar implementations on UNIX and Unix-like systems support the POSIX format specified by POSIX.1-2001, so the non-ASCII filenames are handled safely. On the MS Windows platform a number of archival utilities stores the filenames using the current codepage so names of files extracted from such archives can become garbled.

In that case the convmv(1) tool can be used to repair them, when the codepage is known:

$ convmv -f cp437 -t utf8 my_extracted_filename

In Zip files, the original specification sets the encoding of file names and file comments to IBM437. In 2007 PKWare extended the specification to also allow UTF-8. In the meantime various zip implementations adopted the strategy of using the current codepage as the filename encoding (usually on the MS Windows platform).

Info-ZIP's Zip 3.0, used in Oracle Solaris 10 and Oracle Solaris 11, stores filenames in UTF-8, so if both the compression and decompression utility are of this version, the archive contents would not become corrupted.

When a zip archive using a non-UTF-8 encoding to store the file names is extracted on Oracle Solaris, the file names might get garbled. You can use the convmv(1) tool to repair them, if the codepage is known:

	$ convmv -f cp437 -t utf8 my-unzipped-filename