fsexamc - Examine encoding of a file name or content and convert to UTF-8
fsexamc [-a] [-b] [-d dry-run-result-file] [-E module-name] [-e encoding-list] [-F] [-f 'expression'] [-g history-length] [-H] [-k] [-L logfile] [-l] [-n] [-P] [-p] [-R] [-r] [-s] [-t] [-w] fsexamc [-V] fsexamc [-?]
The fsexamc utility examines file names or file contents and tries to convert them from legacy encodings to UTF-8 using given encoding list, system default encoding list, or both.
When converting file names, fsexamc will process regular file names, directory file names, and symbolic links by default. When converting file content, it will handle regular plain text files only by default. Use "–E module-name" to enable special file handling.
fsexamc will ignore most of non-plain text files such as binary files, office document files, image files, and so on. It might produce unexpected results if conversion of such files is forced with the –F option. Internally, fsexamc uses the file(1) utility to determine whether files are plain text files or not.
By default, fsexamc will convert file names. To convert file contents instead, specify the –t option.
To help find the best encoding, fsexamc has encoding lists for supported languages. They include the most popular codesets or encodings of corresponding languages. For example, fsexamc specifies GB18030, BIG5, EUC-TW, and so on for Simplified Chinese. The list is used to generate conversion candidates. You can use the "–e encoding-list" option to add more encodings other than those system predefined encodings. If the –a option is specified, additional encodings that are suggested by the encoding auto-detection library will be added to the encoding list for possible use. The encoding specified by the –e option has higher priority than the automatically detected encodings.
The following options are supported:
Enable encoding auto-detection. fsexamc can guess the encodings of file names or file contents with the help of encoding auto-detection library interfaces. Use this option when you do not know the encodings of files. Note that in file name conversions, the auto-detection based on the statistics may not be reliable due to small number of characters in the file names.
Batch mode which is also known as non-interactive mode. With this mode, fsexamc will not display candidates or wait for user's selection or confirmation.
Make sure your terminal can display UTF-8 characters when using this option. Otherwise, illegible or gibberish characters may be presented.
Specifies the dry run result file. Used with –n option, the dry run result will be stored into the file. Used without the –n option, fsexamc will convert based on the scenario in the dry run result file supplied.
The dry run result file will be created if it does not exist. If it exists as a regular file, the file will be truncated to zero length and overwritten.
When fsexamc creates a dry run result file, you can edit and then subsequently feed it to fsexamc to perform conversions based on the content of the edited dry run result file. Note that the editing should be done carefully in the format preserving manner. Recommended edit operation is to delete any wrong or inappropriate candidates and make the right one as the first candidate. For more information, refer to fsexam(5).
If the edited file does not conform to the file format described in fsexam(5), then fsexamc will print out a warning message and quit without doing anything.
Enable special file handling. Currently the only valid option argument is "COMPRESS". "ALL" can be used to enable all modules available.
The COMPRESS module supports several popular compress and archive format files. Currently, the module supports .tar, .tar.gz, .tar.bz2, .zip, and .tar.Z file formats. Used with the –t option, fsexamc converts contents of files in archived, compressed, or files of both. Without the –t option, fsexamc converts file names.
Note that the COMPRESS module ignores symbolic links in the files archived, compressed, or both. It also ignores the –n option. The COMPRESS module handles files compressed, archived, or both only if the –R option is specified. If there is no suitable ISO8859-1 codeset locale in the system, this option is not supported as described in the NOTES section.
Specifies one or more colon or comma separated encodings to be used during conversion.
If this or –a options are not specified, fsexamc uses system pre-defined encoding list for the current locale.
If specified without the –a, –p or –P options, by default, the list of encodings supplied with the –e option replaces the system pre-defined encoding list for this session.
Use –p to prepend it after the system pre-defined encoding list. Use –P to append it before the pre-defined encoding list. If you want to make the encoding-list permanent, instead of only for the current session, use the –S option.
When used with the –a option, fsexamc will merge the supplied encoding list and auto-detected encoding list. Note that the supplied encoding-list here has higher priority than the auto-detected encodings.
In non-interactive mode, the first encoding which can be used to convert file name or file content to UTF-8 successfully is used. In interactive mode, fsexamc will display all candidates that are successfully converted from the encodings in the list of encodings to UTF-8. Note that if fsexamc cannot convert successfully, such encodings will not be displayed in the list of candidates.
Forcible conversion mode. fsexamc will determine whether file name or file content is in UTF-8 or not. If it is in UTF-8 already, then, fsexamc will not convert by default. However, since fsexamc has no completely accurate way to determine whether a string is in UTF-8 or not, sometimes, a byte sequence in legacy encoding could be treated as a valid UTF-8 string. As an example, three Simplified Chinese characters in GB2312 (two bytes per character) could be treated as two valid UTF-8 characters (three bytes per character). Use this option to bypass the verification step and perform conversions forcibly.
This option has to be used with caution and you should avoid using it with the –R option whenever possible. It may convert real UTF-8 encoded file names or file contents to unintended characters.
Search files according to 'expression'. The 'expression' here is a subset of the 'expression' used in file(1). But unlike file(1), the 'expression' here must include a path name of a starting point in the directory hierarchy in which you want to search files from as the first item. Following the path name, other items valid for the expression are the following options and their combinations: –name, –amin, –atime, –cmin, –ctime, –group, –mmin, –mtime, and –user. Refer to file(1), for more information. Internally, fsexamc uses file(1), to perform searching.
You may want to use single quote to quote the whole expression because shell may expand special characters in it if you use double quotes.
When this option is used, any other operands are ignored.
Set the history length. fsexamc saves the information about on what it has done and use the information to handle restore operations.
By default, fsexamc will save history information for 100 fsexamc executions as long as disk space permits. A single batch conversion counts as one. Use this option to change the default value.
If you change the length from a higher value to a lower value, the older history information will be purged.
When the number of history reach to the top limit, fsexamc will discard the oldest history information in order to accept and record new history information.
Handles hidden files. Unless the option is specified, hidden files with names starting with a dot (.) will be ignored by default.
By default, during file name conversions, if both symbolic link and its source belong to the user supplied list of files or a starting point of a directory hierarchy at operands, fsexamc tries to keep them consistent. In other words, if a source name is converted, then, not only symbolic link itself when applicable but also the content of the symbolic link is converted. If given source names are not converted for some reason, the corresponding symbolic link contents are also not converted and warning messages are issued. If either is not in the operand specified list, fsexamc may break the symbolic links.
This default behavior of symbolic link processings need more resource and computation time and thus use of -k option is recommended to bypass the default behavior of symbolic link processing if you have no symbolic links.
During content conversions and dry run conversions, fsexamc does not care about the symbolic link contents.
List all available encodings supported by fsexamc.
If specified, fsexamc writes log into the log-file. Default is no log file writing.
The basic log file format is:
(category) fullpath: message
The "category" values possible are "ERROR" "WARNING" and "INFO". The "fullpath" is the full path of file that is handled. The "message" briefly describes the operation result.
If the "fullpath" or the "message" contain non-UTF-8 characters, fsexamc writes their hexadecimal byte values prefixed with "x" such as "xAEx89" into the file.
Dry run mode. With this mode, fsexamc writes conversion information into the dry-run-result-file specified with the –d option instead of actually performing the conversion on the file names or contents.
If used with the –a option, the dry-run-result-file may get more candidates.
Note that compressed or archived files are not supported with this mode and symbolic links and their source consistencies are also not kept.
When used with the –e option, fsexamc appends the encoding-list to the system pre-defined encoding list. Otherwise, it has no effect.
When used with the –e option, fsexamc prepends the encoding-list to the system pre-defined encoding list. Otherwise, it has no effect.
Recursive mode. In this mode, fsexamc recursively converts all applicable files and subdirectories specified at the operands as directories.
With this option, fsexamc handles files mounted as NFS and such remote file systems. Without the option, fsexamc handles files in local disks only.
Obviously, while fsexamc is running, mounting or unmounting a file system in the directory hierarchy that is being examined is not recommended.
Restores file names to their original names. To restore file contents, specify with the –t option.
This option is useful when you want to restore files to their last states in case wrong conversions have been made.
When this option is used on a file, fsexamc restores its name or content. When used on a directory together with the –R option, fsexamc restores all files and subdirectories under the directory including the directory to their original names or contents.
Converts file contents rather than file names. fsexamc mainly handles plain text files only.
Internally, fsexamc uses file(1) to determine whether a file is a plain text file or not.
First convert file names before converting contents if there are files or directories that contain multi-byte characters in their files names. Otherwise, you may get illegible characters in your log-file or dry-run-result-file.
If specified with the –R option, fsexamc follows symbolic links if they are symbolic links to directories as if they were regular and normal directories. If no –R option is specified, fsexamc tries to convert symbolic links and it source only. If the source is a symbolic link too, fsexamc keep convert source's source and so on. By default, fsexamc does not follow symbolic links.
Display version information of fsexamc and exit.
Display usage information and exit.
The following operand is supported:
The pathname of a file or a directory to be converted. All arguments behind "--" will be treated as an operand, even if they begin with '-' character. If fsexamc encounters '-' as an operand or no operand at all, fsexamc will read pathnames from the standard input.
The following will convert the name of file "myfile" using the system pre-defined encoding list:
% fsexamc myfile
If there is no pre-defined encoding for the current locale, fsexamc will exit with error messages.Example 2 Recursively convert the names of files and subdirectories under the directory "mydir" with the given encoding list
% fsexamc -e GB18030:BIG5:EUC-TW --recursive mydirExample 3 Dry run fsexamc with auto-detected encoding
The following will scan the directory "mydir" and try to convert file and directory names under the directory with the system pre-defined plus auto-detected encodings to UTF-8 and store the result into the file, "mydryrunresult" without actually changing the names:
% fsexamc --auto-detect --dry-run -d mydryrunresult \ --recursive mydirExample 4 Perform scenario based conversions using a dry run result file
The following will perform scenario based conversions by using the "mydryrunresult". The first candidate for each file name is used. If there is no candidate, no action will be taken on the file:
% fsexamc -d mydryrunresultExample 5 Forcibly convert a file name
The following will convert the file "myfile" by using the system pre-defined encodings even if fsexamc thinks it is UTF-8 encoding already. This option should be used with caution as it may corrupt the already UTF-8 file names and contents:
% fsexamc --force myfileExample 6 Convert files generated by other utility
The following two examples have the same effect and it will convert files generated by the find(1) command with the system pre-defined and auto-detected encodings:
% /usr/bin/find . -name "*.txt" | fsexamc --auto-detect
% fsexamc --auto-detect `/usr/bin/find . -name "*.txt"`
The following is similar to the above two examples except the following uses the system pre-defined encodings only and files generated by the ls(1) utility:
% /usr/bin/ls *.txt | fsexamc
The following will search all files trailing with '.txt' under the current directory and convert any of them using the system pre-defined encoding list:
% fsexamc -f '. -name "*.txt"'Example 7 Batch mode conversion
The following will use GB18030 and BIG5 to recursively convert file names under the directory "mydir" and use the first candidate to convert the file names.
% fsexamc --batch -e GB18030:BIG5 --recursive mydirExample 8 Follow symbolic links and handle hidden files
The following will follow all symbolic links in the directory "mydir" and symbolic links in the symbolic link source's directory. Hidden files under the directory will be converted also:
% fsexamc --follow --hidden --recursive mydirExample 9 Convert file contents recursively using specified encoding list
The following will recursively scan files under the directory "mydir". For each plain text file, it will automatically detect its possible encodings, combine them with GB18030 or BIG5, and try to convert the file using the encodings formulated one by one. If the conversion is successful, fsexamc is done with the file and rest of the encodings will not be tried. If a file is a compressed or archived file, fsexamc will first uncompress and unarchive them into a temporary directory and perform above operation, compress and archive them again, and replace the original file:
% fsexamc --conv-content --recursive -e GB18030:BIG5 \ --auto-detect --enable-module COMPRESS mydirExample 10 Restore a file name or file content
The following restores the file "myfile" to its original name:
% fsexamc --restore myfile
The following restores the content of "myfile" to its original content:
% fsexamc --conv-content --restore myfile
The following exit values are returned:
File names or contents are converted successfully or corresponding information is written to a dry run result file successfully.
An error occurred. More information can be retrieved from a log file if –L log-file" option and option argument are supplied.
See attributes(7) for descriptions of the following attributes:
When you want to convert names of many files, do not convert them one by one in a loop. Try to construct a list of files and give the list to fsexamc for conversions. For example, the following is not recommended:
for file in * do fsexamc -b $file done
It is highly recommended to run this utility with UTF-8 locale. Otherwise, you may see some illegible or garbled characters. Since fsexamc has the system pre-defined and the most popular encodings for every language, considering the best multiscript capability, it will be more smooth if you run on a UTF-8 locale environment of your language.
As shown in the NOTES section of the tar(1) man page, if an archive is created that contains files whose names were created by processes running in multiple or different locales, a locale that uses a full 8-bit coding space, i.e., 0x0 to 0xff, such as en_US.ISO8859-1 should be used both to create the archive and to extract files from the archive. Due to that, when you specify COMPRESS module with –E option, fsexamc tries to use en_US.ISO8859-1, fr_FR.ISO8859-1, de_DE.ISO8859-1, es_ES.ISO8859-1, it_IT.ISO8859-1, or sv_SE.ISO8859-1 locales. If there is no such locale in the current system, use of –E option is ignored and a warning message is issued.