UTFCast Pro User Manual

What is a warning

You may often see warnings when detecting or converting files, they are not errors. A warning comes when the detection engine is not sure what kind the file is or what codepage the file is in.

Codepage detection is a job based on data statistics, there is no "100% sure" of the detection technology. UTFCast Professional has four different detection engines working together to ensure the detection result. However, if the file is not large enough, or some text is too short in the file, it will be very hard to make sure what kind of the data is. In this case, a warning occurs.

All files in warning status are not converted. If you have chosen to copy unconverted files, the files are copied to the output directory without conversion. You can use the preview panel to verify if the detection result is correct and then right-click on the file and use "Accept result" function to ignore the warning and convert the files with the detected codepage. Or you can use "Make correction" function to specify a different codepage to convert the files.

File name filters

Wildcard filter

A wildcard is a symbol that represents an unknown character or a set of characters. UTFCast Professional supports two wildcard symbols: The asterisk (*) for any number of unknown characters, and the question mark (?) for only one unknown character.

You can mix-and-match the asterisk (*) and the question mark (?), as well as combine multiple wildcard strings with the semicolon (;). If a file name does not match any of your provided wildcard strings, the file will be ignored.

Examples

Given the below string, only the file names starting with the character W and ending with the .TXT extension will be picked:

w*.txt

Given the below string, only the file names with two characters and with the .PHP extension will be picked:

??.php

Given the below string, only the file names having two or three characters, while ending with any extension will be picked:

??.*; ???.*

Regular expression filter

UTFCast Professional supports the ECMA Script (ECMA-262) regular expression. For the full specification of the ECMA Script standard, please refer to EMCA website or download the specification document from http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf.

Example

Given the below regular expression, the numeric file names with a .TXT extension will be picked:

\d+\.txt

Settings

UTFCast Professional works well with default settings in most cases. However, you may experience performance difference when converting large sized files or a plenty of small sized files. The settings allows you to fine tune the converter parameters for your specification and maximize its performance on your system. It also allows you to change the default behaviors when executing the Instant Conversion, and the default parameters on the Custom Conversion dialog.

Default Settings

Changing the below settings affects the behavior of Instant Conversion and the default settings of Custom Conversion.

Copy unconverted files

Instant Conversion does not copy unconverted files to the output directory by default, if you would like to change this behavior, change this setting here. This setting can also affect the Copy Unconverted setting in Custom Convert function.

Write BOM to converted files

If you would like Instant Conversion to write BOM to outputed files, you should set this setting on. It can also affect the BOM setting in Custom Convert function.

Process hidden files

Hidden files and hidden folders in source folder are ignored when this option is turned off.

Output encoding

Specify output encoding for Instant Conversion, In-place Conversion and the default setting for Custom Conversion.

Return type

Specify return type for Instant Conversion, In-place Conversion and the default setting for Custom Conversion.

Advanced Settings

Converter Threads

Converter Threads setting is auto-detected by default. Its value is based on how many Logical CPUs your system has. Decreasing this value will result in a fewer CPU usage when converting files. However, increasing it does not always mean a better performance. The best performance is subject to the overall performance of your system, especially your hard disk drive performance and memory performance. Leaving this setting to its default value suits most cases.

Chunk Size

A chunk is a series of memory space that UTFCast uses to hold the content of a file. Using a smaller chunk size results in more hard disk Input/Output operations of a file, but a faster processing speed of memory operation and fewer memory usage. A larger chunk size results in the opposite effect. But if to use a very large chunk to process a small file, it can waste time to allocate unnecessary memory space. If you always need to convert large files, it is better to increase the Chuck Size. For example, if you need to convert 1000 files x 10MB/each, you can set the Chunk Size to its maximum size because it will obviously reduce hard disk IO operation and increase performance.

Sample Size

The Sample Size is only used for codepage detection. A larger sample size results in a more precision detection result, but also more memory usage and a slower detection speed. This setting does not affect a file which its size is smaller than the sample size.

Use codepage GB18030 instead of GB2312

See the section: GB18030 Support

Binary List Acceleration

The binary list is a list that contains many file extensions. Some files such as .exe files, .rar files, .zip files, .pdf files and other known files are well-known as binary file types, detecting these files always need much time and the detection result is imprecise. Enabling this setting can skip detecting such files to increase detection speed. If you have some files that are text files but they are always ignored by the detection engine, you should turn off this option. Otherwise you can keep this option on to increase detection speed.

Use Consolidated Buffer

Consolidated Buffer is a better memory management method since version 1.5. It can reduce memory fragments, provide higher memory accessing speed, and thus increase a little bit of conversion performance, especially when converting a huge of files. A higher buffer size provides higher memory performance but uses more memory space. This option is enabled by default.

Codepage Reference

GB18030 Support

GB18030 is a separate standard used in the People's Republic of China for encoding Chinese characters superseding GB2312. In GB18030, characters can be 1, 2, or 4 bytes. GB18030 support is turned on by default.

Supported Codepages

UTFCast Professional supports detecting and reading the below input codepages:

  • US-ASCII
  • Big5
  • EUC-JP (EUC 20932 subset)
  • EUC-KR
  • EUC-TW
  • GB18030
  • GB2312
  • HZ-GB2312
  • IBM855
  • IBM866
  • ISO-2022-CN
  • ISO-2022-JP / JIS
  • ISO-2022-KR
  • ISO-8859-2
  • ISO-8859-5
  • ISO-8859-7
  • ISO-8859-8
  • KOI8-R
  • Shift-JIS
  • UCS-4-2143
  • UCS-4-3412
  • UTF-16 Big Endian
  • UTF-16 Little Endian
  • UTF-32 Big Endian
  • UTF-32 Little Endian
  • UTF-8
  • Windows-1250 / ISO-8859-1 / Latin-2
  • Windows-1251
  • Windows-1252 / ANSI / Latin-1
  • Windows-1253
  • Windows-1255
  • Windows-874 / TIS 620
  • MAC-Cyrillic / x-mac-cyrillic

Codepage Identifiers

A codepage identifier is a numberic for UTFCast Pro to identify which codepage you are referring to. It is usually used with the /cp command line switch. For example:

UTFCastPro.exe /in:"C:\My Files" /out:"D:\My Output" /cp:1252

The below table shows the complete list of supported codepage identifiers.

Codepage Name Codepage Identifier
BIG5 950
EUC-JP (EUC 20932 subset) 20932
EUC-KR 51949
EUC-TW 51950
GB2312 936
GB18030 54936
HZ-GB-2312 52936
IBM855 855
IBM866 866
ISO-2022-JP / JIS 50222
ISO-2022-KR 50225
ISO-2022-CN 50227
ISO-8859-2 28592
ISO-8859-5 28595
ISO-8859-7 28597
ISO-8859-8 28598
KOI8-R 20866
MAC-Cyrillic / x-mac-cyrillic 10007
Shif-JIS 932
UCS-4-3412 3412
UCS-4-2143 2143
UTF-8 65001
UTF-16LE 1200
UTF-16BE 1201
UTF-32LE 12000
UTF-32BE 12001
Windows-874 /TIS 620 874
Windows-1250 / ISO-8859-1 / Latin-2 1250
Windows-1251 1251
Windows-1252 / ANSI / Latin-1 1252
Windows-1253 1253
Windows-1255 1255

Supported output Return-Types (Also known as CR/LF Style)

  • No change
  • Force CRLF (Windows Style)
  • Force CR Only (Macintosh Style)
  • Force LF Only (Unix/Linux Style)

Command Line Reference

Command Line Modes

UTFCast Pro supports two command line modes, one is the Windows Application Command Line Mode, and the other one is the Console Command Line Mode.

The Windows Application Command Line Mode can run in GUI mode which shows the current status of a session, and can interact with a user to pause or stop the session. It can also run in Quiet mode which does not generate an interface for a user to control the session, and can perfectly work with a System Service in the background.

The Console Command Line Mode provides a text-only interface that runs in a console window, such as a System Command Line Prompt Window, or a Powershell Window.

Most features provided in both modes are the same, and in many cases they can be used instead of each other. However, they run in different contexts and may provide different conveniences in different cases.

Console Command Line Mode

UTFCast Pro's Console Command Line Mode application is a separate application. You should be able to find UTFCastCon.exe in UTFCast Pro's install folder. To call this command line mode, you need to open a Windows Command Line Prompt Window or a Powershell Window first, and run UTFCastCon.exe inside that window. UTFCast Pro's installer automatically adds its install folder to your system's PATH environment variable, so you don't need to type the full path to UTFCast Pro's install folder when running console command line mode. If for any reason the PATH environment variable does not contain UTFCast Pro's install folder, it's recommended that you manually add it for convenience.

Command Line Syntax

The command line syntax for UTFCast Pro is:

UTFCastPro.exe /switch:argument /switch /switch:"argument contains space characters"

The command line syntax for UTFCast Console is:

UTFCastCon.exe /switch:argument /switch /switch:"argument contains space characters"

Switches And Arguments

The below table lists all switches and their available arguments. Switches and arguments are case insensitive. If an argument contains at least one space character, the argument must be in a pair of double quotes.

Switch Argument Description Comment
/in "A path to a folder or a file" Specify which folder or file to input
/out "A path to a folder or a file" Specify which folder or file to output

In DIR mode, if this switch is not present, a sibling folder name will be generated. For example, a folder named Source_Folder (Converted).

In FILE mode, this switch must be present.

If any part of the output path does not exist, a corresponding folder will be created. If /d switch is present, this switch is ignored.

/r Recursive conversion
/c   Copy unconverted files  
/h   Process hidden files If this command is not specified, hidden files and hidden folders in source folder will be ignored.
/quiet   Quiet mode Suppress all messages and user interactions. To record any error, detection result or conversion result in quiet mode, please use in combination with /logfile and /export switches.
/mode DIR The Source is a folder The Output must be a folder. If this switch is not present, DIR mode is assumed.
FILE The Source is a file The Output must be a file.
BACHUITE The Source is a Bachuite file  
/enc UTF8 Convert files to UTF-8 UTF-8 is assumed if this command is not present.
UTF16 Equivalent to UTF16LE
UTF16LE Convert files to UTF-16 Little Endian
UTF16BE Convert files to UTF-16 Big Endian
UTF32 Equivalent to UTF32LE
UTF32LE Convert files to UTF-32 Little Endian
UTF32BE Convert files to UTF-32 Big Endian
2143 Convert files to UCS-4-2143
3412 Convert files to UCS-4-3412
/bom YES Write a BOM to a converted file A BOM will be written if this command is not present.
  NO Do not write a BOM to a converted file
/rt CR Set return type to CR (Macintosh) Return type will not be changed if this command is not present.
LF Set return type to LF (Unix)
CRLF Set return type to CRLF (Windows)
NOCHANGE Do not change return type
/wf "A wild card string" Apply wildcard filter If both wf and rf are present, wf is used, unless its value is set to empty.
/rf "A regular expression" Apply regular expression filter
/cp A codepage identifier Skip auto-detection and manually specify codepage decoder If the source file is a Unicode text file with a BOM, the Codepage Identifier is ignored. Refer to the Codepage Identifiers section for the full list of available identifiers.
/logfile "A path to a log file" Write debug messages to specified file If the log file is in a system folder or a folder that needs additional privileges to access, make sure UTFCast Pro is running with the required privileges, or run UTFCast Pro with administrator, otherwise logging will fail.
/d   Detection only

Detect the file (in file mode) or directory (in dir mode) provided with /in switch.

If this switch is present, the /out switch is ignored.

To record detection result, use in combination with /export switch.

The Console Application Command Line Mode also outputs the detection result to the console window.

/export "Path to a CSV file" Export detection or conversion result to a CSV file= By default, the exported result file is in UTF-8. To specify a different encoding, use with /exportenc switch.
/exportenc UTF8 Encode the exported result file in UTF-8 UTF-8 is assumed if this switch is not present.
UTF16 Equivalent to UTF16LE
UTF16LE Encode the exported result file in UTF-16 Little Endian
UTF16BE Encode the exported result file in UTF-16 Big Endian
UTF32 Equivalent to UTF32LE
UTF32LE Encode the exported result file in UTF-32 Little Endian
UTF32BE Encode the exported result file in UTF-32 Big Endian
/exportbom YES Add a BOM to the exported result file If this switch is not present, a BOM will be added.
NO Do not add a BOM to the exported result file
/resetlayout   Reset the GUI layout data to its initial state If you cannot reach to some GUI elements due to a resolution change of your monitor settings, or have problems after installing a different version of UTFCast Pro, please try resetting the layout.
/cmdfile "A path to a command line file" Read the command line from a text file instead Windows has the 260-character path length limit. To pass a very long command line to UTFCast Pro, you can use a Command Line File. See the details in Using Command Line File.
/ver   Show version info If this switch is present, other switches are ignored.
/register   Prompt to enter license information Use this switch to change license information without showing the main GUI.
/clearlicense   Delete license information from system If you need to move your license to a new PC, you should clear your license information from the old one.

Using Command Line File

Windows has a 260-character path length limit. That means you cannot access a file, a directory, or run a command line that its total length is longer than 260 characters. UTFCast Professional provides various command line switches and arguments for you to control the command line mode, some of them also accept a path to a file or a directory. If you combine multiple switches in command line mode, it is possible that your command line will exceed 260 characters. Besides, UTFCast Professional can access paths that longer than 260 characters, if you would like to pass a very long path as an argument to UTFCast Professional in command line mode, Windows does not allow you to do that.

A Command Line File is introduced to UTFCast Professional since version 2.8. It is simply a text file that its content is the command line switches and arguments. Because you can store very long text in a text file, so UTFCast Professional can read the command line from the text file up to 32768 characters.

It's very easy to use this Command Line File. You store all of your command line switches and arguments in the first line of the text file, and pass the /cmdfile switch with an argument pointing to the command line file, and UTFCast Professional will do the rest. For example:

UTFCastPro.exe /cmdfile:"C:\My UTFCast Command Line.txt"

And now in your C:\My UTFCast Command Line.txt can contain a full command line in the first line like the below example (note that the keyword UTFCastPro.exe or UTFCastCon.exe must not be in the command line file):

/in:"C:\A Very Very Very Long Long Long Path That Causes The Command Line to Exceed 260 characters\Input.txt" /out:"C:\Another Very Very Very Long Long Long Path That Causes The Command Line to Exceed 260 characters\Output.txt" /mode:file /bom:yes /enc:utf8 /export:"D:\My UTFCast Logs\Today.log"

Command Line Examples

To convert every text file in C:\MyFolder, include any files in subfolders but exclude hidden files and hidden folders, save the converted files to D:\MyOutput as UTF-16BE without BOM encoding, the command line is:

UTFCastPro.exe /in:"C:\MyFolder" /out:"D:\MyOutput" /r /enc:utf16be /bom:no

To convert the file C:\MyFile.txt to C:\MyConvertedFile.txt as UTF-8 with BOM encoding, skip auto-detection and manually specify the Windows-1252 decoder to read the source file, the command line is:

UTFCastPro.exe /in:"C:\MyFile.txt" /out:"D:\MyConvertedFile.txt" /enc:utf8 /bom:yes /mode:file /cp:1252

Logging

UTFCastPro.exe /in:"C:\MyFiles" /out:"D:\MyConvertedFiles" /enc:utf8 /bom:yes /mode:file /cp:1252 /logfile:"D:\UTFCastPro.log"

Bachuite Reference

Merging multiple tasks

Let's start with an example. The below command line converts your files to another folder:

UTFCastPro.exe /in:"D:\My Files" /out:"D:\My Output" /enc:utf8 /rt:crlf /bom:YES

With Bachuite, instead, you use XML elements to describe the same command, only the command line arguments turn into XML attributes:

<dir in="D:\My Files" out="D:\My Output" enc="utf8" rt="crlf" bom="yes"/>

If you want to convert multiple sibling folders and some single files with the command line, you'll need to run the command line multiple times, one time for each task:

          
UTFCastPro.exe /in:"D:\My Files A" /out:"D:\My Output A" /enc:UTF8 /bom:YES /rt:CRLF
UTFCastPro.exe /in:"D:\My Files B" /out:"D:\My Output B" /enc:UTF8 /bom:NO /rt:CRLF
UTFCastPro.exe /in:"D:\Single File A.txt" /out:"D:\Single File Output A.txt" /enc:UTF16LE /mode:FILE /rt:CRLF /bom:YES
UTFCastPro.exe /in:"D:\Single File B.txt" /out:"D:\Single File Output B.txt" /enc:UTF16LE /mode:FILE /rt:CRLF /bom:NO
UTFCastPro.exe /in:"D:\Single File C.txt" /out:"D:\Single File Output C.txt" /enc:UTF16LE /mode:FILE /rt:CRLF /bom:YES
          
        

With Bachuite, instead, you can simply wrap the command lines to Bachuite XML so that to get the job done with running UTFCast only once:

          
<dir in="D:\My Files A" out="D:\My Output A" enc="utf8" bom="yes" rt="crlf"/>
<dir in="D:\My Files B" out="D:\My Output B" enc="utf8" bom="no" rt="crlf"/>
<file in="D:\Single File A.txt" out="D:\Single File Output A.txt" enc="utf16le" rt="crlf" bom="yes" />
<file in="D:\Single File B.txt" out="D:\Single File Output B.txt" enc="utf16le" rt="crlf" bom="no" />
<file in="D:\Single File C.txt" out="D:\Single File Output C.txt" enc="utf16le" rt="crlf" bom="yes" />
          
        

In fact, simple wrapping is just one of the options. Bachuite can do multiple tasks with ease by using Sets.

Using Sets

A set is a group of elements that share the same predefined attributes. Here's an example of using Sets:

          
<set rt="crlf" bom="yes" enc="utf8">
 
  <!-- The below tasks inherit rt, bom and enc from the parent set -->
  <dir in="D:\My Files A" out="D:\My Output A" />
  <dir in="D:\My Files B" out="D:\My Output B" bom="no" />
 
    <!-- A child set inherits properties too -->
    <set enc="utf16le">
      <file in="D:\Single File A.txt" out="D:\Single File Output A.txt" />
      <file in="D:\Single File B.txt" out="D:\Single File Output B.txt" bom="no"/>
      <file in="D:\Single File C.txt" out="D:\Single File Output C.txt" />
    </set>
 
</set>
          
        

As you can see, if the attribute is identical to that of its parent, you don't need to specify an attribute for any child element (either any child set or child task). Attributes are inherited by default but you can also override at any time.

Links can be used for reusing Bachuite files. Here's an example:

          
<link src="D:\MyBachuite1.xml" />
<link src="MyBachuite2.xml" enc="utf32be" bom="no" />
          
        

The src attribute must be pointing to an existing Bachuite file. Otherwise the whole Bachuite will not run.

Using Profiles

A profile is a set of predefined attributes for reusing in other elements. A profile element is like a set element with a name, but cannot have children. When a profile is defined, it covers the scope that any sibling elements and their children, and only elements in the covered scope can access it. If an element is assigned with an existing profile, the element does not inherit any attribute from its parent, it inherits the profile's parent attribute instead, and clones all attributes from the profile. Bachuite applies all attributes of the profile to the element first, and then applies explicitly presented attributes.

NOTE: Bachuite profile elements are similar to the setting profile feature in the GUI, but they are not the same feature. They are designed for and work in different environments. There's no way to load saved setting profiles or Bachuite profiles in each other.

Here's an example of using profiles:

          
<set enc="utf8" in="D:\Text Files">
    <!-- A profile also inherits attributes from its parent, just like other elements. -->
    <!-- The below profiles also have the enc attribute set to "utf8" and the in attribute set to "D:\Text Files" even these attributes are not explicitly presented. -->
    <profile name="with_bom" bom="yes" rt="crlf" />
    <profile name="without_bom" bom="no" rt="crlf" />
 
    <!-- All elements below here and their children can use the two profiles defined above -->
    <dir out="D:\Output with bom" profile="with_bom" />
    <dir out="D:\Output without bom" profile="without_bom" />
 
    <!-- A profile can also be assigned to a set, a link, or even another profile -->
    <!-- Explicitly setting an attribute value (enc="utf16" in this example) overrides it -->
    <profile name="different_enc" enc="utf16" out="D:\Profile Out" profile="without_bom" />
 
    <set out="D:\New Output">
        <!-- Because a profile is assigned, this element does not inherited any attribute from its parent, the out attribute value "D:\Profile Out" which is copied from the profile is used -->
        <dir profile="different_enc" />
    </set>
</set>
<set enc="utf16le">
    <!-- ERROR, the below element is out of the "with_borm" profile's scope -->
    <dir in="D:\Text Files" out="D:\Output with bom" profile="with_bom" />
</set>
          
        

Resolving absolute and relative paths

Any path in an element can be an absolute path like C:\MyFile.txt, or a relative path like: MyFile.txt or ..\MyFile.txt. If a path is a relative one, it will be resolved to the relative location of the current Bachuite file. For example:

In C:\First.xml:

          
<link src="D:\Second.xml" />
<link src="Third.xml" />
<file in="MyFile.txt" out="SubDir\MyFileOutput.txt" />
          
        

When linking to Third.xml in C:\First.xml, the path of Third.xml is resolved to C:\Third.xml.

The same thing applies to paths in other elements. In C:\First.xml, MyFile.txt and MyFileOutput.txt are resolved to C:\MyFile.txt and C:\SubDir\MyFileOutput.txt.

Running Bachuite

The Bachuite XML must be saved as an XML file. Its content is nothing more than a normal XML file with the Bachuite root element and the Bachuite XML schema. For example, save the below XML to D:\MyBachuite.xml:

          
<?xml version="1.0" encoding="UTF-8"?>
<bachuite version="1.0">
 
  <!-- Your Bachuite XML goes here -->
 
</bachuite>
          
        

Run the Bachuite file using the below command line:

UTFCastPro.exe /in:"D:\MyBachuite.xml" /mode:bachuite

Bachuite Attributes

All available attributes are listed in the below table. Elements and Attributes are case sensitive, however, Attribute Values are case insensitive.

Supported Elements Attribute Value Description Comment
set, file, dir, link, profile in A Path to a file or a folder Specify which folder or file to input  
out A Path to a file or a folder Specify which folder or file to output

In a dir element, if this attribute is not present, or its value is empty, a sibling folder name will be generated. For example, a folder named Source_Folder (Converted).

In a file element, this attribute must have a value.

If any part of the output path does not exist, a corresponding folder will be created.

r YES Recursive conversion. NO is assumed if the attribute is not present.
NO Non-recursive conversion.
c YES Copy unconverted files. NO is assumed if the attribute is not present.
NO Ignore unconverted files.
h YES Process hidden files. NO is assumed if the attribute is not present.
NO Do not process hidden files.
enc UTF8 Convert to UTF-8 UTF8 is assumed if the attribute is not present.
UTF16LE Convert to UTF-16 Little Endian.
UTF16BE Convert to UTF-16 Big Endian.
UTF32LE Convert to UTF-32 Little Endian.
UTF32BE Convert to UTF-32 Big Endian.
2143 Convert to UCS-4-2143.
3412 Convert to UCS-4-3412.
bom YES Write a BOM to a converted file. YES is assumed if the attribute is not present.
NO Do not write a BOM to a converted file.
rt CR Set return type to CR (Macintosh) NOCHANGE is assumed if the attribute is not present.
LF Set return type to LF (Unix)
CRLF Set return type to CRLF (Windows)
NOCHANGE Do not change return type
cp A codepage identifier Skip auto-detection and manually specify codepage decoder If the source file is a Unicode text file with a BOM, the Codepage Identifier is ignored. Refer to the Codepage Identifiers section for the full list of available identifiers.
wf A wildcard string Apply wildcard filter

If both wf and rf are present, wf is assumed unless its value is set to empty.

To disable filters, either set both values to empty, or do not provide any of them in the element or any parent elements.

rf A regular expression Apply regular expression filter
profile A profile name Assign an attribute profile to the element

Assigning an attribute profile to an element clones all attribute values (including implicit and explicit attribute values) from the profile.

Only profiles that defined in the same or parent scope can be accessed by the current element.

link src A path to a Bachuite file Link to an external Bachuite file Linking to a non-existent Bachuite file can make the whole Bachuite refuses to run.
profile name A unique element name Define a profile The profile can be accessed by all sibling elements and their children.