This is the Byte Order Mark in some encoding schemes. It is usually added by mistake or malformed software. You can remove this special character using dos2unix filename.txt
. This command will modify the file in place.
Read more here
iconv
is a convenient command line tool to convert between two different file encodings. Suppose you have a file.txt
which is GBK encoded and want to have UTF-8 version, you can use the following command.
iconv -f gbk -t utf-8 file.txt
The converted result is sent to standard output by default. You can use stdout
redirect to save that in a file called output.txt
.
iconv -f gbk -t utf-8 file.txt > output.txt
Suppose you have original GBK encoded files in orig/
and want to output UTF-8 encoded text files to utf-8/
. Following command can help you do the batch conversion.
ls orig -1 | xargs -I{} sh -c 'iconv -f gbk -t utf-8 orig/"{}" > utf-8/"{}"'
MAC user can install iconv via brew
.
By default, the .py
file you write uses the system encoding. For MAC and Linux, this had long been unified by utf-8
. Windows systems are usually a bit tricky. The Chinese version Windows could use one of some common encodings other than utf-8
: gbk
in mainland; big5
in Hong Kong and Taiwan. That makes your script less portabel to other systems. Or you will meet problems, especially when string operation is heavily used in your program.
To declare an encoding other than the default one, a special comment line should be added as the first line of the file. the syntax is as follows:
# coding={the codec name here}
# Remaining part of your Python script
For example, if you have Chinese characters in your code and your operating system is also Chinese version of Windows, it is better to declare the coding by writing the following as the first line in your source code file:
# coding=utf-8
or # -*- coding: utf-8 -*-
There might be other situations and encoding format, and its a case by case situation, if you encounter more situation, please free feel to open an issue.