Home Contents Index Summary Previous Next

2.17 Wide character support

SWI-Prolog supports wide characters, characters with character codes above 255 that cannot be represented in a single byte. Universal Character Set (UCS) is the ISO/IEC 10646 standard that specifies a unique 31-bits unsigned integer for any character in any language. It is a superset of 16-bit UNICODE, which in turn is a superset of ISO 8859-1 (ISO Latin-1), a superset of US-ASCII. UCS can handle strings holding characters from multiple languages and character classification (uppercase, lowercase, digit, etc.) and operations such as case-conversion are unambiguously defined.

For this reason SWI-Prolog has two representations for atoms and string objects (see section 4.23. If the text fits in ISO Latin-1, it is represented as an array of 8-bit characters. Otherwise the text is represented as an array of 32-bit numbers. This representational issue is completely transparent to the Prolog user. Users of the foreign language interface as described in section 9 sometimes need to be aware of these issues though.

Character coding comes into view when characters of strings need to be read from or written to file or when they have to be communicated to other software components using the foreign language interface. In this section we only deal with I/O through streams, which includes file I/O as well as I/O through network sockets.

2.17.1 Wide character encodings on streams

Although characters are uniquely coded using the UCS standard internally, streams and files are byte (8-bit) oriented and there are a variety of ways to represent the larger UCS codes in an 8-bit octet stream. The most popular one, especially in the context of the web, is UTF-8. Bytes 0 ... 127 represent simply the corresponding US-ASCII character, while bytes 128 ... 255 are used for multi-byte encoding of characters placed higher in the UCS space. Especially on MS-Windows the 16-bit UNICODE standard, represented by pairs of bytes is also popular.

Prolog I/O streams have a property called encoding which specifies the used encoding that influence get_code/2 and put_code/2 as well as all the other text I/O predicates.

The default encoding for files is derived from the Prolog flag encoding, which is initialised from the environment. If the environment variable LANG ends in "UTF-8", this encoding is assumed. Otherwise the default is ISO Latin-1. The encoding can be specified explicitely in load_files/2 for loading Prolog source with an alternative encoding, open/4 when opening files or using set_stream/2 on any open stream. For Prolog sourcefiles we also provide the encoding/1 directive that can be used to switch between UTF-8 and ISO Latin-1 inside the file.

SWI-Prolog currently defines and supports the following encodings:

ascii
7-bit encoding in 8-bit bytes. Equivalent to iso_latin_1, but generates errors and warnings on encountering values above 127.

iso_latin_1
8-bit encoding supporting many western languages. This causes the stream to be read and written fully untranslated.

utf8
Multi-byte encoding of full UCS, compatible to ascii. See above.

unicode_be
UNICODE Big Endian. Reads input in pairs of bytes, most significant byte first. Can only represent 16-bit characters.

unicode_le
UNICODE Little Endian. Reads input in pairs of bytes, least significant byte first. Can only represent 16-bit characters.