Firebird Character Sets and Collations

Every CHAR or VARCHAR field can (or, better: must) have a character set assigned. Firebird uses this information to correctly store the bytes that make up the character string.

In order to be able to sort or compare strings, you also need to define a collation. A collation defines the sort ordering and uppercase conversions for a string.

Firebird is unable to transliterate between character sets. So you must set the correct values on the server and on the client if everything is to work smoothely.

An Example

In the German language there are the "Umlauts", special vowels with a double-dot (diaeresis) over them. A common last name in Germany is Müller. (If you don't have umlauts on your keyboard, you could also write "Mueller", but that's not what we want to discuss here ;-)

When you convert Müller to uppercase you get MÜLLER, so there is an uppercase Ü and a lowercase ü.

When you want to perform a lexicographic compare on the name, you have several options:

You can treat the Ü like a U (German "Duden" dictionary)
You can treat the Ü like UE (German telephone book)
You can treat the Ü like a special character, sorted in after Z (a common practice in Scandinavia)

Creating a Database

You can define the default character set for a new database in the CREATE DATABASE statement:

CREATE DATABASE <database>

  USER <username>

  PASSWORD <password>

  PAGE_SIZE <pagesize>

  DEFAULT CHARACTER SET <charset>

For example:

CREATE DATABASE localhost:meter

  USER SYSDBA

  PASSWORD masterkey

  PAGE_SIZE 4096

  DEFAULT CHARACTER SET ISO8859_1;

From now on, any VARCHAR or CHAR field will default to the ISO8859_1 character set. You can, however, specify a special character set for each column:

CREATE TABLE users (

  CZECH_NAME VARCHAR(50) CHARACTER SET ISO8859_2,

  ...

Collations

There is no default collation. So you should define a collation for every field that is to be used for sorting (ORDER BY) or comparing (UPPER):

CREATE TABLE users (

  NAME VARCHAR(50) COLLATE DE_DE,

  ...

COLLATE DE_DE means: "use a collation for the German language (the first DE), applying the rules from Germany (the second DE)"

You can also specify the collation with the ORDER BY clause:

ORDER BY LASTNAME COLLATE FR_CA, FIRSTNAME COLLATE FR_CA

or with the WHERE clause:

WHERE LASTNAME COLLATE FR_CA = :lastnametosearch

or when searching:

WHERE UPPER (LAST_NAME COLLATE SV_SV) = 'PAULSEN';

The UPPER() function

UPPER() only works correctly if there is a collation defined for the parameter field:

WHERE UPPER (NAME COLLATE DE_DE) = 'MÜLLER';

Specifying the client character set

ISQL

SET NAMES ISO8859_1;

InterBase Objects (Ibo) by Jason Wharton

The TIb_Connection class has a string property named CharSet. Assign it the name of the character set to use:

Ib_Connection1.CharSet := 'ISO8859_1';

InterBase Express (IBX), built into Delphi

The TIbDatabase class has a TStrings property named Params. Add a field with the name lc_ctype and specify the character set:

IbDatabase1.Params.Add ('lc_ctype=ISO8859_1');

PHP

In PHP you define the Client Character Set when you connect (or pconnect) to the database.

$db = ibase_connect ($Name, $Usr, $Pwd, "ISO8859_1");

Conversions

Conversions between character sets are always done as: CHARSET1 -> UNICODE -> CHARSET2

With NONE or OCTETS as the connection character set, the bytes are just copied: NONE/OCTETS -> CHARSET2 and CHARSET1 -> NONE/OCTETS.

Case insensitive searching

I have written a separate article about this.

Character Sets and Collations

Character Set	Languages	Collation	Comments
Generic Character Sets
NONE	All	NONE	No character set applied. With this character set setting, Firebird is unable to perform conversion operations like UPPER() correctly on anything other than the standard 26 latin letters.
OCTETS \| BINARY	All	OCTETS	Same as NONE. Cannot be used as client connection character set. Space character is #x00. Will be displayed as hex in ISQL 2.0.
ASCII	English	ASCII	English
Unicode based Character Sets
UNICODE_FSS	All	UNICODE_FSS	Unicode UTF-8. An old implementation that accepts malformed strings and does not enforce correct max. string length. All characters 3 bytes, no case mapping. Superseded in Firebird 2.0 with the UTF8 character set. Deprecated.
UTF8	All	UCS_BASIC	UCS_BASIC sorts in Unicode code-point order (Firebird 2.0)
		UNICODE	Sorts using the Unicode Collation Algorithm (UCA) (Firebird 2.0)
		UTF-8	Case insensitive collation (Firebird 2.1)
		UNICODE_CI_AI	Case insentive, Accent insensitive collation for Unicode (Firebird 2.5)
Current Character Sets
ISO8859_1	Western Europe	ISO8859_1	Latin-1
		DA_DA	Danish/Danmark
		DE_DE	German/Germany
		DU_NL	Dutch/The Netherlands
		EN_UK	English/United Kingdom
		EN_US	English/USA
		ES_ES	Spanish/Spain
		ES_ES_CI_AI	Spanish/Spain, case insensitive, accent insensitive (Firebird 2.0)
		FI_FI	Finnish/Finnland
		FR_CA	French/Canada
		FR_FR	French/France
		FR_FR_CI_AI	French/France, case insensitive, accent insensitive (Firebird 2.1)
		IS_IS	Icelandic/Iceland
		IT_IT	Italian/Italy
		NO_NO	Norwegian/Norway
		PT_PT	Portuguese/Portugal
		PT_BR	Portuguese/Brasil (Firebird 2.0). Case+Accent insensitive
		SV_SV	Swedish/Sweden
ISO8859_2	Central Europe	ISO8859_2	Central Europe
		CS_CZ	Czech
		ISO_HUN	Hungarian
		ISO_PLK	Polish (Firebird 2.0)
ISO8859_3	Southern Europe	ISO8859_3	Maltese, Esperanto
ISO8859_4	North European	ISO8859_4	Estonian, Latvian, Lithuanian, Greenlandic, Lappish
ISO8859_5	Cyrillic	ISO8859_5	Russian, Ukrainian
ISO8859_6	Arabic	ISO8859_6
ISO8859_7	Modern Greek	ISO8859_7
ISO8859_8	Hebrew	ISO8859_8
ISO8859_9	Turkish	ISO8859_9
ISO8859_13	Baltic	ISO8859_13	Baltic
ISO8859_13	Baltic	LT_LT	Lithuanian
WIN1250	Central Europe	WIN1250	Central Europe
		BS_BA	Bosnian (Firebird 2.0)
		WIN_CZ	Czech, case-insensitive (Firebird 2.0)
		WIN_CZ_CI_AI	Czech, case-insensitive, accent-insensitive (Firebird 2.0)
		PXW_CSY	Czech
		PXW_HUN	Hungarian
		PXW_HUNDC	Hungarian, Dictionary sort
		PXW_PLK	Polish
		PXW_SLOV	Slovanian
WIN1251	Cyrillic	WIN1251	Cyrillic
		WIN1251_UA	Ukrainian
		PXW_CYRL	Cyrillic, Paradox compatibility
WIN1252	Western Europe, America	WIN1252	Latin-1 with Windows extensions
		WIN_PTBR	Brasilian Portuguese (Firebird 2.0). Case+Accent insensitive
		PXW_INTL	Paradox ANSI International
		PXW_INTL850	Paradox Multi-Lingual Latin-1
		PXW_NORDAN4	Paradox Norwegian and Danish
		PXW_SPAN	Paradox Spanish
		PXW_SWEDFIN	Paradox Swedish, Finnish
WIN1253	Modern Greek	WIN1253
WIN1253	Modern Greek	PXW_GREEK	Paradox Greek
WIN1254	Turkish	WIN1254
WIN1254	Turkish	PXW_TURK	Paradox Turkish
WIN1255	Hebrew	WIN1255
WIN1256	Arabic	WIN1256
WIN1257	Baltic	WIN1257	Baltic
		WIN1257_LV	Latvian dictionary collation (Firebird 2.0)
		WIN1257_LT	Lithuanian dictionary collation (Firebird 2.0)
		WIN1257_EE	Estonian dictionary collation (Firebird 2.0)
WIN1258	Vietnamese		Vietnamese (Firebird 2.0)
MS-DOS, dBASE and Paradox compatibility
DOS437	Western Europe, America	DOS437	English/USA
		DB_DEU437	dBASE German
		DB_ESP437	dBASE Spanish
		DB_FIN437	dBASE Finnish
		DB_FRA437	dBASE French
		DB_ITA437	dBASE Italian
		DB_NLD437	dBASE Dutch
		DB_SVE437	dBASE Swedisch
		DB_UK437	dBASE English/UK
		DB_US437	dBASE English/US
		PDOX_ASCII	Paradox ASCII code page
		PDOX_INTL	Paradox International English code page
		PDOX_SWEDFIN	Paradox Swedish/Finnish code page
DOS737	Greek	DOS737	Greek
DOS775	Baltic	DOS775	Baltic
DOS850	Western Europe, America	DOS850	Latin-1 (without Euro € symbol)
		DB_DEU850	dBASE German
		DB_ESP850	dBASE Spanish
		DB_FRA850	dBASE French/France
		DB_FRC850	dBASE French/Canada
		DB_ITA850	dBASE Italian
		DB_NLD850	dBASE Dutch
		DB_PTB850	dBASE Portuguese/Brasil
		DB_SVE850	dBASE Swedish
		DB_UK850	dBASE English/UK
		DB_US850	dBASE English/USA
DOS852	Central Europe	DOS852	Latin-2 (Central Europe)
		DB_CSY	dBASE Czech
		DB_PLK	dBASE Polish
		DB_SLO	dBASE Slovakian
		PDOX_CSY	Paradox Czech
		PDOX_HUN	Paradox Hungarian
		PDOX_PLK	Paradox Polish
		PDOX_SLO	Paradox Slovakian
DOS857	Turkish	DOS857	Turkish
DOS857	Turkish	DB_TRK	dBASE Turkish
DOS858		DOS858	Latin-1 plus Euro symbol €
DOS860	Portuguese	DOS860	Portuguese
DOS860	Portuguese	DB_PTG860	dBASE Portuguese
DOS861	Icelandic	DOS861	Icelandic
DOS861	Icelandic	PDOX_ISL	Paradox Icelandic
DOS862	Hebrew	DOS862	Hebrew
DOS863	Canadian French	DOS863	French/Canada
DOS863	Canadian French	DB_FRC863	dBASE French/Canada
DOS864	Arabic	DOS864	Arabic
DOS865	Scandinavian	DOS865	Nordic
		DB_NOR865	dBASE Norwegian
		DB_DAN865	dBASE Danish
		PDOX_NORDAN4	Paradox Norwegian
DOS866	Russian	DOS866	Russian
DOS869	Greek	DOS869	Modern Greek
Others
BIG_5	Chinese	BIG_5	Chinese
KOI8R	Russian		Russian character set and dictionary collation (Firebird 2.0)
KOI8U	Ukrainian		Ukrainian character set and dictionary collation (Firebird 2.0)
CYRL	Russian/Ukrainian	CYRL	Cyrillic
		DB_RUS	dBASE Russian
		PDOX_CYRL	Paradox Cyrillic
KSC_5601	Korean	KSC_5601	Unified Korean Hangeul, also known as windows-949
KSC_5601	Korean	KSC_DICTIONARY	Korean dictionary ordering
NEXT	NeXT Computers	NEXT	NeXTSTEP encoding
		NXT_DEU	German
		NXT_ESP	Spanish
		NXT_FRA	French
		NXT_ITA	Italian
		NXT_US	US-English
SJIS_0208	Japanese	SJIS_0208	Shift-JIS
EUCJ_0208	Japanese	EUCJ_0208	EUC Japanese
GB_2312	Chinese	GB_2312	Simplified Chinese (HongKong, PRC), a subset of GBK/windows-936
CP943C	Japanese	CP943C_UNICODE	Japanese character set (Firebird 2.1)
TIS620	Thai	TIS620_UNICODE	Thai character set, single byte (Firebird 2.1)

Which one to choose?

The question now is: which character set do I choose for my database?

Note: I don't have any experience with Asian scripts (Chinese, Korean, Japanese) so I can't give you any hint on these

You should chose the DOS, dBASE and Paradox character sets only if you have legacy applications to support
The WINxxx character sets are extensions of the corresponding ISOxxx character sets, however you will have problems on non-Windows systems. So if you have a cross-platform application, stay with the ISOxxx character sets.
The ISOxxx character sets are missing a few characters of the WINxxx character sets (like typographic dash signs) or they can have some different characters, so your application must be prepared to handle this.

Unicode?

The Unicode situation dramatically improved with Firebird 2.0. Now there is the new UTF8 character set that correctly handles Unicode strings in UTF-8 format. The Unicode collation algorithm has been implemented so now you can use UPPER() and the new LOWER() function without the need to specify a collation.