Firebird Character Sets and Collations

Every CHAR or VARCHAR field can (or, better: must) have a character set assigned. Firebird uses this information to correctly store the bytes that make up the character string.

In order to be able to sort or compare strings, you also need to define a collation. A collation defines the sort ordering and uppercase conversions for a string.

Firebird is unable to transliterate between character sets. So you must set the correct values on the server and on the client if everything is to work smoothely.

An Example

In the German language there are the "Umlauts", special vowels with a double-dot (diaeresis) over them. A common last name in Germany is Müller. (If you don't have umlauts on your keyboard, you could also write "Mueller", but that's not what we want to discuss here ;-)

When you convert Müller to uppercase you get MÜLLER, so there is an uppercase Ü and a lowercase ü.

When you want to perform a lexicographic compare on the name, you have several options:

  • You can treat the Ü like a U (German "Duden" dictionary)
  • You can treat the Ü like UE (German telephone book)
  • You can treat the Ü like a special character, sorted in after Z (a common practice in Scandinavia)

Creating a Database

You can define the default character set for a new database in the CREATE DATABASE statement:

CREATE DATABASE <database>
USER <username>
PASSWORD <password>
PAGE_SIZE <pagesize>
DEFAULT CHARACTER SET <charset>

For example:

CREATE DATABASE localhost:meter
USER SYSDBA
PASSWORD masterkey
PAGE_SIZE 4096
DEFAULT CHARACTER SET ISO8859_1;

From now on, any VARCHAR or CHAR field will default to the ISO8859_1 character set. You can, however, specify a special character set for each column:

CREATE TABLE users (
CZECH_NAME VARCHAR(50) CHARACTER SET ISO8859_2,
...

Collations

There is no default collation. So you should define a collation for every field that is to be used for sorting (ORDER BY) or comparing (UPPER):

CREATE TABLE users (
NAME VARCHAR(50) COLLATE DE_DE,
...

COLLATE DE_DE means: "use a collation for the German language (the first DE), applying the rules from Germany (the second DE)"

You can also specify the collation with the ORDER BY clause:

ORDER BY LASTNAME COLLATE FR_CA, FIRSTNAME COLLATE FR_CA

or with the WHERE clause:

WHERE LASTNAME COLLATE FR_CA = :lastnametosearch

or when searching:

WHERE UPPER (LAST_NAME COLLATE SV_SV) = 'PAULSEN';

The UPPER() function

UPPER() only works correctly if there is a collation defined for the parameter field:

WHERE UPPER (NAME COLLATE DE_DE) = 'MÜLLER';

Specifying the client character set

ISQL

SET NAMES ISO8859_1;

InterBase Objects (Ibo) by Jason Wharton

The TIb_Connection class has a string property named CharSet. Assign it the name of the character set to use:

Ib_Connection1.CharSet := 'ISO8859_1';

InterBase Express (IBX), built into Delphi

The TIbDatabase class has a TStrings property named Params. Add a field with the name lc_ctype and specify the character set:

IbDatabase1.Params.Add ('lc_ctype=ISO8859_1');

PHP

In PHP you define the Client Character Set when you connect (or pconnect) to the database.

$db = ibase_connect ($Name, $Usr, $Pwd, "ISO8859_1");

Conversions

Conversions between character sets are always done as: CHARSET1 -> UNICODE -> CHARSET2

With NONE or OCTETS as the connection character set, the bytes are just copied: NONE/OCTETS -> CHARSET2 and CHARSET1 -> NONE/OCTETS.

Case insensitive searching

I have written a separate article about this.


Character Sets and Collations

Character Set Languages Collation Comments
Generic Character Sets
NONE All NONE No character set applied. With this character set setting, Firebird is unable to perform conversion operations like UPPER() correctly on anything other than the standard 26 latin letters.
OCTETS | BINARY All OCTETS Same as NONE. Cannot be used as client connection character set. Space character is #x00. Will be displayed as hex in ISQL 2.0.
ASCII English ASCII English
Unicode based Character Sets
UNICODE_FSS All UNICODE_FSS Unicode UTF-8. An old implementation that accepts malformed strings and does not enforce correct max. string length. All characters 3 bytes, no case mapping.

Superseded in Firebird 2.0 with the UTF8 character set. Deprecated.

UTF8 All UCS_BASIC UCS_BASIC sorts in Unicode code-point order (Firebird 2.0)
UNICODE Sorts using the Unicode Collation Algorithm (UCA) (Firebird 2.0)
UTF-8 Case insensitive collation (Firebird 2.1)
UNICODE_CI_AI Case insentive, Accent insensitive collation for Unicode (Firebird 2.5)
Current Character Sets
ISO8859_1 Western Europe ISO8859_1 Latin-1
DA_DA Danish/Danmark
DE_DE German/Germany
DU_NL Dutch/The Netherlands
EN_UK English/United Kingdom
EN_US English/USA
ES_ES Spanish/Spain
ES_ES_CI_AI Spanish/Spain, case insensitive, accent insensitive (Firebird 2.0)
FI_FI Finnish/Finnland
FR_CA French/Canada
FR_FR French/France
FR_FR_CI_AI French/France, case insensitive, accent insensitive (Firebird 2.1)
IS_IS Icelandic/Iceland
IT_IT Italian/Italy
NO_NO Norwegian/Norway
PT_PT Portuguese/Portugal
PT_BR Portuguese/Brasil (Firebird 2.0). Case+Accent insensitive
SV_SV Swedish/Sweden
ISO8859_2 Central Europe ISO8859_2 Central Europe
CS_CZ Czech
ISO_HUN Hungarian
ISO_PLK Polish (Firebird 2.0)
ISO8859_3 Southern Europe ISO8859_3 Maltese, Esperanto
ISO8859_4 North European ISO8859_4 Estonian, Latvian, Lithuanian, Greenlandic, Lappish
ISO8859_5 Cyrillic ISO8859_5 Russian, Ukrainian
ISO8859_6 Arabic ISO8859_6  
ISO8859_7 Modern Greek ISO8859_7  
ISO8859_8 Hebrew ISO8859_8  
ISO8859_9 Turkish ISO8859_9  
ISO8859_13 Baltic ISO8859_13 Baltic
LT_LT Lithuanian
WIN1250 Central Europe WIN1250 Central Europe
BS_BA Bosnian (Firebird 2.0)
WIN_CZ Czech, case-insensitive (Firebird 2.0)
WIN_CZ_CI_AI Czech, case-insensitive, accent-insensitive (Firebird 2.0)
PXW_CSY Czech
PXW_HUN Hungarian
PXW_HUNDC Hungarian, Dictionary sort
PXW_PLK Polish
PXW_SLOV Slovanian
WIN1251 Cyrillic WIN1251 Cyrillic
WIN1251_UA Ukrainian
PXW_CYRL Cyrillic, Paradox compatibility
WIN1252 Western Europe, America WIN1252 Latin-1 with Windows extensions
WIN_PTBR Brasilian Portuguese (Firebird 2.0). Case+Accent insensitive
PXW_INTL Paradox ANSI International
PXW_INTL850 Paradox Multi-Lingual Latin-1
PXW_NORDAN4 Paradox Norwegian and Danish
PXW_SPAN Paradox Spanish
PXW_SWEDFIN Paradox Swedish, Finnish
WIN1253 Modern Greek WIN1253  
PXW_GREEK Paradox Greek
WIN1254 Turkish WIN1254  
PXW_TURK Paradox Turkish
WIN1255 Hebrew WIN1255  
WIN1256 Arabic WIN1256  
WIN1257 Baltic WIN1257 Baltic
WIN1257_LV Latvian dictionary collation (Firebird 2.0)
WIN1257_LT Lithuanian dictionary collation (Firebird 2.0)
WIN1257_EE Estonian dictionary collation (Firebird 2.0)
WIN1258 Vietnamese   Vietnamese (Firebird 2.0)
MS-DOS, dBASE and Paradox compatibility
DOS437 Western Europe, America DOS437 English/USA
DB_DEU437 dBASE German
DB_ESP437 dBASE Spanish
DB_FIN437 dBASE Finnish
DB_FRA437 dBASE French
DB_ITA437 dBASE Italian
DB_NLD437 dBASE Dutch
DB_SVE437 dBASE Swedisch
DB_UK437 dBASE English/UK
DB_US437 dBASE English/US
PDOX_ASCII Paradox ASCII code page
PDOX_INTL Paradox International English code page
PDOX_SWEDFIN Paradox Swedish/Finnish code page
DOS737 Greek DOS737 Greek
DOS775 Baltic DOS775 Baltic
DOS850 Western Europe, America DOS850 Latin-1 (without Euro € symbol)
DB_DEU850 dBASE German
DB_ESP850 dBASE Spanish
DB_FRA850 dBASE French/France
DB_FRC850 dBASE French/Canada
DB_ITA850 dBASE Italian
DB_NLD850 dBASE Dutch
DB_PTB850 dBASE Portuguese/Brasil
DB_SVE850 dBASE Swedish
DB_UK850 dBASE English/UK
DB_US850 dBASE English/USA
DOS852 Central Europe DOS852 Latin-2 (Central Europe)
DB_CSY dBASE Czech
DB_PLK dBASE Polish
DB_SLO dBASE Slovakian
PDOX_CSY Paradox Czech
PDOX_HUN Paradox Hungarian
PDOX_PLK Paradox Polish
PDOX_SLO Paradox Slovakian
DOS857 Turkish DOS857 Turkish
DB_TRK dBASE Turkish
DOS858   DOS858 Latin-1 plus Euro symbol €
DOS860 Portuguese DOS860 Portuguese
DB_PTG860 dBASE Portuguese
DOS861 Icelandic DOS861 Icelandic
PDOX_ISL Paradox Icelandic
DOS862 Hebrew DOS862 Hebrew
DOS863 Canadian French DOS863 French/Canada
DB_FRC863 dBASE French/Canada
DOS864 Arabic DOS864 Arabic
DOS865 Scandinavian DOS865 Nordic
DB_NOR865 dBASE Norwegian
DB_DAN865 dBASE Danish
PDOX_NORDAN4 Paradox Norwegian
DOS866 Russian DOS866 Russian
DOS869 Greek DOS869 Modern Greek
Others
BIG_5 Chinese BIG_5 Chinese
KOI8R Russian   Russian character set and dictionary collation (Firebird 2.0)
KOI8U Ukrainian   Ukrainian character set and dictionary collation (Firebird 2.0)
CYRL Russian/Ukrainian CYRL Cyrillic
DB_RUS dBASE Russian
PDOX_CYRL Paradox Cyrillic
KSC_5601 Korean KSC_5601 Unified Korean Hangeul, also known as windows-949
KSC_DICTIONARY Korean dictionary ordering
NEXT NeXT Computers NEXT NeXTSTEP encoding
NXT_DEU German
NXT_ESP Spanish
NXT_FRA French
NXT_ITA Italian
NXT_US US-English
SJIS_0208 Japanese SJIS_0208 Shift-JIS
EUCJ_0208 Japanese EUCJ_0208 EUC Japanese
GB_2312 Chinese GB_2312 Simplified Chinese (HongKong, PRC), a subset of GBK/windows-936
CP943C Japanese CP943C_UNICODE Japanese character set (Firebird 2.1)
TIS620 Thai TIS620_UNICODE Thai character set, single byte (Firebird 2.1)

Which one to choose?

The question now is: which character set do I choose for my database?

Note: I don't have any experience with Asian scripts (Chinese, Korean, Japanese) so I can't give you any hint on these

  • You should chose the DOS, dBASE and Paradox character sets only if you have legacy applications to support
  • The WINxxx character sets are extensions of the corresponding ISOxxx character sets, however you will have problems on non-Windows systems. So if you have a cross-platform application, stay with the ISOxxx character sets.
  • The ISOxxx character sets are missing a few characters of the WINxxx character sets (like typographic dash signs) or they can have some different characters, so your application must be prepared to handle this.

Unicode?

The Unicode situation dramatically improved with Firebird 2.0. Now there is the new UTF8 character set that correctly handles Unicode strings in UTF-8 format. The Unicode collation algorithm has been implemented so now you can use UPPER() and the new LOWER() function without the need to specify a collation.


See also: www.firebirdsql.org/index.php?op=doc&id=fb_1_5_charsets

See also: www.collation-charts.org

Firebird Character Sets and Collations的更多相关文章

  1. 10.1.5 Connection Character Sets and Collations

    10.1.5 Connection Character Sets and Collations Several character set and collation system variables ...

  2. MySQL: Connection Character Sets and Collations

    character_set_server collation_servercharacter_set_databasecollation_database character_set_clientch ...

  3. 02:PostgreSQL Character Sets

    在利用postGIS导入shapefile文件到postgresql数据库的时候,老是提示字符串的问题,或者是乱码,试了好几种都不行,于是度娘之.... 使用默认的UTF8,提示信息是:建议使用LAT ...

  4. Character Sets: Migrating to utf8mb4 with pt_online_schema_change

    David Berube  | June 12, 2018 |  Posted In: MySQL Modern applications often feature the use of data ...

  5. Character Sets, Collation, Unicode :: utf8_unicode_ci vs utf8_general_ci

    w Hi, You can check and compare sort orders provided by these two collations here: http://www.collat ...

  6. information_schema系列之字符集校验(CHARACTER_SETS,COLLATIONS,COLLATION_CHARACTER_SET_APPLICABILITY)

    1:CHARACTER_SETS 首先看一下查询前十条的结果: root@localhost [information_schema]>select * from CHARACTER_SETS ...

  7. MySQL设置字符集CHARACTER SET

    本文地址:http://www.cnblogs.com/yhLinux/p/4036506.html 在 my.cnf 配置文件中设置相关选项,改变为相应的character set. 设置数据库编码 ...

  8. MySQL基础知识:Character Set和Collation

    A character set is a set of symbols and encodings. A collation is a set of rules for comparing chara ...

  9. mysql set names 命令和 mysql 字符编码问题

    先看下面的执行结果: (root@localhost)[(none)]mysql>show variables like 'character%'; +--------------------- ...

随机推荐

  1. 基于MATLAB的多功能语音处理器

    一.设计功能 录制音频,保存音频 对录制的语音信号进行频谱分析,确定该段语音的主要频率范围: 利用采样定理,对该段语音信号进行采样,观察不用采样频率(过采样.欠采样.临界采样)对信号的影响: 实现语音 ...

  2. python tkinter模块小工具界面

    代码 #-*-coding:utf-8-*- import os from tkinter import * root=Tk() root.title('小工具') #清空文本框内容 def clea ...

  3. php第八节课

    加载 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.o ...

  4. Nginx服务(端口80)

    Nginx安装: 一.编译安装 1.安装相应软件 yum install pcre pcre-devel openssl openssl-devel -y 2.检查: rpm -aq pcre pcr ...

  5. 爬虫系列(二) Chrome抓包分析

    在这篇文章中,我们将尝试使用直观的网页分析工具(Chrome 开发者工具)对网页进行抓包分析,更加深入的了解网络爬虫的本质与内涵 1.测试环境 浏览器:Chrome 浏览器 浏览器版本:67.0.33 ...

  6. 29.es路由原理

    主要知识点 1.document路由到shard的理解及原理 2.路由算法:shard = hash(routing) % number_of_primary_shards 3.routing值(_i ...

  7. 【Educational Codeforces Round 53 (Rated for Div. 2) C】Vasya and Robot

    [链接] 我是链接,点我呀:) [题意] [题解] 如果|x|+|y|>n 显然.从(0,0)根本就没法到(x,y) 但|x|+|y|<=n还不一定就能到达(x,y) 注意到,你每走一步路 ...

  8. ELK 聚合查询

    在elasticsearch中es支持对存储文档进行复杂的统计.简称聚合. ES中的聚合被分为两大类. 1.Metrics, Metrics 是简单的对过滤出来的数据集进行avg,max等操作,是一个 ...

  9. linux c 操作utmp 和 wtmp 文件接口

    /var/run/utmp 保存当前在本系统中的用户信息 /var/log/wtmp 保存登陆过本系统的用户信息 他们保存的信息是基于结构体 struct utmp 的(/usr/include/bi ...

  10. 《Android源代码设计模式解析与实战》读书笔记(八)

    第八章.状态模式 1.定义 状态模式中的行为是由状态来决定,不同的状态下有不同的行为.当一个对象的内在状态改变时同意改变其行为,这个对象看起来像是改变了其类. 2.使用场景 1.一个对象的行为取决于它 ...