Wrong codepoints for non-ASCII characters inserted in UTF-8 database using CLP
Technote (troubleshooting)
Problem(Abstract)
During insert from the CLP there is no codepage conversion if operating system codepage and database codepage are both UTF-8. In this case data to be inserted should also be in UTF-8 encoding.
If data has a different encoding then the database codepage (this can be verified using any hex editor), then the operating system codepage should be changed to match the data's encoding in order to enforce the data conversion to the database codepage.
Symptom
Error executing Select SQL statement. Caught by java.io.CharConversionException. ERRORCODE=-4220
Caused by: java.nio.charset.MalformedInputException: Input length = 4759 at com.ibm.db2.jcc.b.u.a(u.java:19) at com.ibm.db2.jcc.b.bc.a(bc.java:1762)
Cause
During an insert of data using CLP characters, they do not go through codepage conversion. If operating system and database codepage both are UTF-8, but the data to be inserted is not Unicode, then data in the database might have incorrect codepoints (not-Unicode) and the above error will be a result during data retrieval.
To verify the encoding for data to be inserted you can use any editor that shows hex representation of characters. Please verify the codepoints for non-ASCII characters that you try to insert. If you see only 1 byte per non-ASCII characters then you need to force the database conversion during insert from CLP to UTF-8 database.
To force codepage conversion during insert from the CLP make sure that the operating system codepage is non-Unicode and matching to the codepage of data when you insert data to Unicode database from non-Unicode data source.
Problem Details An example problem scenario is as follows:
- Create a database of type UTF-8:
CREATE DATABASE <db> USING CODESET utf-8 TERRITORY US - Create a table that holds character data:
CREATE TABLE test (col char(20)) - Check operating system locale:
locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" - Insert the non-ASCII characters 'Ã' , '³', '©' which have codepoint 0x'C3', 0x'B3',0x'A9' in codepage 819 into the table:
INSERT INTO test VALUES ('Ã') INSERT INTO test VALUES ('³') INSERT INTO test VALUES ('©') - By running the following statement, you can see that all INSERT statements caused only one byte to be inserted into the table:
SELECT col, HEX(col) FROM test
à C3 ³ B3 © A9
However, the UTF-8 representation of those characters are: 0x'C383' for 'Ã', 0x'C2B3' for '³', and 0x'C2A9' for '©'. So these three rows in the table contain invalid characters in UTF-8. - When selecting from a column using the JDBC application, the following error will occur. This is expected because the table contains invalid UTF-8 data: Error executing Select SQL statement. Caught by java.io.CharConversionException. ERRORCODE=-4220 Caused by: java.nio.charset.MalformedInputException: Input length = 4759 at com.ibm.db2.jcc.b.u.a(u.java:19) at com.ibm.db2.jcc.b.bc.a(bc.java:1762)
- Delete all rows with incorrect Unicode codepoints from the test table: DELETE * from test
- Change the locale to one that matching codepage of data to be inserted: export locale=en_us. One of the way to determine the codepage for your data can be found here: http://www.codeproject.com/Articles/17201/Detect-Encoding-for-In-and-Outgoing-Text. If you prepare data yourself using some editor please check the documentation for your editor to find out how to set up the codepage for data being prepared by the editor.
- Insert data to the table: INSERT INTO test VALUES ('Ã') INSERT INTO test VALUES ('³') INSERT INTO test VALUES ('©')
- Verify that inserted data were converted to UTF-8 during insert: SELECT col, HEX(col) FROM test
à C383 ³ C2B3 © C2A9 - Run your java application selecting Unicode data. No exception should be reported.
Environment
UNIX, Linux, Unicode database
Diagnosing the problem
Verify that non-ASCII data have a proper Unicode codepoints in Unicode database
Resolving the problem
Reinsert data with codepage conversion enforced by setting the operation system codepage matching to the codepage of data to be inserted
Related information
Community questions and discussion
By adding a comment, you accept our Terms of Use. Your comments entered on this IBM Support site do not represent the views or opinions of IBM. IBM, in its sole discretion, reserves the right to remove any comments from this site. IBM is not responsible for, and does not validate or confirm, the correctness or accuracy of any comments you post. IBM does not endorse any of your comments. All IBM comments are provided "AS IS" and are not warranted by IBM in any way.
Wrong codepoints for non-ASCII characters inserted in UTF-8 database using CLP的更多相关文章
- ascii、unicode、utf、gb等编码详解
很久很久以前,有一群人,他们决定用8个可以开合的晶体管来组合成不同的状态,以表示世界上的万物.他们看到8个开关状态是好的,于是他们把这称为"字节".再后来,他们又做了一些可以处理这 ...
- ASCII、UNICODE、UTF
在计算机中,一个字节对应8位,每位可以用0或1表示,因此一个字节可以表示256种情况. ascii 美国人用了一个字节中的后7位来表达他们常用的字符,最高位一直是0,这便是ascii码. 因此asci ...
- man ascii
Linux 2.6 - man page for ascii (linux section 7) - Unix & Linux Commands Linux 2.6 - man page fo ...
- ASCII Table/ASCII表
ASCII Table/ASCII表 参考: 1.Table of ASCII Characters
- ASCII 码对应表
Macron symbol ASCII CODE 238 : HTML entity : [ Home ][ español ] What is my IP address ? your public ...
- ASCII Art (English)
Conmajia, 2012 Updated on Feb. 18, 2018 What is ASCII art? It's graphic symbols formed by ASCII char ...
- ASCII Art ヾ(≧∇≦*)ゝ
Conmajia, 2012 Updated on Feb. 18, 2018 What is ASCII art? It's graphic symbols formed by ASCII char ...
- Ascii vs. Binary Files
Ascii vs. Binary Files Introduction Most people classify files in two categories: binary files and A ...
- [错误处理]UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)
Stackoverflow 回答: 将byte类型转化:byte_string.decode('utf-8') Jinja2 is using Unicode internally which mea ...
随机推荐
- lua脚本之lua语言简介以及lua的安装
本博主不擅于进行文字创作,所以,相当一部分文字皆复制于其他博文.还希望能够得到谅解. 一.Lua语言简介 Lua是一个免费的开源软件,可以免费用于科研及商业.Lua具有一个专家团队在维护和升级,其设 ...
- 如何解压POSIX tar archive文件
下载了一个xxx.gz的文件,使用x xxx.gz(zsh的x插件,十分之好用,再也不用担心tar后面该加哪些参数了)的命令解压,然后出现了一个文件,本以为解压后是一个文件夹:然后一脸蒙逼~ 突然又想 ...
- Java使用HttpClient实现Post请求
http://www.cnblogs.com/mengrennwpu/p/6418114.html ******************************************* 基于项目需求 ...
- 【转】【MySQL报错】ERROR 1558 (HY000): Column count of mysql.user is wrong. Expected 43, found 39.
之前在centos6.4系统安装的是自带的mysql 5.1版本,后来升级到了5.6版本,执行以下命令报错 在网上查找原因说说因为升级不当导致,执行以下命令即可正常执行命令 mysql_upgrade ...
- python dict与list
本文实例讲述了python中字典(Dictionary)用法.分享给大家供大家参考.具体分析如下: 字典(Dictionary)是一种映射结构的数据类型,由无序的“键-值对”组成.字典的键必须是不可改 ...
- PTS无法同步
最近在使用PTS的时候,一直重现PTS无法同步的情况,一直显示No block source available,在查了中英各种帖子之后,终于解决了这个问题,下面是解决的办法. 在windows下运行 ...
- am335x Lan8710a 双网口配置
一. 经过调试, LAN8710A在 am335x 上面需要使用 GMII的模式,设备树 pin mux配置如下: // 下面是工作模式的配置,在睡眠模式下是配成GPIO模式 162 cpsw_def ...
- Vagrant (3) —— 复制/备份Vagrant Box
Vagrant (3) -- 复制/备份Vagrant Box 摘要 介绍复制/备份Vagrant Box基本方法 版本 Vagrant版本: 1.8.1 内容 复制vagrant box并压缩 关闭 ...
- 云服务器启动tomcat巨慢,很慢
增加随机数生成熵池 0.查看熵池 cat /proc/sys/kernel/random/entropy_avail 1. yum install rng-tools 2. systemctl sta ...
- SpringCloud 详解配置刷新的原理 使用jasypt自动加解密后 无法使用 springcloud 中的自动刷新/refresh功能
之所以会查找这篇文章,是因为要解决这样一个问题: 当我使用了jasypt进行配置文件加解密后,如果再使用refresh 去刷新配置,则自动加解密会失效. 原因分析:刷新不是我之前想象的直接调用conf ...