【参考】IBM sun.io.MalformedInputException and text encoding conversions transforms numerals to their word equivalents - United States
Problem(Abstract)
When converting contents from a file or string using WebSphere Application Server, numbers may be converted to their word equivalents, especially if using PDFBOX to extract text, along with sun.io.MalformedInputExceptions.
Symptom
Text extracted from UTF-8 sources, such as PDFs, are displayed incorrectly.
For example, when "123 Hello Motto" is extracted from a PDF, the text "onetwothreespaceHellospaceMottospace" is output.
Diagnosing the problem
When extracting text from a source that using UTF-8, you may find that numbers and non-alpha characters are transforming into word equivalents. This is a problem seen on Linux; however, MalformedInputExceptions are likely to be seen on other operating systems.
We have a stand-alone test case that can confirm if the text transformations are occurring. Unzip the contents of PDF_Test_Case.zip into a temporary location and execute the following against the Java™ executable that is bundled with your WebSphere Application Server.
[JAVA_EXECUTABLE] -jar pdfproblem.jar "123_Hello_Motto.pdf"
If the test fails, you will see output similar to the following:
onetwothreespaceHellospaceMottospace
Resolving the problem
Because of the IBM SDK's use of Java IO for text and font conversion, these transformation issues occur. The solution is to force the Java Virtual Machine (JVM) to use the Java NIO libraries for extracting text. Add this JVM argument to resolve the problem:
-Dibm.stream.nio=true
I am getting a MalformedInputException. How can I resolve this?
This exception does not alter the resulting string, which is output after the exception. Java IO is designed to throw exceptions when errors are reported. By switching to NIO, these exceptions would be caught and not reported to the log.
You can resolve these errors by forcing NIO, but there is an alternative. Check the environment variable LANG to see if it set to UTF-8. It may read something like this:
# echo $LANG
en_US.UTF-8
Alter the variable and remove the .UTF-8 appended to the end of the string. From the command prompt on UNIX and Linux, you can type the following:
# export LANG=en_US
Alternatively, you can add this environment variable from the administration console.
MalformedInputException may also occur when running your application on WebSphere Application Server and would be output to the standard error.
Why is Java IO used for converting text?
Java IO is retained in the IBM SDK for performance reasons instead of using NIO, or New IO. By design, Java IO will throw exceptions when errors are encountered, such as the MalformedInputExcpetion error, while NIO will not.
The JVM can be forced to use NIO if the JVM argument is used as stated above.
Does the Oracle JDK suffer similar problems?
Since the Oracle JDK uses NIO by default, this issue does not occur when running WebSphere Application Server on Solaris and HP-UX.
【参考】IBM sun.io.MalformedInputException and text encoding conversions transforms numerals to their word equivalents - United States的更多相关文章
- c#字符编码,System.Text.Encoding类,字符编码大全:如Unicode编码、GB18030、UTF-8,UTF-7,GB2312,ASCII,UTF32,Big5
本页列出来目前window下所有支持的字符编码 ---c#通过 System.Text.Encoding.GetEncodings()获取,里面可以对其进行查询,筛选,对同一个字符,在不同编码进行查 ...
- System.Text.Encoding.Default
string strTmp = "abcdefg某某某";int i= System.Text.Encoding.Default.GetBytes(strTmp).Length;/ ...
- java.io.IOException: Malformed \uxxxx encoding.
java.io.IOException: Malformed \uxxxx encoding. at com.dong.frame.util.ReadProperties.read(ReadProp ...
- System.Text.Encoding.cs
ylbtech-System.Text.Encoding.cs 1.程序集 mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77 ...
- LookupError: 'hex' is not a text encoding; use codecs.decode() to handle arbitrary codecs
问题代码: b=b'\x01\x02\x03' x=binascii.b2a_hex(b.decode('hex')[::-1].encode('hex')) python2下是不报错的,因为pyth ...
- UnicodeMath数学公式编码_翻译(Unicode Nearly Plain - Text Encoding of Mathematics Version 3)
目录 完整目录 1. 简介 2. 编码简单数学表达式 2.1 分数 2.2 上标和下标 2.3 空白(空格)字符使用 3. 编码其他数学表达式 3.1 分隔符 强烈推荐本文简明版UnicodeMath ...
- (https://www.ibm.com/developerworks/community/forums/html/topic?id=77777777-0000-0000-0000-000014550004)Topic: Caught java.io.CharConversionException. ERRORCODE=-4220, SQLSTATE=null
270002WDPN 3 Posts 0 people l ...
- SqlException with message "Caught java.io.CharConversionException." and ERRORCODE=-4220
Technote (troubleshooting) Problem(Abstract) When an application uses the IBM Data Server Driver for ...
- sqoop从DB2迁移数据到HDFS
Sqoop import job failed to read data from DB2 database which has UTF8 encoding. Essentially, even th ...
随机推荐
- 体系化认识RPC--转
原文地址:http://www.infoq.com/cn/articles/get-to-know-rpc?utm_source=infoq&utm_medium=popular_widget ...
- Python3之时间模块详述
Python3之时间模块 time & datetime & calendar 一. 概述 python 提供很多方式处理日期与时间,转换日期格式是一个常见的功能. 时间元组:很多p ...
- iOS11即将到来,让我们具体了解下
谷歌开发者大会后,苹果的WWDC终于也要来了,目前准确时间已经确定. 近日,苹果官方发出的公告显示,WWDC 2017将在北京时间6月6日凌晨1点正式进行,同时他们强调会进行现场直播,用户可以在苹果主 ...
- Image解码
Image解码 可以看到从CFDataRef直到创建出UIImage,都没有调用过对图像解码的函数,只读取了一些图像基础数据和元数据. Image解码发生在什么时候?在ImageIO/CGImageS ...
- JS怎样写闰年
<!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title> ...
- crm需求分析步骤
# CRM开发顺序# 需求分析# 思维导图# 业务场景分析#-------------------------------------## 原型图(Demo)# Axure#------------- ...
- jq——动画
基本 1 show(可加时间)显示[在效果完成后可执行函数] 2 hide(可加时间)隐藏 3 toggle():切换效果 [在show和hide中切换] 有函数时 滑动动画 1 slideDown: ...
- HYSBZ-1566 管道取珠 区间dp
题目链接:https://cn.vjudge.net/problem/HYSBZ-1566 题意 思路 已经说了,面对\sum a^2的时候把状态分两个, 当这两个状态相同时,满足题意的方案数即变为a ...
- 什么是PL/SQL,有什么用
1.什么是PL/SQL,有什么用 Procedure Language+SQL PL/SQL是Oracle数据库特有的编程语言. PL/SQL程序是以SQL为基础,引入了 编程语言特点,例如变 ...
- Error: Password file read access must be restricted: /etc/cassandra/jmxremote.password
在配置JMX远程访问的时候,设置jmxremote.password文件权限,修改该文件时添加写权限,chmod +w jmxremote.password ,放开角色信息那俩行的注释,保存,再使用c ...