【参考】IBM sun.io.MalformedInputException and text encoding conversions transforms numerals to their word equivalents - United States
Problem(Abstract)
When converting contents from a file or string using WebSphere Application Server, numbers may be converted to their word equivalents, especially if using PDFBOX to extract text, along with sun.io.MalformedInputExceptions.
Symptom
Text extracted from UTF-8 sources, such as PDFs, are displayed incorrectly.
For example, when "123 Hello Motto" is extracted from a PDF, the text "onetwothreespaceHellospaceMottospace" is output.
Diagnosing the problem
When extracting text from a source that using UTF-8, you may find that numbers and non-alpha characters are transforming into word equivalents. This is a problem seen on Linux; however, MalformedInputExceptions are likely to be seen on other operating systems.
We have a stand-alone test case that can confirm if the text transformations are occurring. Unzip the contents of PDF_Test_Case.zip into a temporary location and execute the following against the Java™ executable that is bundled with your WebSphere Application Server.
[JAVA_EXECUTABLE] -jar pdfproblem.jar "123_Hello_Motto.pdf"

If the test fails, you will see output similar to the following:
onetwothreespaceHellospaceMottospace
Resolving the problem
Because of the IBM SDK's use of Java IO for text and font conversion, these transformation issues occur. The solution is to force the Java Virtual Machine (JVM) to use the Java NIO libraries for extracting text. Add this JVM argument to resolve the problem:
-Dibm.stream.nio=true
I am getting a MalformedInputException. How can I resolve this?
This exception does not alter the resulting string, which is output after the exception. Java IO is designed to throw exceptions when errors are reported. By switching to NIO, these exceptions would be caught and not reported to the log.
You can resolve these errors by forcing NIO, but there is an alternative. Check the environment variable LANG to see if it set to UTF-8. It may read something like this:
# echo $LANG
en_US.UTF-8
Alter the variable and remove the .UTF-8 appended to the end of the string. From the command prompt on UNIX and Linux, you can type the following:
# export LANG=en_US
Alternatively, you can add this environment variable from the administration console.
MalformedInputException may also occur when running your application on WebSphere Application Server and would be output to the standard error.
Why is Java IO used for converting text?
Java IO is retained in the IBM SDK for performance reasons instead of using NIO, or New IO. By design, Java IO will throw exceptions when errors are encountered, such as the MalformedInputExcpetion error, while NIO will not.
The JVM can be forced to use NIO if the JVM argument is used as stated above.
Does the Oracle JDK suffer similar problems?
Since the Oracle JDK uses NIO by default, this issue does not occur when running WebSphere Application Server on Solaris and HP-UX.
【参考】IBM sun.io.MalformedInputException and text encoding conversions transforms numerals to their word equivalents - United States的更多相关文章
- c#字符编码,System.Text.Encoding类,字符编码大全:如Unicode编码、GB18030、UTF-8,UTF-7,GB2312,ASCII,UTF32,Big5
本页列出来目前window下所有支持的字符编码 ---c#通过 System.Text.Encoding.GetEncodings()获取,里面可以对其进行查询,筛选,对同一个字符,在不同编码进行查 ...
- System.Text.Encoding.Default
string strTmp = "abcdefg某某某";int i= System.Text.Encoding.Default.GetBytes(strTmp).Length;/ ...
- java.io.IOException: Malformed \uxxxx encoding.
java.io.IOException: Malformed \uxxxx encoding. at com.dong.frame.util.ReadProperties.read(ReadProp ...
- System.Text.Encoding.cs
ylbtech-System.Text.Encoding.cs 1.程序集 mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77 ...
- LookupError: 'hex' is not a text encoding; use codecs.decode() to handle arbitrary codecs
问题代码: b=b'\x01\x02\x03' x=binascii.b2a_hex(b.decode('hex')[::-1].encode('hex')) python2下是不报错的,因为pyth ...
- UnicodeMath数学公式编码_翻译(Unicode Nearly Plain - Text Encoding of Mathematics Version 3)
目录 完整目录 1. 简介 2. 编码简单数学表达式 2.1 分数 2.2 上标和下标 2.3 空白(空格)字符使用 3. 编码其他数学表达式 3.1 分隔符 强烈推荐本文简明版UnicodeMath ...
- (https://www.ibm.com/developerworks/community/forums/html/topic?id=77777777-0000-0000-0000-000014550004)Topic: Caught java.io.CharConversionException. ERRORCODE=-4220, SQLSTATE=null
270002WDPN 3 Posts 0 people l ...
- SqlException with message "Caught java.io.CharConversionException." and ERRORCODE=-4220
Technote (troubleshooting) Problem(Abstract) When an application uses the IBM Data Server Driver for ...
- sqoop从DB2迁移数据到HDFS
Sqoop import job failed to read data from DB2 database which has UTF8 encoding. Essentially, even th ...
随机推荐
- http接口服务方结合策略模式实现总结
在项目中,我们经常会使用到http+xml的接口,而且不仅仅的是一个,可能会有多个http的接口需要实时的交互.但是http接口的接收消息的公共部分是一样的,只有每个接口的报文解析和返回报文是不同的, ...
- C++的头文件(转)
这几天在写比较困难的一部分,所以也没有时间总结一些东西了,不过昨天翻我的笔记本,发现了一篇还不错的笔记,给大家看看. C/C++头文件一览 C.传统 C++ #include <assert.h ...
- ajax第一天总结
AJAX开发步骤 步一:创建AJAX异步对象,例如:createAJAX() 步二:准备发送异步请求,例如:ajax.open(method,url) 步三:如果是POST请求的话,一定要设置AJAX ...
- node——文件写入,文件读取
ru //实行文件操作 //文件写入 //1.加载文件操作,fs模块 var fs = require('fs'); //2.实现文件写入操作 var msg='Hello world'; //调用f ...
- 队列Queue的get方法
写了一段生产者消费者模型的代码: from time import sleep from random import randint, random from multiprocessing impo ...
- 流媒体应用程序Mobdro或存在安全隐患
Mobdro是一款流媒体应用程序,可以安装在任何Android设备上,包括手机,平板电脑,亚马逊的Fire TV Stick和Google的Chromecast.它现在已经流行了一段时间,特别是在围绕 ...
- 小学生绞尽脑汁也学不会的python(反射)
小学生绞尽脑汁也学不会的python(反射) 1. issubclass, type, isinstance issubclass 判断xxxx类是否是xxxx类的子类 type 给出xxx的数据类型 ...
- JS中的DOM操作怎样添加、移除、移动、复制、创建和查找节点
DOM操作怎样添加.移除.移动.复制.创建和查找节点? (1)创建新节点 createDocumentFragment() //创建一个DOM片段 createElement() //创建一个具体的元 ...
- IDEA使用快捷键
sout+TAB键---->System.out.println();你可以按ctrl+j里面各种快捷键模板都可以看到. Intellij Idea get/set方法快捷键:Alt+Inse ...
- css3特效第二篇--行走的线条&&置顶导航栏
一.行走的线条. 效果图(加载可能会慢一点儿,请稍等...): html代码: <div class="movingLines"> <img src=" ...