UTF-8 Invalid Byte Sequences

Chances are, some of you have run into the issue with the invalid byte sequence in UTF-8 error when dealing with user-submitted data. A Google search shows that my hunch isn’t off.

Among the search results are plenty of answers—some using the deprecated iconv library—that might lead you to a sufficient fix. However, among the slew of queries are few answers on how to reliably replicate and test the issue.

In developing the Griddler gem we ran into some cases where the data being posted back to our controller had invalid UTF-8 bytes. For Griddler, our failing case needs to simulate the body of an email having an invalid byte, and encoded as UTF-8.

What are valid and invalid bytes? This table on Wikipedia tells us bytes 192, 193, and 245-255 are off limits. In ruby’s string literal we can represent this by escaping one of those numbers:

> "hi \255"

 => "hi \xAD"

There’s our string with the invalid byte! How do we know for sure? In that IRB session we can simulate a comparable issue by sending a message to the string it won’t like - like split or gsub.

> "hi \255".split(' ')

ArgumentError: invalid byte sequence in UTF-8

  from (irb):9:in `split'

  from (irb):9

  from /Users/joel/.rvm/rubies/ruby-1.9.3-p125/bin/irb:16:in `<main>'

Yup. It certainly does not like that.

Let’s create a very real-world, enterprise-level, business-critical test case:

invalid_byte_spec.rb

require 'rspec'

def replace_name(body, name)

  body.gsub(/joel/, name)

end

describe 'replace_name' do

  it 'removes my name' do

    body = "hello joel"

    replace_name(body, 'hank').should eq "hello hank"

  end

  it 'clears out invalid UTF-8 bytes' do

    body = "hello joel\255"

    replace_name(body, 'hank').should eq "hello hank"

  end

end

The first test passes as expected, and the second will fail as expected but not with the error we want. By adding that extra byte we should see an exception raised similar to what we simulated in IRB. Instead it’s failing in the comparison with the expected value.

1) replace_name clears out invalid UTF-8 bytes

   Failure/Error: replace_name(body, 'hank').should eq "hello hank"

     expected: "hello hank"

          got: "hello hank\xAD"

     (compared using ==)

   # ./invalid_byte_spec.rb:17:in `block (2 levels) in <top (required)>'

Why isn’t it failing properly? If we pry into our running test we find out that inside our file the strings being passed around are encoded as ASCII-8BIT instead of UTF-8.

[2] pry(#<RSpec::Core::ExampleGroup::Nested_1>)> body.encoding

=> #<Encoding:ASCII-8BIT>

As a result we’ll have to force that string’s encoding to UTF-8:

it 'clears out invalid UTF-8 bytes' do

  body = "hello joel\255".force_encoding('UTF-8')

  replace_name(body, 'hank').should_not raise_error(ArgumentError)

  replace_name(body, 'hank').should eq "hello hank"

end

By running the test now we will see our desired exception

1) replace_name clears out invalid UTF-8 bytes

   Failure/Error: body.gsub(/joel/, name)

   ArgumentError:

     invalid byte sequence in UTF-8

   # ./invalid_byte_spec.rb:4:in `gsub'

   # ./invalid_byte_spec.rb:4:in `replace_name'

   # ./invalid_byte_spec.rb:17:in `block (2 levels) in <top (required)>'

Finished in 0.00426 seconds

2 examples, 1 failure

Now that we’re comfortably in the red part of red/green/refactor we can move on to getting this passing by updating our replace_name method.

def replace_name(body, name)

  body

    .encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')

    .gsub(/joel/, name)

end

And the test?

Finished in 0.04252 seconds

2 examples, 0 failures

For such a small piece of code we admittedly had to jump through some hoops. Through that process, however, we learned a bit about character encoding and how to put ourselves in the right position—through the red/green/refactor cycle—to fix bugs we will undoubtedly run into while writing software.

#encoding: utf-8

require 'json'

f="dsp-cpi"

File.open(f).each  do |line|

line = line.encode("UTF-16be", :invalid=>:replace, :replace=>"?").encode('UTF-8')

end

UTF-8 Invalid Byte Sequences的更多相关文章

maven filter 乱码，MalformedByteSequenceException: Invalid byte 3 of 3-byte UTF-8 sequence.
<plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactI ...
MalformedByteSequenceException: Invalid byte 1 of 1-byte
修改了线上程序的xml配置文件,重启后报如下错误: MalformedByteSequenceException: Invalid byte 1 of 1-byte 百度了下大体的意思是说文件的编码错 ...
[字符编码]Invalid byte 1 of 1-byte UTF-8 sequence终极解决方案
今天在eclipse中编写pom.xml文件时,注释中的中文被eclipse识别到错误:Invalid byte 1 of 1-byte UTF-8 sequence,曾多次遇到该问题,问题的根源是: ...
读取xml文件报错：Invalid byte 2 of 2-byte UTF-8 sequence。
程序读取xml文件后,系统报“Invalid byte 2 of 2-byte UTF-8 sequence”错误,如何解决呢? 1.程序解析xml的时候,出现Invalid byte 2 of 2- ...
Invalid byte 3 of 3-byte UTF-8 sequence
用maven编译,tomcat启动时报错:IOException parsing XML document from class path resource [applicationContext.x ...
com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 2-byte
com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 2-byte ...
tomcat部署新的项目的时候出现报错信息： Invalid byte tag in constant pool: 15
上面一堆tomcat启动的提示信息省略掉,下面是报错的具体信息:org.apache.tomcat.util.bcel.classfile.ClassFormatException: Invalid ...
xml中1字节的UTF-8序列的字节1无效（[字符编码]Invalid byte 1 of 1-byte UTF-8 sequence终极解决方案）
今天在eclipse中编写pom.xml文件时,注释中的中文被eclipse识别到错误:Invalid byte 1 of 1-byte UTF-8 sequence,曾多次遇到该问题,问题的根源是: ...
Xml读取异常--Invalid byte 1 of 1-byte UTF-8 sequence
xml读取异常Invalid byte 1 of 1-byte UTF-8 sequence org.dom4j.DocumentException: Invalid byte 1 of 1-byte ...

随机推荐

unordered_map/unordered_set & unordered_multimap/unordered_multiset非关联容器
body, table{font-family: 微软雅黑; font-size: 10pt} table{border-collapse: collapse; border: solid gray; ...
国内npm镜像使用
淘宝npm镜像搜索地址:http://npm.taobao.org/ registry地址:http://registry.npm.taobao.org/ cnpmjs镜像搜索地址:http:// ...
Vue基础以及指令
Vue 基础篇一一.Vue框架介绍之前大家学过HTML,CSS,JS,JQuery,Bootstrap,现在我们要学一个新的框架Vue~ Vue是一个构建数据驱动的web界面的渐进式框架. 目 ...
Azulão--青鸟--IPA--巴西葡萄牙语
这是巴西很有名的民谣.
java动手动脑3
2016-10-152016-10-15一.编写一个方法,使用以上算法生成指定数目(比如1000个)的随机整数. 生成50个1到10的随机整数. value=a+(int)(Math.Random() ...
MyEclipse使用教程：使用REST Web Services管理JPA实体
MyEclipse 在线订购专享特惠!火爆开抢>> MyEclipse最新版下载使用REST Web Services来管理JPA实体.在逆向工程数据库表后生成REST Web服务,下面 ...
Java进程和线程
进程是资源分配和任务调度的基本单位, 进程就是包含上下文切换的程序执行时间总和=CPU加载上下文环境+CPU执行+CPU保存上下文环境,可以理解为时间片段: 进程的颗粒度太大了,将进程分块,按照a,c ...
day 35 关于线程
并发编程之协程对于单线程下,我们不可避免程序中出现io操作,但如果我们能在自己的程序中(即用户程序级别,而非操作系统级别)控制单线程下的多个任务能在一个任务遇到io阻塞时就切换到另外一个任务去计 ...
JAVA_全局配置文件(配置网址,url等等)_第一种方式
一.概述当使用httpClient调其他系统接口时,需要通过地址来发送post请求. 这时我们有不同的环境,那么就有两个问题. 1是地址不能写在代码中,而是要写在配置文件. 2是不同环境配置文件应该 ...
KB/MB/GB。。单位换算
今天遇到一个需求,需要把数据单位进行换算,记录一下.写的不好请勿见怪. function bytesToSize( bytes ) {//单位转化 var k = 1024, ...

UTF-8 Invalid Byte Sequences

UTF-8 Invalid Byte Sequences的更多相关文章

随机推荐

热门专题