Lua截取utf-8编码的中英文混合字符串

参考博客：UTF8字符串在lua的截取和字数统计【转载】

需求

按字面个数来截取子字符串

函数(字符串, 开始位置, 截取长度)
 
utf8sub("你好1世界哈哈",,)    =    好1世界哈
utf8sub("1你好1世界哈哈",,)    =    你好1世界
utf8sub("你好世界1哈哈",,)    =    你好世界1
utf8sub("",,)    =
utf8sub("øpø你好pix",,)    =    pø你好p

错误方法

网上找了一些算法, 都不太正确; 要么就是乱码, 要么就是只考虑了4 byte 中文的情况, 不够全面

1. string.sub(s,1,截取长度*4)

　　网上很多直接使用"`""string.sub(s,1,截取长度*4)`"是肯定不对的, 因为如果中英文混合的字符串, 例如`你好1世界`的字符长度分别是`4,4,1,4,4`, 如果截取4个字, 4*4=4+4+1+4+3, 那`世界`的`界`字将会被取前3个byte, 就会出现乱码

2. if byte>128 then index = index + 4

问题关键

1. utf8字符是变长字符

2. 字符长度有规律

如文字符编码中所列，utf-8是对unicode字符集的编码方案。因此其变长编码方式为：

一字节：0*******

两字节：110*****，10******

三字节：1110****，10******，10******

四字节：11110***，10******，10******，10******

五字节：111110**，10******，10******，10******，10******

六字节：1111110*，10******，10******，10******，10******，10******

因此，拿到字节串后，想判断UTF8字符的byte长度，按照上文的规律，只需要获取该字符的首个Byte，根据其值就可以判断出该字符由几个Byte表示。

其代码如下：

local funciton charsize(ch)
    if not ch then return
    elseif ch >= then return
    elseif ch >=  and ch <  then return
    elseif ch >=  and ch <  then return
    elseif ch >=  and ch <  then return
    elseif ch >=  and ch <  then return
    elseif ch <  then return
    end
end

-- 计算utf8字符串字符数, 各种字符都按一个字符计算
-- 例如utf8len("1你好") => 3
function utf8len(str)
    local len =
    local aNum =  --字母个数
    local hNum =  --汉字个数
    local currentIndex =
    while currentIndex <= #str do
        local char = string.byte(str, currentIndex)
        local cs = charsize(char)
        currentIndex = currentIndex + cs
        len = len +
        if cs ==  then
            aNum = aNum +
        elseif cs >=  then
            hNum = hNum +
        end
    end
    return len, aNum, hNum
end

-- 截取utf8 字符串
-- str:            要截取的字符串
-- startChar:    开始字符下标,从1开始
-- numChars:    要截取的字符长度
function utf8sub(str, startChar, numChars)
    local startIndex =
    while startChar >  do
        local char = string.byte(str, startIndex)
        startIndex = startIndex + chsize(char)
        startChar = startChar -
    end
 
    local currentIndex = startIndex
 
    while numChars >  and currentIndex <= #str do
        local char = string.byte(str, currentIndex)
        currentIndex = currentIndex + chsize(char)
        numChars = numChars -
    end
    return str:sub(startIndex, currentIndex - )
end
 
-- 自测
function test()
    -- test utf8len
    assert(utf8len("你好1世界哈哈") == )
    assert(utf8len("你好世界1哈哈 ") == )
    assert(utf8len(" 你好世 界1哈哈") == )
    assert(utf8len("") == )
    assert(utf8len("øpø你好pix") == )
 
    -- test utf8sub
    assert(utf8sub("你好1世界哈哈",,) == "好1世界哈")
    assert(utf8sub("1你好1世界哈哈",,) == "你好1世界")
    assert(utf8sub(" 你好1世界 哈哈",,) == "你好1世界 ")
    assert(utf8sub("你好世界1哈哈",,) == "你好世界1")
    assert(utf8sub("",,) == "")
    assert(utf8sub("øpø你好pix",,) == "pø你好p")
 
    print("all test succ")
end
 
test()