Shell grep正则匹配中文

测试文本 demo_exe.c,内容如下,需要注意保存的编码格式,对输出到终端有影响:

我们中文操作系统ASNI默认是GBK的。

  1. #include<stdio.h>
  2. #include<stdlib.h>
  3. #include <string.h>
  4. #include <errno.h>
  5. #include <locale.h>
  6. #include <dlfcn.h>
  7.  
  8. /*
  9. * export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/data/local/tmp; /data/local/tmp/demo_exe
  10. */
  11. int main(int argc, char** argv) {
  12. // 这个是中文
  13. void *handle = NULL;
  14. char* locname = setlocale(LC_ALL, "");
  15. // 这个是中文
  16. //
  17. if ((handle = dlopen(demo_dso_so, RTLD_NOW)) == NULL) {
  18. printf("dlopen出错: %s\n", dlerror());
  19. }
  20. printf("@%s[%s]dlopen return handle = %#x.\n", __FILE__, __FUNCTION__, handle);
  21. // 这个是
  22. // 中文
  23. return ;
  24. }

1、匹配特定文字:

$ grep -nP "\xE4\xB8\xAD\xE6\x96\x87|\xD6\xD0\xCE\xC4" ./demo_exe.c
12:// 这个是中文
15:// 这个是中文
22:// 中文

编码 在线码表
GBK D6D0 CEC4 http://www.lhelper.org/tech/chinese_internal_code_specification_classified.txt
Unicode 4E2D 6587  
UTF-8 %E4%B8%AD %E6%96%87 http://wenku.baidu.com/link?url=DfbzjKLcRaQ7yVIA_EHVP7mKdVbkggq4hwkCmmO9uR76Jib_5Y1Y_h616NnI21XY_x85YZqN1SQBAdCFQjklS_

GBK码 : 中=D6D0,文=CEC4

Unicode码:中=4E2D,文=6587

UTF-8码:中=%E4%B8%AD,文=%E6%96%87

2、匹配特定范围文字

$ grep -nP "[\xB0\xA1-\xF7\xFE]+" /home/fangss/c/dynamic_share_object_test/demo_exe.c
12:// 这个是中文
15:// 这个是中文
18: printf("dlopen出错: %s\n", dlerror());
21:// 这个是
22:// 中文

范围:

  1. GBK/2: GB2312 汉字
  2.  
  3. B0
  4.  




  5.  
  6. 。。。
  1. F7
  2.  




2

正则表达式

正则表达式30分钟入门教程

转载来源 版本:v2.33 (2013-1-10) 作者:deerchao

Get XRegExp 2.0: minified (3.5 KB gzipped), or with comments. Get the full package or the latest development build at GitHub.

Java中正则,中链接 Regular Expressions of Java Tutorial

Java正则表达式教程

  1. <!--<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  2. <!-- saved from url=(0029)http://tool.chinaz.com/ -->
  3. <html xmlns="http://www.w3.org/1999/xhtml">
  4. <head>
  5. <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  6. <title>正则表达式在线测试 - 站长工具</title>
  7. <meta name="keywords" content="正则表达式在线测试,正则表达式测试工具">
  8. <meta name="description" content="该工具主要针对程序开发人员,通过该工具可以快速准备的判断所写的正则是否能正确匹配相应的字符">
  9. <link rel="icon" href="http://tool.chinaz.com/Chinaz.ico" type="image/x-icon">
  10. <!--<link href="http://tool.chinaz.com/template/default/styles/toolsite.css?ver=2011_11" rel="stylesheet" type="text/css">-->
  11. <!--<script src="http://tool.chinaz.com/template/default/js/globals.js?ver=2011_10" type="text/javascript"></script>-->
  12. <!--<link rel="Stylesheet" type="text/css" href="http://tool.chinaz.com/template/default/styles/topbar.css">-->
  13. <script language="JavaScript">
  14. //if (self != top) { top.location = self.location; }
  15. </script>
  16. <style type="text/css">
  17. ul, ol, li, dl, dd, p, h1, h2, h3, h4, h5, h6, form, fieldset {
  18. margin: 0;
  19. padding: 0;
  20. }
  21.  
  22. h1, h2, h3, h4, h5, h6 {
  23. font-size: 1em;
  24. }
  25.  
  26. ul, ol {
  27. list-style: none;
  28. }
  29.  
  30. img {
  31. vertical-align: middle;
  32. border: 0;
  33. }
  34.  
  35. a {
  36. color: #0d8c21;
  37. text-decoration: none;
  38. }
  39.  
  40. a:hover {
  41. color: red;
  42. text-decoration: underline;
  43. }
  44.  
  45. select {
  46. font-size: 15px;
  47. }
  48.  
  49. .blue {
  50. color: #3333ff;
  51. }
  52.  
  53. .fl {
  54. float: left;
  55. }
  56.  
  57. .fr {
  58. float: right;
  59. }
  60.  
  61. .box h1 {
  62. line-height: 37px;
  63. height: 37px;
  64. padding-left: 20px;
  65. background: url("http://tool.chinaz.com/template/default/images/h1-bg.gif") repeat-x;
  66. color: #0066CC;
  67. border: 1px solid #c5e2f2;
  68. border-bottom: 0;
  69. font-size: 14px;
  70. font-weight: normal;
  71. }
  72.  
  73. .box h1 span {
  74. float: right;
  75. background: url("http://tool.chinaz.com/template/default/images/h1.gif") no-repeat right;
  76. padding: 10px 12px 0 0;
  77. font-weight: normal;
  78. }
  79.  
  80. .box h1 a {
  81. color: #3333ff;
  82. }
  83.  
  84. .box .titright {
  85. float: right;
  86. padding-right: 10px;
  87. }
  88.  
  89. .box .titleft {
  90. float: left;
  91. }
  92.  
  93. .box .notice {
  94. color: red;
  95. margin-bottom: 5px;
  96. background: none repeat scroll 0% 0% transparent;
  97. border: 1px solid rgb(197, 226, 242);
  98. }
  99.  
  100. .clear {
  101. clear: both;
  102. font-size: 0;
  103. line-height: 0;
  104. height: 10px;
  105. }
  106.  
  107. .input {
  108. border: 1px solid #94c6e1;
  109. background: #fff;
  110. color: #22ac38;
  111. font-weight: bold;
  112. padding: 5px;
  113. margin-bottom: 5px;
  114. }
  115.  
  116. .input {
  117. font-size: 13px;
  118. }
  119.  
  120. .but {
  121. width: 90px;
  122. border: 1px solid #c5e2f2;
  123. background: #cde4f2 url('http://tool.chinaz.com/template/default/images/but.gif') repeat-x 50% top;
  124. height: 30px;
  125. margin-left: 5px;
  126. cursor: pointer;
  127. margin-bottom: 5px;
  128. }
  129.  
  130. .but2 {
  131. border: 1px solid #c5e2f2;
  132. background: #cde4f2 url('http://tool.chinaz.com/template/default/images/but.gif') repeat-x 50% top;
  133. height: 30px;
  134. margin-left: 5px;
  135. cursor: pointer;
  136. margin-bottom: 5px;
  137. width: 90px;
  138. }
  139.  
  140. .but3 {
  141. border: 1px solid #c5e2f2;
  142. background: #cde4f2 url('http://tool.chinaz.com/template/default/images/but.gif') repeat-x 50% top;
  143. height: 30px;
  144. margin-left: 5px;
  145. cursor: pointer;
  146. margin-bottom: 5px;
  147. width: 50px;
  148. }
  149.  
  150. .but4 {
  151. border: 1px solid #c5e2f2;
  152. background: #cde4f2 url('http://tool.chinaz.com/template/default/images/but.gif') repeat-x 50% top;
  153. height: 30px;
  154. margin-left: 5px;
  155. cursor: pointer;
  156. margin-bottom: 5px;
  157. width: 120px;
  158. }
  159.  
  160. .input1 {
  161. border: 1px solid #7f9db9;
  162. background: #fff;
  163. color: #333;
  164. font-weight: bold;
  165. padding: 3px 5px;
  166. margin-bottom: 5px;
  167. }
  168.  
  169. .but1 {
  170. border: 1px solid #7f9db9;
  171. background: #f0f7fd;
  172. height: 23px;
  173. margin-left: 5px;
  174. cursor: pointer;
  175. overflow: visible;
  176. padding: 0 15px;
  177. margin-bottom: 5px;
  178. }
  179. /*w4648*/
  180. . {
  181. margin: auto;
  182. width: 900px;
  183. clear: both;
  184. }
  185.  
  186. td, th {
  187. border: 1px solid #C0C0C0;
  188. border-collapse: collapse;
  189. padding: 5px;
  190. }
  191.  
  192. table {
  193. border-collapse: collapse;
  194. border: 1px solid #C0C0C0;
  195. margin: 0 auto;
  196. }
  197.  
  198. .menu-list {
  199. z-index: 5;
  200. }
  201.  
  202. #mainbody {
  203. padding-top: 10px;
  204. padding-bottom: 10px;
  205. }
  206.  
  207. #condition ul li {
  208. float: left;
  209. }
  210.  
  211. #search {
  212. height: 180px;
  213. width: 99.75%;
  214. }
  215.  
  216. #input {
  217. height: 375px;
  218. margin-top: 10px;
  219. width: 99.75%;
  220. }
  221.  
  222. .smartField {
  223. border: 1px solid #CCCCCC;
  224. overflow: auto;
  225. position: relative;
  226. }
  227.  
  228. .smartField pre, .smartField textarea {
  229. width: 100%;
  230. padding: 0;
  231. margin: 0;
  232. font: 100% "courier new",monospace;
  233. }
  234.  
  235. .smartField pre {
  236. text-align: left;
  237. color: #F9F9F9;
  238. z-index: 1;
  239. }
  240.  
  241. .smartField textarea {
  242. background: none repeat scroll 0 0 transparent;
  243. border: 0 none;
  244. height: 100%;
  245. overflow: hidden;
  246. position: absolute;
  247. left: 0px;
  248. top: 0px;
  249. z-index: 2;
  250. }
  251.  
  252. b, i, u {
  253. font-style: normal;
  254. font-weight: normal;
  255. text-decoration: none;
  256. }
  257.  
  258. #input b {
  259. background: none repeat scroll 0 0 #FFF000;
  260. color: #FFF000;
  261. }
  262.  
  263. #input i {
  264. background: none repeat scroll 0 0 #80C0FF;
  265. color: #80C0FF;
  266. }
  267.  
  268. #search b {
  269. background: none repeat scroll 0 0 #AAD1F7;
  270. color: #AAD1F7;
  271. }
  272.  
  273. #search i {
  274. background: none repeat scroll 0 0 #F9CA69;
  275. color: #F9CA69;
  276. }
  277.  
  278. #search i b {
  279. background: none repeat scroll 0 0 #F7A700;
  280. color: #F7A700;
  281. }
  282.  
  283. #search i u {
  284. background: none repeat scroll 0 0 #EFBA4A;
  285. color: #EFBA4A;
  286. }
  287.  
  288. #search b.g1 {
  289. background: none repeat scroll 0 0 #D2F854;
  290. color: #D2F854;
  291. }
  292.  
  293. #search b.g2 {
  294. background: none repeat scroll 0 0 #9EC70C;
  295. color: #9EC70C;
  296. }
  297.  
  298. #search b.g3 {
  299. background: none repeat scroll 0 0 #ECC9F7;
  300. color: #ECC9F7;
  301. }
  302.  
  303. #search b.g4 {
  304. background: none repeat scroll 0 0 #54B70B;
  305. color: #54B70B;
  306. }
  307.  
  308. #search b.g5 {
  309. background: none repeat scroll 0 0 #B688CF;
  310. color: #B688CF;
  311. }
  312.  
  313. #search b.err {
  314. background: none repeat scroll 0 0 #FF4300 !important;
  315. color: #FF4300 !important;
  316. }
  317. </style>
  318. </head>
  319. <body>
  320. <div class="w4648">
  321. <!--main-->
  322. <div class="main">
  323. <div class="box">
  324. <div id="b_1">
  325. <h1><a style="color: #3333ff;" href="http://tool.chinaz.com/regex/">正则表达式在线测试</a></h1>
  326. <div class="box1" style="text-align:center;">
  327. <div id="condition">
  328. <ul>
  329. <li style=" display:none;"><input type="checkbox" checked="checked" id="toolG"><label for="toolG">全局</label></li>
  330. <li><input type="checkbox" id="toolI"><label for="toolI">不区分大小写</label></li>
  331. <li><input type="checkbox" id="toolM"><label for="toolM">对^$前后换行也支持</label></li>
  332. <li><input type="checkbox" id="toolS"><label for="toolS">符号.匹配所有</label></li>
  333. </ul>
  334. <span><input type="checkbox" checked="checked" id="highSyntax"><label for="highSyntax">对正则着色</label></span>
  335. <span><input type="checkbox" checked="checked" id="highMatch"><label for="highMatch">对匹配结果着色</label></span>
  336. <span><input type="checkbox" id="invertMatch"><label for="invertMatch">对无匹配结果着色</label></span>
  337. </div>
  338. <div id="mainbody">
  339. <div class="smartField" id="search">
  340. <textarea spellcheck="false" tabindex="1" rows="3" cols="100" id="searchText" style="height: 180px; margin-left: 0px; width: 856px;">(\w+\.){2}\w+</textarea>
  341. </div>
  342. <div class="smartField" id="input" style="height: 180px;">
  343. <textarea spellcheck="false" tabindex="2" rows="10" cols="100" id="inputText" style="height: 180px; margin-left: 0px; width: 856px;">tool.chinaz.com|888|http://www.cnblogs.com/Fang3s/p/4338103.html</textarea>
  344. </div>
  345. </div>
  346. </div>
  347. </div>
  348. </div>
  349. <!--<script type="text/javascript" src="http://tool.chinaz.com/template/default/js/regbase.js"></script>-->
  350. <!--<script type="text/javascript" src="http://tool.chinaz.com/template/default/js/reg.js"></script>-->
  351. <script type="text/javascript">
  352. //http://tool.chinaz.com/template/default/js/regbase.js
  353. /*
  354. XRegExp 0.2.5
  355. (c)Steven Levithan
  356. MIT license
  357.  
  358. Provides an augmented, cross-browser implementation of regular
  359. expressions, including support for additional flags.
  360. */
  361. (function () { if (window.XRegExp) return; var real = { RegExp: RegExp, exec: RegExp.prototype.exec, match: String.prototype.match, replace: String.prototype.replace }; var re = { extended: /(?:[^[#\s\\]+|\\(?:[\S\s]|$)|\[\^?]?(?:[^\\\]]+|\\(?:[\S\s]|$))*]?)+|(\s*#[^\n\r\u2028\u2029]*\s*|\s+)([?*+]|{[0-9]+(?:,[0-9]*)?})?/g, singleLine: /(?:[^[\\.]+|\\(?:[\S\s]|$)|\[\^?]?(?:[^\\\]]+|\\(?:[\S\s]|$))*]?)+|\./g, characterClass: /(?:[^\\[]+|\\(?:[\S\s]|$))+|\[\^?(]?)(?:[^\\\]]+|\\(?:[\S\s]|$))*]?/g, capturingGroup: /(?:[^[(\\]+|\\(?:[\S\s]|$)|\[\^?]?(?:[^\\\]]+|\\(?:[\S\s]|$))*]?|\((?=\?))+|(\()(?:<([$\w]+)>)?/g, namedBackreference: /(?:[^\\[]+|\\(?:[^k]|$)|\[\^?]?(?:[^\\\]]+|\\(?:[\S\s]|$))*]?|\\k(?!<[$\w]+>))+|\\k<([$\w]+)>([0-9]?)/g, replacementVariable: /(?:[^$]+|\$(?![1-9$&`']|{[$\w]+}))+|\$(?:([1-9]\d*|[$&`'])|{([$\w]+)})/g }; XRegExp = function (pattern, flags) { flags = flags || ""; if (flags.indexOf("x") > -1) { pattern = real.replace.call(pattern, re.extended, function ($0, $1, $2) { return $1 ? ($2 || "(?:)") : $0 }) }; var hasNamedCapture = false; if (flags.indexOf("k") > -1) { var captureNames = []; pattern = real.replace.call(pattern, re.capturingGroup, function ($0, $1, $2) { if ($1) { if ($2) hasNamedCapture = true; captureNames.push($2 || null); return "(" } else { return $0 } }); if (hasNamedCapture) { pattern = real.replace.call(pattern, re.namedBackreference, function ($0, $1, $2) { var index = $1 ? captureNames.indexOf($1) : -1; return index > -1 ? "\\" + (index + 1) + ($2 ? "(?:)" + $2 : "") : $0 }) } }; pattern = real.replace.call(pattern, re.characterClass, function ($0, $1) { return $1 ? real.replace.call($0, "]", "\\]") : $0 }); if (flags.indexOf("s") > -1) { pattern = real.replace.call(pattern, re.singleLine, function ($0) { return $0 === "." ? "[\\S\\s]" : $0 }) }; var regex = real.RegExp(pattern, real.replace.call(flags, /[sxk]+/g, "")); if (hasNamedCapture) regex._captureNames = captureNames; return regex }; RegExp.prototype.addFlags = function (flags) { flags = (flags || "") + (this.global ? "g" : "") + (this.ignoreCase ? "i" : "") + (this.multiline ? "m" : ""); var regex = new XRegExp(this.source, flags); if (!regex._captureNames && this._captureNames) regex._captureNames = this._captureNames.slice(0); return regex }; RegExp.prototype.exec = function (str) { var result = real.exec.call(this, str); if (!(this._captureNames && result && result.length > 1)) return result; for (var i = 1; i < result.length; i++) { var name = this._captureNames[i - 1]; if (name) result[name] = result[i] }; return result }; String.prototype.match = function (regex) { if (!regex._captureNames || regex.global) return real.match.call(this, regex); return regex.exec(this) }; String.prototype.replace = function (search, replacement) { if (!(search instanceof real.RegExp && search._captureNames)) return real.replace.apply(this, arguments); if (typeof replacement === "function") { return real.replace.call(this, search, function () { arguments[0] = new String(arguments[0]); for (var i = 0; i < search._captureNames.length; i++) { if (search._captureNames[i]) arguments[0][search._captureNames[i]] = arguments[i + 1] }; return replacement.apply(window, arguments) }) } else { return real.replace.call(this, search, function () { var args = arguments; return real.replace.call(replacement, re.replacementVariable, function ($0, $1, $2) { if ($1) { switch ($1) { case "$": return "$"; case "&": return args[0]; case "`": return args[args.length - 1].slice(0, args[args.length - 2]); case "'": return args[args.length - 1].slice(args[args.length - 2] + args[0].length); default: var literalNumbers = ""; $1 = +$1; while ($1 > search._captureNames.length) { literalNumbers = $1.split("").pop() + literalNumbers; $1 = Math.floor($1 / 10) }; return ($1 ? args[$1] : "$") + literalNumbers } } else if ($2) { var index = search._captureNames.indexOf($2); return index > -1 ? args[index + 1] : $0 } else { return $0 } }) }) } } })(); XRegExp.cache = function (pattern, flags) { var key = "/" + pattern + "/" + (flags || ""); return XRegExp.cache[key] || (XRegExp.cache[key] = new XRegExp(pattern, flags)) }; XRegExp.overrideNative = function () { RegExp = XRegExp }; if (!Array.prototype.indexOf) { Array.prototype.indexOf = function (item, from) { var len = this.length; for (var i = (from < 0) ? Math.max(0, len + from) : from || 0; i < len; i++) { if (this[i] === item) return i }; return -1 } }
  362.  
  363. // http://tool.chinaz.com/template/default/js/reg.js
  364. function $(el) { if (el.nodeName) return el; if (typeof el === "string") return document.getElementById(el); return false }; var trim = function () { var lSpace = /^\s\s*/, rSpace = /\s\s*$/; return function (str) { return str.replace(lSpace, "").replace(rSpace, "") } }(); function replaceHtml(el, html) { var oldEl = $(el); var newEl = oldEl.cloneNode(false); newEl.innerHTML = html; oldEl.parentNode.replaceChild(newEl, oldEl); return newEl }; function replaceOuterHtml(el, html) { el = replaceHtml(el, ""); if (el.outerHTML) { var id = el.id, className = el.className, nodeName = el.nodeName; el.outerHTML = "<" + nodeName + " id=\"" + id + "\" class=\"" + className + "\">" + html + "</" + nodeName + ">"; el = $(id) } else { el.innerHTML = html }; return el }; function getElementsByClassName(className, tagName, parentNode) { var els = ($(parentNode) || document).getElementsByTagName(tagName || "*"), results = []; for (var i = 0; i < els.length; i++) { if (hasClass(className, els[i])) results.push(els[i]) }; return results }; function hasClass(className, el) { return XRegExp.cache("(?:^|\\s)" + className + "(?:\\s|$)").test($(el).className) }; function addClass(className, el) { el = $(el); if (!hasClass(className, el)) { el.className = trim(el.className + " " + className) } }; function removeClass(className, el) { el = $(el); el.className = trim(el.className.replace(XRegExp.cache("(?:^|\\s)" + className + "(?:\\s|$)", "g"), " ")) }; function toggleClass(className, el) { if (hasClass(className, el)) { removeClass(className, el) } else { addClass(className, el) } }; function swapClass(oldClass, newClass, el) { removeClass(oldClass, el); addClass(newClass, el) }; function replaceSelection(textbox, str) { if (textbox.setSelectionRange) { var start = textbox.selectionStart, end = textbox.selectionEnd, offset = (start + str.length); textbox.value = (textbox.value.substring(0, start) + str + textbox.value.substring(end)); textbox.setSelectionRange(offset, offset) } else if (document.selection) { var range = document.selection.createRange(); range.text = str; range.select() } }; function extend(to, from) { for (var property in from) to[property] = from[property]; return to }; function purge(d) { var a = d.attributes, i, l, n; if (a) { l = a.length; for (i = 0; i < l; i += 1) { n = a[i].name; if (typeof d[n] === 'function') { d[n] = null } } }; a = d.childNodes; if (a) { l = a.length; for (i = 0; i < l; i += 1) { purge(d.childNodes[i]) } } }; var isWebKit = navigator.userAgent.indexOf("WebKit") > -1, isIE, isIE6 = isIE && !window.XMLHttpRequest; var RegexPal = { fields: { search: new SmartField("search"), input: new SmartField("input"), options: { flags: { g: $("toolG"), i: $("toolI"), m: $("toolM"), s: $("toolS") }, highlightSyntax: $("highSyntax"), highlightMatches: $("highMatch"), invertMatches: $("invertMatch") } } }; extend(RegexPal, function () { var f = RegexPal.fields, o = f.options; return { highlightMatches: function () { var re = { matchPair: /`~\{((?:[^}]+|\}(?!~`))*)\}~`((?:[^`]+|`(?!~\{(?:[^}]+|\}(?!~`))*\}~`))*)(?:`~\{((?:[^}]+|\}(?!~`))*)\}~`)?/g, sansTrailingAlternator: /^(?:[^\\|]+|\\[\S\s]?|\|(?=[\S\s]))*/ }; return function () { var search = String(f.search.textbox.value), input = String(f.input.textbox.value); if (XRegExp.cache('<[bB] class="?err"?>').test(f.search.bg.innerHTML) || (!search.length && !o.invertMatches.checked) || !o.highlightMatches.checked) { f.input.clearBg(); return }; try { var searchRegex = new XRegExp(re.sansTrailingAlternator.exec(search)[0], (o.flags.g.checked ? "g" : "") + (o.flags.i.checked ? "i" : "") + (o.flags.m.checked ? "m" : "") + (o.flags.s.checked ? "s" : "")) } catch (err) { f.input.clearBg(); return }; if (o.invertMatches.checked) { var output = ("`~{" + input.replace(searchRegex, "}~`$&`~{") + "}~`").replace(XRegExp.cache("`~\\{\\}~`|\\}~``~\\{", "g"), "") } else { var output = input.replace(searchRegex, "`~{$&}~`") }; output = output.replace(XRegExp.cache("[<&>]", "g"), "_").replace(re.matchPair, "<b>$1</b>$2<i>$3</i>"); f.input.setBgHtml(output) } }(), highlightSearchSyntax: function () { if (o.highlightSyntax.checked) { f.search.setBgHtml(parseRegex(f.search.textbox.value)) } else { f.search.clearBg() } } } }()); var parseRegex = function () { var re = { regexToken: /\[\^?]?(?:[^\\\]]+|\\[\S\s]?)*]?|\\(?:0(?:[0-3][0-7]{0,2}|[4-7][0-7]?)?|[1-9][0-9]*|x[0-9A-Fa-f]{2}|u[0-9A-Fa-f]{4}|c[A-Za-z]|[\S\s]?)|\((?:\?[:=!]?)?|(?:[?*+]|\{[0-9]+(?:,[0-9]*)?\})\??|[^.?*+^${[()|\\]+|./g, characterClassParts: /^(<opening>\[\^?)(<contents>]?(?:[^\\\]]+|\\[\S\s]?)*)(<closing>]?)$/.addFlags("k"), characterClassToken: /[^\\-]+|-|\\(?:[0-3][0-7]{0,2}|[4-7][0-7]?|x[0-9A-Fa-f]{2}|u[0-9A-Fa-f]{4}|c[A-Za-z]|[\S\s]?)/g, quantifier: /^(?:[?*+]|\{[0-9]+(?:,[0-9]*)?\})\??$/ }, type = { NONE: 0, RANGE_HYPHEN: 1, METACLASS: 2, ALTERNATOR: 3 }; function errorStr(str) { return '<b class="err">' + str + '</b>' }; function getTokenCharCode(token) { if (token.length > 1 && token.charAt(0) === "\\") { var t = token.slice(1); if (XRegExp.cache("^c[A-Za-z]$").test(t)) { return "ABCDEFGHIJKLMNOPQRSTUVWXYZ".indexOf(t.charAt(1).toUpperCase()) + 1 } else if (XRegExp.cache("^(?:x[0-9A-Fa-f]{2}|u[0-9A-Fa-f]{4})$").test(t)) { return parseInt(t.slice(1), 16) } else if (XRegExp.cache("^(?:[0-3][0-7]{0,2}|[4-7][0-7]?)$").test(t)) { return parseInt(t, 8) } else if (t.length === 1 && "cuxDdSsWw".indexOf(t) > -1) { return false } else if (t.length === 1) { switch (t) { case "b": return 8; case "f": return 12; case "n": return 10; case "r": return 13; case "t": return 9; case "v": return 11; default: return t.charCodeAt(0) } } } else if (token !== "\\") { return token.charCodeAt(0) }; return false }; function parseCharacterClass(value) { var output = "", parts = re.characterClassParts.exec(value), parser = re.characterClassToken, lastToken = { rangeable: false, type: type.NONE }, match, m; output += parts.closing ? parts.opening : errorStr(parts.opening); while (match = parser.exec(parts.contents)) { m = match[0]; if (m.charAt(0) === "\\") { if (XRegExp.cache("^\\\\[cux]$").test(m)) { output += errorStr(m); lastToken = { rangeable: lastToken.type !== type.RANGE_HYPHEN } } else if (XRegExp.cache("^\\\\[dsw]$", "i").test(m)) { output += "<b>" + m + "</b>"; lastToken = { rangeable: lastToken.type !== type.RANGE_HYPHEN, type: type.METACLASS } } else if (m === "\\") { output += errorStr(m) } else { output += "<b>" + m.replace(XRegExp.cache("[<&>]"), "_") + "</b>"; lastToken = { rangeable: lastToken.type !== type.RANGE_HYPHEN, charCode: getTokenCharCode(m) } } } else if (m === "-") { if (lastToken.rangeable) { var lastIndex = parser.lastIndex, nextToken = parser.exec(parts.contents); if (nextToken) { var nextTokenCharCode = getTokenCharCode(nextToken[0]); if ((nextTokenCharCode !== false && lastToken.charCode > nextTokenCharCode) || lastToken.type === type.METACLASS || XRegExp.cache("^\\\\[dsw]$", "i").test(nextToken[0])) { output += errorStr("-") } else { output += "<u>-</u>" }; lastToken = { rangeable: false, type: type.RANGE_HYPHEN } } else { if (parts.closing) { output += "-" } else { output += "<u>-</u>"; break } }; parser.lastIndex = lastIndex } else { output += "-"; lastToken = { rangeable: lastToken.type !== type.RANGE_HYPHEN } } } else { output += m.replace(XRegExp.cache("[<&>]", "g"), "_"); lastToken = { rangeable: (m.length > 1 || lastToken.type !== type.RANGE_HYPHEN), charCode: m.charCodeAt(m.length - 1) } } }; return output + parts.closing }; return function (value) { var output = "", capturingGroupCount = 0, groupStyleDepth = 0, openGroups = [], lastToken = { quantifiable: false, type: type.NONE }, match, m; function groupStyleStr(str) { return '<b class="g' + groupStyleDepth + '">' + str + '</b>' }; while (match = re.regexToken.exec(value)) { m = match[0]; switch (m.charAt(0)) { case "[": output += "<i>" + parseCharacterClass(m) + "</i>"; lastToken = { quantifiable: true }; break; case "(": if (m.length === 2) { output += errorStr(m) } else { if (m.length === 1) capturingGroupCount++; groupStyleDepth = groupStyleDepth === 5 ? 1 : groupStyleDepth + 1; openGroups.push({ index: output.length + 14, opening: m }); output += groupStyleStr(m) }; lastToken = { quantifiable: false }; break; case ")": if (!openGroups.length) { output += errorStr(")"); lastToken = { quantifiable: false } } else { output += groupStyleStr(")"); lastToken = { quantifiable: !XRegExp.cache("^[=!]").test(openGroups[openGroups.length - 1].opening.charAt(2)), style: "g" + groupStyleDepth }; groupStyleDepth = groupStyleDepth === 1 ? 5 : groupStyleDepth - 1; openGroups.pop() }; break; case "\\": if (XRegExp.cache("^[1-9]").test(m.charAt(1))) { var nonBackrefDigits = "", num = +m.slice(1); while (num > capturingGroupCount) { nonBackrefDigits = XRegExp.cache("[0-9]$").exec(num)[0] + nonBackrefDigits; num = Math.floor(num / 10) }; if (num > 0) { output += "<b>\\" + num + "</b>" + nonBackrefDigits } else { var parts = XRegExp.cache("^\\\\([0-3][0-7]{0,2}|[4-7][0-7]?|[89])([0-9]*)").exec(m); output += "<b>\\" + parts[1] + "</b>" + parts[2] } } else if (XRegExp.cache("^[0bBcdDfnrsStuvwWx]").test(m.charAt(1))) { if (XRegExp.cache("^\\\\[cux]$").test(m)) { output += errorStr(m); lastToken = { quantifiable: false }; break }; output += "<b>" + m + "</b>"; if ("bB".indexOf(m.charAt(1)) > -1) { lastToken = { quantifiable: false }; break } } else if (m === "\\") { output += errorStr(m) } else { output += m.replace(XRegExp.cache("[<&>]"), "_") }; lastToken = { quantifiable: true }; break; default: if (re.quantifier.test(m)) { if (lastToken.quantifiable) { var interval = XRegExp.cache("^\\{([0-9]+)(?:,([0-9]*))?").exec(m); if (interval && ((interval[1] > 65535) || (interval[2] && ((interval[2] > 65535) || (+interval[1] > +interval[2]))))) { output += errorStr(m) } else { output += (lastToken.style ? '<b class="' + lastToken.style + '">' : '<b>') + m + '</b>' } } else { output += errorStr(m) }; lastToken = { quantifiable: false } } else if (m === "|") { if (lastToken.type === type.NONE || (lastToken.type === type.ALTERNATOR && !openGroups.length)) { output += errorStr(m) } else { output += openGroups.length ? groupStyleStr("|") : "<b>|</b>" }; lastToken = { quantifiable: false, type: type.ALTERNATOR } } else if ("^$".indexOf(m) > -1) { output += "<b>" + m + "</b>"; lastToken = { quantifiable: false } } else if (m === ".") { output += "<b>.</b>"; lastToken = { quantifiable: true } } else { output += m.replace(XRegExp.cache("[<&>]", "g"), "_"); lastToken = { quantifiable: true } } } }; var numCharsAdded = 0; for (var i = 0; i < openGroups.length; i++) { var errorIndex = openGroups[i].index + numCharsAdded; output = (output.slice(0, errorIndex) + errorStr(openGroups[i].opening) + output.slice(errorIndex + openGroups[i].opening.length)); numCharsAdded += errorStr("").length }; return output } }(); function SmartField(el) { el = $(el); var textboxEl = el.getElementsByTagName("textarea")[0], bgEl = document.createElement("pre"); textboxEl.id = el.id + "Text"; bgEl.id = el.id + "Bg"; el.insertBefore(bgEl, textboxEl); textboxEl.onkeydown = function (e) { SmartField.prototype._onKeyDown(e) }; textboxEl.onkeyup = function (e) { SmartField.prototype._onKeyUp(e) }; if (isIE) el.style.overflowX = "hidden"; if (isWebKit) textboxEl.style.marginLeft = 0; this.field = el; this.textbox = textboxEl; this.bg = bgEl }; extend(SmartField.prototype, { setBgHtml: function (html) { html = html.replace(XRegExp.cache("^\\n"), "\n\n"); this.bg = replaceOuterHtml(this.bg, html + "<br>&nbsp;"); this.setDimensions() }, clearBg: function () { this.setBgHtml(this.textbox.value.replace(XRegExp.cache("[<&>]", "g"), "_")) }, setDimensions: function () { this.textbox.style.width = ""; var scrollWidth = this.textbox.scrollWidth, offsetWidth = this.textbox.offsetWidth; this.textbox.style.width = (scrollWidth === offsetWidth ? offsetWidth - 1 : scrollWidth + 8) + "px"; this.textbox.style.height = Math.max(this.bg.offsetHeight, this.field.offsetHeight - 2) + "px" }, _onKeyDown: function (e) { e = e || event; if (!this._filterKeys(e)) return false; var srcEl = e.srcElement || e.target; switch (srcEl) { case RegexPal.fields.search.textbox: setTimeout(function () { RegexPal.highlightSearchSyntax.call(RegexPal) }, 0); break }; if (isWebKit && srcEl.selectionEnd === srcEl.value.length) { srcEl.parentNode.scrollTop = srcEl.scrollHeight }; this._testKeyHold(e) }, _onKeyUp: function (e) { e = e || event; var srcEl = e.srcElement || e.target; this._keydownCount = 0; if (this._matchOnKeyUp) { this._matchOnKeyUp = false; switch (srcEl) { case RegexPal.fields.search.textbox: case RegexPal.fields.input.textbox: RegexPal.highlightMatches(); break } } }, _testKeyHold: function (e) { var srcEl = e.srcElement || e.target; this._keydownCount++; if (this._keydownCount > 2) { RegexPal.fields.input.clearBg(); this._matchOnKeyUp = true } else { switch (srcEl) { case RegexPal.fields.search.textbox: case RegexPal.fields.input.textbox: setTimeout(function () { RegexPal.highlightMatches.call(RegexPal) }, 0); break } } }, _filterKeys: function (e) { var srcEl = e.srcElement || e.target, f = RegexPal.fields; if (this._deadKeys.indexOf(e.keyCode) > -1) return false; if ((e.keyCode === 9) && (srcEl === f.input.textbox || (srcEl === f.search.textbox && !e.shiftKey))) { if (srcEl === f.input.textbox) { if (e.shiftKey) { f.search.textbox.focus() } else { replaceSelection(srcEl, "\t"); if (window.opera) setTimeout(function () { srcEl.focus() }, 0) } } else { f.input.textbox.focus() }; if (e.preventDefault) e.preventDefault(); else e.returnValue = false }; return true }, _matchOnKeyUp: false, _keydownCount: 0, _deadKeys: [16, 17, 18, 19, 20, 27, 33, 34, 35, 36, 37, 38, 39, 40, 44, 45, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 144, 145] }); (function () { var f = RegexPal.fields, o = f.options; onresize = function (e) { var isIE1 = !!window.ActiveXObject; var isIE61 = isIE1 && !window.XMLHttpRequest; if (isIE61) f.input.field.style.height = Math.max((window.innerHeight || document.documentElement.clientHeight) - 310, 180) + "px"; else f.input.field.style.height = Math.max((window.innerHeight || document.documentElement.clientHeight) - 610, 180) + "px"; f.search.setDimensions(); f.input.setDimensions() }; onresize(); RegexPal.highlightSearchSyntax(); RegexPal.highlightMatches(); for (var flag in o.flags) { o.flags[flag].onclick = RegexPal.highlightMatches }; o.highlightSyntax.onclick = RegexPal.highlightSearchSyntax; o.highlightMatches.onclick = RegexPal.highlightMatches; o.invertMatches.onclick = RegexPal.highlightMatches; function makeResetter(field) { return function () { field.clearBg(); field.textbox.value = ""; field.textbox.onfocus = null } }; if (f.search.textbox.value == "(\\w+\\.){2}\\w+") { f.search.textbox.onfocus = makeResetter(f.search) }; if (f.input.textbox.value === "tool.chinaz.com|888") { f.input.textbox.onfocus = makeResetter(f.input) } })();
  365. </script>
  366.  
  367. <div class="box">
  368. <div id="b_14">
  369. <h1>工具简介</h1>
  370. <div class="box1">
  371. <span class="info2" style=" font-size: 14px; line-height: 24px; text-align: left;white-space:normal; width:860px;overflow:hidden;">
  372. <span style=" font-weight:bold; color:Red;">正则表达式到底是什么东西?</span><br>在编写处理字符串的程序或网页时,经常会有查找符合某些复杂规则的字符串的需要。正则表达式就是用于描述这些规则的工具。换句话说,正则表达式就是记录文本规则的代码。<br><span style=" font-weight:bold; color:Red;">常用元字符</span><br><table cellspacing="0"><thead><tr><th scope="col">代码</th><th scope="col">说明</th></tr></thead><tbody><tr><td><span class="code">.</span></td><td><span class="desc">匹配除换行符以外的任意字符</span></td></tr><tr><td><span class="code">\w</span></td><td><span class="desc">匹配字母或数字或下划线或汉字</span></td></tr><tr><td><span class="code">\s</span></td><td><span class="desc">匹配任意的空白符</span></td></tr><tr><td><span class="code">\d</span></td><td><span class="desc">匹配数字</span></td></tr><tr><td><span class="code">\b</span></td><td><span class="desc">匹配单词的开始或结束</span></td></tr><tr><td><span class="code">^</span></td><td><span class="desc">匹配字符串的开始</span></td></tr><tr><td><span class="code">$</span></td><td><span class="desc">匹配字符串的结束</span></td></tr></tbody></table><br><span style=" font-weight:bold; color:Red;">常用限定符</span><br><table cellspacing="0"><thead><tr><th scope="col">代码/语法</th><th scope="col">说明</th></tr></thead><tbody><tr><td><span class="code">*</span></td><td><span class="desc">重复零次或更多次</span></td></tr><tr><td><span class="code">+</span></td><td><span class="desc">重复一次或更多次</span></td></tr><tr><td><span class="code">?</span></td><td><span class="desc">重复零次或一次</span></td></tr><tr><td><span class="code">{n}</span></td><td><span class="desc">重复n次</span></td></tr><tr><td><span class="code">{n,}</span></td><td><span class="desc">重复n次或更多次</span></td></tr><tr><td><span class="code">{n,m}</span></td><td><span class="desc">重复n到m次</span></td></tr></tbody></table><br><span style=" font-weight:bold; color:Red;">常用反义词</span><br><table cellspacing="0"><thead><tr><th scope="col">代码/语法</th><th scope="col">说明</th></tr></thead><tbody><tr><td><span class="code">\W</span></td><td><span class="desc">匹配任意不是字母,数字,下划线,汉字的字符</span></td></tr><tr><td><span class="code">\S</span></td><td><span class="desc">匹配任意不是空白符的字符</span></td></tr><tr><td><span class="code">\D</span></td><td><span class="desc">匹配任意非数字的字符</span></td></tr><tr><td><span class="code">\B</span></td><td><span class="desc">匹配不是单词开头或结束的位置</span></td></tr><tr><td><span class="code">[^x]</span></td><td><span class="desc">匹配除了x以外的任意字符</span></td></tr><tr><td><span class="code">[^aeiou]</span></td><td><span class="desc">匹配除了aeiou这几个字母以外的任意字符</span></td></tr></tbody></table><br>
  373. </span>
  374. </div>
  375. </div>
  376. <div style=" height:5px;"></div>
  377. </div>
  378. </div>
  379. </div>
  380. </body>
  381. </html>-->

正则表达式在线测试

  1. /* utf-8: 0xc0, 0xe0, 0xf0, 0xf8, 0xfc
  2.  
  3. char str[] = "hello,中文字", len = strlen(str);
  4. int utf8CharLen;
  5. for (int i = 0, utf8CharLen; i < len; i += utf8CharLen)
  6. {
  7. utf8CharLen = BYTE_WIDTH_UTF8(str[i]);
  8. printf("str[%d] is a word character with %d bytes\n", i, utf8CharLen);
  9. }
  10. */
  11. unsigned char mblen_table_utf8[] =
  12. {
  13. , , , , , , , , , , , , , , , ,
  14. , , , , , , , , , , , , , , , ,
  15. , , , , , , , , , , , , , , , ,
  16. , , , , , , , , , , , , , , , ,
  17. , , , , , , , , , , , , , , , ,
  18. , , , , , , , , , , , , , , , ,
  19. , , , , , , , , , , , , , , , ,
  20. , , , , , , , , , , , , , , , ,
  21. , , , , , , , , , , , , , , , ,
  22. , , , , , , , , , , , , , , , ,
  23. , , , , , , , , , , , , , , , ,
  24. , , , , , , , , , , , , , , , ,
  25. , , , , , , , , , , , , , , , ,
  26. , , , , , , , , , , , , , , , ,
  27. , , , , , , , , , , , , , , , ,
  28. , , , , , , , , , , , , , , ,
  29. };
  30.  
  31. #define BYTE_WIDTH_UTF8(x) mblen_table_utf8[(unsigned char)(x)]
  32.  
  33. int _tmain(int argc, _TCHAR* argv[])
  34. {
  35. char str[] = "hello,中文字", len = strlen(str);
  36. int utf8CharLen;
  37. for (int i = , utf8CharLen; i < len; i += utf8CharLen)
  38. {
  39. utf8CharLen = BYTE_WIDTH_UTF8(str[i]);
  40. printf("str[%d] is a word character with %d bytes\n", i, utf8CharLen);
  41. }
  42. getchar();
  43. }

下面

比如某二进制字节是UTF-8编码的字符串的实际字节,则作用和public String(byte bytes[], String charsetName)取codePoint类似

  1. /* UTF8 utilities */
  2.  
  3. /* This parses a UTF8 string one character at a time. It is passed a pointer
  4. * to the string and the length of the string. It sets 'value' to the value of
  5. * the current character. It returns the number of characters read or a
  6. * negative error code:
  7. * -1 = string too short
  8. * -2 = illegal character
  9. * -3 = subsequent characters not of the form 10xxxxxx
  10. * -4 = character encoded incorrectly (not minimal length).
  11. */
  12.  
  13. int UTF8_getc(const unsigned char *str, int len, unsigned long *val)
  14. {
  15. const unsigned char *p;
  16. unsigned long value;
  17. int ret;
  18. if(len <= ) return ;
  19. p = str;
  20.  
  21. /* Check syntax and work out the encoded value (if correct) */
  22. if((*p & 0x80) == ) {
  23. value = *p++ & 0x7f;
  24. ret = ;
  25. } else if((*p & 0xe0) == 0xc0) {
  26. if(len < ) return -;
  27. if((p[] & 0xc0) != 0x80) return -;
  28. value = (*p++ & 0x1f) << ;
  29. value |= *p++ & 0x3f;
  30. if(value < 0x80) return -;
  31. ret = ;
  32. } else if((*p & 0xf0) == 0xe0) {
  33. if(len < ) return -;
  34. if( ((p[] & 0xc0) != 0x80)
  35. || ((p[] & 0xc0) != 0x80) ) return -;
  36. value = (*p++ & 0xf) << ;
  37. value |= (*p++ & 0x3f) << ;
  38. value |= *p++ & 0x3f;
  39. if(value < 0x800) return -;
  40. ret = ;
  41. } else if((*p & 0xf8) == 0xf0) {
  42. if(len < ) return -;
  43. if( ((p[] & 0xc0) != 0x80)
  44. || ((p[] & 0xc0) != 0x80)
  45. || ((p[] & 0xc0) != 0x80) ) return -;
  46. value = ((unsigned long)(*p++ & 0x7)) << ;
  47. value |= (*p++ & 0x3f) << ;
  48. value |= (*p++ & 0x3f) << ;
  49. value |= *p++ & 0x3f;
  50. if(value < 0x10000) return -;
  51. ret = ;
  52. } else if((*p & 0xfc) == 0xf8) {
  53. if(len < ) return -;
  54. if( ((p[] & 0xc0) != 0x80)
  55. || ((p[] & 0xc0) != 0x80)
  56. || ((p[] & 0xc0) != 0x80)
  57. || ((p[] & 0xc0) != 0x80) ) return -;
  58. value = ((unsigned long)(*p++ & 0x3)) << ;
  59. value |= ((unsigned long)(*p++ & 0x3f)) << ;
  60. value |= ((unsigned long)(*p++ & 0x3f)) << ;
  61. value |= (*p++ & 0x3f) << ;
  62. value |= *p++ & 0x3f;
  63. if(value < 0x200000) return -;
  64. ret = ;
  65. } else if((*p & 0xfe) == 0xfc) {
  66. if(len < ) return -;
  67. if( ((p[] & 0xc0) != 0x80)
  68. || ((p[] & 0xc0) != 0x80)
  69. || ((p[] & 0xc0) != 0x80)
  70. || ((p[] & 0xc0) != 0x80)
  71. || ((p[] & 0xc0) != 0x80) ) return -;
  72. value = ((unsigned long)(*p++ & 0x1)) << ;
  73. value |= ((unsigned long)(*p++ & 0x3f)) << ;
  74. value |= ((unsigned long)(*p++ & 0x3f)) << ;
  75. value |= ((unsigned long)(*p++ & 0x3f)) << ;
  76. value |= (*p++ & 0x3f) << ;
  77. value |= *p++ & 0x3f;
  78. if(value < 0x4000000) return -;
  79. ret = ;
  80. } else return -;
  81. *val = value;
  82. return ret;
  83. }
  84.  
  85. /* This takes a character 'value' and writes the UTF8 encoded value in
  86. * 'str' where 'str' is a buffer containing 'len' characters. Returns
  87. * the number of characters written or -1 if 'len' is too small. 'str' can
  88. * be set to NULL in which case it just returns the number of characters.
  89. * It will need at most 6 characters.
  90. */
  91.  
  92. int UTF8_putc(unsigned char *str, int len, unsigned long value)
  93. {
  94. if(!str) len = ; /* Maximum we will need */
  95. else if(len <= ) return -;
  96. if(value < 0x80) {
  97. if(str) *str = (unsigned char)value;
  98. return ;
  99. }
  100. if(value < 0x800) {
  101. if(len < ) return -;
  102. if(str) {
  103. *str++ = (unsigned char)(((value >> ) & 0x1f) | 0xc0);
  104. *str = (unsigned char)((value & 0x3f) | 0x80);
  105. }
  106. return ;
  107. }
  108. if(value < 0x10000) {
  109. if(len < ) return -;
  110. if(str) {
  111. *str++ = (unsigned char)(((value >> ) & 0xf) | 0xe0);
  112. *str++ = (unsigned char)(((value >> ) & 0x3f) | 0x80);
  113. *str = (unsigned char)((value & 0x3f) | 0x80);
  114. }
  115. return ;
  116. }
  117. if(value < 0x200000) {
  118. if(len < ) return -;
  119. if(str) {
  120. *str++ = (unsigned char)(((value >> ) & 0x7) | 0xf0);
  121. *str++ = (unsigned char)(((value >> ) & 0x3f) | 0x80);
  122. *str++ = (unsigned char)(((value >> ) & 0x3f) | 0x80);
  123. *str = (unsigned char)((value & 0x3f) | 0x80);
  124. }
  125. return ;
  126. }
  127. if(value < 0x4000000) {
  128. if(len < ) return -;
  129. if(str) {
  130. *str++ = (unsigned char)(((value >> ) & 0x3) | 0xf8);
  131. *str++ = (unsigned char)(((value >> ) & 0x3f) | 0x80);
  132. *str++ = (unsigned char)(((value >> ) & 0x3f) | 0x80);
  133. *str++ = (unsigned char)(((value >> ) & 0x3f) | 0x80);
  134. *str = (unsigned char)((value & 0x3f) | 0x80);
  135. }
  136. return ;
  137. }
  138. if(len < ) return -;
  139. if(str) {
  140. *str++ = (unsigned char)(((value >> ) & 0x1) | 0xfc);
  141. *str++ = (unsigned char)(((value >> ) & 0x3f) | 0x80);
  142. *str++ = (unsigned char)(((value >> ) & 0x3f) | 0x80);
  143. *str++ = (unsigned char)(((value >> ) & 0x3f) | 0x80);
  144. *str++ = (unsigned char)(((value >> ) & 0x3f) | 0x80);
  145. *str = (unsigned char)((value & 0x3f) | 0x80);
  146. }
  147. return ;
  148. }

http://www.leonerd.org.uk/code/libtickit/doc/

tickit_string_putchar(3) tickit_string_putchar - append a UTF-8 encoded codepoint to a buffer
tickit_string_seqlen(3) tickit_string_seqlen - determine the length of a UTF-8 codepoint encoding
  1. static int next_utf8(const char *str, size_t len, uint32_t *cp)
  2. {
  3. unsigned char b0 = (str++)[];
  4. int nbytes;
  5.  
  6. if(!len)
  7. return -;
  8.  
  9. if(!b0)
  10. return -;
  11. else if(b0 < 0x80) { // ASCII
  12. *cp = b0; return ;
  13. }
  14. else if(b0 < 0xc0) // C1 or continuation
  15. return -;
  16. else if(b0 < 0xe0) {
  17. nbytes = ; *cp = b0 & 0x1f;
  18. }
  19. else if(b0 < 0xf0) {
  20. nbytes = ; *cp = b0 & 0x0f;
  21. }
  22. else if(b0 < 0xf8) {
  23. nbytes = ; *cp = b0 & 0x07;
  24. }
  25. else
  26. return -;
  27.  
  28. if(len < nbytes)
  29. return -;
  30.  
  31. for(int i = ; i < nbytes; i++) {
  32. b0 = (str++)[];
  33. if(!b0)
  34. return -;
  35.  
  36. *cp <<= ;
  37. *cp |= b0 & 0x3f;
  38. }
  39.  
  40. return nbytes;
  41. }
  42.  
  43. int tickit_string_seqlen(long codepoint)
  44. {
  45. if(codepoint < 0x0000080) return ;
  46. if(codepoint < 0x0000800) return ;
  47. if(codepoint < 0x0010000) return ;
  48. if(codepoint < 0x0200000) return ;
  49. if(codepoint < 0x4000000) return ;
  50. return ;
  51. }
  52.  
  53. size_t tickit_string_putchar(char *str, size_t len, long codepoint)
  54. {
  55. int nbytes = tickit_string_seqlen(codepoint);
  56. if(!str)
  57. return nbytes;
  58. if(len < nbytes)
  59. return -;
  60.  
  61. // This is easier done backwards
  62. int b = nbytes;
  63. while(b > ) {
  64. b--;
  65. str[b] = 0x80 | (codepoint & 0x3f);
  66. codepoint >>= ;
  67. }
  68.  
  69. switch(nbytes) {
  70. case : str[] = (codepoint & 0x7f); break;
  71. case : str[] = 0xc0 | (codepoint & 0x1f); break;
  72. case : str[] = 0xe0 | (codepoint & 0x0f); break;
  73. case : str[] = 0xf0 | (codepoint & 0x07); break;
  74. case : str[] = 0xf8 | (codepoint & 0x03); break;
  75. case : str[] = 0xfc | (codepoint & 0x01); break;
  76. }
  77.  
  78. return nbytes;
  79. }

由codePoint计算这个值转化为UTF8应该占几个字节

  1. /* The following functions copied and adapted from libtermkey
  2. *
  3. * http://www.leonerd.org.uk/code/libtermkey/
  4. */
  5. static inline unsigned int utf8_seqlen(long codepoint)
  6. {
  7. if(codepoint < 0x0000080) return ;
  8. if(codepoint < 0x0000800) return ;
  9. if(codepoint < 0x0010000) return ;
  10. if(codepoint < 0x0200000) return ;
  11. if(codepoint < 0x4000000) return ;
  12. return ;
  13. }
  14.  
  15. /* Does NOT NUL-terminate the buffer */
  16. static int fill_utf8(long codepoint, char *str)
  17. {
  18. int nbytes = utf8_seqlen(codepoint);
  19.  
  20. // This is easier done backwards
  21. int b = nbytes;
  22. while(b > ) {
  23. b--;
  24. str[b] = 0x80 | (codepoint & 0x3f);
  25. codepoint >>= ;
  26. }
  27.  
  28. switch(nbytes) {
  29. case : str[] = (codepoint & 0x7f); break;
  30. case : str[] = 0xc0 | (codepoint & 0x1f); break;
  31. case : str[] = 0xe0 | (codepoint & 0x0f); break;
  32. case : str[] = 0xf0 | (codepoint & 0x07); break;
  33. case : str[] = 0xf8 | (codepoint & 0x03); break;
  34. case : str[] = 0xfc | (codepoint & 0x01); break;
  35. }
  36.  
  37. return nbytes;
  38. }
  39. /* end copy */

liblinebreak-2.0 : Line breaking in a Unicode sequence. Designed to be used in a generic text renderer.

  1. typedef unsigned char utf8_t; /**< Type for UTF-8 data points */
  2. typedef unsigned short utf16_t; /**< Type for UTF-16 data points */
  3. typedef unsigned int utf32_t; /**< Type for UTF-32 data points */
  1. /**
  2. * Gets the next Unicode character in a UTF-8 sequence. The index will
  3. * be advanced to the next complete character, unless the end of string
  4. * is reached in the middle of a UTF-8 sequence.
  5. *
  6. * @param[in] s input UTF-8 string
  7. * @param[in] len length of the string in bytes
  8. * @param[in,out] ip pointer to the index
  9. * @return the Unicode character beginning at the index; or
  10. * #EOS if end of input is encountered
  11. */
  12. utf32_t lb_get_next_char_utf8(
  13. const utf8_t *s,
  14. size_t len,
  15. size_t *ip)
  16. {
  17. utf8_t ch;
  18. utf32_t res;
  19.  
  20. assert(*ip <= len);
  21. if (*ip == len)
  22. return EOS;
  23. ch = s[*ip];
  24.  
  25. if (ch < 0xC2 || ch > 0xF4)
  26. { /* One-byte sequence, tail (should not occur), or invalid */
  27. *ip += ;
  28. return ch;
  29. }
  30. else if (ch < 0xE0)
  31. { /* Two-byte sequence */
  32. if (*ip + > len)
  33. return EOS;
  34. res = ((ch & 0x1F) << ) + (s[*ip + ] & 0x3F);
  35. *ip += ;
  36. return res;
  37. }
  38. else if (ch < 0xF0)
  39. { /* Three-byte sequence */
  40. if (*ip + > len)
  41. return EOS;
  42. res = ((ch & 0x0F) << ) +
  43. ((s[*ip + ] & 0x3F) << ) +
  44. ((s[*ip + ] & 0x3F));
  45. *ip += ;
  46. return res;
  47. }
  48. else
  49. { /* Four-byte sequence */
  50. if (*ip + > len)
  51. return EOS;
  52. res = ((ch & 0x07) << ) +
  53. ((s[*ip + ] & 0x3F) << ) +
  54. ((s[*ip + ] & 0x3F) << ) +
  55. ((s[*ip + ] & 0x3F));
  56. *ip += ;
  57. return res;
  58. }
  59. }
  60.  
  61. /**
  62. * Gets the next Unicode character in a UTF-16 sequence. The index will
  63. * be advanced to the next complete character, unless the end of string
  64. * is reached in the middle of a UTF-16 surrogate pair.
  65. *
  66. * @param[in] s input UTF-16 string
  67. * @param[in] len length of the string in words
  68. * @param[in,out] ip pointer to the index
  69. * @return the Unicode character beginning at the index; or
  70. * #EOS if end of input is encountered
  71. */
  72. utf32_t lb_get_next_char_utf16(
  73. const utf16_t *s,
  74. size_t len,
  75. size_t *ip)
  76. {
  77. utf16_t ch;
  78.  
  79. assert(*ip <= len);
  80. if (*ip == len)
  81. return EOS;
  82. ch = s[(*ip)++];
  83.  
  84. if (ch < 0xD800 || ch > 0xDBFF)
  85. { /* If the character is not a high surrogate */
  86. return ch;
  87. }
  88. if (*ip == len)
  89. { /* If the input ends here (an error) */
  90. --(*ip);
  91. return EOS;
  92. }
  93. if (s[*ip] < 0xDC00 || s[*ip] > 0xDFFF)
  94. { /* If the next character is not the low surrogate (an error) */
  95. return ch;
  96. }
  97. /* Return the constructed character and advance the index again */
  98. return (((utf32_t)ch & 0x3FF) << ) + (s[(*ip)++] & 0x3FF) + 0x10000;
  99. }
  100.  
  101. /**
  102. * Gets the next Unicode character in a UTF-32 sequence. The index will
  103. * be advanced to the next character.
  104. *
  105. * @param[in] s input UTF-32 string
  106. * @param[in] len length of the string in dwords
  107. * @param[in,out] ip pointer to the index
  108. * @return the Unicode character beginning at the index; or
  109. * #EOS if end of input is encountered
  110. */
  111. utf32_t lb_get_next_char_utf32(
  112. const utf32_t *s,
  113. size_t len,
  114. size_t *ip)
  115. {
  116. assert(*ip <= len);
  117. if (*ip == len)
  118. return EOS;
  119. return s[(*ip)++];
  120. }

java String 的内部是char[],char数组,char是16比特,2字节。,某字符,如汉字“中”,UTF-8编码时,字节数组为{-28, -72, -83}。

在java中用String表示则内部char[]为{0x4e2d},长度为1;而在c中用char[]或char*表示则直接为{-28, -72, -83},长度为3

  1. String one = "中";
  2. int unicodeCodeUnitCount = one.length();//
  3. String unicodeCodePointValue = Integer.toHexString(one.codePointAt(0));// Unicode编码值
  4. byte[] storedBytesUtf8 = one.getBytes();// 如果按UTF-8(默认)编码存储需要的字节存储情况
  5. char char16Bit = one.charAt(0);
  6.  
  7. System.out.println("sizeof char = " + Character.SIZE + " (bytes in Java)");
  8. System.out.println("one.length() = " + unicodeCodeUnitCount + ", one.codePointAt(0) = " + unicodeCodePointValue);
  9. System.out.println("one.getBytes().length() = " + storedBytesUtf8.length + " : " + bytesToHexString(storedBytesUtf8)
  10. + ", " + Arrays.toString(storedBytesUtf8));
  11. System.out.println("one.charAt(0) = " + char16Bit + " = " + Integer.toBinaryString(char16Bit) + " = "
  12. + Integer.toHexString(char16Bit));
  13. System.out.println(Character.charCount(char16Bit));
  14.  
  15. System.out.println(Integer.toHexString(new String(new byte[] { -28, -72, -83 }).charAt(0)));
  16.  
  17. // sizeof char = 16 (bytes in Java)
  18. // one.length() = 1, one.codePointAt(0) = 4e2d
  19. // one.getBytes().length() = 3 : e4b8ad, [-28, -72, -83]
  20. // one.charAt(0) = 中 = 100111000101101 = 4e2d

Unicode符号范围 | UTF-8编码方式(变长编码)
(十六进制) | 字节数| 首字节范围 | 二进制
----------------------+-------+------------+-----------------------------------------------------
0000 0000 - 0000 007F | 单字节 [0x00, 0x7F] 0xxxxxxx
0000 0080 - 0000 07FF | 两字节 [0xC0, 0xE0) 110xxxxx 10xxxxxx
0000 0800 - 0000 FFFF | 三字节 [0xE0, 0xF0) 1110xxxx 10xxxxxx 10xxxxxx
0001 0000 - 001F FFFF | 四字节 [0xF0, 0xF8) 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0020 0000 - 03FF FFFF | 五字节 [0xF8, 0xFC) 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0400 0000 - 7FFF FFFF | 六字节 [0xFC, 0xFE) 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

2

Unicode符号范围

UTF-8编码方式(变长编码)

十六进制

Bytes

首字节范围

二进制

0000 0000 - 0000 007F

单字节

[0x00, 0x7F]

0xxxxxxx

0000 0080 - 0000 07FF

两字节

[0xC0, 0xE0)

110xxxxx 10xxxxxx

0000 0800 - 0000 FFFF

三字节

[0xE0, 0xF0)

1110xxxx 10xxxxxx 10xxxxxx

0001 0000 - 001F FFFF

四字节

[0xF0, 0xF8)

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

0020 0000 - 03FF FFFF

五字节

[0xF8, 0xFC)

111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

0400 0000 - 7FFF FFFF

六字节

[0xFC, 0xFE)

1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

http://ideone.com/

  1. #include <stdio.h>
  2.  
  3. typedef unsigned char utf8_t; //< Type for UTF-8 data points
  4. typedef unsigned short utf16_t; //< Type for UTF-16 data points
  5. typedef unsigned int utf32_t; //< Type for UTF-32 data points
  6.  
  7. #define STRING_TOO_SHORT (-1) //string too short
  8. #define ILLEGAL_CHARACTER (-2) //illegal character not starts with 0xxxxxxx, 110xxxxx, 1110xxxx, etc.
  9. #define UNEXPECTED_CHARACTER (-3) //subsequent characters not of the form 10xxxxxx
  10. //- 4 = character encoded incorrectly(not minimal length).
  11.  
  12. // Unicode符号范围 | UTF-8编码方式(变长编码)
  13. // (十六进制) | 字节数| 首字节范围 | 二进制
  14. //----------------------+-------+------------+-----------------------------------------------------
  15. //0000 0000 - 0000 007F | 单字节 [0x00, 0x7F] 0xxxxxxx
  16. //0000 0080 - 0000 07FF | 两字节 [0xC0, 0xE0) 110xxxxx 10xxxxxx
  17. //0000 0800 - 0000 FFFF | 三字节 [0xE0, 0xF0) 1110xxxx 10xxxxxx 10xxxxxx
  18. //0001 0000 - 001F FFFF | 四字节 [0xF0, 0xF8) 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
  19. //0020 0000 - 03FF FFFF | 五字节 [0xF8, 0xFC) 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
  20. //0400 0000 - 7FFF FFFF | 六字节 [0xFC, 0xFE) 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
  21.  
  22. static inline size_t utf8_seqlen(utf32_t codepoint)
  23. {
  24. if (codepoint < 0x0000080) return ;
  25. if (codepoint < 0x0000800) return ;
  26. if (codepoint < 0x0010000) return ;
  27. if (codepoint < 0x0200000) return ;
  28. if (codepoint < 0x4000000) return ;
  29. return ;
  30. }
  31.  
  32. static size_t utf8_putc(char *str, size_t len, utf32_t codepoint)
  33. {
  34. int nbytes = utf8_seqlen(codepoint);
  35. if (!str)
  36. return nbytes;
  37. if (len < nbytes)
  38. return -;
  39.  
  40. // This is easier done backwards
  41. int b = nbytes;
  42. while (b > ) {
  43. b--;
  44. str[b] = 0x80 | (codepoint & 0x3f);
  45. codepoint >>= ;
  46. }
  47.  
  48. switch (nbytes) {
  49. case : str[] = (codepoint & 0x7f); break;
  50. case : str[] = 0xc0 | (codepoint & 0x1f); break;
  51. case : str[] = 0xe0 | (codepoint & 0x0f); break;
  52. case : str[] = 0xf0 | (codepoint & 0x07); break;
  53. case : str[] = 0xf8 | (codepoint & 0x03); break;
  54. case : str[] = 0xfc | (codepoint & 0x01); break;
  55. }
  56.  
  57. return nbytes;
  58. }
  59.  
  60. //static unsigned char utf8_b0masks[] = { 0x1f, 0x0f, 0x07, 0x03, 0x01};
  61. //#define utf8_extra_first_byte(nbytes) utf8_b0masks[nbytes - 2]
  62.  
  63. //It returns the number of characters read or a negative error code
  64. static size_t utf8_getc(const utf8_t *str, size_t len, utf32_t *cp)
  65. {
  66. unsigned char b0 = (str++)[], b0mask;
  67. int nbytes;//UTF-8编码下一个字符占有多少字节
  68.  
  69. if (b0 < 0x80) { // ASCII
  70. //nbytes = 1;
  71. if (len >= ){
  72. *cp = b0; return ;
  73. }
  74. return STRING_TOO_SHORT;
  75. }else if (b0 < 0xc0){ // C1 or continuation
  76. return ILLEGAL_CHARACTER;
  77. }else if (b0 < 0xe0) {
  78. nbytes = ; b0mask = 0x1f;
  79. }else if (b0 < 0xf0) {
  80. nbytes = ; b0mask = 0x0f;
  81. }else if (b0 < 0xf8) {
  82. nbytes = ; b0mask = 0x07;
  83. }else if (b0 < 0xfc){
  84. nbytes = ; b0mask = 0x03;
  85. }else if (b0 < 0xfe){
  86. nbytes = ; b0mask = 0x01;
  87. }else
  88. return ILLEGAL_CHARACTER;
  89.  
  90. if (len < nbytes)
  91. return STRING_TOO_SHORT;
  92.  
  93. *cp = b0 & b0mask;
  94. for (int i = ; i < nbytes; i++) {
  95. b0 = (str++)[];
  96. if ((b0 & 0xc0) != 0x80)
  97. return UNEXPECTED_CHARACTER;
  98. *cp <<= ;
  99. *cp |= b0 & 0x3f;
  100. }
  101.  
  102. return nbytes;
  103. }
  104.  
  105. int main(void) {
  106. // your code goes here
  107. utf32_t cp;
  108. size_t seqlen = utf8_getc("", , &cp);
  109. printf("%d, %d", seqlen, cp);
  110. return ;
  111. }

2

shell grep正则匹配汉字的更多相关文章

  1. Shell case正则匹配法

    Shell case正则匹配法   case $BOOLEAN in [yY][eE][sS]) echo 'Thanks' $BOOLEAN ;; [yY]|[nN]) echo 'Thanks' ...

  2. php正则匹配汉字提取其它信息剔除和验证邮箱

    正则匹配汉字提取其它信息剔除demo <?php //提取字符串中的汉字其余信息剔除 $str='te,st 测 .试,.,.?!::·…~&@#,.?!:;.……-&@#“” ...

  3. liux三剑客grep 正则匹配

    001正则匹配(大部分需要转义) ‘^‘: 锚定行首 '$' : 锚定行尾 [0-9] 一个数字 [^0-9] 除去数字所有,^出现在[]这里表示取反 [a-z] [A-Z] [a-Z] \s 匹配空 ...

  4. day11 grep正则匹配

    ps aus | trep nginx # 查看所有正在运行的nginx任务 别名路径: alias test_cmd='ls -l' PATH路径: 临时修改: PATH=$PATH:/usr/lo ...

  5. grep 正则匹配

    \{0,n\}:至多n次 \{\ 匹配/etc/passwd文件中数字出现只是数字1次到3次 匹配/etc/grub2.cfg文件以一个空格开头匹配一个字符的文件的所有行 显示以LISTEN结尾的行 ...

  6. grep[行号&正则匹配字符有颜色]

    事情是这样的,昨天在深入学习grep命令时,看到别人博客用grep正则匹配,不仅行数有颜色,而且匹配到的字符也有颜色.我在CRT也试了下,毛颜色都没有.顿时感觉 so low. 解决 编辑vim~/. ...

  7. Autoit3 正则表达式 匹配汉字

    关于Autoit3正则匹配汉字,在网上搜来搜去都是雷同的内容,[\u4e00-\u9fa5] 然而,Invalid all the time 直到认真钻研Help File,最终又看到了这个 http ...

  8. shell脚本-正则、grep、sed、awk

    ----------------------------------------正则---------------------------------------- 基础正则 ^word ##搜索以w ...

  9. Shell遍历文件,对每行进行正则匹配

    Shell查看文件的最后5行,并对每行进行正则匹配,代码如下: #!/bin/sh pattern="HeartBeat" /home/test/log/log_20150205. ...

随机推荐

  1. 周赛Problem 1025: Hkhv love spent money(RMQ)

    Problem 1025: Hkhv love spent money Time Limits:  1000 MS   Memory Limits:  65536 KB 64-bit interger ...

  2. [BZOJ4318] WJMZBMR打osu! / Easy (期望DP)

    题目链接 Solution Wa,我是真的被期望折服了,感觉这道题拿来练手正好. DP的难度可做又巧妙... 我们定义: \(f[i]\) 代表到第 \(i\) 次点击的时候的最大答案. \(g[i] ...

  3. 20140323组队赛 2012福建省第三届ACM省赛题目

    A - Solve equation Time Limit:1000MS     Memory Limit:32768KB     64bit IO Format:%I64d & %I64u ...

  4. 应用css3制作loading效果

    参考秒味课堂 代码发出来备忘 html <!DOCTYPE html> <html lang="en"> <head> <meta cha ...

  5. 标准C程序设计七---67

    Linux应用             编程深入            语言编程 标准C程序设计七---经典C11程序设计    以下内容为阅读:    <标准C程序设计>(第7版) 作者 ...

  6. Struts+ibatis-学习总结一

    1查询并返回list 别名映射->实体类:resultClass <select id=" selectAll" resultClass="AppLog&qu ...

  7. hdu 1180(广搜好题)

    诡异的楼梯 Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 131072/65536 K (Java/Others)Total Subm ...

  8. TF-IDF 实践

    打算分以下几个部分进行 1. 用python写一个爬虫爬取网易新闻 2. 用分词工具对爬下来的文字进行处理, 形成语料库 3. 根据TF-IDF, 自动找出新闻的关键词 4. 根据TF-IDF, 实现 ...

  9. (五)github删除仓库

    一.一直学习怎么创建仓库,创建了太多仓库,一直不知道咋删除,有点懵,其实很简单,就是对英文不太习惯,要加深英文水平. 找到setting,然后再下面找到danger Zone

  10. Remove Nth Node From End of List(链表,带测试代码)

    Given a linked list, remove the nth node from the end of list and return its head. For example, Give ...