Python正则处理多行日志一例

　正则表达式基础知识请参阅《正则表达式基础知识》，本文使用正则表达式来匹配多行日志并从中解析出相应的信息。

假设现在有这样的SQL日志：

SELECT * FROM open_app WHERE 1 and `client_id` = 'a08f5e32909cc9418f' and `is_valid` = '1' order by id desc limit 32700,100;

# Time: 160616 10:05:10

# User@Host: shuqin[qqqq] @  [1.1.1.1]  Id: 46765069

# Schema: db_xxx  Last_errno: 0  Killed: 0

# Query_time: 0.561383  Lock_time: 0.000048  Rows_sent: 100  Rows_examined: 191166  Rows_affected: 0

# Bytes_sent: 14653

SET timestamp=1466042710;

SELECT * FROM open_app WHERE 1 and `client_id` = 'a08f5e32909cc9418f' and `is_valid` = '1' order by id desc limit 36700,100;

# User@Host: shuqin[ssss] @  [2.2.2.2]  Id: 46765069

# Schema: db_yyy  Last_errno: 0  Killed: 0

# Query_time: 0.501094  Lock_time: 0.000042  Rows_sent: 100  Rows_examined: 192966  Rows_affected: 0

# Bytes_sent: 14966

SET timestamp=1466042727;

要求从中解析出相应的信息，有如下知识点：

　 (1) 默认正则是单行模式，要匹配多行，需要开启 "多行模式"：　MULTILINE；对于点号，默认不匹配换行符，为了匹配换行符，也需要开启 "DOTALL模式"；

　 (2) 为了匹配每个多行日志，必须使用非贪婪模式，即在 .* 后面加 ? ,　否则第一个匹配会匹配到末尾；

　 (3) 分而治之。编写正确的正则表达式匹配指定长字符串是不容易的，采用的策略是分而治之，将整个字符串分解成多个子串，分别匹配字串。这里每个字串都是一行，匹配好一行后，可以进一步在行内更细化的匹配；

　 (4) 无处不在的空格符要使用 \s* 或 \s+ 来增强健壮性；固定的普通字符串可以在正则表达式中更好地标识各个字串，更容易地匹配到。

　 (5) Python 正则有两个常用用法： re.findall , re.match ; 前者的匹配结果是一个列表，　每个列表元素是一个元组，匹配一个多行日志；元组的每个元素用来提取对应捕获分组的字符串；　re.match 的匹配结果是一个 Match 对象，可以通过 group(n) 来获取每个捕获分组的匹配字符串。下面的程序特意两种都用到了。对于多行匹配，使用了 re.findall ; 对于行内匹配，使用了 re.match ; 初学者常问这两者那两者有什么区别，其实动手试试就知道了。

　 (6) 展示结构使用 Map. 解析出结果后，必然要展示或做成报告，使用 Map & List 结合的复合结构通常是非常适宜的选择。比如这一例，如果要展示所有 SQL 日志详情，可以做成

{"tablename1": [{sqlobj11}, {sqlobj12}], ..., "tablenameN": [{sqlobjN1}, {sqlobjN2}] } ，每个 sqlobj 结构为：

　　　　{"sql": "select xxx", "QueryTime": 0.5600, ...}

要展示简要的报告，比如每个表的 SQL 统计，　可以做成

　　　　{"tablename1": {"sql11": 98, "sql12": 16}, ..., "tablenameN": {"sqlN1": 75, "sqlN2": 23} }

　　Python 程序实现：

import re

globalRegex = r'^\s*(.*?)# (User@Host:.*?)# (Schema:.*?)# (Query_time:.*?)# Bytes_sent:(.*?)SET timestamp=(\d+);\s*$'

costRegex = r'Query_time:\s*(.*)\s*Lock_time:\s*(.*)\s*Rows_sent:\s*(\d+)\s*Rows_examined:\s*(\d+)\s*Rows_affected:\s*(\d+)\s*'

schemaRegex = r'Schema:\s*(.*)\s*Last_errno:(.*)\s*Killed:\s*(.*)\s*'

def readSlowSqlFile(slowSqlFilename):

    f = open(slowSqlFilename)

    ftext = ''

    for line in f:

         ftext += line

    f.close()

    return ftext

def findInText(regex, text):

    return re.findall(regex, text, flags=re.DOTALL+re.MULTILINE)

def parseSql(sqlobj, sqlText):

    try:

        if sqlText.find('#') != -1:

            sqlobj['sql'] = sqlText.split('#')[0].strip()

            sqlobj['time'] = sqlText.split('#')[1].strip()

        else:

            sqlobj['sql'] = sqlText.strip()

            sqlobj['time'] = ''

    except:

        sqlobj['sql'] = sqlText.strip()

def parseCost(sqlobj, costText):

    matched = re.match(costRegex, costText)

    sqlobj['Cost'] = costText

    if matched:

        sqlobj['QueryTime'] = matched.group(1).strip()

        sqlobj['LockTime'] = matched.group(2).strip()

        sqlobj['RowsSent'] = int(matched.group(3))

        sqlobj['RowsExamined'] = int(matched.group(4))

        sqlobj['RowsAffected'] = int(matched.group(5))

def parseSchema(sqlobj, schemaText):

    matched = re.match(schemaRegex, schemaText)

    sqlobj['Schema'] = schemaText

    if matched:

        sqlobj['Schema'] = matched.group(1).strip()

        sqlobj['LastErrno'] = int(matched.group(2))

        sqlobj['Killed'] = int(matched.group(3))

def parseSQLObj(matched):

    sqlobj = {}

    try:

        if matched and len(matched) > 0:

            parseSql(sqlobj, matched[0].strip())

            sqlobj['UserHost'] = matched[1].strip()

            sqlobj['ByteSent'] = int(matched[4])

            sqlobj['timestamp'] = int(matched[5])

            parseCost(sqlobj, matched[3].strip())

            parseSchema(sqlobj, matched[2].strip())

            return sqlobj

    except:

        return sqlobj

if __name__ == '__main__':

    files = ['slow_sqls.txt']

    alltext = ''

    for f in files:

        text = readSlowSqlFile(f)

        alltext += text

    allmatched = findInText(globalRegex, alltext)

    tablenames = ['open_app']

    if not allmatched or len(allmatched) == 0:

        print 'No matched. exit.'

        exit(1)

    sqlobjMap = {}

    for matched in allmatched:

        sqlobj = parseSQLObj(matched)

        if len(sqlobj) == 0:

            continue

        for tablename in tablenames:

            if sqlobj['sql'].find(tablename) != -1:

                 if not sqlobjMap.get(tablename):

                     sqlobjMap[tablename] = []

                 sqlobjMap[tablename].append(sqlobj)

                 break

    resultMap = {}

    for (tablename, sqlobjlist) in sqlobjMap.iteritems():

        sqlstat = {}

        for sqlobj in sqlobjlist:

            if sqlobj['sql'] not in sqlstat:

                sqlstat[sqlobj['sql']] = 0

            sqlstat[sqlobj['sql']] += 1

        resultMap[tablename] = sqlstat

    f_res = open('/tmp/res.txt', 'w')

    f_res.write('-------------------------------------: \n')

    f_res.write('Bref results: \n')

    for (tablename, sqlstat) in resultMap.iteritems():

        f_res.write('tablename: ' + tablename + '\n')

        sortedsqlstat = sorted(sqlstat.iteritems(), key=lambda d:d[1], reverse = True)

        for sortedsql in sortedsqlstat:

            f_res.write('sql = %s\ncounts: %d\n\n' % (sortedsql[0], sortedsql[1]))

    f_res.write('-------------------------------------: \n\n')

    f_res.write('-------------------------------------: \n')

    f_res.write('Detail results: \n')

    for (tablename, sqlobjlist) in sqlobjMap.iteritems():

        f_res.write('tablename: ' + tablename + '\n')

        f_res.write('sqlinfo: \n')

        for sqlobj in sqlobjlist:

            f_res.write('sql: ' + sqlobj['sql'] + ' QueryTime: ' + str(sqlobj.get('QueryTime')) + ' LockTime: ' + str(sqlobj.get('LockTime')) + '\n')

            f_res.write(str(sqlobj) + '\n\n')

    f_res.write('-------------------------------------: \n')

    f_res.close()

可配置

事实上，可以做成可配置的。只要给定行间及行内关键字集合，可以分割多行及行内字段，就可以分别提取相应的内容。

这里有个基本函数 matchOneLine：根据一个依序分割一行内容的关键字列表，匹配一行内容，得到每个关键字对应的内容。这个函数用于匹配行内内容。

配置方式：采用列表的列表。列表中的每个元素列表是可以分割和匹配单行内容的关键字列表。每个关键字都用于分割单行的某个区域的内容。为了提升解析性能，这里对关键字列表进行了预编译正则表达式，以便在解析字符串的时候不做重复工作。

见如下代码：

#!/usr/bin/python

#_*_encoding:utf-8_*_

import re

# config line keywords to seperate lines.

ksconf = [['S'], ['# User@Host:','Id:'] , ['# Schema:', 'Last_errno:', 'Killed:'], ['# Query_time:','Lock_time:', 'Rows_sent:', 'Rows_examined:', 'Rows_affected:'], ['# Bytes_sent:'], ['SET timestamp=']]

files = ['slow_sqls.txt']

#ksconf = [['id:'], ['name:'], ['able:']]

#files = ['stu.txt']

globalConf = {'ksconf': ksconf, 'files': files}

def produceRegex(keywordlistInOneLine):

    ''' build the regex to match keywords in the list of keywordlistInOneLine '''

    oneLineRegex = "^\s*"

    oneLineRegex += "(.*?)".join(keywordlistInOneLine)

    oneLineRegex += "(.*?)\s*$"

    return oneLineRegex

def readFile(filename):

    f = open(filename)

    ftext = ''

    for line in f:

        ftext += line

    f.close()

    return ftext

def readAllFiles(files):

    return ''.join(map(readFile, files))

def findInText(regex, text, linesConf):

    '''

       return a list of maps, each map is a match to multilines,

              in a map, key is the line keyword

                         and value is the content corresponding to the key

    '''

    matched = regex.findall(text)

    if empty(matched):

        return []

    allMatched = []

    linePatternMap = buildLinePatternMap(linesConf)

    for onematch in matched:

        oneMatchedMap = buildOneMatchMap(linesConf, onematch, linePatternMap)

        allMatched.append(oneMatchedMap)

    return allMatched

def buildOneMatchMap(linesConf, onematch, linePatternMap):

    sepLines = map(lambda ks:ks[0], linesConf)

    lenOflinesInOneMatch = len(sepLines)

    lineMatchedMap = {}

    for i in range(lenOflinesInOneMatch):

        lineContent = sepLines[i] + onematch[i].strip()

        linekey = getLineKey(linesConf[i])

        lineMatchedMap.update(matchOneLine(linesConf[i], lineContent, linePatternMap))

    return lineMatchedMap    

def matchOneLine(keywordlistOneLine, lineContent, patternMap):

    '''

       match lineContent with a list of keywords , and return a map

       in which key is the keyword and value is the content matched the key.

       eg.

       keywordlistOneLine = ["host:", "ip:"] , lineContent = "host: qinhost ip: 1.1.1.1"

       return {"host:": "qinhost", "ip": "1.1.1.1"}

    '''

    ksmatchedResult = {}

    if len(keywordlistOneLine) == 0 or lineContent.strip() == "":

        return {}

    linekey = getLineKey(keywordlistOneLine)

    if empty(patternMap):

        linePattern = getLinePattern(keywordlistOneLine)

    else:

        linePattern = patternMap.get(linekey)

    lineMatched = linePattern.findall(lineContent)

    if empty(lineMatched):

        return {}

    kslen = len(keywordlistOneLine)

    if kslen == 1:

        ksmatchedResult[cleankey(keywordlistOneLine[0])] = lineMatched[0].strip()

    else:

        for i in range(kslen):

            ksmatchedResult[cleankey(keywordlistOneLine[i])] = lineMatched[0][i].strip()

    return ksmatchedResult

def empty(obj):

    return obj is None or len(obj) == 0

def cleankey(dirtykey):

    ''' clean unused characters in key '''

    return re.sub(r"[# :]", "", dirtykey)

def printMatched(allMatched, linesConf):

    allks = []

    for kslist in linesConf:

        allks.extend(kslist)

    for matched in allMatched:

        for k in allks:

            print cleankey(k) , "=>", matched.get(cleankey(k))

        print '\n'    

def buildLinePatternMap(linesConf):

    linePatternMap = {}

    for keywordlistOneLine in linesConf:

        linekey = getLineKey(keywordlistOneLine)

        linePatternMap[linekey] = getLinePattern(keywordlistOneLine)

    return linePatternMap    

def getLineKey(keywordlistForOneLine):

    return "_".join(keywordlistForOneLine)

def getLinePattern(keywordlistForOneLine):

    return re.compile(produceRegex(keywordlistForOneLine))

def testMatchOneLine():

    assert len(matchOneLine([], "haha", {})) == 0

    assert len(matchOneLine(["host"], "", {})) == 0

    assert len(matchOneLine("", "haha", {})) == 0

    assert len(matchOneLine(["host", "ip"], "host:qqq addr: 1.1.1.1", {})) == 0

    lineMatchMap1 = matchOneLine(["id:"], "id: 123456", {"id:": re.compile(produceRegex(["id:"]))})

    assert lineMatchMap1.get("id") == ""

    lineMatchMap2 = matchOneLine(["host:", "ip:"], "host: qinhost  ip: 1.1.1.1  ", {"host:_ip:": re.compile(produceRegex(["host:", "ip:"]))})

    assert lineMatchMap2.get("host") == "qinhost"

    assert lineMatchMap2.get("ip") == "1.1.1.1"

    print 'testMatchOneLine passed.'

if __name__ == '__main__':

    testMatchOneLine()

    files = globalConf['files']

    linesConf = globalConf['ksconf']

    sepLines = map(lambda ks:ks[0], linesConf)

    text = readAllFiles(files)

    wholeRegex = produceRegex(sepLines)

    print 'wholeRegex: ', wholeRegex

    compiledPattern = re.compile(wholeRegex, flags=re.DOTALL+re.MULTILINE)

    allMatched = findInText(compiledPattern, text, linesConf)

    printMatched(allMatched, linesConf)

如果想以下多行解析文本文件，只需要修改下 ksconf = [['id:'], ['name:'], ['able:']]。

id:1

name:shu

able:swim,study

id:2

name:qin

able:sleep,run

Python正则处理多行日志一例的更多相关文章

Python正则处理多行日志一例(可配置化)
正则表达式基础知识请参阅<正则表达式基础知识>,本文使用正则表达式来匹配多行日志并从中解析出相应的信息. 假设现在有这样的SQL日志: SELECT * FROM open_app WHE ...
Python正则匹配多行，多个数据
最近用Python做一个crawler工具的时候,发现用一个正则表达式可以匹配到个数据的时候用match.group()只能打印出第一个数据,其它数据不能打印出来.最后找到解决方法,现在记录一下,直接 ...
python 正则,常用正则表达式大全
Nginx访问日志匹配 re.compile #re.compile 规则解释,改规则必须从前面开始匹配一个一个写到后面,前面一个修改后面全部错误.特殊标准结束为符号为空或者双引号: 改符号开始从 ...
Python正则式的基本用法
Python正则式的基本用法 1.1基本规则 1.2重复 1.2.1最小匹配与精确匹配 1.3前向界定与后向界定 1.4组的基本知识 2．re模块的基本函数 2.1使用compile加速 2.2 ma ...
【Python】[技术博客] 一些使用Python编写获取手机App日志的操作
一些使用Python编写获取手机App日志的操作如何获取手机当前打开的App的包名如何获取当前App进程的PID 如何查看当前App的日志如何将日志保存到文件如何关闭进程如何不显示命令行窗口 ...
Python正则匹配字母大小写不敏感在读xml中的应用
需要解决的问题:要匹配字符串,字符串中字母的大小写不确定,如何匹配? 问题出现之前是使用字符串比较的方式,比如要匹配'abc',则用语句: if s == 'abc':#s为需要匹配的字符串 prin ...
【转】python模块分析之logging日志（四）
[转]python模块分析之logging日志(四) python的logging模块是用来写日志的,是python的标准模块. 系列文章 python模块分析之random(一) python模块分 ...
python模块分析之logging日志（四）
前言 python的logging模块是用来设置日志的,是python的标准模块. 系列文章 python模块分析之random(一) python模块分析之hashlib加密(二) python模块 ...
Python 全栈开发九日志模块
日志是一种可以追踪某些软件运行时所发生事件的方法.软件开发人员可以向他们的代码中调用日志记录相关的方法来表明发生了某些事情.一个事件可以用一个可包含可选变量数据的消息来描述.此外,事件也有重要性的概念 ...

随机推荐

2015Web前端攻城之路
2015目标成为一名合格的前端攻城狮. 养成计划: 1.html / css 2.js 3.ajax 4.框架 5.项目实战
Hbuider 同步github
别人的教程,仅作收藏. http://blog.csdn.net/u011871921/article/details/44238971
HDU3732 背包DP
Ahui Writes Word Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others) ...
TypeError: 'stepUp' called on an object that does not implement interface HTMLInputElement.
我今天写程序的时候遇到的问题,开始完成功能后没发觉.当再次部署程序更新时候,出的错误,通过firebug发现提示是TypeError: 'stepUp' called on an object tha ...
Orale介绍
Oracle数据库: 是甲骨文公司的一款关系数据库管理系统具有网格计算的框架数据量大,并发操作比较多,实时性要求高,采取ORACLE数据库 Oracle数据库的体系结构包括物理存储结构和逻辑存储结 ...
IOS第三天
第三天 ******** 九宫格代码的现实 @interface HMViewController () /** 应用程序列表 */ @property (nonatomic, strong) NSA ...
File类的createNewFile()与createTempFile()的区别
最近,在看代码时看到了一个方法, File.createTempFile() ,由此联想到File.createNewFile() 方法,一时间不知道两者到底有什么区别,感觉都是创建新文件嘛,后来查看 ...
内省操作javabean的属性
import java.beans.BeanInfo; import java.beans.IntrospectionException; import java.beans.Introspector ...
smaller programs should improve performance
COMPUTER ORGANIZATION AND ARCHITECTURE DESIGNING FOR PERFORMANCE NINTH EDITION In this section, we l ...
JSON和JS对象之间的互转（转）
文章出处:http://www.cnblogs.com/dyllove98/p/4235909.html 1. jQuery插件支持的转换方式 $.parseJSON( jsonstr ); //jQ ...

Python正则处理多行日志一例

Python正则处理多行日志一例的更多相关文章

随机推荐

热门专题