Oracle和Elasticsearch数据同步
一、版本
Python版本 x64 2.7.12
Oracle(x64 12.1.0.2.0)和Elasticsearch(2.2.0)
如果是远程连接数据库和ES,请一定注意安装的模块或包版本。务必选择相应的版本,不然会遇到问题。
三、安装过程中会遇到的问题
四、源码
- # -*- coding: utf-8 -*-
- """
- 作者:陈龙
- 日期:2016-7-22
- 功能:oracle数据库到ES的数据同步
- """
- import os
- import sys
- import datetime, time
- # import fcntl
- import threading
- import pyes # 引入pyes模块,ES接口
- import cx_Oracle # 引入cx_Oracle模块,Oracle接口
- os.environ['NLS_LANG'] = 'SIMPLIFIED CHINESE_CHINA.UTF8' # 中文编码
- reload(sys) # 默认编码设置为utf-8
- sys.setdefaultencoding('utf-8')
- # 创建ES连接 并返回连接参数
- def connect_ES(addr):
- try:
- global conn
- conn = pyes.ES(addr) # 链接ES '127.0.0.1:9200'
- print 'ES连接成功'
- return conn
- except:
- print 'ES连接错误'
- pass
- # 创建ES映射mapping 注意各各个字段的类型
- def create_ESmapping():
- global spiderInfo_mapping, involveVideo_mapping, involveCeefax_mapping,keyWord_mapping,sensitiveWord_mapping
- spiderInfo_mapping = {'tableName': {'index': 'not_analyzed', 'type': 'string'},
- 'tableId': {'index': 'not_analyzed', 'type': 'integer'},
- 'title': {'index': 'analyzed', 'type': 'string'},
- 'author': {'index': 'not_analyzed', 'type': 'string'},
- 'content': {'index': 'analyzed', 'type': 'string'},
- 'publishTime': {'index': 'not_analyzed', 'type': 'string'},
- 'browseNum': {'index': 'not_analyzed', 'type': 'integer'},
- 'commentNum': {'index': 'not_analyzed', 'type': 'integer'},
- 'dataType': {'index': 'not_analyzed', 'type': 'integer'}} # 除去涉我部分内容的ES映射结构
- involveVideo_mapping = {'tableName': {'index': 'not_analyzed', 'type': 'string'},
- 'tableId': {'index': 'not_analyzed', 'type': 'integer'},
- 'title': {'index': 'analyzed', 'type': 'string'},
- 'author': {'index': 'not_analyzed', 'type': 'string'},
- 'summary': {'index': 'analyzed', 'type': 'string'},
- 'publishTime': {'index': 'not_analyzed', 'type': 'string'},
- 'url': {'index': 'not_analyzed', 'type': 'string'},
- 'imgUrl': {'index': 'not_analyzed', 'type': 'string'},
- 'ranking': {'index': 'not_analyzed', 'type': 'integer'},
- 'playNum': {'index': 'not_analyzed', 'type': 'integer'},
- 'dataType': {'index': 'not_analyzed', 'type': 'integer'}} # 涉我视音频内容的ES映射结构
- involveCeefax_mapping = {'tableName': {'index': 'not_analyzed', 'type': 'string'},
- 'tableId': {'index': 'not_analyzed', 'type': 'integer'},
- 'title': {'index': 'analyzed', 'type': 'string'},
- 'author': {'index': 'not_analyzed', 'type': 'string'},
- 'content': {'index': 'analyzed', 'type': 'string'},
- 'publishTime': {'index': 'not_analyzed', 'type': 'string'},
- 'keyWords': {'index': 'not_analyzed', 'type': 'string'},
- 'popularity': {'index': 'not_analyzed', 'type': 'integer'},
- 'url': {'index': 'not_analyzed', 'type': 'string'},
- 'dataType': {'index': 'not_analyzed', 'type': 'integer'}} # 涉我图文资讯内容的ES映射结构
- keyWord_mapping = {'id':{'index': 'not_analyzed', 'type': 'integer'},
- 'keywords':{'index': 'not_analyzed', 'type': 'string'}}
- sensitiveWord_mapping = {'id':{'index': 'not_analyzed', 'type': 'integer'},
- 'sensitiveType':{'index': 'not_analyzed', 'type': 'string'},
- 'sensitiveTopic': {'index': 'not_analyzed', 'type': 'string'},
- 'sensitiveWords': {'index': 'not_analyzed', 'type': 'string'}}
- # 创建ES相关索引和索引下的type
- def create_ESindex(ES_index, index_type1,index_type2,index_type3,index_type4,index_type5):
- if conn.indices.exists_index(ES_index):
- pass
- else:
- conn.indices.create_index(ES_index) # 如果所有Str不存在,则创建Str索引
- create_ESmapping()
- conn.indices.put_mapping(index_type1, {'properties': spiderInfo_mapping},[ES_index]) # 在索引pom下创建spiderInfo的_type "spiderInfo"
- conn.indices.put_mapping(index_type2, {'properties': involveVideo_mapping},[ES_index]) # 在索引pom下创建involveVideo的_type "involveVideo"
- conn.indices.put_mapping(index_type3, {'properties': involveCeefax_mapping},[ES_index]) # 在索引pom下创建involveCeefax的_type "involveCeefax"
- conn.indices.put_mapping(index_type4, {'properties': keyWord_mapping}, [ES_index])
- conn.indices.put_mapping(index_type5, {'properties': sensitiveWord_mapping}, [ES_index])
- # conn.ensure_index
- # 创建数据库连接 并返回连接参数
- def connect_Oracle(name, password, address):
- try:
- global conn1
- # conn1 = cx_Oracle.connect('c##chenlong','1234567890','localhost:1521/ORCL') #链接本地数据库
- conn1 = cx_Oracle.connect(name, password, address) # 链接远程数据库 "pom","Bohui@123","172.17.7.118:1521/ORCL"
- print 'Oracle连接成功'
- return conn1
- except:
- print 'ES数据同步脚本连接不上数据库,请检查connect参数是否正确,或者模块版本是否匹配'
- pass
- def fetch_account(accountcode): # 取两个‘_’之间的账号名称
- end = accountcode.find('_')
- return accountcode[0:end].strip()
- # 根据表的个数创建不同的对象
- # 从记录文档中读取各个表的记录ID,判断各个表的ID是否有变化
- # 分别读取各个表中的相关数据
- # 读取各个表的ID与记录的ID(记录在文本或者数据库中)并判断
- """def read_compare_ID():
- global tuple_tableName_IdNum
- global cur
- tuple_tableName_IdNum = {}
- tablename = []
- cur = conn1.cursor()
- result1 = cur.execute("select * from tabs") ##执行数据库操作 读取各个表名
- row = result1.fetchall()
- for x in row:
- tablename.append(x[0]) # 将表名取出并赋值给tablename数组
- result2 = cur.execute('select {}_ID from {}'.format(x[0], x[0]))
- ID_num = result2.fetchall()
- tuple_tableName_IdNum[x[0]] = ID_num"""
- def readOracle_writeES(tableName, ES_index, index_type):
- global cc
- cur = conn1.cursor()
- #result_AlltableNames = cur.execute("select * from tabs")
- result_latestId = cur.execute("select max({}_Id) from {} ".format(tableName,tableName))
- num1 = result_latestId.fetchone() #当前表中的最大ID
- print '当前表中的最大ID{}'.format(num1[0])
- result_rememberId = cur.execute("select tableId from T_REMEMBERID where tableName='{}'".format(tableName.upper())) #通过数据库表拿到更新的ID tablename 都转化成大写
- num2 = result_rememberId.fetchone() #上次记录的更新ID
- print '上次记录的更新ID{}'.format(num2[0])
- if tableName.upper() == 'T_SOCIAL':
- while num2[0] < num1[0]:
- result_readOracle = cur.execute("select {}_ID,title,author,content,publishTime,browseNum,likeNum,forwardNum,commentNum,accountCode from {} where {}_ID > {} and rownum<=40 ".format(tableName, tableName, tableName, num2[0]))
- result_tuple1 = result_readOracle.fetchall() #之前是因为数据量太大,超过了变量的内存空间,所以用fetchmany取40条 后来大神建议数据库中限制查询数 然后fetchall,这样查询更有效率
- for i in result_tuple1: #一条一条写入ES,这个速度太慢,改进 通过bulk接口导入
- aa= (i[5]+i[6])
- bb= (i[7]+i[8])
- if conn.index(
- {'tableName': tableName, 'tableId': i[0], 'title': unicode(i[1]), 'author': unicode(i[2]),
- 'content': unicode(i[3]), 'publishTime': str(i[4]), 'browseNum': aa,
- 'commentNum':bb, 'dataType':fetch_account(i[9])}, ES_index, index_type,bulk=True): # 将数据写入索引pom的spiderInfo
- cc += 1
- print 'bulk导入后的ID:{}'.format(i[0])
- rememberId = i[0] #如果写入成功才赋值
- cur.execute("update T_REMEMBERID set tableId = {} where tableName = '{}'".format(rememberId,tableName))
- conn1.commit()
- result_rememberId = cur.execute("select tableId from T_REMEMBERID where tableName='{}'".format(tableName)) # 通过数据库表拿到更新的ID
- num2 = result_rememberId.fetchone()
- print "{}读{}写成功".format(tableName,index_type)
- if tableName.upper() == 'T_HOTSEARCH':
- while num2[0] < num1[0]:
- result_readOracle = cur.execute("select {}_ID,accountCode,title,publishTime from {} where {}_ID > {} and rownum<=40 ".format(tableName, tableName, tableName, num2[0]))
- result_tuple1 = result_readOracle.fetchall() # 之前是因为数据量太大,超过了变量的内存空间,所以用fetchmany取40条 后来大神建议数据库中限制查询数 然后fetchall,这样查询更有效率
- for i in result_tuple1: #一条一条写入ES,这个速度太慢,改进 通过bulk接口导入
- if conn.index(
- {'tableName': tableName, 'tableId': i[0], 'title': unicode(i[2]),'author': '','content': '', 'publishTime': str(i[3]), 'browseNum': 0,
- 'commentNum': 0, 'dataType': fetch_account(i[1])}, ES_index, index_type,bulk=True): # 将数据写入索引pom的spiderInfo
- cc += 1
- print 'bulk导入后的ID:{}'.format(i[0])
- rememberId = i[0]
- cur.execute("update T_REMEMBERID set tableId = {} where tableName = '{}'".format(rememberId, tableName))
- conn1.commit()
- result_rememberId = cur.execute("select tableId from T_REMEMBERID where tableName='{}'".format(tableName)) # 通过数据库表拿到更新的ID
- num2 = result_rememberId.fetchone()
- print "{}读{}写成功".format(tableName, index_type)
- if tableName.upper() == 'T_VIDEO_HOT':
- while num2[0] < num1[0]:
- result_readOracle = cur.execute("select {}_ID,accountCode,title,Author,publishTime from {} where {}_ID > {} and rownum<=40 ".format(tableName,tableName,tableName,num2[0]))
- result_tuple1 = result_readOracle.fetchall() # 之前是因为数据量太大,超过了变量的内存空间,所以用fetchmany取40条 后来大神建议数据库中限制查询数 然后fetchall,这样查询更有效率
- for i in result_tuple1: # 一条一条写入ES,这个速度太慢,强烈需要改进 通过bulk接口导入?
- if conn.index(
- {'tableName': tableName, 'tableId': i[0], 'title': unicode(i[2]),'author': unicode(i[3]),
- 'content': '', 'publishTime': str(i[4]), 'browseNum': 0,
- 'commentNum': 0, 'dataType': fetch_account(i[1])}, ES_index, index_type, bulk=True): # 将数据写入索引pom的spiderInfo
- cc += 1
- print 'bulk导入后的ID:{}'.format(i[0])
- rememberId = i[0]
- cur.execute("update T_REMEMBERID set tableId = {} where tableName = '{}'".format(rememberId, tableName))
- conn1.commit()
- result_rememberId = cur.execute("select tableId from T_REMEMBERID where tableName='{}'".format(tableName)) # 通过数据库表拿到更新的ID
- num2 = result_rememberId.fetchone()
- print "{}读写成功".format(tableName)
- if tableName.upper() == 'T_PRESS':
- while num2[0] < num1[0]:
- result_readOracle = cur.execute(
- "select {}_ID,accountCode,title,Author,PublishDate,Content from {} where {}_ID > {} and rownum<=40 ".format(
- tableName, tableName, tableName, num2[0]))
- result_tuple1 = result_readOracle.fetchall() # 之前是因为数据量太大,超过了变量的内存空间,所以用fetchmany取40条 后来大神建议数据库中限制查询数 然后fetchall,这样查询更有效率
- for i in result_tuple1: # 一条一条写入ES,这个速度太慢,强烈需要改进 通过bulk接口导入?
- if conn.index(
- {'tableName': tableName, 'tableId': i[0], 'title': unicode(i[2]),'author': unicode(i[3]),
- 'content': unicode(i[5]), 'publishTime': str(i[4]), 'browseNum': 0,
- 'commentNum': 0, 'dataType': fetch_account(i[1])}, ES_index, index_type,bulk=True): # 将数据写入索引pom的spiderInfo
- cc += 1
- print 'bulk导入后的ID:{}'.format(i[0])
- rememberId = i[0]
- cur.execute("update T_REMEMBERID set tableId = {} where tableName = '{}'".format(rememberId, tableName))
- conn1.commit()
- result_rememberId = cur.execute(
- "select tableId from T_REMEMBERID where tableName='{}'".format(tableName)) # 通过数据库表拿到更新的ID
- num2 = result_rememberId.fetchone()
- print "{}读写成功".format(tableName)
- if tableName.upper() == 'T_INDUSTRY':
- while num2[0] < num1[0]:
- result_readOracle = cur.execute(
- "select {}_ID,accountCode,title,Author,PublishTime,Content,BrowseNum from {} where {}_ID > {} and rownum<=40 ".format(
- tableName, tableName, tableName, num2[0]))
- result_tuple1 = result_readOracle.fetchall() # 之前是因为数据量太大,超过了变量的内存空间,所以用fetchmany取40条 后来大神建议数据库中限制查询数 然后fetchall,这样查询更有效率
- for i in result_tuple1: # 一条一条写入ES,这个速度太慢,强烈需要改进 通过bulk接口导入?
- if conn.index(
- {'tableName': tableName, 'tableId': i[0], 'title': unicode(i[2]),'author': unicode(i[3]),
- 'content': unicode(i[5]), 'publishTime': str(i[4]), 'browseNum': i[6],
- 'commentNum': 0, 'dataType': fetch_account(i[1])}, ES_index, index_type,bulk=True) : # 将数据写入索引pom的spiderInfo
- cc += 1
- print 'bulk导入后的ID:{}'.format(i[0])
- rememberId = i[0]
- cur.execute("update T_REMEMBERID set tableId = {} where tableName = '{}'".format(rememberId, tableName))
- conn1.commit()
- result_rememberId = cur.execute(
- "select tableId from T_REMEMBERID where tableName='{}'".format(tableName)) # 通过数据库表拿到更新的ID
- num2 = result_rememberId.fetchone()
- print "{}读写成功".format(tableName)
- if tableName.upper() == 'T_SOCIAL_SITESEARCH':
- while num2[0] < num1[0]:
- result_readOracle = cur.execute('select {}_ID,title,author,content,publishTime,keyWords,browseNum,likeNum,forwardNum,commentNum,url,accountCode from {} where ({}_ID > {})'.format(tableName, tableName, tableName, num2[0]))
- result_tuple1 = result_readOracle.fetchmany(50) #因为数据量太大,超过了变量的内存空间,所以一次性取40条
- for i in result_tuple1: # 一条一条写入ES,这个速度太慢,强烈需要改进 通过bulk接口导入?
- popularity = (i[6] + i[7] + i[8] * 2 + i[9] * 2)
- if conn.index(
- {'tableName': tableName,'tableId':i[0],'title': unicode(i[1]),'author':unicode(i[2]),
- 'content':unicode(i[3]),'publishTime':str(i[4]),'keyWords':unicode(i[5]),
- 'popularity':popularity,'url': i[10],
- 'dataType':fetch_account(i[11])}, ES_index, index_type, bulk=True): # 将数据写入索引pom的spiderInfo
- cc += 1
- print 'bulk导入后的ID:{}'.format(i[0])
- rememberId = i[0]
- cur.execute("update T_REMEMBERID set tableId = {} where tableName = '{}'".format(rememberId,tableName))
- conn1.commit()
- result_rememberId = cur.execute("select tableId from T_REMEMBERID where tableName='{}'".format(tableName)) # 通过数据库表拿到更新的ID
- num2 = result_rememberId.fetchone()
- print "{}读写成功".format(tableName)
- if tableName.upper() == 'T_REALTIME_NEWS':
- while num2[0] < num1[0]:
- result_readOracle = cur.execute("select {}_ID,title,author,content,publishTime,browseNum,commentNum,accountCode,url from {} where {}_ID > {} and rownum<=40 ".format(tableName, tableName, tableName, num2[0]))
- result_tuple1 = result_readOracle.fetchall() # 之前是因为数据量太大,超过了变量的内存空间,所以用fetchmany取40条 后来大神建议数据库中限制查询数 然后fetchall,这样查询更有效率
- for i in result_tuple1: # 一条一条写入ES,这个速度太慢,强烈需要改进 通过bulk接口导入?
- popularity = (i[5] + i[6] * 2)
- if conn.index(
- {'tableName': tableName,'tableId':i[0],'title': unicode(i[1]),'author':unicode(i[2]),
- 'content':unicode(i[3]),'publishTime':str(i[4]),'keyWords':unicode(''),
- 'popularity':popularity,'url': i[8],'dataType':fetch_account(i[7])}, ES_index, index_type, bulk=True): # 将数据写入索引pom的spiderInfo
- cc += 1
- print 'bulk导入后的ID:{}'.format(i[0])
- rememberId = i[0]
- cur.execute("update T_REMEMBERID set tableId = {} where tableName = '{}'".format(rememberId, tableName))
- conn1.commit()
- result_rememberId = cur.execute(
- "select tableId from T_REMEMBERID where tableName='{}'".format(tableName)) # 通过数据库表拿到更新的ID
- num2 = result_rememberId.fetchone()
- print "{}读{}写成功".format(tableName, index_type)
- if tableName.upper() == 'T_KEY_NEWS':
- while num2[0] < num1[0]:
- result_readOracle = cur.execute("select {}_ID,title,author,content,publishTime,browseNum,commentNum,accountCode,url from {} where {}_ID > {} and rownum<=40 ".format(tableName, tableName, tableName, num2[0]))
- result_tuple1 = result_readOracle.fetchall() # 之前是因为数据量太大,超过了变量的内存空间,所以用fetchmany取40条 后来大神建议数据库中限制查询数 然后fetchall,这样查询更有效率
- for i in result_tuple1: # 一条一条写入ES,这个速度太慢,强烈需要改进 通过bulk接口导入?
- popularity = (i[5] + i[6] * 2)
- if conn.index(
- {'tableName': tableName,'tableId':i[0],'title': unicode(i[1]),'author':unicode(i[2]),
- 'content':unicode(i[3]),'publishTime':str(i[4]),'keyWords':unicode(''),
- 'popularity':popularity,'url': i[8],'dataType':fetch_account(i[7])}, ES_index, index_type, bulk=True): # 将数据写入索引pom的spiderInfo
- cc += 1
- print 'bulk导入后的ID:{}'.format(i[0])
- rememberId = i[0]
- cur.execute("update T_REMEMBERID set tableId = {} where tableName = '{}'".format(rememberId, tableName))
- conn1.commit()
- result_rememberId = cur.execute(
- "select tableId from T_REMEMBERID where tableName='{}'".format(tableName)) # 通过数据库表拿到更新的ID
- num2 = result_rememberId.fetchone()
- print "{}读{}写成功".format(tableName, index_type)
- if tableName.upper() == 'T_LOCAL_NEWS':
- while num2[0] < num1[0]:
- result_readOracle = cur.execute("select {}_ID,title,author,content,publishTime,browseNum,commentNum,accountCode,url from {} where {}_ID > {} and rownum<=40 ".format(tableName, tableName, tableName, num2[0]))
- result_tuple1 = result_readOracle.fetchall() # 之前是因为数据量太大,超过了变量的内存空间,所以用fetchmany取40条 后来大神建议数据库中限制查询数 然后fetchall,这样查询更有效率
- for i in result_tuple1: # 一条一条写入ES,这个速度太慢,强烈需要改进 通过bulk接口导入?
- popularity = (i[5] + i[6] * 2)
- if conn.index(
- {'tableName': tableName, 'tableId': i[0], 'title': unicode(i[1]), 'author': unicode(i[2]),
- 'content': unicode(i[3]), 'publishTime': str(i[4]), 'keyWords': unicode(''),
- 'popularity': popularity, 'url': i[8], 'dataType': fetch_account(i[7])}, ES_index, index_type,bulk=True): # 将数据写入索引pom的spiderInfo
- cc += 1
- print 'bulk导入后的ID:{}'.format(i[0])
- rememberId = i[0]
- cur.execute("update T_REMEMBERID set tableId = {} where tableName = '{}'".format(rememberId, tableName))
- conn1.commit()
- result_rememberId = cur.execute(
- "select tableId from T_REMEMBERID where tableName='{}'".format(tableName)) # 通过数据库表拿到更新的ID
- num2 = result_rememberId.fetchone()
- print "{}读{}写成功".format(tableName, index_type)
- if tableName.upper() == 'T_VIDEO_SITESEARCH':
- while num2[0] < num1[0]:
- result_readOracle = cur.execute("select {}_ID,accountCode,title,Author,publishTime,url,imgUrl,playNum,keyWords from {} where {}_ID > {} and rownum<=40 ".format(tableName, tableName, tableName, num2[0]))
- result_tuple1 = result_readOracle.fetchall() # 之前是因为数据量太大,超过了变量的内存空间,所以用fetchmany取40条 后来大神建议数据库中限制查询数 然后fetchall,这样查询更有效率
- for i in result_tuple1: # 一条一条写入ES,这个速度太慢,强烈需要改进 通过bulk接口导入?
- if conn.index(
- {
- 'tableName': tableName, 'tableId': i[0], 'title': unicode(i[2]), 'author': unicode(i[3]),
- 'summary': unicode('0'), 'publishTime': str(i[4]), 'browseNum': i[7],'url':i[5],'imgUrl':i[6],'ranking':0,
- 'playNum': 0, 'dataType': fetch_account(i[1])}, ES_index, index_type,bulk=True): # 将数据写入索引pom的spiderInfo
- cc += 1
- print 'bulk导入后的ID:{}'.format(i[0])
- rememberId = i[0]
- cur.execute("update T_REMEMBERID set tableId = {} where tableName = '{}'".format(rememberId, tableName))
- conn1.commit()
- result_rememberId = cur.execute(
- "select tableId from T_REMEMBERID where tableName='{}'".format(tableName)) # 通过数据库表拿到更新的ID
- num2 = result_rememberId.fetchone()
- print "{}读{}写成功".format(tableName,index_type)
- if tableName.upper() == 'T_BASE_KEYWORDS':
- while num2[0] < num1[0]:
- result_readOracle = cur.execute('select {}_ID,keywords from {} where {}_ID > {} and rownum<=50'.format(tableName, tableName, tableName, num2[0]))
- result_tuple1 = result_readOracle.fetchall() #因为数据量太大,超过了变量的内存空间,所以一次性取40条
- for i in result_tuple1: # 一条一条写入ES,这个速度太慢,强烈需要改进 通过bulk接口导入?
- if conn.index({'id': i[0], 'keywords': i[1]}, ES_index, index_type,bulk=True): # 将数据写入索引pom的spiderInfo
- cc += 1
- print 'bulk导入后的ID:{}'.format(i[0])
- rememberId = i[0]
- cur.execute("update T_REMEMBERID set tableId = {} where tableName = '{}'".format(rememberId,tableName))
- conn1.commit()
- result_rememberId = cur.execute("select tableId from T_REMEMBERID where tableName='{}'".format(tableName)) # 通过数据库表拿到更新的ID
- num2 = result_rememberId.fetchone()
- print "{}读写成功".format(tableName)
- if tableName.upper() == 'T_BASE_SENSITIVEWORDS':
- while num2[0] < num1[0]:
- result_readOracle = cur.execute('select {}_ID,SensitiveType,SensitiveTopic,SensitiveWords from {} where {}_ID > {} and rownum<=50'.format(tableName, tableName, tableName,num2[0]))
- result_tuple1 = result_readOracle.fetchall() # 因为数据量太大,超过了变量的内存空间,所以一次性取40条
- for i in result_tuple1: # 一条一条写入ES,这个速度太慢,强烈需要改进 通过bulk接口导入?
- if conn.index({'id':i[0],
- 'sensitiveType':unicode(i[1]),
- 'sensitiveTopic': unicode(i[2]),
- 'sensitiveWords':unicode(i[3])}, ES_index, index_type, bulk=True): # 将数据写入索引pom的spiderInfo
- cc +=1
- print 'bulk导入后的ID:{}'.format(i[0])
- rememberId = i[0]
- cur.execute("update T_REMEMBERID set tableId = {} where tableName = '{}'".format(rememberId, tableName))
- conn1.commit()
- result_rememberId = cur.execute("select tableId from T_REMEMBERID where tableName='{}'".format(tableName)) # 通过数据库表拿到更新的ID
- num2 = result_rememberId.fetchone()
- print "{}读写成功".format(tableName)
- else:
- pass
- def ww(a):
- while True:
- print a
- time.sleep(0.5) #用于多线程的一个实验函数
- if __name__ == "__main__":
- cc = 0
- connect_ES('172.17.5.66:9200')
- # conn.indices.delete_index('_all') # 清除所有索引
- create_ESindex("pom", "spiderInfo", "involveVideo", "involveCeefax","keyWord","sensitiveWord")
- connect_Oracle("pom", "Bohui@123", "172.17.7.118:1521/ORCL")
- # thread.start_new_thread(readOracle_writeES,("T_SOCIAL","pom","spiderInfo"),)#创建一个多线程
- # thread.start_new_thread(readOracle_writeES,("T_SOCIAL_SITESEARCH", "pom", "spiderInfo"),)#创建一个多线程
- mm = time.clock()
- readOracle_writeES("T_SOCIAL", "pom", "spiderInfo") #表名虽然在程序中设置了转化为大写,但是还是全大写比较好
- readOracle_writeES("T_HOTSEARCH", "pom", "spiderInfo")
- readOracle_writeES("T_VIDEO_HOT", "pom", "spiderInfo")
- readOracle_writeES("T_PRESS", "pom", "spiderInfo")
- readOracle_writeES("T_INDUSTRY", "pom", "spiderInfo")
- readOracle_writeES("T_VIDEO_SITESEARCH", "pom", "involveVideo")
- readOracle_writeES("T_REALTIME_NEWS", "pom", "involveCeefax")
- readOracle_writeES("T_KEY_NEWS", "pom", "involveCeefax")
- readOracle_writeES("T_LOCAL_NEWS", "pom", "involveCeefax")
- readOracle_writeES("T_SOCIAL_SITESEARCH", "pom", "involveCeefax")
- readOracle_writeES("T_BASE_KEYWORDS", "pom", "keyWord")
- readOracle_writeES("T_BASE_SENSITIVEWORDS", "pom", "sensitiveWord")
- nn = time.clock()
- # conn.indices.close_index('pom')
- conn1.close()
- print '数据写入耗时:{} 成功写入数据{}条'.format(nn-mm,cc)
- #实验多线程
- """
- while a < 100:
- conn.index(
- {'tableName': 'T_base_account', 'type': '1', 'tableId': '123', 'title': unicode('陈龙'), 'author': 'ABC',
- 'content': 'ABC', 'publishTime': '12:00:00', 'browseNum': '12', 'commentNum': '12', 'dataType': '1'},
- "pom", "spiderInfo", ) # 将数据写入索引pom的spiderInfo
- a += 1
- print time.ctime()
- """
- """
- threads = []
- t1 = threading.Thread(target=readOracle_writeES,args=("T_SOCIAL","pom","spiderInfo"))
- threads.append(t1)
- #t3 = threading.Thread(target=ww,args=(10,))
- #threads.append(t3)
- #t2 = threading.Thread(target=readOracle_writeES,args=("T_SOCIAL_SITESEARCH", "pom", "spiderInfo"))
- #threads.append(t2)
- print time.ctime()
- for t in threads:
- t.setDaemon(True)
- t.start()
- t.join()
- """
五、编译过程的问题
rollback() 回滚
callproc(self, procname, args):用来执行存储过程,接收的参数为存储过程名和参数列表,返回值为受影响的行数
execute(self, query, args):执行单条sql语句,接收的参数为sql语句本身和使用的参数列表,返回值为受影响的行数
executemany(self, query, args):执行单挑sql语句,但是重复执行参数列表里的参数,返回值为受影响的行数
nextset(self):移动到下一个结果集
cursor用来接收返回值的方法:
fetchall(self):接收全部的返回结果行.
fetchmany(self, size=None):接收size条返回结果行.如果size的值大于返回的结果行的数量,则会返回cursor.arraysize条数据.
fetchone(self):返回一条结果行.
Oracle和Elasticsearch数据同步的更多相关文章
- 基于 MySQL Binlog 的 Elasticsearch 数据同步实践 原
一.背景 随着马蜂窝的逐渐发展,我们的业务数据越来越多,单纯使用 MySQL 已经不能满足我们的数据查询需求,例如对于商品.订单等数据的多维度检索. 使用 Elasticsearch 存储业务数据可以 ...
- Neo4j与ElasticSearch数据同步
Neo4j与ElasticSearch数据同步 针对节点删除,加了一些逻辑,代码地址 背景 需要强大的检索功能,所有需要被查询的数据都在neo4j. 方案 在Server逻辑中直接编写.后端有一个St ...
- Oracle数据库之间数据同步
这段时间负责某个项目开发的数据库管理工作,这个项目中开发库与测试数据库分离,其中某些系统表数据与基础资料数据经常需要进行同步,为方便完成指定数据表的同步操作,可以采用dblink与merge结合的方法 ...
- 基于MySQL Binlog的Elasticsearch数据同步实践
一.为什么要做 随着马蜂窝的逐渐发展,我们的业务数据越来越多,单纯使用 MySQL 已经不能满足我们的数据查询需求,例如对于商品.订单等数据的多维度检索. 使用 Elasticsearch 存储业务数 ...
- Elasticsearch和mysql数据同步(logstash)
1.版本介绍 Elasticsearch: https://www.elastic.co/products/elasticsearch 版本:2.4.0 Logstash: https://www ...
- mysql数据同步到Elasticsearch
1.版本介绍 Elasticsearch: https://www.elastic.co/products/elasticsearch 版本:2.4.0 Logstash: https://www ...
- kafka2x-Elasticsearch 数据同步工具demo
Bboss is a good elasticsearch Java rest client. It operates and accesses elasticsearch in a way simi ...
- Elasticsearch和mysql数据同步(elasticsearch-jdbc)
1.介绍 对mysql.oracle等数据库数据进行同步到ES有三种做法:一个是通过elasticsearch提供的API进行增删改查,一个就是通过中间件进行数据全量.增量的数据同步,另一个是通过收集 ...
- 几篇关于MySQL数据同步到Elasticsearch的文章---第五篇:logstash-input-jdbc实现mysql 与elasticsearch实时同步深入详解
文章转载自: https://blog.csdn.net/laoyang360/article/details/51747266 引言: elasticsearch 的出现使得我们的存储.检索数据更快 ...
随机推荐
- 你的MySQL服务器开启SSL了吗?SSL在https和MySQL中的原理思考
最近,准备升级一组MySQL到5.7版本,在安装完MySQL5.7后,在其data目录下发现多了很多.pem类型的文件,然后通过查阅相关资料,才知这些文件是MySQL5.7使用SSL加密连接的.本篇主 ...
- Windows Server 2016-管理站点复制(二)
为了保持所有域控制器上的目录数据一致和最新,Active Directory 会定期复制目录更改.复制根据标准网络协议进行,并使用更改跟踪信息防止发生不必要的复制,以及使用链接值复制以提高效率. 本章 ...
- ueditor富文本编辑器跨域上传图片解决办法
在使用百度富文本编辑器上传图片的过程中,如果是有一台单独的图片服务器就需要将上传的图片放到图片服务器,比如在a.com的编辑器中上传图片,图片要保存到img.com,这就涉及到跨域上传图片,而在ued ...
- June 5. 2018 Week 23rd Tuesday
Learn to let go and be clear of where you really want to head for. 学会放手,同时也要弄清楚自己的真正所爱. From Kissing ...
- 我使用的Bem的习惯
在基于BEM命名思想的基础上,我整理一些实用的点: BEM,B即block(块),E即element(元素),M即modifier(修饰符) 块:最顶层,可包含块和元素 元素:被块包含,通常为最终被包 ...
- @ModelAttribute
在执行Controller方法前都会新建一个Map对象称为隐含模型,该Map对象是共享的,如果一个方法的入参为Map ModelAndMap ModelMap等类型,那么会把隐含模型当做入参赋给方法. ...
- 7.封装,static,方法重载
一.访问修饰符1.public:公共的,所有在该项目中都可见2.protected:受保护的,同包,以及子类不同包可见3.默认:就是不写修饰符.同包4.private:私有,只在同类中 二.封装1.定 ...
- 数据库迁移之mysql-redis.txt
一.mysql迁移数据到redis 关于redis+mysql应用: 微博当然是最大的redis集群了: 总结了基本流程: 1. 发微博– > 进入消息队列– > 存入MySQL– > ...
- intoj
192.168.0.6:1024 emm....我太健忘了...
- 【mongoDB查询进阶】聚合管道(二) -- 阶段操作符
https://segmentfault.com/a/1190000010826809 什么是管道操作符(Aggregation Pipeline Operators) mongoDB有4类操作符用于 ...