python 遍历hadoop，跟指定列表对比包含列表中值的取出。

import sys

import tstree

fname = 'high_freq_site.list'

tree = tstree.TernarySearchTrie()

tree.loadData(fname)

token = ''

counter =

post = []

# url, count, posttime

for line in sys.stdin:

    line = line.strip()

    arr = line.split()

    if len(arr) != :

        continue

    #print arr

    num = arr[]

    url = arr[]

    posttime = int(arr[])

    if token == '':

        token = url

        counter =

        counter += int(num)

        post.append(posttime)

    elif token == url:

        counter += int(num)

        post.append(posttime)

    elif token != url:

        ret = tree.maxMatch(token)

        if ret and post:

            print '%s\t%s\t%s\t%s' % (ret, token, counter, min(post))

        token = url

        counter =

        counter += int(num)

        post = []

ret = tree.maxMatch(token)

if ret and post:

    print '%s\t%s\t%s\t%s' % (ret, token, counter, min(post))

class TSTNode(object):

    def __init__(self, splitchar):

        self.splitchar = splitchar

        self.data = None

        self.loNode = None

        self.eqNode = None

        self.hiNode = None

class TernarySearchTrie(object):

    def __init__(self):

        self.rootNode = None

    def loadData(self, fname):

        f = open(fname)

        while True:

            line = f.readline()

            if not line:

                break

            line = line.strip()

            node = self.addWord(line)

            if node:

                node.data = line

        f.close()

    def addWord(self, word):

        if not word:

            return None

        charIndex =

        if not self.rootNode:

            self.rootNode = TSTNode(word[])

        currentNode = self.rootNode

        while True:

            charComp = ord(word[charIndex]) - ord(currentNode.splitchar)

            if charComp == :

                charIndex +=

                if charIndex == len(word):

                    return currentNode

                if not currentNode.eqNode:

                    currentNode.eqNode = TSTNode(word[charIndex])

                currentNode = currentNode.eqNode

            elif charComp < :

                if not currentNode.loNode:

                    currentNode.loNode = TSTNode(word[charIndex])

                currentNode = currentNode.loNode

            else:

                if not currentNode.hiNode:

                    currentNode.hiNode = TSTNode(word[charIndex])

                currentNode = currentNode.hiNode

    def maxMatch(self, url):

        ret = None

        currentNode = self.rootNode

        charIndex =

        while currentNode:

            if charIndex >= len(url):

                break

            charComp = ord(url[charIndex]) - ord(currentNode.splitchar)

            if charComp == :

                charIndex +=

                if currentNode.data:

                    ret = currentNode.data

                if charIndex == len(url):

                    return ret

                currentNode = currentNode.eqNode

            elif charComp < :

                currentNode = currentNode.loNode

            else:

                currentNode = currentNode.hiNode

        return ret

if __name__ == '__main__':

    import sys

    fname = 'high_freq_site.list'

    tree = TernarySearchTrie()

    tree.loadData(fname)

    for url in sys.stdin:

        url = url.strip()

        ret = tree.maxMatch(url)

        print ret

python 遍历hadoop，跟指定列表对比包含列表中值的取出。的更多相关文章

数据结构作业——P53算法设计题（6）：设计一个算法，通过一趟遍历确定长度为n的单链表中值最大的结点
思路: 设单链表首个元素为最大值max 通过遍历元素,与最大值max作比较,将较大值附给max 输出最大值max 算法: /* *title:P53页程序设计第6题 *writer:weiyuexin ...
使用python遍历指定城市的一周气温
处于兴趣,写了一个遍历指定城市五天内的天气预报,并转为华氏度显示.把城市名字写到一个列表里这样可以方便的添加城市.并附有详细注释 1 import requests import json#定义一个函 ...
python遍历列表删除多个元素的坑
如下代码,遍历列表,删除列表中的偶数时,结果与预期不符. a = [11, 20, 4, 5, 16, 28] for i in a: if i % 2 == 0: a.remove(i) print ...
python开发学习-day02(元组、字符串、列表、字典深入)
s12-20160109-day02 *:first-child { margin-top: 0 !important; } body>*:last-child { margin-bottom: ...
Python黑帽编程2.3 字符串、列表、元组、字典和集合
Python黑帽编程2.3 字符串.列表.元组.字典和集合本节要介绍的是Python里面常用的几种数据结构.通常情况下,声明一个变量只保存一个值是远远不够的,我们需要将一组或多组数据进行存储.查询 ...
python整理之（字符串、元组、列表、字典）
一.关于字符串的整理总结对于字符串的操作常用的有这些: 字符串的操作通过dir()函数可以查看我们先整理没有下划线的用法,有下划线的暂时不去考虑. 1.capitalize 功能:使字符串的首字母 ...
python基础知识3——基本的数据类型2——列表，元组，字典，集合
磨人的小妖精们啊!终于可以归置下自己的大脑啦,在这里我要把--整型,长整型,浮点型,字符串,列表,元组,字典,集合,这几个知识点特别多的东西,统一的捯饬捯饬,不然一直脑袋里面乱乱的. 一.列表 1.列 ...
Python第三天序列数据类型数值字符串列表元组字典
Python第三天序列数据类型数值字符串列表元组字典数据类型数值字符串列表元组字典序列序列:字符串.列表.元组序列的两个主要特点是索引操作符和切片操作符- 索引操作符让我 ...
Python：list 和 array的对比以及转换时的注意事项
Python:list 和 array的对比以及转换时的注意事项 zoerywzhou@163.com http://www.cnblogs.com/swje/ 作者:Zhouwan 2017-6-4 ...

随机推荐

2018 完美搭建VS Code 的JAVA开发环境并解决print乱码问题
出自微软的Visual Studio Code 并不是一个 IDE,它是个有理想,有前途的编辑器,通过相应语言的插件,可以将其包装成一个轻量级的功能完善的IDE. 自从遇见了她,真的是对她一见钟情不 ...
iframe边距问题解决
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8&quo ...
Java基础知识--集合
集合类数组和集合的比较:数组可以存储对象,也可以存储基本数据类型,但是缺点就是长度固定,不能改变:集合长度是可变的,但是集合只能存储对象,集合可以存储不同类型的对象. Java容器类库一共有两种主要 ...
【C语言程序】让用户输入十个数，用冒泡排序法从小到大排序
#include <stdio.h> #define N 10 void swap(int*a,int*b); int main(int argc, char *argv[]) { in ...
2018年东北农业大学春季校赛 E-wyh的阶乘(求n!的0的个数)
链接:https://www.nowcoder.com/acm/contest/93/E来源:牛客网题目描述这个问题很简单,就是问你n的阶乘末尾有几个0? 输入描述: 输入第一行一个整数T(1&l ...
Tornado-cookie
cookie 服务端在客户端的中写一个字符串,下一次客户端再访问时只要携带该字符串,就认为其是合法用户. tornado中的cookie有两种,一种是未加密的,一种是加密的,并且可以配置生效域名.路径 ...
OCR库Tesseract初探
1.Tesseract 安装及使用一款由HP实验室开发由Google维护的开源OCR(Optical Character Recognition , 光学字符识别)引擎,与Microsoft Off ...
【管用】使用VMtools实现主机Windows与虚拟机Linux文件共享
实现windows主机与linux虚拟机文件共享,有很多方法,包括使用samba文件服务器等,本文介绍通过vmware虚拟机软件中的vmtools工具来实现文件共享. 一.环境 1.主机:Window ...
Python的ctypes 和pyinstaller
这几天在学习python的爬虫, 无意中看到一篇博文 Python爬虫之自制英汉字典发现里面的ctypes 和pyinstaller 还不了解,这边文章说白了就是你输入英文, python读取你的输 ...
Hadoop Ls命令添加显示条数限制參数
前言在hadoop的FsShell命令中,预计非常多人比較经常使用的就是hadoop fs -ls,-lsr,-cat等等这种与Linux系统中差点儿一致的文件系统相关的命令.可是细致想想,这里还是 ...

python 遍历hadoop， 跟指定列表对比 包含列表中值的取出。

python 遍历hadoop， 跟指定列表对比 包含列表中值的取出。的更多相关文章

随机推荐

热门专题

python 遍历hadoop，跟指定列表对比包含列表中值的取出。

python 遍历hadoop，跟指定列表对比包含列表中值的取出。的更多相关文章