paper about spring

一、解析用户原始信息的json文件

#!/usr/bin/python

# -*- coding=utf-8 -*-

import os

import sys

import json

def main():

    root_dir = sys.argv[1]

    province_file = root_dir +"/conf/province.list"

    fin = open(province_file, 'r')

    provinces = set()

    for line in fin:

        province = line.strip()

        provinces.add(province)

    fin.close()

    input_file  = root_dir +"/source_data/userinfo.json"

    output_file = root_dir +"/result_data/userinfo.data"

    fin = open(input_file, 'r')

    fout = open(output_file, 'w')

    for line in fin:

        if line.strip() == "[]":

            continue

        json_file = json.loads(line.strip())

        userid   = json_file['userId']

        sex      = json_file['sex']

        location = json_file['location']

        birthday = json_file['birthday']

        attentioncount = json_file['attentionCount']

        fanscount      = json_file['fansCount']

        weibocount     = json_file['weiboCount']

        label_list=json_file['labelList']

        user_introduce=json_file['userIntroduce']

        if not sex:

            sex = 'null'

        if location.find(' ') != -1:

            fields = location.split(' ')

            location = fields[0]

        elif location:

            for province in provinces:

                if location.find(province) != -1:

                    location = province

        if not location :

            location = 'null'

        index = birthday.find('年')

        if index != -1:

            birthday = birthday[0:index]

        else:

            birthday = 'null'

        if not attentioncount:

            attentioncount = ''

        if not fanscount:

            fanscount = ''

        if not weibocount:

            weibocount = ''

        if not label_list or not label_list.strip():

            label_list='null'

        if not user_introduce or not user_introduce.strip():

            user_introduce='null'

        print>>fout, "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s"%(userid, sex, location, birthday, attentioncount, fanscount, weibocount,label_list,user_introduce)

    fin.close()

    fout.close()

if __name__ == "__main__":

    main()

UserInfoParser

1,用户的标签需要进一步进行分词处理

2，根据这份数据，打了标签的用户大概占据总用户的1/3

3，对于没有标签的，这里使用的是null标注

如果需要删除没有标签的记录，那么相关的shell语句为：

cat userinfo.data | awk -F '\t' '{print $8}'|sed /null/d

cat userinfo.data | cut -f 8|sed /null/d

cat userinfo.data | awk -F '\t' {if $8!=null print $8}

UserShell

二、进行映射以及排序的一些操作

#!/usr/bin/python

import os

import sys

def main():

    root_dir = sys.argv[1]

    topN     = int(sys.argv[2])

    topic_total_file = root_dir +'/result_data/topic_id.data.total'

    id_topic = {}

    fin = open(topic_total_file, 'r')

    for line in fin:

        fields = line.strip().split('\t')

        id_topic[fields[1]] = fields[0]

    fin.close()

    topicid_count = {}

    sources = ['sina', 'tencent']

    for source in sources:

        input_file = root_dir +'/result_data/'+ source +'.data'

        fin = open(input_file, 'r')

        for line in fin:

            fields = line.strip().split('\t')

            if fields[2] == '-1':

                continue

            topics = fields[2].split(':')

            for topic in topics:

                if topic in topicid_count:

                    topicid_count[topic] += 1

                else:

                    topicid_count[topic] = 1

        fin.close()

    sort_topic = sorted(topicid_count.items(), key = lambda d:d[1], reverse=True)

    if len(sort_topic) < topN:

        topN = len(sort_topic)

    output_file = root_dir +'/result_data/topic_id.data'

    fout = open(output_file, 'w')

    for i in range(topN):

        print>>fout, "%s\t%s\t%s"%(sort_topic[i][0], id_topic[sort_topic[i][0]], topicid_count[sort_topic[i][0]])

    fout.close()

if __name__ == "__main__":

    main()

TopN_topic

1）构建两个列表，一个存储id和topic的对应关系，一个存储id和该topic出现次数的对应关系

2）按照次数排序的时候采用sorted（dict.items(),key=lambda d:d[1],reverse=True）进行，这样排序之后得到一个元组构成的列表

三、文档id分配、停用词处理

#!/usr/bin/python

import os

import sys

def main():

    if len(sys.argv) != 4:

        print "error parameters!"

        sys.exit(0)

    root_dir = sys.argv[1][0:sys.argv[1].rfind('/')]

    input_dir = sys.argv[1]

    output_root_dir = sys.argv[2]

    topic_multiple = float(sys.argv[3])

    # stopwords

    stopwords_file = root_dir +'/conf/stopwords.list'

    fin = open(stopwords_file, 'r')

    stopwords = set()

    for line in fin:

        word = line.strip()

        stopwords.add(word)

    fin.close()

    # generate ntopics_alpha.data

    cmd = "wc -l "+ root_dir +"/result_data/topic_id.data | awk -F' ' '{print $1}'"

    num_topics = int(int(os.popen(cmd).read().strip()) * topic_multiple)

    alpha = 50 / float(num_topics)

    ntopics_alpha_file = output_root_dir +'/ntopics_alpha.data'

    fout = open(ntopics_alpha_file, 'w')

    print>>fout, "%s\t%s"%(num_topics, alpha)

    fout.close()

    # allocate docid and remove stopwords

    source_list = ['sina', 'tencent', 'tianya']

    for source in source_list:

        input_file = input_dir +'/'+ source +'.data'

        cmd = "wc -l "+ input_file +" | awk -F' ' '{print $1}'"

        line_number = os.popen(cmd).read().strip()

        output_file = output_root_dir +'/'+ source +'/source.data'

        fin = open(input_file, 'r')

        fout = open(output_file, 'w')

        print>>fout, line_number

        docid = {}

        allocate_id = 0

        for line in fin:

            fields = line.strip().split('\t')

            doc    = fields[0]

            docid[doc] = allocate_id

            allocate_id += 1

            line = ""

            for word in fields[1].split(' '):

                if word.strip() and word not in stopwords:

                    line += word +'\t'

            if len(line) == 0:

                print>>fout, 'null'

            else:

                print>>fout, line

        fin.close()

        fout.close()

        docid_file = output_root_dir +'/'+ source +'/docid.map'

        fout = open(docid_file, 'w')

        for doc in docid:

            print>>fout, "%s\t%s"%(doc, docid[doc])

        fout.close()

if __name__ == "__main__":

    main()

allocateDocId

1）如何去除停用词

2）如何为文档分配id，并将记录进行保存

3）在实际情况中，是将当前的文档id，词本身作为key还是进一步处理，其实可以看情况而定。

4, generate_nw_nd

#!/usr/bin/python

import os

import sys

def main():

    root_dir = sys.argv[]

    cmd = "cat "+ root_dir +"/lda_model/ntopics_alpha.data | awk -F' ' '{print $1}' "

    num_topics = int(os.popen(cmd).read().strip())

    source_list = ['sina', 'tencent', 'tianya']

    for source in source_list:

        tassign_file = root_dir +'/lda_model/'+ source +'/model-final.tassign'

        nd_file = root_dir +'/lda_model/'+ source +'/nd.data'

        cmd = "head -1 "+ root_dir +"/lda_model/"+ source +"/wordmap.txt"

        num_tokens = int(os.popen(cmd).read().strip())

        nw = [ for i in range(num_topics * num_tokens)]

        fin = open(tassign_file, 'r')

        fout = open(nd_file, 'w')

        docid =

        for line in fin:

            fields = line.strip().split(' ')

            nd = [ for i in range(num_topics)]

            for pair in fields:

                parts   = pair.split(':')

                wordid  = int(parts[])

                topicid = int(parts[])

                nw[wordid*num_topics + topicid] +=

                nd[topicid] +=

            print>>fout, "%s\t%s"%(docid, "\t".join([str(i) for i in nd]))

            docid +=

        fin.close()

        fout.close()

        nw_file = root_dir +'/lda_model/'+ source +'/nw.data'

        fout = open(nw_file, 'w')

        for wordid in range(num_tokens):

            line = ''

            for topicid in range(num_topics):

                line += str(nw[wordid*num_topics + topicid]) +'\t'

            print>>fout, line

        fout.close()

if __name__ == "__main__":

    main()

generae_nw_nd

1) use list to do matrix

5,topic_mapping

#!/usr/bin/python

import os

import sys

def similarity(real_vector, lda_vector):

    score = float(0)

    words = set()

    for word in real_vector:

        if word not in words:

            words.add(word)

    for word in lda_vector:

        if word not in words:

            words.add(word)

    real_list = []

    lda_list = []

    for word in words:

        if word in real_vector:

            real_list.append(real_vector[word])

        else:

            real_list.append(float(0))

        if word in lda_vector:

            lda_list.append(lda_vector[word])

        else:

            lda_list.append(float(0))

    for i in range(len(real_list)):

        score += real_list[i] * lda_list[i]

    return score

def topic_mapping(realtopic_vector, ldatopic_vector):

    real_lda = {}

    for realtopic in realtopic_vector:

        max_topic = ''

        max_score = float(0)

        for ldatopic in ldatopic_vector:

            score = similarity(realtopic_vector[realtopic], ldatopic_vector[ldatopic])

            if score > max_score:

                max_topic = ldatopic

                max_score = score

        real_lda[realtopic] = max_topic

    return real_lda

def main():

    root_dir = sys.argv[1]

    twords   = int(sys.argv[2])

    realtopic_words = int(sys.argv[3])

    source_list = ['sina', 'tencent', 'tianya']

    # generate vsm of real topic

    topicid_file = root_dir +"/result_data/topic_id.data"

    realtopic_vsm = {}

    fin = open(topicid_file, 'r')

    for line in fin:

        fields = line.strip().split('\t')

        realtopic_vsm[fields[0]] = {}

    fin.close()

    topic_source_list = ['sina', 'tencent']

    for topic_source in topic_source_list:

        input_file = root_dir +'/result_data/'+ topic_source +'.data'

        fin = open(input_file, 'r')

        for line in fin:

            fields = line.strip().split('\t')

            topicid = fields[2]

            if topicid == '-1':

                continue

            for topic in topicid.split(':'):

                if topic not in realtopic_vsm:

                    continue

                for word in fields[1].split(' '):

                    if word not in realtopic_vsm[topic]:

                        realtopic_vsm[topic][word] = 1

                    else:

                        realtopic_vsm[topic][word] += 1

        fin.close()

    # generate vector of real topic

    realtopic_vector = {}

    for topic in realtopic_vsm:

        realtopic_vector[topic] = {}

        length = realtopic_words

        sorted_tmp = sorted(realtopic_vsm[topic].items(), key = lambda d:d[1], reverse=True)

        if len(sorted_tmp) < length:

            length = len(sorted_tmp)

        sum_count = 0

        for i in range(length):

            sum_count += sorted_tmp[i][1]

        for i in range(length):

            realtopic_vector[topic][sorted_tmp[i][0]] = sorted_tmp[i][1] / float(sum_count)

    # mapping real topic with lda topic

    for source in source_list:

        input_file = root_dir +"/lda_model/"+ source +"/model-final.twords"

        # re-build topic vectoc

        ldatopic_vector = {}

        fin = open(input_file, 'r')

        cur_topic = ""

        for line in fin:

            line = line.strip()

            if line.find('Topic') != -1:

                fields = line.split(' ')

                cur_topic = fields[1][0: fields[1].find('th')]

                ldatopic_vector[cur_topic] = {}

            else:

                fields = line.split('\t')

                word = fields[0]

                weight = float(fields[1])

                if weight > 0.0:

                    ldatopic_vector[cur_topic][word] = weight

        fin.close()

        real_lda = topic_mapping(realtopic_vector, ldatopic_vector)

        output_file = root_dir +"/lda_model/"+ source +"/topic_mapping.data"

        fout = open(output_file, 'w')

        for realtopic in real_lda:

            print>>fout, "%s\t%s"%(realtopic, real_lda[realtopic])

        fout.close()

if __name__ == "__main__":

    main()

topic_mapping

1)set an real_topic to lda_topic

(the real_topic 's words are by counting;the lda_topic 's word are by training)

2) caculate the similarity of two dictory or two vector

6, final_data

#!/usr/bin/python

import sys

def main():

    root_dir = sys.argv[1]

    topn = 2 # the top n topic is the real distribution of document

    source_list = ['sina', 'tencent', 'tianya']

    for source in source_list:

        allocateid_ldatopic = {} # value is a list

        input_file = root_dir +'/lda_model/'+ source +'/nd.data'

        fin = open(input_file, 'r')

        for line in fin:

            fields = line.strip().split('\t')

            allocateid =  fields[0]

            topic_distribution = {}

            for i in range(1, len(fields)-1):

                topic_distribution[i-1] = int(fields[i])

            sorted_tmp = sorted(topic_distribution.items(), key = lambda d:d[1], reverse=True)

            allocateid_ldatopic[allocateid] = []

            for i in range(topn):

                allocateid_ldatopic[allocateid].append(sorted_tmp[i][0])

        fin.close()

        ldatopic_realtopic = {} # value is a list

        input_file = root_dir +'/lda_model/'+ source +'/topic_mapping.data'

        fin = open(input_file, 'r')

        for line in fin:

            fields = line.strip().split('\t')

            ldatopic = fields[1]

            realtopic = fields[0]

            if ldatopic not in ldatopic_realtopic:

                ldatopic_realtopic[ldatopic] = [realtopic]

            else:

                ldatopic_realtopic[ldatopic].append(realtopic)

        fin.close()

        userid_profile = {}

        input_file = root_dir +'/result_data/userinfo.data'

        fin = open(input_file, 'r')

        for line in fin:

            fields = line.strip().split('\t')

            userid = fields[0]

            sex = fields[1]

            location = fields[2]

            age = fields[3]

            fanscount = fields[5]

            weibocount = fields[6]

            userid_profile[userid] = [sex, location, age, fanscount, weibocount]

        fin.close()

        docid_allocateid = {}

        input_file = root_dir +'/lda_model/'+ source +'/docid.map'

        fin = open(input_file, 'r')

        for line in fin:

            fields = line.strip().split('\t')

            docid_allocateid[fields[0]] = fields[1]

        fin.close()

        # final.data

        input_file = root_dir +'/result_data/'+ source +'.data'

        output_file = root_dir +'/lda_model/'+ source +'/final.data'

        fin = open(input_file, 'r')

        fout = open(output_file, 'w')

        for line in fin:

            fields = line.strip().split('\t')

            docid = fields[0]

            allocateid = docid_allocateid[docid]

            topic_set = set()

            if fields[2] != '-1':

                for topic in fields[2].split(':'):

                    if topic in topic_set:

                        continue

                    topic_set.add(topic)

            for ldatopic in allocateid_ldatopic[allocateid]:

                if str(ldatopic) not in ldatopic_realtopic:

                    continue

                for topic in ldatopic_realtopic[str(ldatopic)]:

                    if topic not in topic_set:

                        topic_set.add(topic)

            if topic_set:

                topics = ':'.join(topic_set)

            else:

                topics = 'null'

            comment = fields[3]

            retweet = fields[4]

            praise = fields[5]

            userid = fields[6]

            if userid in userid_profile:

                user_profile = '\t'.join(userid_profile[userid])

            else:

                user_profile = 'null\tnull\tnull\tnull\tnull'

            print>>fout, "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s"%(docid, allocateid, topics, comment, retweet, praise, userid, user_profile)

        fin.close()

        fout.close()

if __name__ == "__main__":

    main()

final_data

1) allocate each doc the top2 lda_topic

2) totate a dict's key and value, though the key is unique ,but the same value can project different key

3)allocate a doc some relate topic ,include it's own topic,as well as the top2 lda-projection topic

4)merge the allocateid_ldatopic, ldatopic_realtopic, userid_profile, docid_allocateid to a single file

7,Visualization

#!/usr/bin/python

# -*- coding=utf-8 -*-

import sys

from string import Template

def replace_template(template_file, replaceDict, output_file):

    fh = open(template_file, 'r')

    content = fh.read()

    fh.close()

    content_template = Template(content)

    content_final = content_template.safe_substitute(replaceDict)

    fout = open(output_file, 'w')

    fout.write(content_final)

    fout.close()

def bar_categories(categories_list):

    categories = "["

    for i in range(len(categories_list)):

        if i == len(categories_list)-1:

            categories += "'"+ categories_list[i] +"']"

        else:

            categories += "'"+ categories_list[i] +"',"

    return categories

def bar_series(data_list):

    series = "[{ name: 'count', data: ["

    for i in range(len(data_list)):

        if i == len(data_list)-1:

            series += str(data_list[i]) +"]}]"

        else:

            series += str(data_list[i]) +","

    return series

def pie_data(data_map):

    data = "["

    index = 0

    for item in data_map:

        if index == len(data_map)-1:

            data += "['"+ str(item) +"',"+ str(data_map[item]) +"]"

        else:

            data += "['"+ str(item) +"',"+ str(data_map[item]) +"],"

    data += "]"

    return data

def main():

    root_dir = sys.argv[1]

    # topicid and topic's content

    topicid_content = {}

    input_file = root_dir +'/result_data/topic_id.data'

    fin = open(input_file, 'r')

    for line in fin:

        fields = line.strip().split('\t')

        topicid_content[fields[0]] = fields[1]

    fin.close()

    #1、话题分布

    source_list = ['sina', 'tencent', 'tianya']

    topicid_count = {}

    for source in source_list:

        input_file = root_dir +'/lda_model/'+ source +'/final.data'

        fin = open(input_file, 'r')

        for line in fin:

            fields = line.strip().split('\t')

            if fields[2] == 'null':

                continue

            for topic in fields[2].split(':'):

                if topic in topicid_count:

                    topicid_count[topic] += 1

                else:

                    topicid_count[topic] = 1

    # Total topics sorted by its total count

    sorted_result = sorted(topicid_count.items(), key = lambda d:d[1], reverse=True)

    topN = 20

    replaceDict = {}

    replaceDict['title'] = "'话题分布'"

    replaceDict['subtitle'] = "''"

    categories_list = []

    for i in range(topN):

        categories_list.append(topicid_content[ sorted_result[i][0] ])

    replaceDict['categories'] = bar_categories(categories_list)

    replaceDict['x_name'] = "'相关微博或帖子条数'"

    data_list = []

    for i in range(topN):

        data_list.append(sorted_result[i][1])

    replaceDict['series'] = bar_series(data_list)

    template_file = root_dir +'/template/horizontal_bar.tpl'

    output_file = root_dir +'/final_html/1.htm'

    replace_template(template_file, replaceDict, output_file)

    #2、话题分布变化趋势

    #3、话题关注用户的男女比例

    topN = 10

    topicid_sex = {}

    for i in range(topN):

        topicid_sex[sorted_result[i][0]] = [0, 0]

    source_list = ['sina'] # we only has user profile of sina currently

    for source in source_list:

        input_file = root_dir +'/lda_model/'+ source +'/final.data'

        fin = open(input_file, 'r')

        for line in fin:

            fields = line.strip().split('\t')

            if fields[2] == 'null' or fields[7] == 'null':

                continue

            for topic in fields[2].split(':'):

                if topic not in topicid_sex:

                    continue

                if fields[7] == "男":

                    topicid_sex[topic][0] += 1

                if fields[7] == "女":

                    topicid_sex[topic][1] += 1

        fin.close()

    for i in range(topN):

        template_file = root_dir +'/template/pie.tpl'

        output_file = root_dir +'/final_html/3-'+ str(i) +'.htm'

        replaceDict = {}

        replaceDict['title'] = "'#"+ topicid_content[sorted_result[i][0]] +"# 关注用户男女比例'"

        sum_count = topicid_sex[sorted_result[i][0]][0] + topicid_sex[sorted_result[i][0]][1]

        sex_map = {}

        sex_map['男'] = topicid_sex[sorted_result[i][0]][0] / float(sum_count)

        sex_map['女'] = topicid_sex[sorted_result[i][0]][1] / float(sum_count)

        replaceDict['data'] = pie_data(sex_map)

        replace_template(template_file, replaceDict, output_file)

    #4、话题关注用户的地域分布

    topN = 10

    province_conf = root_dir +'/conf/province.list'

    province_list = []

    province_map = {}

    fin = open(province_conf, 'r')

    index = 0

    for line in fin:

        province = line.strip()

        province_list.append(province)

        province_map[province] = index

        index += 1

    fin.close()

    source_list = ['sina']

    topicid_province = {}

    for i in range(topN):

        topicid_province[sorted_result[i][0]] = []

        for j in range(len(province_list)):

            topicid_province[sorted_result[i][0]].append(0)

    for source in source_list:

        input_file = root_dir +'/lda_model/'+ source +'/final.data'

        fin = open(input_file, 'r')

        for line in fin:

            fields = line.strip().split('\t')

            if fields[2] == 'null' or fields[8] == 'null':

                continue

            for topic in fields[2].split(':'):

                if topic not in topicid_province:

                    continue

                province_index = int(province_map[fields[8]])

                topicid_province[topic][province_index] += 1

        fin.close()

    for i in range(topN):

        template_file = root_dir +'/template/horizontal_bar.tpl'

        output_file = root_dir +'/final_html/4-'+ str(i) +'.htm'

        replaceDict = {}

        replaceDict['title'] = "'#"+ topicid_content[sorted_result[i][0]] +"# 关注用户地域分布'"

        replaceDict['subtitle'] = "''"

        replaceDict['x_name'] = "'相关微博或帖子条数'"

        replaceDict['categories'] = bar_categories(province_list)

        replaceDict['series'] = bar_series(topicid_province[sorted_result[i][0]])

        replace_template(template_file, replaceDict, output_file)

    #5、话题关注用户的年龄分布

    topN = 10

    age_list = ['10岁以下', '10-19岁', '20-29岁', '30-39岁', '40-49岁', '50-59岁', '60岁以上']

    source_list = ['sina']

    topicid_age = {}

    for i in range(topN):

        topicid_age[sorted_result[i][0]] = []

        for j in range(len(age_list)):

            topicid_age[sorted_result[i][0]].append(0)

    for source in source_list:

        input_file = root_dir +'/lda_model/'+ source +'/final.data'

        fin = open(input_file, 'r')

        for line in fin:

            fields = line.strip().split('\t')

            if fields[2] == 'null' or fields[9] == 'null':

                continue

            for topic in fields[2].split(':'):

                if topic not in topicid_age:

                    continue

                age  = 2013 -int(fields[9])

                if age <= 9:

                    topicid_age[topic][0] += 1

                elif age >= 10 and age <= 19:

                    topicid_age[topic][1] += 1

                elif age >= 20 and age <= 29:

                    topicid_age[topic][2] += 1

                elif age >= 30 and age <= 39:

                    topicid_age[topic][3] += 1

                elif age >= 40 and age <= 49:

                    topicid_age[topic][4] += 1

                elif age >= 50 and age <= 59:

                    topicid_age[topic][5] += 1

                elif age >= 60:

                    topicid_age[topic][6] += 1

        fin.close()

    for i in range(topN):

        template_file = root_dir +'/template/vertical_bar.tpl'

        output_file = root_dir +'/final_html/5-'+ str(i) +'.htm'

        replaceDict = {}

        replaceDict['title'] = "'#"+ topicid_content[sorted_result[i][0]] +"# 关注用户年龄分布'"

        replaceDict['subtitle'] = "''"

        replaceDict['y_name'] = "'人数'"

        replaceDict['categories'] = bar_categories(age_list)

        replaceDict['series'] = bar_series(topicid_age[sorted_result[i][0]])

        replace_template(template_file, replaceDict, output_file)

    #6、话题来源媒体的比例

    topN = 10

    source_list = ['sina', 'tencent', 'tianya']

    topicid_source = {}

    for i in range(topN):

        topicid_source[sorted_result[i][0]] = []

        for j in range(len(source_list)):

            topicid_source[sorted_result[i][0]].append(0)

    for source in source_list:

        input_file = root_dir +'/lda_model/'+ source +'/final.data'

        fin = open(input_file, 'r')

        for line in fin:

            fields = line.strip().split('\t')

            if fields[2] == 'null':

                continue

            for topic in fields[2].split(':'):

                if topic not in topicid_source:

                    continue

                if source == "sina":

                    topicid_source[topic][0] += 1

                if source == "tencent":

                    topicid_source[topic][1] += 1

                if source == "tianya":

                    topicid_source[topic][2] += 1

        fin.close()

    for i in range(topN):

        template_file = root_dir +'/template/pie.tpl'

        output_file = root_dir +'/final_html/6-'+ str(i) +'.htm'

        replaceDict = {}

        replaceDict['title'] = "'#"+ topicid_content[sorted_result[i][0]] +"#  话题来源媒体分布'"

        source_map = {}

        source_map['sina'] = topicid_source[sorted_result[i][0]][0]

        source_map['tencent'] = topicid_source[sorted_result[i][0]][1]

        source_map['tianya'] = topicid_source[sorted_result[i][0]][2]

        replaceDict['data'] = pie_data(source_map)

        replace_template(template_file, replaceDict, output_file)

    #7、话题的核心关注用户

    topN = 10

    coreuser = 5

    source_list = ['sina']

    topicid_user = {}

    for i in range(topN):

        topicid_user[sorted_result[i][0]] = {}

    for source in source_list:

        input_file = root_dir +'/lda_model/'+ source +'/final.data'

        fin = open(input_file, 'r')

        for line in fin:

            fields = line.strip().split('\t')

            if fields[2] == 'null' or fields[6] == 'null':

                continue

            userid = fields[6]

            for topic in fields[2].split(':'):

                if topic not in topicid_user:

                    continue

                if userid not in topicid_user[topic]:

                    topicid_user[topic][userid] = 1

                else:

                    topicid_user[topic][userid] += 1

        fin.close()

    output_file = root_dir +'/final_html/topic_coreuser.list'

    fout = open(output_file, 'w')

    for i in range(topN):

        title = "#"+ topicid_content[sorted_result[i][0]] +"#  话题核心关注人物"

        print>>fout, title

        sorted_tmp = sorted(topicid_user[sorted_result[i][0]].items(), key = lambda d:d[1], reverse =True)

        if len(sorted_tmp) < coreuser:

            coreuser = len(sorted_tmp)

        for j in range(coreuser):

            print>>fout, "\t%s\t%s"%(sorted_tmp[j][0], sorted_tmp[j][1]) # userid and related documents count

    fout.close()

    #8、话题关注用户的粉丝数分布

    topN = 10

    fans_list = ['0-100', '101-1000', '1001-10000', '10001-100000', '100001-500000', '500000以上']

    source_list = ['sina']

    topicid_fans = {}

    for i in range(topN):

        topicid_fans[sorted_result[i][0]] = []

        for j in range(len(fans_list)):

            topicid_fans[sorted_result[i][0]].append(0)

    for source in source_list:

        input_file = root_dir +'/lda_model/'+ source +'/final.data'

        fin = open(input_file, 'r')

        for line in fin:

            fields = line.strip().split('\t')

            if fields[2] == 'null' or fields[6] == 'null' or fields[10] == 'null':

                continue

            for topic in fields[2].split(':'):

                if topic not in topicid_fans:

                    continue

                fans = int(fields[10])

                if fans <= 100:

                    topicid_fans[topic][0] += 1

                elif fans >= 101 and fans <= 1000:

                    topicid_fans[topic][1] += 1

                elif fans >= 1001 and fans <= 10000:

                    topicid_fans[topic][2] += 1

                elif fans >= 10001 and fans <= 100000:

                    topicid_fans[topic][3] += 1

                elif fans >= 100001 and fans <= 500000:

                    topicid_fans[topic][4] += 1

                elif fans >= 500001:

                    topicid_fans[topic][5] += 1

        fin.close()

    for i in range(topN):

        template_file = root_dir +'/template/horizontal_bar.tpl'

        output_file = root_dir +'/final_html/8-'+ str(i) +'.htm'

        replaceDict = {}

        replaceDict['title'] = "'#"+ topicid_content[sorted_result[i][0]] +"# 关注用户粉丝数分布'"

        replaceDict['subtitle'] = "''"

        replaceDict['x_name'] = "'粉丝数'"

        replaceDict['categories'] = bar_categories(fans_list)

        replaceDict['series'] = bar_series(topicid_fans[sorted_result[i][0]])

        replace_template(template_file, replaceDict, output_file)

    #9、话题关注用户的微博数分布

    topN = 10

    weibo_list = ['0-100', '101-1000', '1001-3000', '3001-5000', '5001-10000', '10000以上']

    source_list = ['sina']

    topicid_weibo = {}

    for i in range(topN):

        topicid_weibo[sorted_result[i][0]] = []

        for j in range(len(weibo_list)):

            topicid_weibo[sorted_result[i][0]].append(0)

    for source in source_list:

        input_file = root_dir +'/lda_model/'+ source +'/final.data'

        fin = open(input_file, 'r')

        for line in fin:

            fields = line.strip().split('\t')

            if fields[2] == 'null' or fields[6] == 'null' or fields[11] == 'null':

                continue

            for topic in fields[2].split(':'):

                if topic not in topicid_weibo:

                    continue

                weibo = int(fields[10])

                if weibo <= 100:

                    topicid_weibo[topic][0] += 1

                elif weibo >= 101 and weibo <= 1000:

                    topicid_weibo[topic][1] += 1

                elif weibo >= 1001 and weibo <= 3000:

                    topicid_weibo[topic][2] += 1

                elif weibo >= 3001 and weibo <= 5000:

                    topicid_weibo[topic][3] += 1

                elif weibo >= 5001 and weibo <= 10000:

                    topicid_weibo[topic][4] += 1

                elif weibo >= 10001:

                    topicid_weibo[topic][5] += 1

        fin.close()

    for i in range(topN):

        template_file = root_dir +'/template/horizontal_bar.tpl'

        output_file = root_dir +'/final_html/9-'+ str(i) +'.htm'

        replaceDict = {}

        replaceDict['title'] = "'#"+ topicid_content[sorted_result[i][0]] +"# 关注用户微博数分布'"

        replaceDict['subtitle'] = "''"

        replaceDict['x_name'] = "'微博数'"

        replaceDict['categories'] = bar_categories(weibo_list)

        replaceDict['series'] = bar_series(topicid_weibo[sorted_result[i][0]])

        replace_template(template_file, replaceDict, output_file)

    #10、关注度、传播度、活跃度

    topN = 10

    source_list = ['sina', 'tencent', 'tianya']

    topicid_attention = {}

    topicid_diffuse   = {}

    topicid_active    = {}

    for i in range(topN):

        topicid_attention[sorted_result[i][0]] = set() #userlist

        topicid_diffuse[sorted_result[i][0]]   = {} #user and fans

        topicid_active[sorted_result[i][0]]    = 0 #comment and retweet and praise

    for source in source_list:

        input_file = root_dir +'/lda_model/'+ source +'/final.data'

        fin = open(input_file, 'r')

        for line in fin:

            fields = line.strip().split('\t')

            if fields[2] == 'null':

                continue

            for topic in fields[2].split(':'):

                if topic not in topicid_attention:

                    continue

                if fields[6] != 'null':

                    if fields[6] not in topicid_attention[topic]:

                        topicid_attention[topic].add(fields[6])

                    if fields[10] != 'null':

                        if fields[6] not in topicid_diffuse[topic]:

                            topicid_diffuse[topic][fields[6]] = int(fields[10])

                if fields[3] != 'null':

                    topicid_active[topic] += int(fields[3])

                if fields[4] != 'null':

                    topicid_active[topic] += int(fields[4])

                if fields[5] != 'null':

                    topicid_active[topic] += int(fields[5])

        fin.close()

    output_file = root_dir +'/final_html/topic_attention_diffuse_active.list'

    fout = open(output_file, 'w')

    for i in range(topN):

        title = "#"+ topicid_content[sorted_result[i][0]] +"#  关注度、传播度、活跃度"

        print>>fout, title

        attention = len(topicid_attention[sorted_result[i][0]])

        diffuse = 0

        for user in topicid_diffuse[sorted_result[i][0]]:

            diffuse += topicid_diffuse[sorted_result[i][0]][user]

        active = topicid_active[sorted_result[i][0]]

        print>>fout, "\t%s\t%s\t%s"%(attention, diffuse, active)

    fout.close()

if __name__ == "__main__":

    main()

visualization

paper about spring的更多相关文章

OpenCASCADE Ring Type Spring Modeling
OpenCASCADE Ring Type Spring Modeling eryar@163.com Abstract. The general method to directly create ...
Spring基础
一.Spring作用:管理项目中各种业务Bean(service类.Dao类.Action类),实例化类,属性赋值二.Spring IOC(Inversion of Control )控制反转,也被 ...
Spring 01基础
一.Spring作用:管理项目中各种业务Bean(service类.Dao类.Action类),实例化类,属性赋值二.Spring IOC(Inversion of Control )控制反转,也被 ...
第一章 spring核心概念
一.Spring作用:管理项目中各种业务Bean(service类.Dao类.Action类),实例化类,属性赋值二.Spring IOC(Inversion of Control )控制反转,也被 ...
Spring基础[IOC/DI、AOP]
一.Spring作用:管理项目中各种业务Bean(service类.Dao类.Action类),实例化类,属性赋值二.Spring IOC(Inversion of Control )控制反转,也被 ...
### Paper about Event Detection
Paper about Event Detection. #@author: gr #@date: 2014-03-15 #@email: forgerui@gmail.com 看一些相关的论文. 1 ...
spring之bean
Bean的基本配置 id属性 id属性确定bean的唯一标识符,容器对bean的管理,访问,以及该bean的依赖关系,都通过该属性来完成.bean的id属性在Spring容器中应该是唯一的. clas ...
(转)Spring中Singleton模式的线程安全
不知道哪里的文章,总结性还是比较好的.但是代码凌乱,有的还没有图.如果找到原文了可以进行替换! spring中的单例 spring中管理的bean实例默认情况下是单例的[sigleton类型],就还有 ...
SSM-Spring-02：Spring的DI初步加俩个实例
------------吾亦无他,唯手熟尔,谦卑若愚,好学若饥------------- DI:依赖注入第一个DEMO:域属性注入 java类:(Car类和Stu类,学生有一辆小汽车) packag ...

随机推荐

第四周课程总结&第二次实验报告
实验二 Java简单类与对象实验目的掌握类的定义,熟悉属性.构造函数.方法的作用,掌握用类作为类型声明变量和方法返回值: 理解类和对象的区别,掌握构造函数的使用,熟悉通过对象名引用实例的方法和属性 ...
zookeeper 分布式协调服务
分布式协调服务作用是将多机协调的职责从分布式应用中独立出来,以减少系统的耦合性和增加扩展性. 而zookeeper采用分布式中经典的主从架构:master->slave,通常以动态的存储分布式应 ...
Vue 2.0 入门系列（15）学习 Vue.js 需要掌握的 es6 (2)
类与模块类 es6 之前,通常使用构造函数来创建对象 // 构造函数 User function User(username, email) { this.username = username; ...
Codeforces 1255F Point Ordering（凸包+叉积）
我们随机选取点1,2作为凸包的一个分割线,那么我们可以直接枚举剩下n-2个点找到他们和向量1-2的叉积大小与正负,然后我们可以根据叉积的正负,先将他们分割出两个区域,在向量1-2的下方还是上方,接下来 ...
05: zabbix 监控配置
目录:zabbix其他篇 01: 安装zabbix server 02:zabbix-agent安装配置及 web界面管理 03: zabbix API接口对主机.主机组.模板.应用集.监控项. ...
TypeScript ES6-Promise 递归遍历文件夹中的文件
貌似很多人都爱用这个作为写文章的初尝试,那来吧.遍历文件夹下的所有文件,如遍历文件夹下并操作HTML/CSS/JS/PNG/JPG步骤如下:1.传入一个路径,读取路径里面所有的文件:2.遍历读取的文件 ...
oracle数据库ID自增长--序列
什么是序列?在mysql中有一个主键自动增长的id,例如:uid number primary key auto_increment;在oracle中序列就是类似于主键自动增长,两者功能是一样的,只是 ...
HTML回顾之表单和列表
FORM HTML 表单表单是一个包含表单元素的区域. 表单元素是允许用户在表单中输入内容,比如:文本域(textarea).下拉列表.单选框(radio-buttons).复选框(checkbo ...
netstat - 显示网络连接，路由表，接口状态，伪装连接，网络链路信息和组播成员组。
总览 SYNOPSIS netstat [address_family_options] [--tcp|-t] [--udp|-u] [--raw|-w] [--listening|-l] [--al ...
SQL Server 批量删除表索引
当旧的数据库中的数据几乎很少使用到的时候,索引又占用着较大的磁盘空间,数据又不能删除,又想节省磁盘空间. 这个时候可以将所有表的索引进行删除了(先创建索引备份脚本,以备需要还原),可以批量一起删除. ...

paper about spring

paper about spring的更多相关文章

随机推荐

热门专题