paper about spring
一、解析用户原始信息的json文件
#!/usr/bin/python
# -*- coding=utf-8 -*- import os
import sys import json def main(): root_dir = sys.argv[1] province_file = root_dir +"/conf/province.list"
fin = open(province_file, 'r')
provinces = set()
for line in fin:
province = line.strip()
provinces.add(province)
fin.close() input_file = root_dir +"/source_data/userinfo.json"
output_file = root_dir +"/result_data/userinfo.data" fin = open(input_file, 'r')
fout = open(output_file, 'w')
for line in fin:
if line.strip() == "[]":
continue
json_file = json.loads(line.strip())
userid = json_file['userId']
sex = json_file['sex']
location = json_file['location']
birthday = json_file['birthday']
attentioncount = json_file['attentionCount']
fanscount = json_file['fansCount']
weibocount = json_file['weiboCount']
label_list=json_file['labelList']
user_introduce=json_file['userIntroduce']
if not sex:
sex = 'null'
if location.find(' ') != -1:
fields = location.split(' ')
location = fields[0]
elif location:
for province in provinces:
if location.find(province) != -1:
location = province
if not location :
location = 'null'
index = birthday.find('年')
if index != -1:
birthday = birthday[0:index]
else:
birthday = 'null'
if not attentioncount:
attentioncount = ''
if not fanscount:
fanscount = ''
if not weibocount:
weibocount = ''
if not label_list or not label_list.strip():
label_list='null'
if not user_introduce or not user_introduce.strip():
user_introduce='null' print>>fout, "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s"%(userid, sex, location, birthday, attentioncount, fanscount, weibocount,label_list,user_introduce)
fin.close()
fout.close() if __name__ == "__main__": main()
UserInfoParser
1,用户的标签需要进一步进行分词处理
2,根据这份数据,打了标签的用户大概占据总用户的1/3
3,对于没有标签的,这里使用的是null标注
如果需要删除没有标签的记录,那么相关的shell语句为:
cat userinfo.data | awk -F '\t' '{print $8}'|sed /null/d
cat userinfo.data | cut -f 8|sed /null/d
cat userinfo.data | awk -F '\t' {if $8!=null print $8}
UserShell
二、进行映射以及排序的一些操作
#!/usr/bin/python import os
import sys def main(): root_dir = sys.argv[1]
topN = int(sys.argv[2]) topic_total_file = root_dir +'/result_data/topic_id.data.total'
id_topic = {}
fin = open(topic_total_file, 'r')
for line in fin:
fields = line.strip().split('\t')
id_topic[fields[1]] = fields[0]
fin.close() topicid_count = {}
sources = ['sina', 'tencent']
for source in sources:
input_file = root_dir +'/result_data/'+ source +'.data'
fin = open(input_file, 'r')
for line in fin:
fields = line.strip().split('\t')
if fields[2] == '-1':
continue
topics = fields[2].split(':')
for topic in topics:
if topic in topicid_count:
topicid_count[topic] += 1
else:
topicid_count[topic] = 1
fin.close()
sort_topic = sorted(topicid_count.items(), key = lambda d:d[1], reverse=True)
if len(sort_topic) < topN:
topN = len(sort_topic)
output_file = root_dir +'/result_data/topic_id.data'
fout = open(output_file, 'w')
for i in range(topN):
print>>fout, "%s\t%s\t%s"%(sort_topic[i][0], id_topic[sort_topic[i][0]], topicid_count[sort_topic[i][0]])
fout.close() if __name__ == "__main__": main()
TopN_topic
1)构建两个列表,一个存储id和topic的对应关系,一个存储id和该topic出现次数的对应关系
2)按照次数排序的时候采用sorted(dict.items(),key=lambda d:d[1],reverse=True)进行,这样排序之后得到一个元组构成的列表
三、文档id分配、停用词处理
#!/usr/bin/python import os
import sys def main(): if len(sys.argv) != 4:
print "error parameters!"
sys.exit(0) root_dir = sys.argv[1][0:sys.argv[1].rfind('/')]
input_dir = sys.argv[1]
output_root_dir = sys.argv[2]
topic_multiple = float(sys.argv[3]) # stopwords
stopwords_file = root_dir +'/conf/stopwords.list'
fin = open(stopwords_file, 'r')
stopwords = set()
for line in fin:
word = line.strip()
stopwords.add(word)
fin.close() # generate ntopics_alpha.data
cmd = "wc -l "+ root_dir +"/result_data/topic_id.data | awk -F' ' '{print $1}'"
num_topics = int(int(os.popen(cmd).read().strip()) * topic_multiple)
alpha = 50 / float(num_topics)
ntopics_alpha_file = output_root_dir +'/ntopics_alpha.data'
fout = open(ntopics_alpha_file, 'w')
print>>fout, "%s\t%s"%(num_topics, alpha)
fout.close() # allocate docid and remove stopwords
source_list = ['sina', 'tencent', 'tianya']
for source in source_list:
input_file = input_dir +'/'+ source +'.data'
cmd = "wc -l "+ input_file +" | awk -F' ' '{print $1}'"
line_number = os.popen(cmd).read().strip()
output_file = output_root_dir +'/'+ source +'/source.data'
fin = open(input_file, 'r')
fout = open(output_file, 'w')
print>>fout, line_number
docid = {}
allocate_id = 0
for line in fin:
fields = line.strip().split('\t')
doc = fields[0]
docid[doc] = allocate_id
allocate_id += 1
line = ""
for word in fields[1].split(' '):
if word.strip() and word not in stopwords:
line += word +'\t'
if len(line) == 0:
print>>fout, 'null'
else:
print>>fout, line
fin.close()
fout.close()
docid_file = output_root_dir +'/'+ source +'/docid.map'
fout = open(docid_file, 'w')
for doc in docid:
print>>fout, "%s\t%s"%(doc, docid[doc])
fout.close() if __name__ == "__main__": main()
allocateDocId
1)如何去除停用词
2)如何为文档分配id,并将记录进行保存
3)在实际情况中,是将当前的文档id,词本身作为key还是进一步处理,其实可以看情况而定。
4, generate_nw_nd
#!/usr/bin/python import os
import sys def main(): root_dir = sys.argv[] cmd = "cat "+ root_dir +"/lda_model/ntopics_alpha.data | awk -F' ' '{print $1}' "
num_topics = int(os.popen(cmd).read().strip()) source_list = ['sina', 'tencent', 'tianya']
for source in source_list:
tassign_file = root_dir +'/lda_model/'+ source +'/model-final.tassign'
nd_file = root_dir +'/lda_model/'+ source +'/nd.data'
cmd = "head -1 "+ root_dir +"/lda_model/"+ source +"/wordmap.txt"
num_tokens = int(os.popen(cmd).read().strip())
nw = [ for i in range(num_topics * num_tokens)]
fin = open(tassign_file, 'r')
fout = open(nd_file, 'w')
docid =
for line in fin:
fields = line.strip().split(' ')
nd = [ for i in range(num_topics)]
for pair in fields:
parts = pair.split(':')
wordid = int(parts[])
topicid = int(parts[])
nw[wordid*num_topics + topicid] +=
nd[topicid] +=
print>>fout, "%s\t%s"%(docid, "\t".join([str(i) for i in nd]))
docid +=
fin.close()
fout.close()
nw_file = root_dir +'/lda_model/'+ source +'/nw.data'
fout = open(nw_file, 'w')
for wordid in range(num_tokens):
line = ''
for topicid in range(num_topics):
line += str(nw[wordid*num_topics + topicid]) +'\t'
print>>fout, line
fout.close() if __name__ == "__main__":
main()
generae_nw_nd
1) use list to do matrix
5,topic_mapping
#!/usr/bin/python import os
import sys def similarity(real_vector, lda_vector): score = float(0) words = set()
for word in real_vector:
if word not in words:
words.add(word)
for word in lda_vector:
if word not in words:
words.add(word) real_list = []
lda_list = []
for word in words:
if word in real_vector:
real_list.append(real_vector[word])
else:
real_list.append(float(0))
if word in lda_vector:
lda_list.append(lda_vector[word])
else:
lda_list.append(float(0))
for i in range(len(real_list)):
score += real_list[i] * lda_list[i] return score def topic_mapping(realtopic_vector, ldatopic_vector): real_lda = {} for realtopic in realtopic_vector:
max_topic = ''
max_score = float(0)
for ldatopic in ldatopic_vector:
score = similarity(realtopic_vector[realtopic], ldatopic_vector[ldatopic])
if score > max_score:
max_topic = ldatopic
max_score = score
real_lda[realtopic] = max_topic return real_lda def main(): root_dir = sys.argv[1]
twords = int(sys.argv[2])
realtopic_words = int(sys.argv[3]) source_list = ['sina', 'tencent', 'tianya'] # generate vsm of real topic
topicid_file = root_dir +"/result_data/topic_id.data"
realtopic_vsm = {}
fin = open(topicid_file, 'r')
for line in fin:
fields = line.strip().split('\t')
realtopic_vsm[fields[0]] = {}
fin.close()
topic_source_list = ['sina', 'tencent']
for topic_source in topic_source_list:
input_file = root_dir +'/result_data/'+ topic_source +'.data'
fin = open(input_file, 'r')
for line in fin:
fields = line.strip().split('\t')
topicid = fields[2]
if topicid == '-1':
continue
for topic in topicid.split(':'):
if topic not in realtopic_vsm:
continue
for word in fields[1].split(' '):
if word not in realtopic_vsm[topic]:
realtopic_vsm[topic][word] = 1
else:
realtopic_vsm[topic][word] += 1
fin.close()
# generate vector of real topic
realtopic_vector = {}
for topic in realtopic_vsm:
realtopic_vector[topic] = {}
length = realtopic_words
sorted_tmp = sorted(realtopic_vsm[topic].items(), key = lambda d:d[1], reverse=True)
if len(sorted_tmp) < length:
length = len(sorted_tmp)
sum_count = 0
for i in range(length):
sum_count += sorted_tmp[i][1]
for i in range(length):
realtopic_vector[topic][sorted_tmp[i][0]] = sorted_tmp[i][1] / float(sum_count) # mapping real topic with lda topic
for source in source_list:
input_file = root_dir +"/lda_model/"+ source +"/model-final.twords"
# re-build topic vectoc
ldatopic_vector = {}
fin = open(input_file, 'r')
cur_topic = ""
for line in fin:
line = line.strip()
if line.find('Topic') != -1:
fields = line.split(' ')
cur_topic = fields[1][0: fields[1].find('th')]
ldatopic_vector[cur_topic] = {}
else:
fields = line.split('\t')
word = fields[0]
weight = float(fields[1])
if weight > 0.0:
ldatopic_vector[cur_topic][word] = weight
fin.close()
real_lda = topic_mapping(realtopic_vector, ldatopic_vector)
output_file = root_dir +"/lda_model/"+ source +"/topic_mapping.data"
fout = open(output_file, 'w')
for realtopic in real_lda:
print>>fout, "%s\t%s"%(realtopic, real_lda[realtopic])
fout.close() if __name__ == "__main__": main()
topic_mapping
1)set an real_topic to lda_topic
(the real_topic 's words are by counting;the lda_topic 's word are by training)
2) caculate the similarity of two dictory or two vector
6, final_data
#!/usr/bin/python import sys def main():
root_dir = sys.argv[1] topn = 2 # the top n topic is the real distribution of document
source_list = ['sina', 'tencent', 'tianya']
for source in source_list:
allocateid_ldatopic = {} # value is a list
input_file = root_dir +'/lda_model/'+ source +'/nd.data'
fin = open(input_file, 'r')
for line in fin:
fields = line.strip().split('\t')
allocateid = fields[0]
topic_distribution = {}
for i in range(1, len(fields)-1):
topic_distribution[i-1] = int(fields[i])
sorted_tmp = sorted(topic_distribution.items(), key = lambda d:d[1], reverse=True)
allocateid_ldatopic[allocateid] = []
for i in range(topn):
allocateid_ldatopic[allocateid].append(sorted_tmp[i][0])
fin.close()
ldatopic_realtopic = {} # value is a list
input_file = root_dir +'/lda_model/'+ source +'/topic_mapping.data'
fin = open(input_file, 'r')
for line in fin:
fields = line.strip().split('\t')
ldatopic = fields[1]
realtopic = fields[0]
if ldatopic not in ldatopic_realtopic:
ldatopic_realtopic[ldatopic] = [realtopic]
else:
ldatopic_realtopic[ldatopic].append(realtopic)
fin.close()
userid_profile = {}
input_file = root_dir +'/result_data/userinfo.data'
fin = open(input_file, 'r')
for line in fin:
fields = line.strip().split('\t')
userid = fields[0]
sex = fields[1]
location = fields[2]
age = fields[3]
fanscount = fields[5]
weibocount = fields[6]
userid_profile[userid] = [sex, location, age, fanscount, weibocount]
fin.close()
docid_allocateid = {}
input_file = root_dir +'/lda_model/'+ source +'/docid.map'
fin = open(input_file, 'r')
for line in fin:
fields = line.strip().split('\t')
docid_allocateid[fields[0]] = fields[1]
fin.close()
# final.data
input_file = root_dir +'/result_data/'+ source +'.data'
output_file = root_dir +'/lda_model/'+ source +'/final.data'
fin = open(input_file, 'r')
fout = open(output_file, 'w')
for line in fin:
fields = line.strip().split('\t')
docid = fields[0]
allocateid = docid_allocateid[docid]
topic_set = set()
if fields[2] != '-1':
for topic in fields[2].split(':'):
if topic in topic_set:
continue
topic_set.add(topic)
for ldatopic in allocateid_ldatopic[allocateid]:
if str(ldatopic) not in ldatopic_realtopic:
continue
for topic in ldatopic_realtopic[str(ldatopic)]:
if topic not in topic_set:
topic_set.add(topic)
if topic_set:
topics = ':'.join(topic_set)
else:
topics = 'null'
comment = fields[3]
retweet = fields[4]
praise = fields[5]
userid = fields[6]
if userid in userid_profile:
user_profile = '\t'.join(userid_profile[userid])
else:
user_profile = 'null\tnull\tnull\tnull\tnull'
print>>fout, "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s"%(docid, allocateid, topics, comment, retweet, praise, userid, user_profile)
fin.close()
fout.close() if __name__ == "__main__": main()
final_data
1) allocate each doc the top2 lda_topic
2) totate a dict's key and value, though the key is unique ,but the same value can project different key
3)allocate a doc some relate topic ,include it's own topic,as well as the top2 lda-projection topic
4)merge the allocateid_ldatopic, ldatopic_realtopic, userid_profile, docid_allocateid to a single file
7,Visualization
#!/usr/bin/python
# -*- coding=utf-8 -*- import sys
from string import Template def replace_template(template_file, replaceDict, output_file): fh = open(template_file, 'r')
content = fh.read()
fh.close()
content_template = Template(content)
content_final = content_template.safe_substitute(replaceDict) fout = open(output_file, 'w')
fout.write(content_final)
fout.close() def bar_categories(categories_list):
categories = "["
for i in range(len(categories_list)):
if i == len(categories_list)-1:
categories += "'"+ categories_list[i] +"']"
else:
categories += "'"+ categories_list[i] +"',"
return categories def bar_series(data_list):
series = "[{ name: 'count', data: ["
for i in range(len(data_list)):
if i == len(data_list)-1:
series += str(data_list[i]) +"]}]"
else:
series += str(data_list[i]) +","
return series def pie_data(data_map):
data = "["
index = 0
for item in data_map:
if index == len(data_map)-1:
data += "['"+ str(item) +"',"+ str(data_map[item]) +"]"
else:
data += "['"+ str(item) +"',"+ str(data_map[item]) +"],"
data += "]"
return data def main(): root_dir = sys.argv[1] # topicid and topic's content
topicid_content = {}
input_file = root_dir +'/result_data/topic_id.data'
fin = open(input_file, 'r')
for line in fin:
fields = line.strip().split('\t')
topicid_content[fields[0]] = fields[1]
fin.close() #1、话题分布
source_list = ['sina', 'tencent', 'tianya']
topicid_count = {}
for source in source_list:
input_file = root_dir +'/lda_model/'+ source +'/final.data'
fin = open(input_file, 'r')
for line in fin:
fields = line.strip().split('\t')
if fields[2] == 'null':
continue
for topic in fields[2].split(':'):
if topic in topicid_count:
topicid_count[topic] += 1
else:
topicid_count[topic] = 1
# Total topics sorted by its total count
sorted_result = sorted(topicid_count.items(), key = lambda d:d[1], reverse=True) topN = 20
replaceDict = {}
replaceDict['title'] = "'话题分布'"
replaceDict['subtitle'] = "''"
categories_list = []
for i in range(topN):
categories_list.append(topicid_content[ sorted_result[i][0] ])
replaceDict['categories'] = bar_categories(categories_list) replaceDict['x_name'] = "'相关微博或帖子条数'" data_list = []
for i in range(topN):
data_list.append(sorted_result[i][1])
replaceDict['series'] = bar_series(data_list) template_file = root_dir +'/template/horizontal_bar.tpl'
output_file = root_dir +'/final_html/1.htm'
replace_template(template_file, replaceDict, output_file) #2、话题分布变化趋势 #3、话题关注用户的男女比例
topN = 10
topicid_sex = {}
for i in range(topN):
topicid_sex[sorted_result[i][0]] = [0, 0]
source_list = ['sina'] # we only has user profile of sina currently
for source in source_list:
input_file = root_dir +'/lda_model/'+ source +'/final.data'
fin = open(input_file, 'r')
for line in fin:
fields = line.strip().split('\t')
if fields[2] == 'null' or fields[7] == 'null':
continue
for topic in fields[2].split(':'):
if topic not in topicid_sex:
continue
if fields[7] == "男":
topicid_sex[topic][0] += 1
if fields[7] == "女":
topicid_sex[topic][1] += 1
fin.close()
for i in range(topN):
template_file = root_dir +'/template/pie.tpl'
output_file = root_dir +'/final_html/3-'+ str(i) +'.htm'
replaceDict = {}
replaceDict['title'] = "'#"+ topicid_content[sorted_result[i][0]] +"# 关注用户男女比例'"
sum_count = topicid_sex[sorted_result[i][0]][0] + topicid_sex[sorted_result[i][0]][1]
sex_map = {}
sex_map['男'] = topicid_sex[sorted_result[i][0]][0] / float(sum_count)
sex_map['女'] = topicid_sex[sorted_result[i][0]][1] / float(sum_count)
replaceDict['data'] = pie_data(sex_map)
replace_template(template_file, replaceDict, output_file) #4、话题关注用户的地域分布
topN = 10
province_conf = root_dir +'/conf/province.list'
province_list = []
province_map = {}
fin = open(province_conf, 'r')
index = 0
for line in fin:
province = line.strip()
province_list.append(province)
province_map[province] = index
index += 1
fin.close()
source_list = ['sina']
topicid_province = {}
for i in range(topN):
topicid_province[sorted_result[i][0]] = []
for j in range(len(province_list)):
topicid_province[sorted_result[i][0]].append(0)
for source in source_list:
input_file = root_dir +'/lda_model/'+ source +'/final.data'
fin = open(input_file, 'r')
for line in fin:
fields = line.strip().split('\t')
if fields[2] == 'null' or fields[8] == 'null':
continue
for topic in fields[2].split(':'):
if topic not in topicid_province:
continue
province_index = int(province_map[fields[8]])
topicid_province[topic][province_index] += 1
fin.close()
for i in range(topN):
template_file = root_dir +'/template/horizontal_bar.tpl'
output_file = root_dir +'/final_html/4-'+ str(i) +'.htm'
replaceDict = {}
replaceDict['title'] = "'#"+ topicid_content[sorted_result[i][0]] +"# 关注用户地域分布'"
replaceDict['subtitle'] = "''"
replaceDict['x_name'] = "'相关微博或帖子条数'"
replaceDict['categories'] = bar_categories(province_list)
replaceDict['series'] = bar_series(topicid_province[sorted_result[i][0]])
replace_template(template_file, replaceDict, output_file) #5、话题关注用户的年龄分布
topN = 10
age_list = ['10岁以下', '10-19岁', '20-29岁', '30-39岁', '40-49岁', '50-59岁', '60岁以上']
source_list = ['sina']
topicid_age = {}
for i in range(topN):
topicid_age[sorted_result[i][0]] = []
for j in range(len(age_list)):
topicid_age[sorted_result[i][0]].append(0)
for source in source_list:
input_file = root_dir +'/lda_model/'+ source +'/final.data'
fin = open(input_file, 'r')
for line in fin:
fields = line.strip().split('\t')
if fields[2] == 'null' or fields[9] == 'null':
continue
for topic in fields[2].split(':'):
if topic not in topicid_age:
continue
age = 2013 -int(fields[9])
if age <= 9:
topicid_age[topic][0] += 1
elif age >= 10 and age <= 19:
topicid_age[topic][1] += 1
elif age >= 20 and age <= 29:
topicid_age[topic][2] += 1
elif age >= 30 and age <= 39:
topicid_age[topic][3] += 1
elif age >= 40 and age <= 49:
topicid_age[topic][4] += 1
elif age >= 50 and age <= 59:
topicid_age[topic][5] += 1
elif age >= 60:
topicid_age[topic][6] += 1
fin.close()
for i in range(topN):
template_file = root_dir +'/template/vertical_bar.tpl'
output_file = root_dir +'/final_html/5-'+ str(i) +'.htm'
replaceDict = {}
replaceDict['title'] = "'#"+ topicid_content[sorted_result[i][0]] +"# 关注用户年龄分布'"
replaceDict['subtitle'] = "''"
replaceDict['y_name'] = "'人数'"
replaceDict['categories'] = bar_categories(age_list)
replaceDict['series'] = bar_series(topicid_age[sorted_result[i][0]])
replace_template(template_file, replaceDict, output_file) #6、话题来源媒体的比例
topN = 10
source_list = ['sina', 'tencent', 'tianya']
topicid_source = {}
for i in range(topN):
topicid_source[sorted_result[i][0]] = []
for j in range(len(source_list)):
topicid_source[sorted_result[i][0]].append(0)
for source in source_list:
input_file = root_dir +'/lda_model/'+ source +'/final.data'
fin = open(input_file, 'r')
for line in fin:
fields = line.strip().split('\t')
if fields[2] == 'null':
continue
for topic in fields[2].split(':'):
if topic not in topicid_source:
continue
if source == "sina":
topicid_source[topic][0] += 1
if source == "tencent":
topicid_source[topic][1] += 1
if source == "tianya":
topicid_source[topic][2] += 1
fin.close()
for i in range(topN):
template_file = root_dir +'/template/pie.tpl'
output_file = root_dir +'/final_html/6-'+ str(i) +'.htm'
replaceDict = {}
replaceDict['title'] = "'#"+ topicid_content[sorted_result[i][0]] +"# 话题来源媒体分布'"
source_map = {}
source_map['sina'] = topicid_source[sorted_result[i][0]][0]
source_map['tencent'] = topicid_source[sorted_result[i][0]][1]
source_map['tianya'] = topicid_source[sorted_result[i][0]][2]
replaceDict['data'] = pie_data(source_map)
replace_template(template_file, replaceDict, output_file) #7、话题的核心关注用户
topN = 10
coreuser = 5
source_list = ['sina']
topicid_user = {}
for i in range(topN):
topicid_user[sorted_result[i][0]] = {}
for source in source_list:
input_file = root_dir +'/lda_model/'+ source +'/final.data'
fin = open(input_file, 'r')
for line in fin:
fields = line.strip().split('\t')
if fields[2] == 'null' or fields[6] == 'null':
continue
userid = fields[6]
for topic in fields[2].split(':'):
if topic not in topicid_user:
continue
if userid not in topicid_user[topic]:
topicid_user[topic][userid] = 1
else:
topicid_user[topic][userid] += 1
fin.close()
output_file = root_dir +'/final_html/topic_coreuser.list'
fout = open(output_file, 'w')
for i in range(topN):
title = "#"+ topicid_content[sorted_result[i][0]] +"# 话题核心关注人物"
print>>fout, title
sorted_tmp = sorted(topicid_user[sorted_result[i][0]].items(), key = lambda d:d[1], reverse =True)
if len(sorted_tmp) < coreuser:
coreuser = len(sorted_tmp)
for j in range(coreuser):
print>>fout, "\t%s\t%s"%(sorted_tmp[j][0], sorted_tmp[j][1]) # userid and related documents count
fout.close() #8、话题关注用户的粉丝数分布
topN = 10
fans_list = ['0-100', '101-1000', '1001-10000', '10001-100000', '100001-500000', '500000以上']
source_list = ['sina']
topicid_fans = {}
for i in range(topN):
topicid_fans[sorted_result[i][0]] = []
for j in range(len(fans_list)):
topicid_fans[sorted_result[i][0]].append(0)
for source in source_list:
input_file = root_dir +'/lda_model/'+ source +'/final.data'
fin = open(input_file, 'r')
for line in fin:
fields = line.strip().split('\t')
if fields[2] == 'null' or fields[6] == 'null' or fields[10] == 'null':
continue
for topic in fields[2].split(':'):
if topic not in topicid_fans:
continue
fans = int(fields[10])
if fans <= 100:
topicid_fans[topic][0] += 1
elif fans >= 101 and fans <= 1000:
topicid_fans[topic][1] += 1
elif fans >= 1001 and fans <= 10000:
topicid_fans[topic][2] += 1
elif fans >= 10001 and fans <= 100000:
topicid_fans[topic][3] += 1
elif fans >= 100001 and fans <= 500000:
topicid_fans[topic][4] += 1
elif fans >= 500001:
topicid_fans[topic][5] += 1
fin.close()
for i in range(topN):
template_file = root_dir +'/template/horizontal_bar.tpl'
output_file = root_dir +'/final_html/8-'+ str(i) +'.htm'
replaceDict = {}
replaceDict['title'] = "'#"+ topicid_content[sorted_result[i][0]] +"# 关注用户粉丝数分布'"
replaceDict['subtitle'] = "''"
replaceDict['x_name'] = "'粉丝数'"
replaceDict['categories'] = bar_categories(fans_list)
replaceDict['series'] = bar_series(topicid_fans[sorted_result[i][0]])
replace_template(template_file, replaceDict, output_file) #9、话题关注用户的微博数分布
topN = 10
weibo_list = ['0-100', '101-1000', '1001-3000', '3001-5000', '5001-10000', '10000以上']
source_list = ['sina']
topicid_weibo = {}
for i in range(topN):
topicid_weibo[sorted_result[i][0]] = []
for j in range(len(weibo_list)):
topicid_weibo[sorted_result[i][0]].append(0)
for source in source_list:
input_file = root_dir +'/lda_model/'+ source +'/final.data'
fin = open(input_file, 'r')
for line in fin:
fields = line.strip().split('\t')
if fields[2] == 'null' or fields[6] == 'null' or fields[11] == 'null':
continue
for topic in fields[2].split(':'):
if topic not in topicid_weibo:
continue
weibo = int(fields[10])
if weibo <= 100:
topicid_weibo[topic][0] += 1
elif weibo >= 101 and weibo <= 1000:
topicid_weibo[topic][1] += 1
elif weibo >= 1001 and weibo <= 3000:
topicid_weibo[topic][2] += 1
elif weibo >= 3001 and weibo <= 5000:
topicid_weibo[topic][3] += 1
elif weibo >= 5001 and weibo <= 10000:
topicid_weibo[topic][4] += 1
elif weibo >= 10001:
topicid_weibo[topic][5] += 1
fin.close()
for i in range(topN):
template_file = root_dir +'/template/horizontal_bar.tpl'
output_file = root_dir +'/final_html/9-'+ str(i) +'.htm'
replaceDict = {}
replaceDict['title'] = "'#"+ topicid_content[sorted_result[i][0]] +"# 关注用户微博数分布'"
replaceDict['subtitle'] = "''"
replaceDict['x_name'] = "'微博数'"
replaceDict['categories'] = bar_categories(weibo_list)
replaceDict['series'] = bar_series(topicid_weibo[sorted_result[i][0]])
replace_template(template_file, replaceDict, output_file) #10、关注度、传播度、活跃度
topN = 10
source_list = ['sina', 'tencent', 'tianya']
topicid_attention = {}
topicid_diffuse = {}
topicid_active = {}
for i in range(topN):
topicid_attention[sorted_result[i][0]] = set() #userlist
topicid_diffuse[sorted_result[i][0]] = {} #user and fans
topicid_active[sorted_result[i][0]] = 0 #comment and retweet and praise
for source in source_list:
input_file = root_dir +'/lda_model/'+ source +'/final.data'
fin = open(input_file, 'r')
for line in fin:
fields = line.strip().split('\t')
if fields[2] == 'null':
continue
for topic in fields[2].split(':'):
if topic not in topicid_attention:
continue
if fields[6] != 'null':
if fields[6] not in topicid_attention[topic]:
topicid_attention[topic].add(fields[6])
if fields[10] != 'null':
if fields[6] not in topicid_diffuse[topic]:
topicid_diffuse[topic][fields[6]] = int(fields[10])
if fields[3] != 'null':
topicid_active[topic] += int(fields[3])
if fields[4] != 'null':
topicid_active[topic] += int(fields[4])
if fields[5] != 'null':
topicid_active[topic] += int(fields[5])
fin.close()
output_file = root_dir +'/final_html/topic_attention_diffuse_active.list'
fout = open(output_file, 'w')
for i in range(topN):
title = "#"+ topicid_content[sorted_result[i][0]] +"# 关注度、传播度、活跃度"
print>>fout, title
attention = len(topicid_attention[sorted_result[i][0]])
diffuse = 0
for user in topicid_diffuse[sorted_result[i][0]]:
diffuse += topicid_diffuse[sorted_result[i][0]][user]
active = topicid_active[sorted_result[i][0]]
print>>fout, "\t%s\t%s\t%s"%(attention, diffuse, active)
fout.close() if __name__ == "__main__":
main()
visualization
paper about spring的更多相关文章
- OpenCASCADE Ring Type Spring Modeling
OpenCASCADE Ring Type Spring Modeling eryar@163.com Abstract. The general method to directly create ...
- Spring基础
一.Spring作用:管理项目中各种业务Bean(service类.Dao类.Action类),实例化类,属性赋值 二.Spring IOC(Inversion of Control )控制反转,也被 ...
- Spring 01基础
一.Spring作用:管理项目中各种业务Bean(service类.Dao类.Action类),实例化类,属性赋值 二.Spring IOC(Inversion of Control )控制反转,也被 ...
- 第一章 spring核心概念
一.Spring作用:管理项目中各种业务Bean(service类.Dao类.Action类),实例化类,属性赋值 二.Spring IOC(Inversion of Control )控制反转,也被 ...
- Spring基础[IOC/DI、AOP]
一.Spring作用:管理项目中各种业务Bean(service类.Dao类.Action类),实例化类,属性赋值 二.Spring IOC(Inversion of Control )控制反转,也被 ...
- ### Paper about Event Detection
Paper about Event Detection. #@author: gr #@date: 2014-03-15 #@email: forgerui@gmail.com 看一些相关的论文. 1 ...
- spring之bean
Bean的基本配置 id属性 id属性确定bean的唯一标识符,容器对bean的管理,访问,以及该bean的依赖关系,都通过该属性来完成.bean的id属性在Spring容器中应该是唯一的. clas ...
- (转)Spring中Singleton模式的线程安全
不知道哪里的文章,总结性还是比较好的.但是代码凌乱,有的还没有图.如果找到原文了可以进行替换! spring中的单例 spring中管理的bean实例默认情况下是单例的[sigleton类型],就还有 ...
- SSM-Spring-02:Spring的DI初步加俩个实例
------------吾亦无他,唯手熟尔,谦卑若愚,好学若饥------------- DI:依赖注入 第一个DEMO:域属性注入 java类:(Car类和Stu类,学生有一辆小汽车) packag ...
随机推荐
- Spring中单例模式中的饿汉和懒汉以及Spring中的多例模式
链接:https://pan.baidu.com/s/1wgxnXnGbPdK1YaZvhO7PDQ 提取码:opjr 单例模式:每个bean定义只生成一个对象实例,每次getBean请求获得的都是此 ...
- 全自动网络安装centos(一)安装前准备工作
centos系统启动文件详解: 注:在centos6里需要给NetworkManager服务关闭并且禁止开机启动,6和7里都需要将selinux关闭,否则会出现网络配置异常情况,并且要将防火墙关闭. ...
- oracle 实现mysql find_set_in函数
create or replace FUNCTION F_FIND_IN_SET(piv_str1 varchar2, piv_str2 varchar2, p_sep varchar2 := ',' ...
- 51nod1769 Clarke and math 2
题目 实际上就是要求\(f*I^k\). 因为\(I^k\)是一个积性函数,所以我们只需要考虑如何求\(I^k(p^a)\). 把这个东西转化成一个长度为\(k\)的序列,每一位是\(\frac{i_ ...
- Python 入门之 递归
Python 入门之 递归 1.递归: 递:一直传参 归:返回 (1)不断调用自己本身(无效递归 -- 死递归) def func(): print(1) func() func() (2)有明确的终 ...
- 搜索 问题 D: 神奇密码锁
这道题个人认为隐含着状态转换,所以想到的还是BFS,将其中一位数加一或减一或交换临近两位,进入下一状态,使用一个大小为10000的bool数组判重,由于BFS的特性,得到的一定是最小步数: 普通BFS ...
- HNUSTOJ 1516:Loky的烦恼
1516: Loky的烦恼 时间限制: 1 Sec 内存限制: 128 MB 提交: 242 解决: 66 [提交][状态][讨论版] 题目描述 loky喜欢上一个女孩,女孩在loky眼中绝对是1 ...
- pureftp安装
1.下载 #cd /usr/local/src #wget http://download.pureftpd.org/pub/pure-ftpd/releases/pure-ftpd-1.0.36.t ...
- html5移动端Meta的设置
强制让文档的宽度与设备的宽度保持1:1,并且文档最大的宽度比例是1.0,且不允许用户点击屏幕放大浏览 1 <meta name="viewport" content=&quo ...
- 微信小程序 IView List与Icon结合使用
wxml <i-cell-group> <i-cell title="测试" is-link> <i-icon slot= ...