Coursera课程笔记----P4E.Capstone----Week 6&7
Visualizing Email Data(Week 6&7)
code segment
gword.py
import sqlite3
import time
import zlib
import string
conn = sqlite3.connect('index.sqlite')
cur = conn.cursor()
cur.execute('SELECT id, subject FROM Subjects')
subjects = dict()
for message_row in cur :
subjects[message_row[0]] = message_row[1]
# cur.execute('SELECT id, guid,sender_id,subject_id,headers,body FROM Messages')
cur.execute('SELECT subject_id FROM Messages')
counts = dict()
for message_row in cur :
text = subjects[message_row[0]]
text = text.translate(str.maketrans('','',string.punctuation))
text = text.translate(str.maketrans('','','1234567890'))
text = text.strip()
text = text.lower()
words = text.split()
for word in words:
if len(word) < 4 : continue
counts[word] = counts.get(word,0) + 1
x = sorted(counts, key=counts.get, reverse=True)
highest = None
lowest = None
for k in x[:100]:
if highest is None or highest < counts[k] :
highest = counts[k]
if lowest is None or lowest > counts[k] :
lowest = counts[k]
print('Range of counts:',highest,lowest)
# Spread the font sizes across 20-100 based on the count
bigsize = 80
smallsize = 20
fhand = open('gword.js','w')
fhand.write("gword = [")
first = True
for k in x[:100]:
if not first : fhand.write( ",\n")
first = False
size = counts[k]
size = (size - lowest) / float(highest - lowest)
size = int((size * bigsize) + smallsize)
fhand.write("{text: '"+k+"', size: "+str(size)+"}")
fhand.write( "\n];\n")
fhand.close()
print("Output written to gword.js")
print("Open gword.htm in a browser to see the vizualization")
gline.py
import sqlite3
import time
import zlib
conn = sqlite3.connect('index.sqlite')
cur = conn.cursor()
cur.execute('SELECT id, sender FROM Senders')
senders = dict()
for message_row in cur :
senders[message_row[0]] = message_row[1]
cur.execute('SELECT id, guid,sender_id,subject_id,sent_at FROM Messages')
messages = dict()
for message_row in cur :
messages[message_row[0]] = (message_row[1],message_row[2],message_row[3],message_row[4])
print("Loaded messages=",len(messages),"senders=",len(senders))
sendorgs = dict()
for (message_id, message) in list(messages.items()):
sender = message[1]
pieces = senders[sender].split("@")
if len(pieces) != 2 : continue
dns = pieces[1]
sendorgs[dns] = sendorgs.get(dns,0) + 1
# pick the top schools
orgs = sorted(sendorgs, key=sendorgs.get, reverse=True)
orgs = orgs[:10]
print("Top 10 Organizations")
print(orgs)
counts = dict()
months = list()
# cur.execute('SELECT id, guid,sender_id,subject_id,sent_at FROM Messages')
for (message_id, message) in list(messages.items()):
sender = message[1]
pieces = senders[sender].split("@")
if len(pieces) != 2 : continue
dns = pieces[1]
if dns not in orgs : continue
month = message[3][:7]
if month not in months : months.append(month)
key = (month, dns)
counts[key] = counts.get(key,0) + 1
months.sort()
# print counts
# print months
fhand = open('gline.js','w')
fhand.write("gline = [ ['Month'")
for org in orgs:
fhand.write(",'"+org+"'")
fhand.write("]")
for month in months:
fhand.write(",\n['"+month+"'")
for org in orgs:
key = (month, org)
val = counts.get(key,0)
fhand.write(","+str(val))
fhand.write("]");
fhand.write("\n];\n")
fhand.close()
print("Output written to gline.js")
print("Open gline.htm to visualize the data")
Coursera课程笔记----P4E.Capstone----Week 6&7的更多相关文章
- Coursera课程笔记----P4E.Capstone----Week 4&5
Spidering and Modeling Email Data(week4&5) Mailing List - Gmane Crawl the archive of a mailing l ...
- Coursera课程笔记----P4E.Capstone----Week 2&3
Building a Search Engine(week 2&3) Search Engine Architecture Web Crawling Index Building Search ...
- 操作系统学习笔记----进程/线程模型----Coursera课程笔记
操作系统学习笔记----进程/线程模型----Coursera课程笔记 进程/线程模型 0. 概述 0.1 进程模型 多道程序设计 进程的概念.进程控制块 进程状态及转换.进程队列 进程控制----进 ...
- Coursera课程笔记----C++程序设计----Week3
类和对象(Week 3) 内联成员函数和重载成员函数 内联成员函数 inline + 成员函数 整个函数题出现在类定义内部 class B{ inline void func1(); //方式1 vo ...
- Coursera课程笔记----Write Professional Emails in English----Week 3
Introduction and Announcement Emails (Week 3) Overview of Introduction & Announcement Emails Bas ...
- Coursera课程笔记----Write Professional Emails in English----Week 1
Get to Know Basic Email Writing Structures(Week 1) Introduction to Course Email and Editing Basics S ...
- Coursera课程笔记----C程序设计进阶----Week 5
指针(二) (Week 5) 字符串与指针 指向数组的指针 int a[10]; int *p; p = a; 指向字符串的指针 指向字符串的指针变量 char a[10]; char *p; p = ...
- Coursera课程笔记----Write Professional Emails in English----Week 5
Culture Matters(Week 5) High/Low Context Communication High Context Communication The Middle East, A ...
- Coursera课程笔记----Write Professional Emails in English----Week 4
Request and Apology Emails(Week 4) How to Write Request Emails Write more POLITELY & SINCERELUY ...
随机推荐
- Qt发送一次信号触发两次槽函数的原因
在手动为控件编写槽函数的时候,如果将槽函数名字按如下格式编辑,则不需要再次进行手动关联 void on_pushButton_1_clicked(); void on_radioButton_clic ...
- Gatling脚本编写技巧篇(二)
脚本示例: import io.gatling.core.Predef._ import io.gatling.http.Predef._ import scala.concurrent.durati ...
- ubuntu-18.0.4 samba安装
(1)安装 sudo apt-get -y install samba samba-common (2)创建一个用于分享的samba目录. mkdir /home/myshare (3)给创建的这个目 ...
- linux CVE-2019-13272 本地特权漏洞
漏洞描述 在5.1.17之前的Linux内核中,kernel / ptrace.c中的ptrace_link错误地处理了想要创建ptrace关系的进程的凭据记录,这允许本地用户通过利用父子的某些方案来 ...
- PHP代码审计理解(三)---EMLOG某插件文件写入
此漏洞存在于emlog下的某个插件---友言社会化评论1.3. 我们可以看到, uyan.php 文件在判断权限之前就可以接收uid参数.并且uid未被安全过滤即写入到了$uyan_code中. 我们 ...
- windows 系统使用技巧
1 自定义发送到C:\Users\adm\AppData\Roaming\Microsoft\Windows\SendTo 2 自定义关机shutdown -s -t time,可写在快捷方式中shu ...
- Redis学习与应用-位图
什么是位图 位图bitmap是通过一个bit来表示某个元素对应的值或者状态,是由一组bit位组成,每个bit位对应0和1两个状态,虽然内部还是采用string类型进行存储,但是redis提供了直接操作 ...
- 重大更新!Druid 0.18.0 发布—Join登场,支持Java11
Apache Druid本质就是一个分布式支持实时数据分析的数据存储系统. 能够快速的实现查询与数据分析,高可用,高扩展能力. 距离上一次更新刚过了二十多天,距离0.17版本刚过了三个多月,Druid ...
- 一不小心实现了RPC
前言 随着最近关注 cim 项目的人越发增多,导致提的问题以及 Bug 也在增加,在修复问题的过程中难免代码洁癖又上来了. 看着一两年前写的东西总是怀疑这真的是出自自己手里嘛?有些地方实在忍不住了便开 ...
- ES6新增的 Set 和 WeakSet 是什么玩意?在此揭晓
现在的章节内容会更加的紧密,如果大家看不懂可以先去看以前的文章,当然看了的忘了,也可以去看一下,这样学习后面的内容才会更加容易. 什么是Set结构 Set是ES6给开发者带来的一种新的数据结构,你可以 ...