Python 爬取 热词并进行分类数据分析-[热词分类+目录生成]
日期:2020.02.04
博客期:143
星期二
【本博客的代码如若要使用,请在下方评论区留言,之后再用(就是跟我说一声)】
所有相关跳转:
a.【简单准备】
b.【云图制作+数据导入】
c.【拓扑数据】
d.【数据修复】
e.【解释修复+热词引用】
f.【JSP演示+页面跳转】
g.【热词分类+目录生成】(本期博客)
h.【热词关系图+报告生成】
i . 【App制作】
j . 【安全性改造】
如下图,我已经解决的需求是标黄的部分,剩余需求就只有 热词分类、目录生成、热词关系图展示、数据报告导出 四部分了,这些需求是最紧要完成的,呼~撸起袖子加油干!
1、热词分类
老师说要参照各大平台的分类,我就直接按照博客园的分类来吧(我实在看不懂那些机器学习是怎么实现的,连入门的门槛都远远不及)!如下图,可以看到 博客园的新闻将新闻分成了如下几类:互联网类、IT业界类、软件开发类、开源类、电脑硬件类、游戏类、创业类、手机相关类、科学类、其他类。我就根据这几类将对应类新闻里爬出来的数据进行对应类的划分。(看来又要重新爬数据了啊)
开始爬之前事先说明一下,这次改动应该是最后一次改动了,另外我发现每一类新闻都有 100 页,这...相当于每一类都有,所以不保证有误差的存在,另外为了减少数据量,我打算将 “频数为15” 这一条件上升到 “频数为20”,不然怎么爬的完?我先预算一下,今天和明天一起写这个博客,另外明天的话,就再写一份总结性的博客,这个小目标就算完结吧!当然最后可能会加入微信小程序部分或者APP部分,到时候再说。
根据这10类新闻,我们总共要爬取些什么数据呢?
首先,通过带有 header 的 request 方式爬取 https://news.cnblogs.com/ 这一初始链接,要爬以上 10 类新闻的链接,再爬取类中封装链接的构造,并开启新的爬取,对应每一类数据给爬到的热词信息后面追加一个“热词类型”的标签,这需要我们改造 KeyWords 类,向 KeyWords 类中加入 kind 属性,改写 __toString() 成员函数。之后改造调用过 KeyWords 类的地方。(News不需要)
关于分类页面的构造方法:
首先是原新闻网址:https://news.cnblogs.com/
其次,以 “互联网” 为例:https://news.cnblogs.com/n/c1101
然后是第 100 页的地址:https://news.cnblogs.com/n/c1101?page=100
很容易的判断到是在原网址的基础上加入对应 互联网的 a 标签上的 href 链接,需要将数据加载到一起来组成爬取链接!
但是爬的过程中发现了问题,就是我爬不到对应的分类链接,既然这样,我只能人工地获取它们的链接了,就10条数据无所谓了,本来因为懒想让网页帮我做的,看来是博客园让我勤快的。哈哈哈!
对应链接:
互联网类:https://news.cnblogs.com/n/c1101
IT业界类:https://news.cnblogs.com/n/c1102
软件开发类:https://news.cnblogs.com/n/c1103
开源类:https://news.cnblogs.com/n/c1109
电脑硬件类:https://news.cnblogs.com/n/c1111
游戏类:https://news.cnblogs.com/n/c1110
创业类:https://news.cnblogs.com/n/c1112
手机相关类:https://news.cnblogs.com/n/c1113
科学类:https://news.cnblogs.com/n/c1114
其他类:https://news.cnblogs.com/n/c1199
在 Surapity 类 中建立字典,存储类型的名称和对应链接。
爬取时间较长,从下午4:51到现在第2天的1:44,过程曲折且难以简言明之。
途中遇到好几个网站会使爬虫程序终止,比如 其他类的 Apple Watch UI动效解析 ,呜哇~试一次,卡一次。程序员的痛苦莫过于此!!!
统计基础数据共计 17469 条 数据!文件大小约为 1.96 M !
现在开始制作数据表:(先修改 fileR.py)
- import codecs
- def makeSql():
- file_path = "../../testFile/frc/words_sql.txt"
- f = codecs.open(file_path, "w+", 'utf-8')
- f.write("")
- f.close()
- fw = open("../../testFile/frc/word.txt", mode='r', encoding='utf-8')
- tmp = fw.readlines()
- num = tmp.__len__()
- for i in range(0,num):
- group = tmp[i].split("\t")
- group[0] = "'" + group[0] + "'"
- group[3] = "'" + group[3][0:group[3].__len__()-1] + "'"
- f = codecs.open(file_path, "a+", 'utf-8')
- f.write("Insert into words values ("+group[0]+","+group[1]+",'"+group[2]+"',"+group[3]+",'"+group[4]+"');"+"\n")
- f.close()
- makeSql()
fileR.py
执行并按照之前的方法导入数据,这里博主因为使用电脑管家清理了一下C盘,然后 Navicat就崩掉了,真的崩了(建立不了查询了,这个之后有解决方法的话,我再写一期博客吧!)!所以,不搞虚的,直接用文本导入了!
建立 keywords 表(或视图)的方法同上上期的博客,那样获取每一个热词的数量!
- CREATE TABLE keywords
- AS
- (
- SELECT
- word AS word,
- SUM(num) AS num
- FROM
- words
- GROUP BY word
- ORDER BY num
- DESC
- )
CreateKeywordsTable.sql
哈哈哈哈!热词频数过万了呢!希望我的电脑还能撑住,继续爬!(但是现在已经2点了,先定个2个小时的闹钟,拓扑数据让它自己爬着)
对于 WebConnector 类,我要着重说一下,我本次爬取将此代码注释掉了:
- # 这句话处理以后,就将带有 “年”、“月”、“日” 字眼的语句以及之后的语句全部清除掉了,当时是旨在消除不必要的解释部分,但现在看来没必要!多多益善嘛!
- tpl = StrSpecialDealer.ut_date(tpl)
早上醒来发现大问题——电脑自己休眠了,唉~希望自己能够吃一堑长一智吧!
在电脑熬夜干爬虫的时候尽力将休眠关闭,在设置中如下:
拓扑数据也完成了,大约又历时 5 个小时,关键是在电脑爬虫时我还不能用电脑干其他的(尤其是截图软件,运行的话,爬虫程序一准给你崩停)
终于有完整数据了,现在我们开始数据处理!
根据不同分类将数据汇总和数据处理了(也就是说剩余没有Python的事情了),至此热词分类完毕。
2、热词目录生成
我们需要展示每一个分类的前10个数据,以此做成第一个页面。
可以制作新的视图,也可以直接写大长 Sql 语句,我比较懒,就按长语句来了
- package com.servlet;
- import java.io.IOException;
- import java.sql.SQLException;
- import java.util.List;
- import javax.servlet.ServletException;
- import javax.servlet.ServletOutputStream;
- import javax.servlet.annotation.WebServlet;
- import javax.servlet.http.HttpServlet;
- import javax.servlet.http.HttpServletRequest;
- import javax.servlet.http.HttpServletResponse;
- import org.json.JSONArray;
- import org.json.JSONObject;
- import com.dblink.basic.utils.SqlUtils;
- import com.dblink.basic.utils.sqlKind.MySql_s;
- import com.dblink.basic.utils.user.UserInfo;
- import com.dblink.bean.BeanGroup;
- import com.dblink.sql.DBLink;
- @SuppressWarnings("unused")
- public class ServletForMoreInfo extends HttpServlet{
- /**
- *
- */
- private static final long serialVersionUID = 1L;
- //----------------------------------------------------------------------//
- public void doPost(HttpServletRequest request,HttpServletResponse response) throws ServletException, IOException
- {
- request.setCharacterEncoding("utf-8");
- response.setCharacterEncoding("utf-8");
- response.setContentType("application/json");
- response.setHeader("Cache-Control", "no-cache");
- response.setHeader("Pragma", "no-cache");
- String kind = request.getParameter("kind");
- JSONArray jsonArray = new JSONArray();
- JSONObject jsonObj = new JSONObject();
- DBLink dbLink = new DBLink(new SqlUtils(new MySql_s("rc"),new UserInfo("root","123456")));
- BeanGroup bg = null;
- try {
- bg = dbLink.getSelect("Select word As word , SUM(num) As num From ( Select * From words Where kind = '"+kind+"' ) Group By word Order By num DESC Limit 0,10 ").beans;
- int leng = bg.size();
- jsonObj.put("Length",leng);
- jsonArray.put(jsonObj);
- for(int i=0;i<leng;++i)
- {
- JSONObject jsonObject = new JSONObject();
- jsonObject.put("word",bg.get(i).get(0));
- jsonObject.put("num",bg.get(i).get(1));
- jsonArray.put(jsonObject);
- }
- } catch (SQLException e) {
- // Do Nothing ...
- }
- dbLink.free();
- ServletOutputStream os = response.getOutputStream();
- os.write(jsonArray.toString().getBytes());
- os.flush();
- os.close();
- }
- //---------------------------------------------------------------------------------//
- }
ServletForMoreInfo.java
如果你建立了对应 10 个分类的视图,你可以添加 Servlet 如下:(否则将视图名称替换成建立视图的Select语句)
- package com.servlet;
- import java.io.IOException;
- import java.sql.SQLException;
- import java.util.List;
- import javax.servlet.ServletException;
- import javax.servlet.ServletOutputStream;
- import javax.servlet.annotation.WebServlet;
- import javax.servlet.http.HttpServlet;
- import javax.servlet.http.HttpServletRequest;
- import javax.servlet.http.HttpServletResponse;
- import org.json.JSONArray;
- import org.json.JSONObject;
- import com.dblink.basic.utils.SqlUtils;
- import com.dblink.basic.utils.sqlKind.MySql_s;
- import com.dblink.basic.utils.user.UserInfo;
- import com.dblink.bean.BeanGroup;
- import com.dblink.sql.DBLink;
- @SuppressWarnings("unused")
- public class ServletForKindKeyWords extends HttpServlet{
- /**
- *
- */
- private static final long serialVersionUID = 1L;
- //----------------------------------------------------------------------//
- public void doPost(HttpServletRequest request,HttpServletResponse response) throws ServletException, IOException
- {
- request.setCharacterEncoding("utf-8");
- response.setCharacterEncoding("utf-8");
- response.setContentType("application/json");
- response.setHeader("Cache-Control", "no-cache");
- response.setHeader("Pragma", "no-cache");
- String table = request.getParameter("table");
- String sql_rest = request.getParameter("sql");
- JSONArray jsonArray = new JSONArray();
- JSONObject jsonObj = new JSONObject();
- DBLink dbLink = new DBLink(new SqlUtils(new MySql_s("rc"),new UserInfo("root","123456")));
- BeanGroup bg = null;
- try {
- bg = dbLink.getSelect("Select * From "+table+" "+sql_rest).beans;
- int leng = bg.size();
- int maxSize = dbLink.getSelect("Select * From "+table+" ").beans.size();
- int page = maxSize%leng==0?(maxSize/30):(maxSize/30)+1;
- jsonObj.put("Length",leng);
- jsonObj.put("MaxSize",maxSize);
- jsonObj.put("Page",page);
- jsonArray.put(jsonObj);
- for(int i=0;i<leng;++i)
- {
- JSONObject jsonObject = new JSONObject();
- jsonObject.put("word",bg.get(i).get(0));
- jsonObject.put("num",bg.get(i).get(1));
- jsonObject.put("exp",bg.get(i).get(2));
- jsonArray.put(jsonObject);
- }
- } catch (SQLException e) {
- // Do Nothing ...
- }
- dbLink.free();
- ServletOutputStream os = response.getOutputStream();
- os.write(jsonArray.toString().getBytes());
- os.flush();
- os.close();
- }
- //---------------------------------------------------------------------------------//
- }
ServletForKindKeyWords.java
然后制作 js 部分:
先显示分类,然后利用套装形式进行数据载入:
如果点击 获取本类更多热词,就可以跳转至本类页面!
Like this:
附加新 js 代码:
- function makePageToKind()
- {
- var Area = '';
- Area += '<div class="row">';
- Area += ' <div class="col-md-12">';
- Area += ' <h2>热词目录</h2>';
- Area += ' </div>';
- Area += '</div>';
- Area += '<hr />';
- Area += '<br>';
- Area += '<br>';
- Area += '<div id="MessageArea">';
- Area += '</div>';
- document.getElementById("page-inner").innerHTML = Area;
- madeAllKindP();
- }
- function madeAllKindP()
- {
- var Area = '';
- Area += '<div>';
- Area += ' <ul>';
- Area += ' <li>';
- Area += ' <b>互联网类<b>';
- Area += ' <div id="hlw"></div>';
- Area += ' </li>';
- Area += ' <li>';
- Area += ' <b>IT业界类<b>';
- Area += ' <div id="ityj"></div>';
- Area += ' </li>';
- Area += ' <li>';
- Area += ' <b>软件开发类<b>';
- Area += ' <div id="rjkf"></div>';
- Area += ' </li>';
- Area += ' <li>';
- Area += ' <b>开源类<b>';
- Area += ' <div id="ky"></div>';
- Area += ' </li>';
- Area += ' <li>';
- Area += ' <b>电脑硬件类<b>';
- Area += ' <div id="dnyj"></div>';
- Area += ' </li>';
- Area += ' <li>';
- Area += ' <b>游戏类<b>';
- Area += ' <div id="yx"></div>';
- Area += ' </li>';
- Area += ' <li>';
- Area += ' <b>创业类<b>';
- Area += ' <div id="cy"></div>';
- Area += ' </li>';
- Area += ' <li>';
- Area += ' <b>手机相关类<b>';
- Area += ' <div id="sjxg"></div>';
- Area += ' </li>';
- Area += ' <li>';
- Area += ' <b>科学类<b>';
- Area += ' <div id="kx"></div>';
- Area += ' </li>';
- Area += ' <li>';
- Area += ' <b>其他类<b>';
- Area += ' <div id="qt"></div>';
- Area += ' </li>';
- Area += ' </ul>';
- Area += '</div>';
- document.getElementById("MessageArea").innerHTML = Area;
- makeNextStepOfGroupK("互联网类");
- makeNextStepOfGroupK("IT业界类");
- makeNextStepOfGroupK("软件开发类");
- makeNextStepOfGroupK("开源类");
- makeNextStepOfGroupK("电脑硬件类");
- makeNextStepOfGroupK("游戏类");
- makeNextStepOfGroupK("创业类");
- makeNextStepOfGroupK("手机相关类");
- makeNextStepOfGroupK("科学类");
- makeNextStepOfGroupK("其他类");
- }
- function getKindWordsByKindName(word)
- {
- var id_t = "";
- if(word=="互联网类")
- id_t = "hlw";
- else if(word=="IT业界类")
- id_t = "ityj";
- else if(word=="软件开发类")
- id_t = "rjkf";
- else if(word=="开源类")
- id_t = "ky";
- else if(word=="电脑硬件类")
- id_t = "dnyj";
- else if(word=="游戏类")
- id_t = "yx";
- else if(word=="创业类")
- id_t = "cy";
- else if(word=="手机相关类")
- id_t = "sjxg";
- else if(word=="科学类")
- id_t = "kx";
- else if(word=="其他类")
- id_t = "qt";
- return id_t;
- }
- function makeNextStepOfGroupK(word_t)
- {
- var xmlHttp = null;
- try{
- xmlHttp = new XMLHttpRequest();
- } catch (e1) {
- try {
- xmlHttp = new ActiveXObject("Microsoft.XMLHTTP");
- } catch (e2) {
- alert("Your browser does not support XMLHTTP!");
- return;
- }
- }
- xmlHttp.onreadystatechange = function() {
- if (xmlHttp.readyState == 4) {
- if (xmlHttp.status == 200)
- {
- var Area = " ";
- s = xmlHttp.responseText;
- var InformationSet = eval('('+s+')');
- var leng = InformationSet[0].Length;
- var kindness = InformationSet[0].KindNess;
- for(var i=1;i<=leng;++i)
- {
- var word_s = InformationSet[i].word;
- var num = InformationSet[i].num;
- Area += " ";
- Area += "<a href='#' title='在本类型中引用次数:"+num+"' onclick='toSomeWhere(\""+word_s+"\")'>"+word_s+"</a>";
- Area += " ";
- }
- Area += " ";
- Area += " ";
- Area += "<a href='#' onclick='makePageToOneKind(\""+kindness+"\")'/>获取本类更多热词...</a>";
- Area += " ";
- Area += " ";
- var id_t = getKindWordsByKindName(kindness);
- document.getElementById(id_t).innerHTML = Area;
- }
- }
- };
- var url ="../com/servlet/ServletForMoreInfo";
- var server = "kind="+word_t;
- xmlHttp.open("POST", url, true);
- xmlHttp.setRequestHeader("Content-Type","application/x-www-form-urlencoded");
- xmlHttp.send(server);
- }
- function makePageToOneKind(kind)
- {
- var Area = '';
- Area += '<div class="row">';
- Area += ' <div class="col-md-12">';
- Area += ' <h2>'+kind+'</h2>';
- Area += ' </div>';
- Area += '</div>';
- Area += '<hr />';
- Area += '<br>';
- Area += '<div style="background:rgb(0,153,255);margin-left:20px;margin-right:20px;height:25px;">';
- Area += ' <div style="margin-left:10px;margin-right:10px;margin-top:5px;margin-bottom:5px;">';
- Area += ' <b style="float:left;">热词表</b>';
- Area += ' <div style="float:right;">';
- Area += ' <select id="sty" onchange="simpleReset_Kind(\''+kind+'\')">';
- Area += ' <option value="0" selected>按照词频顺序</option>';
- Area += ' <option value="1">按照字母表顺序</option>';
- Area += ' </select>';
- Area += ' ';
- Area += ' <select id="order" onchange="simpleReset_Kind(\''+kind+'\')">';
- Area += ' <option value="0" selected>降序</option>';
- Area += ' <option value="1">增序</option>';
- Area += ' </select>';
- Area += ' ';
- Area += ' </div>';
- Area += ' </div>';
- Area += '</div>';
- Area += '<br>';
- Area += '<br>';
- Area += '<div id="MessageArea">';
- Area += '</div>';
- document.getElementById("page-inner").innerHTML = Area;
- simpleReset_Kind(kind);
- }
- function simpleReset_Kind(kind)
- {
- wordPage = 1;
- resetAndFresh_Kind(kind);
- }
- function XReset_Kind(p,kind)
- {
- wordPage = p;
- wordPage = parseInt(""+wordPage);
- resetAndFresh_Kind(kind);
- }
- function makeSurePage_Kind(kind)
- {
- wordPage = document.getElementById("selPage").value;
- wordPage = parseInt(""+wordPage);
- resetAndFresh_Kind(kind);
- }
- function resetAndFresh_Kind(kind)
- {
- var sty = document.getElementById("sty").value;
- var order = document.getElementById("order").value;
- var xmlHttp = null;
- try{
- xmlHttp = new XMLHttpRequest();
- } catch (e1) {
- try {
- xmlHttp = new ActiveXObject("Microsoft.XMLHTTP");
- } catch (e2) {
- alert("Your browser does not support XMLHTTP!");
- return;
- }
- }
- xmlHttp.onreadystatechange = function() {
- if (xmlHttp.readyState == 4) {
- if (xmlHttp.status == 200)
- {
- var Area = "";
- s = xmlHttp.responseText;
- var InformationSet = eval('('+s+')');
- var leng = InformationSet[0].Length;
- var max = InformationSet[0].MaxSize;
- var pageNum = InformationSet[0].Page;
- var kind = InformationSet[0].KindNess;
- Area += "<table class='WhatATable' style='margin-left:200px;float:left;'>";
- Area += "<tr>";
- Area += "<th style='width:100px;'>热词</th>";
- Area += "<th style='width:100px;'>词频</th>";
- Area += "<th style='width:100px;'>详细信息链接</th>";
- Area += "</tr>";
- if(leng<10)
- {
- for (var i=1;i<=leng;++i)
- {
- Area += "<tr>";
- Area += " <td>";
- Area += InformationSet[i].word;
- Area += " </td>";
- Area += " <td>";
- Area += InformationSet[i].num;
- Area += " </td>";
- Area += " <td>";
- Area += " <a href='#' onclick='toSomeWhere(\""+InformationSet[i].word+"\")'>详细信息</a>";
- Area += " </td>";
- Area += "</tr>";
- }
- }
- else
- {
- for (var i=1;i<=10;++i)
- {
- Area += "<tr>";
- Area += " <td>";
- Area += InformationSet[i].word;
- Area += " </td>";
- Area += " <td>";
- Area += InformationSet[i].num;
- Area += " </td>";
- Area += " <td>";
- Area += " <a href='#' onclick='toSomeWhere(\""+InformationSet[i].word+"\")'>详细信息</a>";
- Area += " </td>";
- Area += "</tr>";
- }
- }
- Area += "</table>";
- if(leng>10)
- {
- Area += "<table class='WhatATable' style='margin-left:10px;float:left;'>";
- Area += "<tr>";
- Area += "<th style='width:100px;'>热词</th>";
- Area += "<th style='width:100px;'>词频</th>";
- Area += "<th style='width:100px;'>详细信息链接</th>";
- Area += "</tr>";
- if(leng<=20)
- {
- for (var i=11;i<=leng;++i)
- {
- Area += "<tr>";
- Area += " <td>";
- Area += InformationSet[i].word;
- Area += " </td>";
- Area += " <td>";
- Area += InformationSet[i].num;
- Area += " </td>";
- Area += " <td>";
- Area += " <a href='#' onclick='toSomeWhere(\""+InformationSet[i].word+"\")'>详细信息</a>";
- Area += " </td>";
- Area += "</tr>";
- }
- }
- else
- {
- for (var i=11;i<=20;++i)
- {
- Area += "<tr>";
- Area += " <td>";
- Area += InformationSet[i].word;
- Area += " </td>";
- Area += " <td>";
- Area += InformationSet[i].num;
- Area += " </td>";
- Area += " <td>";
- Area += " <a href='#' onclick='toSomeWhere(\""+InformationSet[i].word+"\")'>详细信息</a>";
- Area += " </td>";
- Area += "</tr>";
- }
- }
- Area += "</table>";
- }
- if(leng>20)
- {
- Area += "<table class='WhatATable' style='margin-left:10px;float:left;'>";
- Area += "<tr>";
- Area += "<th style='width:100px;'>热词</th>";
- Area += "<th style='width:100px;'>词频</th>";
- Area += "<th style='width:100px;'>详细信息链接</th>";
- Area += "</tr>";
- for (var i=21;i<=leng;++i)
- {
- Area += "<tr>";
- Area += " <td>";
- Area += InformationSet[i].word;
- Area += " </td>";
- Area += " <td>";
- Area += InformationSet[i].num;
- Area += " </td>";
- Area += " <td>";
- Area += " <a href='#' onclick='toSomeWhere(\""+InformationSet[i].word+"\")'>详细信息</a>";
- Area += " </td>";
- Area += "</tr>";
- }
- Area += "</table>";
- }
- Area += "<div style='clear:both;'></div>";
- Area += "<br>";
- Area += "<br>";
- Area += "<br>";
- Area += "<br>";
- Area += "<p style='margin-left:30px;margin-right:30px;'>";
- Area += " <button onclick='simpleReset_Kind(\""+kind+"\")'>起始页</button> ";
- var start = ((wordPage-4)>=1)?wordPage-4:1;
- var end = ((wordPage+4)<=pageNum)?(wordPage+4):pageNum;
- //alert(parseInt(wordPage+4+""));
- if(start!=1)
- {
- Area += " ... ";
- }
- for(var i=start;i<=end;++i)
- {
- Area += " <button onclick='XReset_Kind(\""+i+"\",\""+kind+"\")'>"+i+"</button> ";
- }
- if(end!=pageNum)
- {
- Area += " ... ";
- }
- Area += " <button onclick='XReset_Kind("+pageNum+",\""+kind+"\")'>结束页</button> ";
- Area += " <b>选择页数跳转</b> ";
- Area += "<select id='selPage' onchange='makeSurePage_Kind(\""+kind+"\")'>";
- for(var i=1;i<=pageNum;++i)
- {
- Area += "<option value='"+i+"'>"+i+"</option>";
- }
- Area += "</select>";
- Area += "</p>";
- document.getElementById("MessageArea").innerHTML = Area;
- surePage_Kind();
- }
- }
- };
- var url ="../com/servlet/ServletForKindKeyWords";
- var server = "sql=";
- // 按照词频顺序
- if(sty==0)
- {
- server += " order by num ";
- }
- // 按照字母表顺序
- else if(sty==1)
- {
- server += " order by word ";
- }
- // 如果是降序
- if(order==0)
- {
- server += " DESC ";
- }
- server += (" Limit "+((wordPage-1)*30)+",30 ");
- server += "&table="+kind;
- xmlHttp.open("POST", url, true);
- xmlHttp.setRequestHeader("Content-Type","application/x-www-form-urlencoded");
- xmlHttp.send(server);
- }
- function surePage_Kind(kind)
- {
- document.getElementById("selPage").selectedIndex = wordPage-1;
- }
wordkind.js
- var wordPage = 1;
- function makePageToWord()
- {
- var Area = '';
- Area += '<div class="row">';
- Area += '<div class="col-md-12">';
- Area += '<h2>全部热词</h2>';
- Area += '</div>';
- Area += '</div>';
- Area += '<hr />';
- Area += '<br>';
- Area += '<div style="background:rgb(0,153,255);margin-left:20px;margin-right:20px;height:25px;">';
- Area += ' <div style="margin-left:10px;margin-right:10px;margin-top:5px;margin-bottom:5px;">';
- Area += ' <b style="float:left;">热词表</b>';
- Area += ' <div style="float:right;">';
- Area += ' <select id="sty" onchange="simpleReset()">';
- Area += ' <option value="0" selected>按照词频顺序</option>';
- Area += ' <option value="1">按照字母表顺序</option>';
- Area += ' </select>';
- Area += ' ';
- Area += ' <select id="order" onchange="simpleReset()">';
- Area += ' <option value="0" selected>降序</option>';
- Area += ' <option value="1">增序</option>';
- Area += ' </select>';
- Area += ' ';
- Area += ' </div>';
- Area += ' </div>';
- Area += '</div>';
- Area += '<br>';
- Area += '<br>';
- Area += '<div id="MessageArea">';
- Area += '</div>';
- document.getElementById("page-inner").innerHTML = Area;
- simpleReset();
- }
- function simpleReset()
- {
- wordPage = 1;
- resetAndFresh();
- }
- function XReset(p)
- {
- wordPage = p;
- wordPage = parseInt(""+wordPage);
- resetAndFresh();
- }
- function resetAndFresh()
- {
- var sty = document.getElementById("sty").value;
- var order = document.getElementById("order").value;
- var xmlHttp = null;
- try{
- xmlHttp = new XMLHttpRequest();
- } catch (e1) {
- try {
- xmlHttp = new ActiveXObject("Microsoft.XMLHTTP");
- } catch (e2) {
- alert("Your browser does not support XMLHTTP!");
- return;
- }
- }
- xmlHttp.onreadystatechange = function() {
- if (xmlHttp.readyState == 4) {
- if (xmlHttp.status == 200)
- {
- var Area = "";
- s = xmlHttp.responseText;
- var InformationSet = eval('('+s+')');
- var leng = InformationSet[0].Length;
- var max = InformationSet[0].MaxSize;
- var pageNum = InformationSet[0].Page;
- Area += "<table class='WhatATable' style='margin-left:200px;float:left;'>";
- Area += "<tr>";
- Area += "<th style='width:100px;'>热词</th>";
- Area += "<th style='width:100px;'>词频</th>";
- Area += "<th style='width:100px;'>详细信息链接</th>";
- Area += "</tr>";
- if(leng<10)
- {
- for (var i=1;i<=leng;++i)
- {
- Area += "<tr>";
- Area += " <td>";
- Area += InformationSet[i].word;
- Area += " </td>";
- Area += " <td>";
- Area += InformationSet[i].num;
- Area += " </td>";
- Area += " <td>";
- Area += " <a href='#' onclick='toSomeWhere(\""+InformationSet[i].word+"\")'>详细信息</a>";
- Area += " </td>";
- Area += "</tr>";
- }
- }
- else
- {
- for (var i=1;i<=10;++i)
- {
- Area += "<tr>";
- Area += " <td>";
- Area += InformationSet[i].word;
- Area += " </td>";
- Area += " <td>";
- Area += InformationSet[i].num;
- Area += " </td>";
- Area += " <td>";
- Area += " <a href='#' onclick='toSomeWhere(\""+InformationSet[i].word+"\")'>详细信息</a>";
- Area += " </td>";
- Area += "</tr>";
- }
- }
- Area += "</table>";
- if(leng>10)
- {
- Area += "<table class='WhatATable' style='margin-left:10px;float:left;'>";
- Area += "<tr>";
- Area += "<th style='width:100px;'>热词</th>";
- Area += "<th style='width:100px;'>词频</th>";
- Area += "<th style='width:100px;'>详细信息链接</th>";
- Area += "</tr>";
- if(leng<=20)
- {
- for (var i=11;i<=leng;++i)
- {
- Area += "<tr>";
- Area += " <td>";
- Area += InformationSet[i].word;
- Area += " </td>";
- Area += " <td>";
- Area += InformationSet[i].num;
- Area += " </td>";
- Area += " <td>";
- Area += " <a href='#' onclick='toSomeWhere(\""+InformationSet[i].word+"\")'>详细信息</a>";
- Area += " </td>";
- Area += "</tr>";
- }
- }
- else
- {
- for (var i=11;i<=20;++i)
- {
- Area += "<tr>";
- Area += " <td>";
- Area += InformationSet[i].word;
- Area += " </td>";
- Area += " <td>";
- Area += InformationSet[i].num;
- Area += " </td>";
- Area += " <td>";
- Area += " <a href='#' onclick='toSomeWhere(\""+InformationSet[i].word+"\")'>详细信息</a>";
- Area += " </td>";
- Area += "</tr>";
- }
- }
- Area += "</table>";
- }
- if(leng>20)
- {
- Area += "<table class='WhatATable' style='margin-left:10px;float:left;'>";
- Area += "<tr>";
- Area += "<th style='width:100px;'>热词</th>";
- Area += "<th style='width:100px;'>词频</th>";
- Area += "<th style='width:100px;'>详细信息链接</th>";
- Area += "</tr>";
- for (var i=21;i<=leng;++i)
- {
- Area += "<tr>";
- Area += " <td>";
- Area += InformationSet[i].word;
- Area += " </td>";
- Area += " <td>";
- Area += InformationSet[i].num;
- Area += " </td>";
- Area += " <td>";
- Area += " <a href='#' onclick='toSomeWhere(\""+InformationSet[i].word+"\")'>详细信息</a>";
- Area += " </td>";
- Area += "</tr>";
- }
- Area += "</table>";
- }
- Area += "<div style='clear:both;'></div>";
- Area += "<br>";
- Area += "<br>";
- Area += "<br>";
- Area += "<br>";
- Area += "<p style='margin-left:30px;margin-right:30px;'>";
- Area += " <button onclick='simpleReset()'>起始页</button> ";
- var start = ((wordPage-4)>=1)?wordPage-4:1;
- var end = ((wordPage+4)<=pageNum)?(wordPage+4):pageNum;
- //alert(parseInt(wordPage+4+""));
- if(start!=1)
- {
- Area += " ... ";
- }
- for(var i=start;i<=end;++i)
- {
- Area += " <button onclick='XReset("+i+")'>"+i+"</button> ";
- }
- if(end!=pageNum)
- {
- Area += " ... ";
- }
- Area += " <button onclick='XReset("+pageNum+")'>结束页</button> ";
- Area += " <b>选择页数跳转</b> ";
- Area += "<select id='selPage' onchange='makeSurePage()'>";
- for(var i=1;i<=pageNum;++i)
- {
- Area += "<option value='"+i+"'>"+i+"</option>";
- }
- Area += "</select>";
- Area += "</p>";
- document.getElementById("MessageArea").innerHTML = Area;
- surePage();
- }
- }
- };
- var url ="../com/servlet/ServletForAllKeyWords";
- var server = "sql=";
- // 按照词频顺序
- if(sty==0)
- {
- server += " order by num ";
- }
- // 按照字母表顺序
- else if(sty==1)
- {
- server += " order by word ";
- }
- // 如果是降序
- if(order==0)
- {
- server += " DESC ";
- }
- server += (" Limit "+((wordPage-1)*30)+",30 ");
- xmlHttp.open("POST", url, true);
- xmlHttp.setRequestHeader("Content-Type","application/x-www-form-urlencoded");
- xmlHttp.send(server);
- }
- function toSomeWhere(word)
- {
- var Area = '';
- Area += '<div class="row">';
- Area += ' <div class="col-md-12">';
- Area += ' <h2>'+word+'</h2>';
- Area += ' </div>';
- Area += '</div>';
- Area += '<hr />';
- Area += '<br>';
- Area += '<div id="MessageArea">';
- Area += '</div>';
- document.getElementById("page-inner").innerHTML = Area;
- var xmlHttp = null;
- try{
- xmlHttp = new XMLHttpRequest();
- } catch (e1) {
- try {
- xmlHttp = new ActiveXObject("Microsoft.XMLHTTP");
- } catch (e2) {
- alert("Your browser does not support XMLHTTP!");
- return;
- }
- }
- xmlHttp.onreadystatechange = function() {
- if (xmlHttp.readyState == 4) {
- if (xmlHttp.status == 200)
- {
- var Area = "";
- s = xmlHttp.responseText;
- var InformationSet = eval('('+s+')');
- var word = InformationSet[1].word;
- var num = InformationSet[1].num;
- var exp = InformationSet[1].exp;
- Area += "<p><b id='word' style='font-size:120%;'>"+word+"</b></p>";
- Area += "<p style='color:rgb(200,200,200);'> 引用次数:"+num+"</p>"
- Area += "<p style='font:\"楷体\";font-size:90%;'> ";
- if(exp=="")
- {
- Area += "目前百度百科上并没有相关解释信息...";
- }
- else
- {
- Area += exp;
- }
- Area += "</p>";
- Area += "<br>";
- Area += "<div id='finalDIV'></div>"
- document.getElementById("MessageArea").innerHTML = Area;
- getLinksForKey(word);
- }
- }
- };
- var url ="../com/servlet/ServletForAllKeyWords";
- var server = "sql= where word='"+word+"'";
- xmlHttp.open("POST", url, true);
- xmlHttp.setRequestHeader("Content-Type","application/x-www-form-urlencoded");
- xmlHttp.send(server);
- }
- function getLinksForKey(word)
- {
- var xmlHttp = null;
- try{
- xmlHttp = new XMLHttpRequest();
- } catch (e1) {
- try {
- xmlHttp = new ActiveXObject("Microsoft.XMLHTTP");
- } catch (e2) {
- alert("Your browser does not support XMLHTTP!");
- return;
- }
- }
- xmlHttp.onreadystatechange = function() {
- if (xmlHttp.readyState == 4) {
- if (xmlHttp.status == 200)
- {
- var Area = "";
- Area += "<br>";
- Area += "<br>";
- Area += "<b style='font-size:120%;'>引用网页:</b>";
- Area += "<br>";
- Area += "<br>";
- Area += "<ul>";
- s = xmlHttp.responseText;
- var InformationSet = eval('('+s+')');
- var leng = InformationSet[0].Length;
- for(var i=1;i<=leng;++i)
- {
- var word = InformationSet[i].word;
- var num = InformationSet[i].num;
- var title = InformationSet[i].title;
- var link = InformationSet[i].link;
- Area += "<li>";
- Area += "<a href='"+link+"' title='引用次数:"+num+"'>"+title+"</a>"
- Area += "</li>";
- }
- Area += "</ul>";
- document.getElementById("finalDIV").innerHTML = Area;
- }
- }
- };
- var url ="../com/servlet/ServletForLinkData";
- var server = "word="+word;
- xmlHttp.open("POST", url, true);
- xmlHttp.setRequestHeader("Content-Type","application/x-www-form-urlencoded");
- xmlHttp.send(server);
- }
- function surePage()
- {
- document.getElementById("selPage").selectedIndex = wordPage-1;
- }
- function makeSurePage()
- {
- wordPage = document.getElementById("selPage").value;
- wordPage = parseInt(""+wordPage);
- resetAndFresh();
- }
word.js
更新 web.xml 引用
- <?xml version="1.0" encoding="UTF-8"?>
- <web-app xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://xmlns.jcp.org/xml/ns/javaee" xsi:schemaLocation="http://xmlns.jcp.org/xml/ns/javaee http://xmlns.jcp.org/xml/ns/javaee/web-app_4_0.xsd" id="WebApp_ID" version="4.0">
- <display-name>HotWord</display-name>
- <servlet>
- <description>This is the description of my J2EE component</description>
- <display-name>This is the display name of my J2EE component</display-name>
- <servlet-name>ServletForWords</servlet-name>
- <servlet-class>com.servlet.ServletForWords</servlet-class>
- </servlet>
- <servlet-mapping>
- <servlet-name>ServletForWords</servlet-name>
- <url-pattern>/com/servlet/ServletForWords</url-pattern>
- </servlet-mapping>
- <servlet>
- <description>This is the description of my J2EE component</description>
- <display-name>This is the display name of my J2EE component</display-name>
- <servlet-name>ServletForAllKeyWords</servlet-name>
- <servlet-class>com.servlet.ServletForAllKeyWords</servlet-class>
- </servlet>
- <servlet-mapping>
- <servlet-name>ServletForAllKeyWords</servlet-name>
- <url-pattern>/com/servlet/ServletForAllKeyWords</url-pattern>
- </servlet-mapping>
- <servlet>
- <description>This is the description of my J2EE component</description>
- <display-name>This is the display name of my J2EE component</display-name>
- <servlet-name>ServletForLinkData</servlet-name>
- <servlet-class>com.servlet.ServletForLinkData</servlet-class>
- </servlet>
- <servlet-mapping>
- <servlet-name>ServletForLinkData</servlet-name>
- <url-pattern>/com/servlet/ServletForLinkData</url-pattern>
- </servlet-mapping>
- <servlet>
- <description>This is the description of my J2EE component</description>
- <display-name>This is the display name of my J2EE component</display-name>
- <servlet-name>ServletForMoreInfo</servlet-name>
- <servlet-class>com.servlet.ServletForMoreInfo</servlet-class>
- </servlet>
- <servlet-mapping>
- <servlet-name>ServletForMoreInfo</servlet-name>
- <url-pattern>/com/servlet/ServletForMoreInfo</url-pattern>
- </servlet-mapping>
- <servlet>
- <description>This is the description of my J2EE component</description>
- <display-name>This is the display name of my J2EE component</display-name>
- <servlet-name>ServletForKindKeyWords</servlet-name>
- <servlet-class>com.servlet.ServletForKindKeyWords</servlet-class>
- </servlet>
- <servlet-mapping>
- <servlet-name>ServletForKindKeyWords</servlet-name>
- <url-pattern>/com/servlet/ServletForKindKeyWords</url-pattern>
- </servlet-mapping>
- <welcome-file-list>
- <welcome-file>index.html</welcome-file>
- <welcome-file>index.htm</welcome-file>
- <welcome-file>index.jsp</welcome-file>
- <welcome-file>default.html</welcome-file>
- <welcome-file>default.htm</welcome-file>
- <welcome-file>default.jsp</welcome-file>
- </welcome-file-list>
- </web-app>
web.xml
更新 jsp 页面代码:
- <%@ page language="java" contentType="text/html; charset=utf-8"
- pageEncoding="utf-8"%>
- <!DOCTYPE html>
- <html><!-- xmlns="http://www.w3.org/1999/xhtml" -->
- <head>
- <!--<meta charset="utf-8" />-->
- <meta name="viewport" content="width=device-width, initial-scale=1.0" charset="utf-8"/>
- <title>热词分析</title>
- <!-- BOOTSTRAP STYLES-->
- <link href="../assets/css/bootstrap.css" rel="stylesheet" />
- <!-- FONTAWESOME STYLES-->
- <link href="../assets/css/font-awesome.css" rel="stylesheet" />
- <!-- CUSTOM STYLES-->
- <link href="../assets/css/custom.css" rel="stylesheet" />
- <!-- PERSONAL FONTS-->
- <link href='../cssFiles/basic.css' rel='stylesheet' type='text/css' />
- <!-- GOOGLE FONTS-->
- <link href='http://fonts.googleapis.com/css?family=Open+Sans' rel='stylesheet' type='text/css' />
- </head>
- <script src="../jsFiles/jquery/jquery-3.4.1.min.js" charset="utf-8"></script>
- <script src="../jsFiles/echarts/echarts.min.js" charset="utf-8"></script>
- <script src="../jsFiles/echarts/echarts-wordcloud-master/dist/echarts-wordcloud.min.js" charset="utf-8"></script>
- <!-- <script src="../jsFiles/echarts/echarts-wordcloud-master/dist/echarts-wordcloud.min.js" charset="utf-8"></script> -->
- <script src="../jsFiles/basic.js" charset="utf-8"></script>
- <script src='../jsFiles/echarts/echarts.simple.js'></script>
- <script src="../jsFiles/word.js" charset="utf-8"></script>
- <script src="../jsFiles/wordkind.js" charset="utf-8"></script>
- <script src="../jsFiles/cloud.js" charset="utf-8"></script>
- <body>
- <div id="wrapper">
- <div class="navbar navbar-inverse navbar-fixed-top">
- <div class="adjust-nav">
- <div class="navbar-header">
- <button type="button" class="navbar-toggle" data-toggle="collapse" data-target=".sidebar-collapse">
- <span class="icon-bar"></span>
- <span class="icon-bar"></span>
- <span class="icon-bar"></span>
- </button>
- <a class="navbar-brand"><i class="fa fa-square-o "></i> 欢迎您使用本热词分析系统</a>
- </div>
- </div>
- </div>
- <!-- /. NAV TOP -->
- <div class="navbar-default navbar-side"> <!-- nav role="navigation" -->
- <div class="sidebar-collapse">
- <ul class="nav" id="main-menu">
- <li class="text-center user-image-back">
- <img src="../assets/img/find_user.png" class="img-responsive" />
- </li>
- <li>
- <a href="#" onclick="makePageToMain()"><i class="fa fa-table "></i>主页</a>
- </li>
- <li>
- <a href="#" onclick="makePageToWord()"><i class="fa fa-key "></i>全部热词</a>
- </li>
- <li>
- <a href="#" onclick="makePageToKind()"><i class="fa fa-key "></i>热词目录</a>
- </li>
- <li>
- <a href="#"><i class="fa fa-edit "></i>热词需求<span class="fa arrow"></span></a>
- <ul class="nav nav-second-level">
- <li>
- <a href="#" onclick="makePageToCl()">热词云图</a>
- </li>
- <li>
- <a href="#" onclick="makePageToRe()">热词关系图</a>
- </li>
- </ul>
- </li>
- </ul>
- </div>
- </div>
- <!-- /. NAV SIDE -->
- <div id="page-wrapper" >
- <div id="page-inner">
- <div class="row">
- <div class="col-md-12">
- <h2>主页</h2>
- </div>
- </div>
- <!-- /. ROW -->
- <hr />
- <!-- /. ROW -->
- <br>
- <br>
- <div id="MessageArea">
- <br>
- <h3>欢迎您使用本热词分析系统</h3>
- </div>
- </div>
- <!-- /. PAGE INNER -->
- </div>
- <!-- /. PAGE WRAPPER -->
- </div>
- <!-- /. WRAPPER -->
- <!-- SCRIPTS -AT THE BOTOM TO REDUCE THE LOAD TIME-->
- <!-- JQUERY SCRIPTS -->
- <script src="../assets/js/jquery-1.10.2.js"></script>
- <!-- BOOTSTRAP SCRIPTS -->
- <script src="../assets/js/bootstrap.min.js"></script>
- <!-- METISMENU SCRIPTS -->
- <script src="../assets/js/jquery.metisMenu.js"></script>
- <!-- CUSTOM SCRIPTS -->
- <script src="../assets/js/custom.js"></script>
- </body>
- </html>
index.jsp
另外的部分我想了,还是分开写吧!
Python 爬取 热词并进行分类数据分析-[热词分类+目录生成]的更多相关文章
- Python 爬取 热词并进行分类数据分析-[热词关系图+报告生成]
日期:2020.02.05 博客期:144 星期三 [本博客的代码如若要使用,请在下方评论区留言,之后再用(就是跟我说一声)] 所有相关跳转: a.[简单准备] b.[云图制作+数据导入] c.[拓扑 ...
- python 爬取豆瓣电影评论,并进行词云展示及出现的问题解决办法
本文旨在提供爬取豆瓣电影<我不是药神>评论和词云展示的代码样例 1.分析URL 2.爬取前10页评论 3.进行词云展示 1.分析URL 我不是药神 短评 第一页url https://mo ...
- python爬取花木兰豆瓣影评,并进行词云分析
前言 本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理. PS:如有需要Python学习资料的小伙伴可以加点击下方链接自行获取 python免费学习资 ...
- Python 爬取 热词并进行分类数据分析-[云图制作+数据导入]
日期:2020.01.28 博客期:136 星期二 [本博客的代码如若要使用,请在下方评论区留言,之后再用(就是跟我说一声)] 所有相关跳转: a.[简单准备] b.[云图制作+数据导入](本期博客) ...
- Python 爬取 热词并进行分类数据分析-[简单准备] (2020年寒假小目标05)
日期:2020.01.27 博客期:135 星期一 [本博客的代码如若要使用,请在下方评论区留言,之后再用(就是跟我说一声)] 所有相关跳转: a.[简单准备](本期博客) b.[云图制作+数据导入] ...
- Python 爬取 热词并进行分类数据分析-[数据修复]
日期:2020.02.01 博客期:140 星期六 [本博客的代码如若要使用,请在下方评论区留言,之后再用(就是跟我说一声)] 所有相关跳转: a.[简单准备] b.[云图制作+数据导入] c.[拓扑 ...
- Python 爬取 热词并进行分类数据分析-[解释修复+热词引用]
日期:2020.02.02 博客期:141 星期日 [本博客的代码如若要使用,请在下方评论区留言,之后再用(就是跟我说一声)] 所有相关跳转: a.[简单准备] b.[云图制作+数据导入] c.[拓扑 ...
- Python 爬取 热词并进行分类数据分析-[拓扑数据]
日期:2020.01.29 博客期:137 星期三 [本博客的代码如若要使用,请在下方评论区留言,之后再用(就是跟我说一声)] 所有相关跳转: a.[简单准备] b.[云图制作+数据导入] c.[拓扑 ...
- Python 爬取 热词并进行分类数据分析-[App制作]
日期:2020.02.14 博客期:154 星期五 [本博客的代码如若要使用,请在下方评论区留言,之后再用(就是跟我说一声)] 所有相关跳转: a.[简单准备] b.[云图制作+数据导入] c.[拓扑 ...
随机推荐
- bugku 输入密码查看flag
首先进入网页会看到一个网页 然后用bp进行爆破 首先点击clear 然后选中刚下随便输入进去的密码点击add然后设置一下类型 然后进行开始攻击 攻击过程中点击length 根据长度可以判断出密码然后回 ...
- Layui自定义模块的使用方式
为什么要自定义模块呢?好处很多.比如可以大量重用代码...... 根据layui官方的文档说明.首先第一步是要确定你要扩展的模块名称 现在做的是登录功能.因此扩展模块名叫 login 使用layui ...
- 【Python实现图片验证码】
"```python import base64 import random from PIL import Image from PIL import ImageDraw # 画笔对象 f ...
- python GIL锁与多cpu
多核CPU linux : cat /proc/cpuinfo 如果你不幸拥有一个多核CPU,你肯定在想,多核应该可以同时执行多个线程. 如果写一个死循环的话,会出现什么情况呢? 打开Mac OS ...
- Java代码三级跳——表达式、语句和代码块
Java代码三级跳—表达式.语句和代码块 表达式(expression):Java中最基本的一个运算.比如一个加法运算表达式.1+2是一个表达式,a+b也是. 语句(statement):类似于平时说 ...
- PAT 1014 Waiting in Line (模拟)
Suppose a bank has N windows open for service. There is a yellow line in front of the windows which ...
- The Preliminary Contest for ICPC Asia Xuzhou 2019 B. so easy (unordered_map+并查集)
这题单用map过不了,太慢了,所以改用unordered_map,对于前面删除的点,把它的父亲改成,后面一位数的父亲,初始化的时候,map里是零,说明它的父亲就是它本身,最后输出答案的时候,输出每一位 ...
- JAVA(2)之关于类的访问权限控制
类的成员的四种访问权限 private 只能在当前类中访问 无修饰 同一个包中的类都可以访问 protected 同一个包中的类可以访问 不同包中的子类可以访问 public 所有类都可以访问 示例代 ...
- UTC/GMT/CST/RTC
GMT:格林尼治标准时间,是指位于伦敦郊区的皇家格林尼治天文台的标准时间,因为本初子午线被定义在通过那里的经线.也就是零时区的时间. UTC:世界协调时间,是一个时间系统.可以理解为这个地球的标准时间 ...
- 树莓派3B 安装gcc和g++
转:https://blog.csdn.net/zhuming3834/article/details/81946707 安装 如果不是root 用户,请自行加上sudo apt-get instal ...