python爬虫学习(11) —— 也写个AC自动机

0. 写在前面

本文记录了一个AC自动机的诞生！

之前看过有人用C++写过AC自动机，也有用C#写的，还有一个用nodejs写的。。

感觉他们的代码过于冗长，而且AC率也不是很理想。

刚好在回宿舍的路上和学弟聊起这个事

随意想了想思路，觉得还是蛮简单的，就顺手写了一个，效果，还可以接受。

先上个图吧：

最后应该还可以继续刷，如果修改代码或者再添加以下其他搜索引擎可以AC更多题，

不过我有意控制在3000这个AC量，也有意跟在五虎上将之后。

1. 爬虫思路

思路其实非常清晰：

模拟登录HDU
针对某一道题目
- 搜索AC代码
  - 通过正则表达式进行代码的提取
  - 通过htmlparser进行代码的处理
- 提交
  - 若AC，返回2
  - 否则，继续提交代码（这里最多只提交10份代码）
  - 10次提交后还未AC，放弃此题

2. 简单粗暴的代码

#coding='utf-8'

import requests, re, os, HTMLParser, time, getpass

host_url = 'http://acm.hdu.edu.cn'

post_url = 'http://acm.hdu.edu.cn/userloginex.php?action=login'

sub_url = 'http://acm.hdu.edu.cn/submit.php?action=submit'

csdn_url = 'http://so.csdn.net/so/search/s.do'

head = { 'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36' }

html_parser = HTMLParser.HTMLParser()

s = requests.session()

def login(usr,psw):

    s.get(host_url);

    data = {'username':usr,'userpass':psw,'login':'Sign In'}

    r = s.post(post_url,data=data)

def check_lan(lan):

    if 'java' in lan:

        return '5'

    return '0'

def parser_code(code):

    return html_parser.unescape(code).encode('utf-8')

def is_ac(pid,usr):

    tmp = requests.get('http://acm.hdu.edu.cn/userstatus.php?user='+usr).text

    accept = re.search('List of solved problems</font></h3>.*?<p align=left><script language=javascript>(.*?)</script><br></p>',tmp,re.S)

    if pid in accept.group(1):

        print '%s was solved' %pid

        return True

    else:

        return False

def search_csdn(PID,usr):

    get_data = { 'q':'HDU ' + PID,  't':'blog', 'o':'', 's':'', 'l':'null'  }

    search_html = requests.get(csdn_url,params=get_data).text

    linklist = re.findall('<dd class="search-link"><a href="(.*?)" target="_blank">',search_html,re.S)

    for l in linklist:

        print l

        tm_html = requests.get(l,headers=head).text;

        title = re.search('<title>(.*?)</title>',tm_html,re.S).group(1).lower()

        if PID not in title:

            continue

        if 'hdu' not in title:

            continue

        tmp = re.search('name="code" class="(.*?)">(.*?)</pre>',tm_html,re.S)

        if tmp == None:

            print 'code not find'

            continue

        LAN = check_lan(tmp.group(1))

        CODE =  parser_code(tmp.group(2))

        if r'include' in CODE:

            pass

        elif r'import java' in CODE:

            pass

        else:

            continue

        print PID, LAN

        print '--------------'

        submit_data = { 'check':'0', 'problemid':PID, 'language':LAN, 'usercode':CODE }

        s.post(sub_url,headers=head,data=submit_data)

        time.sleep(5)

        if is_ac(PID,usr):

            break

if __name__ == '__main__':

    usr = raw_input('input your username:')

    psw = getpass.getpass('input your password:')

    login(usr,psw)

    pro_cnt = 1000

    while pro_cnt <= 5679:

        PID = str(pro_cnt)

        if is_ac(PID,usr):

            pro_cnt += 1

            continue

        search_csdn(PID,usr)

        pro_cnt += 1

代码不长，仅仅只有78行，是的，就是这样！

3. TDDO

目前没有打算完善这篇博客，也不推荐去研究这个东西，推荐的是去学习真正的算法，哈哈！

很久很久以前自己写过的AC自动机，，，，贴一发：

#include <cstdio>

#include <cstring>

#include <algorithm>

#include <queue>

using namespace std;

#define clr( a, b ) memset( a, b, sizeof(a) )

const int SIGMA_SIZE = 26;

const int NODE_SIZE = 500000 + 10;

struct ac_automaton{

    int ch[ NODE_SIZE ][ SIGMA_SIZE ];

    int f[ NODE_SIZE ], val[ NODE_SIZE ], last[ NODE_SIZE ];

    int sz;

    void init(){

        sz = 1;

        clr( ch[0], 0 ), clr( val, 0 );

    }

    void insert( char *s ){

        int u = 0, i = 0;

        for( ; s[i]; ++i ){

            int c = s[i] - 'a';

            if( !ch[u][c] ){

                clr( ch[sz], 0 );

                val[sz] = 0;

                ch[u][c] = sz++;

            }

            u = ch[u][c];

        }

        val[u]++;

    }

    void getfail(){

        queue<int> q;

        f[0] = 0;

        for( int c = 0; c < SIGMA_SIZE; ++c ){

            int u = ch[0][c];

            if( u ) f[u] = 0, q.push(u), last[u] = 0;

        }

        while( !q.empty() ){

            int r = q.front(); q.pop();

            for( int c = 0; c < SIGMA_SIZE; ++c ){

                int u = ch[r][c];

                if( !u ){

                    ch[r][c] = ch[ f[r] ][c];

                    continue;

                }

                q.push( u );

                int v = f[r];

                while( v && !ch[v][c] ) v = f[v];

                f[u] = ch[v][c];

                last[u] = val[ f[u] ] ? f[u] : last[ f[u] ];

            }

        }

    }

    int work( char* s ){

        int res = 0;

        int u = 0, i = 0, e;

        for( ; s[i]; ++i ){

            int c = s[i] - 'a';

            u = ch[u][c];

            e = u;

            while( val[e] ){

                res += val[e];

                val[e] = 0;

                e = last[e];

            }

        }

        return res;

    }

}ac;

python爬虫学习(11) —— 也写个AC自动机的更多相关文章

python爬虫学习 —— 总目录
开篇作为一个C党,接触python之后学习了爬虫. 和AC算法题的快感类似,从网络上爬取各种数据也很有意思. 准备写一系列文章,整理一下学习历程,也给后来者提供一点便利. 我是目录听说你叫爬虫 - ...
python爬虫学习(1) —— 从urllib说起
0. 前言如果你从来没有接触过爬虫,刚开始的时候可能会有些许吃力因为我不会从头到尾把所有知识点都说一遍,很多文章主要是记录我自己写的一些爬虫所以建议先学习一下cuiqingcai大神的 Pyth ...
Python爬虫学习：二、爬虫的初步尝试
我使用的编辑器是IDLE,版本为Python2.7.11,Windows平台. 本文是博主原创随笔,转载时请注明出处Maple2cat|Python爬虫学习:二.爬虫的初步尝试 1.尝试抓取指定网页 ...
《Python爬虫学习系列教程》学习笔记
http://cuiqingcai.com/1052.html 大家好哈,我呢最近在学习Python爬虫,感觉非常有意思,真的让生活可以方便很多.学习过程中我把一些学习的笔记总结下来,还记录了一些自己 ...
python爬虫学习笔记（一）——环境配置（windows系统）
在进行python爬虫学习前,需要进行如下准备工作: python3+pip官方配置 1.Anaconda(推荐,包括python和相关库) [推荐地址:清华镜像] https://mirrors ...
[转]《Python爬虫学习系列教程》
<Python爬虫学习系列教程>学习笔记 http://cuiqingcai.com/1052.html 大家好哈,我呢最近在学习Python爬虫,感觉非常有意思,真的让生活可以方便很多. ...
Python爬虫学习02--pyinstaller
Python爬虫学习02--打包exe可执行程序 1.上一次做了一个爬虫爬取电子书的Python程序,然后发现可以通过pyinstaller进行打包成exe可执行程序.发现非常简单好用 2.这是上次写 ...
Python爬虫学习第一记 (翻译小助手)
1 # Python爬虫学习第一记 8.24 (代码有点小,请放大看吧) 2 3 #实现有道翻译,模块一: $fanyi.py 4 5 import urllib.request 6 import u ...
Python爬虫学习：三、爬虫的基本操作流程
本文是博主原创随笔,转载时请注明出处Maple2cat|Python爬虫学习:三.爬虫的基本操作与流程一般我们使用Python爬虫都是希望实现一套完整的功能,如下: 1.爬虫目标数据.信息: 2.将 ...

随机推荐

PHP 面向对象编程和设计模式 (4/5) - 异常的定义、扩展及捕获
PHP高级程序设计学习笔记 2014.06.12 异常经常被用来处理一些在程序正常执行中遇到的各种类型的错误.比如做数据库链接时,你就要处理数据库连接失败的情况.使用异常可以提高我们程序的容错特性, ...
JSON.parse 与 eval() 对于解析json的问题
1.eval()与JSOn.parse的不同 eval() var c = 1; //全局变量 var jsonstr1 = '{"name":"a",&quo ...
前端学PHP之PHP操作memcache
× 目录 [1]安装 [2]连接 [3]增删改查[4]分布式[5]状态[6]安全[7]应用前面的话和访问mysql服务器类似,PHP也是作为客户端API访问memcached服务器的,所以同样需要 ...
前端学PHP之MemCache
× 目录 [1]作用 [2]安装 [3]管理[4]命令前面的话 Memcache是一个高性能的分布式的内存对象缓存系统,通过在内存里维护一个统一的巨大的hash表,它能够用来存储各种格式的数据,包括 ...
ASP.NET MVC5+EF6+EasyUI 后台管理系统（6）-Unity 依赖注入
系列目录前言为了符合后面更新后的重构系统,文章于2016-11-1日重写本节重构一下代码,采用IOC控制反转,也就是依赖注入您可以访问http://unity.codeplex.com/rel ...
Chrome浏览器必装的扩展工具
名称作用下载地址 Postman 模拟各种http请求点击下载 JSON Viewer 访问json结果自动视图化点击下载 Axure RP Extension for Chrome 查看Ax ...
SolrNet高级用法（分页、Facet查询、任意分组）
前言如果你在系统中用到了Solr的话,那么肯定会碰到从Solr中反推数据的需求,基于数据库数据生产索引后,那么Solr索引的数据相对准确,在电商需求中经常会碰到菜单.导航分类(比如电脑.PC的话会有 ...
php内核分析（七）－扩展
这里阅读的php版本为PHP-7.1.0 RC3,阅读代码的平台为linux. 我们研究下反射这个扩展. 反射这个扩展目录是存在在:ext/reflection.其实里面的代码很简单.一个.h文件,一 ...
code
using System;using System.Threading; namespace ThreadLocalTest{ public class MyObject { ...
包含修改字体，图片上传等功能的文本输入框-Bootstrap
通过jQuery Bootstrap小插件,框任何一个div转换变成一个富文本编辑框,主要特色: 在Mac和window平台下自动针对常用操作绑定热键可以拖拽插入图片,支持图片上传(也可以获取移动设 ...