spider_keeper
一 简介
spider_keeper 是一款开源的spider管理工具,可以方便的进行爬虫的启动,暂停,定时,同时可以查看分布式情况下所有爬虫日志,查看爬虫执行情况等功能。
二 安装 部署
安装环境
ubuntu16.04
python3.5
pip3 install scrapy
pip3 install scrapyd
pip3 install scrapyd-client
pip3 install scrapy-redis
pip3 install SpiderKeeper
部署:
# 注意不要覆盖SpiderKeeper.db
rsync -avz spider_keeper crawler_server1:/data --exclude '*.log' --exclude '*.pyc' --exclude '*.db' --exclude 'env' pip install -r requirements.txt
运行:
使用虚拟环境运行
virtualenv -p /usr/bin/python3.6 env
source env/bin/activate python run.py --port=50002 --username=spider --password=sxkjspider2018 --server=http://10.10.4.2:6800 --server=http://10.10.4.11:6800 --server=http://10.10.4.12:6800
三 数据库四张表用处
查看建表信息
看所有表
.table 默认情况下,不会出现红框中的表头,需要之前设置,命令为:
.header on 如果只想查看具体一张表的表结构,比如查看emperors表,命令为: select * from sqlite_master where type="table" and name="emperors";
sk_job_execution : 任务表,所有的任务(运行,取消,等待)都在这一张表里
table|sk_job_execution|sk_job_execution|10|CREATE TABLE sk_job_execution (
id INTEGER NOT NULL,
date_created DATETIME,
date_modified DATETIME,
project_id INTEGER NOT NULL,
service_job_execution_id VARCHAR(50) NOT NULL,
job_instance_id INTEGER NOT NULL,
create_time DATETIME,
start_time DATETIME,
end_time DATETIME,
running_status INTEGER,
running_on TEXT,
PRIMARY KEY (id)
)
sk_job_instance : 定时任务表,存放所有项目的定时任务
table|sk_job_instance|sk_job_instance|5|CREATE TABLE sk_job_instance (
id INTEGER NOT NULL,
date_created DATETIME,
date_modified DATETIME,
spider_name VARCHAR(100) NOT NULL,
project_id INTEGER NOT NULL,
tags TEXT,
spider_arguments TEXT,
priority INTEGER,
"desc" TEXT,
cron_minutes VARCHAR(20),
cron_hour VARCHAR(20),
cron_day_of_month VARCHAR(20),
cron_day_of_week VARCHAR(20),
cron_month VARCHAR(20),
enabled INTEGER,
run_type VARCHAR(20),
PRIMARY KEY (id)
)
. sk_project : 项目表 存放所有项目
type|name|tbl_name|rootpage|sql
table|sk_project|sk_project|2|CREATE TABLE sk_project (
id INTEGER NOT NULL,
date_created DATETIME,
date_modified DATETIME,
project_name VARCHAR(50),
PRIMARY KEY (id)
)
sk_spider : 爬虫表 存放所有爬虫
table|sk_spider|sk_spider|3|CREATE TABLE sk_spider (
id INTEGER NOT NULL,
date_created DATETIME,
date_modified DATETIME,
spider_name VARCHAR(100),
project_id INTEGER NOT NULL,
PRIMARY KEY (id)
)
这四张表都在 spider下的 model.py 里
四 接口 api
在 spider 下的 controller.py 中,
Api部分
api.add_resource(ProjectCtrl, "/api/projects") # 返回项目列表 项目id 项目名称
返回列表
'''
[
{
"project_id": 1,
"project_name": "baidutieba_wz"
}
] '''
api.add_resource(SpiderCtrl, "/api/projects/<project_id>/spiders") # 返回爬虫列表
返回列表
'''
[
{
"spider_instance_id": 1,
"spider_name": "all_detail",
"project_id": 1
}
]
'''
api.add_resource(SpiderDetailCtrl, "/api/projects/<project_id>/spiders/<spider_id>") # 通过项目id爬虫id,返回爬虫字典
'''
{
"spider_instance_id": 1,
"spider_name": "all_detail",
"project_id": 1
} '''
api.add_resource(JobCtrl, "/api/projects/<project_id>/jobs") # 通过项目ID 找所有的定时任务
'''
[
{
"job_instance_id": 2,
"spider_name": "all_detail",
"tags": null,
"spider_arguments": "",
"priority": 0,
"desc": null,
"cron_minutes": "0",
"cron_hour": "*",
"cron_day_of_month": "*",
"cron_day_of_week": "*",
"cron_month": "*",
"enabled": true,
"run_type": "periodic"
}
]
'''
api.add_resource(JobDetailCtrl, "/api/projects/<project_id>/jobs/<job_id>") # 看不出来干啥用的
api.add_resource(JobExecutionCtrl, "/api/projects/<project_id>/jobexecs") # 所有状态的任务字典
''' {
"PENDING": [],
"RUNNING": [],
"COMPLETED": [
{
"project_id": 1,
"job_execution_id": 2,
"job_instance_id": 2,
"service_job_execution_id": "f91f3ed0341311e9a72c645aedeb0f3b",
"create_time": "2019-02-19 15:00:00",
"start_time": "2019-02-19 15:00:03",
"end_time": "2019-02-19 15:24:53",
"running_status": 3,
"running_on": "http://127.0.0.1:6800",
"job_instance": {
"job_instance_id": 2,
"spider_name": "all_detail",
"tags": null,
"spider_arguments": "",
"priority": 0,
"desc": null,
"cron_minutes": "0",
"cron_hour": "*",
"cron_day_of_month": "*",
"cron_day_of_week": "*",
"cron_month": "*",
"enabled": true,
"run_type": "periodic"
}
} ]
}
'''
Router部分
project_create : 创建项目 对应 scrapyd 的 AddVersion
请求方式: POST
参数: project_name : 项目名称
代码如下:
@app.route("/project/create", methods=['post'])
def project_create():
project_name = request.form['project_name']
project = Project()
project.project_name = project_name
db.session.add(project)
db.session.commit()
return redirect("/project/%s/spider/deploy" % project.id, code=302)
project_delete : 删除项目 对应 scrapyd 的 DeleteProject
请求方式 : GET
请求参数 : project_id : 项目的 id
代码如下:
@app.route("/project/<project_id>/delete")
def project_delete(project_id):
project = Project.find_project_by_id(project_id)
agent.delete_project(project)
db.session.delete(project)
db.session.commit()
return redirect("/project/manage", code=302)
project_manage 项目初始页面 项目修改(删除,添加后会跳转到这个页面)
请求方式 : GET
代码如下:
@app.route("/project/manage")
def project_manage():
return render_template("project_manage.html")
project_index 项目任务展示 展示项目的所有任务(等待,运行,取消,成功)
请求方式 : GET
请求参数: 项目 ID project_id
代码如下:
@app.route("/project/<project_id>")
def project_index(project_id):
session['project_id'] = project_id
return redirect("/project/%s/job/dashboard" % project_id, code=302)
index 初始页面 如果没有项目跳转到 project_manage 如果有项目 跳转到第一个项目的任务页面 project_index
请求: GET
代码如下:
@app.route("/")
def index():
project = Project.query.first()
if project:
return redirect("/project/%s/job/dashboard" % project.id, code=302)
return redirect("/project/manage", code=302)
job_dashboard 定时任务展示 展示本项目的所有定时任务 periodic(周期性)
请求方式: GET
请求参数: project_id 项目ID
代码如下:
@app.route("/project/<project_id>/job/periodic")
def job_periodic(project_id):
project = Project.find_project_by_id(project_id)
job_instance_list = [job_instance.to_dict() for job_instance in
JobInstance.query.filter_by(run_type="periodic", project_id=project_id).all()]
return render_template("job_periodic.html",
job_instance_list=job_instance_list)
job_add 添加任务 定时任务 一次性任务 auto 任务
请求方式: GET
请求参数 : 项目ID project_id
爬虫名称 : spider_name
spider_arguments : 不知道干啥的
是否是周期性: priority (默认是 0 )
运行类型 : run_type
分钟 : cron_minutes
小时: cron_hour
每月某日: cron_day_of_month
每周某日: cron_day_of_week
每月 : cron_month
代码如下:
@app.route("/project/<project_id>/job/add", methods=['post'])
def job_add(project_id):
project = Project.find_project_by_id(project_id)
job_instance = JobInstance()
job_instance.spider_name = request.form['spider_name']
job_instance.project_id = project_id
job_instance.spider_arguments = request.form['spider_arguments']
job_instance.priority = request.form.get('priority', 0)
job_instance.run_type = request.form['run_type']
# chose daemon manually
if request.form['daemon'] != 'auto':
spider_args = []
if request.form['spider_arguments']:
spider_args = request.form['spider_arguments'].split(",")
spider_args.append("daemon={}".format(request.form['daemon']))
job_instance.spider_arguments = ','.join(spider_args)
if job_instance.run_type == JobRunType.ONETIME:
job_instance.enabled = -1
db.session.add(job_instance)
db.session.commit()
agent.start_spider(job_instance)
if job_instance.run_type == JobRunType.PERIODIC:
job_instance.cron_minutes = request.form.get('cron_minutes') or ''
job_instance.cron_hour = request.form.get('cron_hour') or '*'
job_instance.cron_day_of_month = request.form.get('cron_day_of_month') or '*'
job_instance.cron_day_of_week = request.form.get('cron_day_of_week') or '*'
job_instance.cron_month = request.form.get('cron_month') or '*'
# set cron exp manually
if request.form.get('cron_exp'):
job_instance.cron_minutes, job_instance.cron_hour, job_instance.cron_day_of_month, job_instance.cron_day_of_week, job_instance.cron_month = \
request.form['cron_exp'].split(' ')
db.session.add(job_instance)
db.session.commit()
return redirect(request.referrer, code=302)
job_stop 暂停任务 将运行中的任务停止
请求方式: GET
请求参数:
project_id 项目 ID
job_exec_id 任务 ID
代码如下:
@app.route("/project/<project_id>/jobexecs/<job_exec_id>/stop")
def job_stop(project_id, job_exec_id):
job_execution = JobExecution.query.filter_by(project_id=project_id, id=job_exec_id).first()
agent.cancel_spider(job_execution)
return redirect(request.referrer, code=302)
job_log 显示任务日志
请求方式 :GET
请求参数:
project_id 项目 ID
job_exec_id 任务 ID
代码如下;
@app.route("/project/<project_id>/jobexecs/<job_exec_id>/log")
def job_log(project_id, job_exec_id):
job_execution = JobExecution.query.filter_by(project_id=project_id, id=job_exec_id).first()
bytes = -5000
while True:
res = requests.get(agent.log_url(job_execution),
headers={'Range': 'bytes={}'.format(bytes)},
)
if '<span>builtins.OSError</span>: <span>[Errno 22] Invalid argument</span>' not in res.text:
break
else:
bytes = bytes / 10
res.encoding = 'utf8'
raw = res.text
return render_template("job_log.html", log_lines=raw.split('\n'))
job_run 启动定时任务
请求方式 :GET
请求参数 :
project_id 项目ID
job_instance_id 定时任务的ID
代码如下:
@app.route("/project/<project_id>/job/<job_instance_id>/remove")
def job_remove(project_id, job_instance_id):
job_instance = JobInstance.query.filter_by(project_id=project_id, id=job_instance_id).first()
db.session.delete(job_instance)
db.session.commit()
return redirect(request.referrer, code=302)
job_remove 移除定时任务
请求方式:GET
请求参数 :
project_id 项目ID
job_instance_id 定时任务的ID
代码如下:
@app.route("/project/<project_id>/job/<job_instance_id>/remove")
def job_remove(project_id, job_instance_id):
job_instance = JobInstance.query.filter_by(project_id=project_id, id=job_instance_id).first()
db.session.delete(job_instance)
db.session.commit()
return redirect(request.referrer, code=302)
job_switch 定时任务状态切换
请求方式 : GET
请求参数:
project_id 项目ID
job_instance_id 定时任务的ID
代码如下:
@app.route("/project/<project_id>/job/<job_instance_id>/switch")
def job_switch(project_id, job_instance_id):
job_instance = JobInstance.query.filter_by(project_id=project_id, id=job_instance_id).first()
job_instance.enabled = -1 if job_instance.enabled == 0 else 0
db.session.commit()
return redirect(request.referrer, code=302)
spider_dashboard 爬虫展示
请求方式 : GET
请求参数:
project_id 项目ID
代码如下:
@app.route("/project/<project_id>/spider/dashboard")
def spider_dashboard(project_id):
spider_instance_list = SpiderInstance.list_spiders(project_id)
return render_template("spider_dashboard.html",
spider_instance_list=spider_instance_list)
spider_deploy 项目配置 egg文件上传页面
请求方式 : GET
请求参数:
project_id 项目ID
代码如下:
@app.route("/project/<project_id>/spider/deploy")
def spider_deploy(project_id):
project = Project.find_project_by_id(project_id)
return render_template("spider_deploy.html")
spider_egg_upload egg 文件上传
请求方式 : POST
请求参数 :
project_id : 项目 ID
file : egg 文件
代码如下:
@app.route("/project/<project_id>/spider/upload", methods=['post'])
def spider_egg_upload(project_id):
project = Project.find_project_by_id(project_id)
if 'file' not in request.files:
flash('No file part')
return redirect(request.referrer)
file = request.files['file']
# if user does not select file, browser also
# submit a empty part without filename
if file.filename == '':
flash('No selected file')
return redirect(request.referrer)
if file:
filename = secure_filename(file.filename)
dst = os.path.join(tempfile.gettempdir(), filename)
file.save(dst)
agent.deploy(project, dst)
flash('deploy success!')
return redirect(request.referrer)
project_stats 项目运行统计(每个项目起始时间) 返回一个图形界面
请求方式 : GET
请求参数 :
project_id : 项目 ID
代码如下:
@app.route("/project/<project_id>/project/stats")
def project_stats(project_id):
project = Project.find_project_by_id(project_id)
run_stats = JobExecution.list_run_stats_by_hours(project_id)
return render_template("project_stats.html", run_stats=run_stats)
service_stats 服务器统计信息
请求方式 :GET
请求参数:
project_id : 项目 ID
代码如下 :
@app.route("/project/<project_id>/server/stats")
def service_stats(project_id):
project = Project.find_project_by_id(project_id)
run_stats = JobExecution.list_run_stats_by_hours(project_id)
return render_template("server_stats.html", run_stats=run_stats)
spider_keeper的更多相关文章
- spider_keeper定时任务
# Define apscheduler app 下的 __init__.py 文件中, 三个主要函数
随机推荐
- Emgu CV的配置以及在VS 2012中进行图像处理的步骤和实例
说明: 1.所使用的Emgu CV是目前的最新版本3.1.0,下载链接为:https://sourceforge.net/projects/emgucv/files/emgucv/3.1.0/(我选的 ...
- Redis学习(1)——下载与配置[转]
Redis是一个开源的使用ANSI C语言编写.支持网络.可基于内存亦可持久化的日志型.Key-Value数据库,并提供多种语言的API.从2010年3月15日起,Redis的开发工作由VMware主 ...
- HBase表的memstore与集群memstore
一直有一个问题,今天调查了一下源码算是明白了. ===问题=== 通过java api(如下代码所示)在创建表的时候,可以通过setMemStoreFlushSize函数来指定memstore的大小, ...
- 编写高质量代码改善C#程序的157个建议——建议155:随生产代码一起提交单元测试代码
建议155:随生产代码一起提交单元测试代码 首先提出一个问题:我们害怕修改代码吗?是否曾经无数次面对乱糟糟的代码,下决心进行重构,然后在一个月后的某个周一,却收到来自测试版的报告:新的版本,没有之前的 ...
- vmware中安装centos 6.7
centos 6.7 软件下载地址:http://b.mirrors.lanunion.org/CentOS/6.7/isos/i386/ 引用:http://www.cnblogs.com/sees ...
- MVC4 4种Filter
1. AuthorizationFilter: 从命名上看这个用于完成授权相关的工作. AuthorizationFilter 实现了 IAuthorizationFilter 接口, 如果我们希望执 ...
- error: field 'b' has imcomplete type
在下面的程序中,在编译时会遇到下面的错误: error: field 'b' has incomplete type 域b是一个不完备的类型,即class B的声明不完备 #include <i ...
- ajax 解析
1.通过适当的Ajax应用达到更好的用户体验; 2.把以前的一些服务器负担的工作转嫁到客户端,利于客户端闲置的处理能力来处理,减轻服务器和带宽的负担,从而达到节约ISP的空间及带宽租用成本的目的. 二 ...
- docker+selenium grid解决node执行经常卡死
执行用例时出现下图: 可以在启动node节点容器时添加如下红色字体的参数 docker run -d -p 5903:5900 --shm-size=512m --link selenium_hub: ...
- Windowsform datagridview选中向父窗口传值
Datagridview mouseclick事件 private void dataGridView1_MouseClick(object sender, MouseEventArgs e) { i ...