



by ——https://github.com/google/cluster-data


数据集描述:The clusterdata-2011-2 trace represents 29 day's worth of cell information from May 2011, on a cluster of about 12.5k machines.


job event表:

row1672923   286.86 MB (300,795,688)

index:jobid,btree 35.56 MB (37,285,888)

machine events表:

row:37780    2.99 MB (3,138,540)

machine attribute:

row:10748566    1.09 GB (1,175,642,124)

task constrains:

row:28485619    2.95 GB (3,163,127,240)

task usage:

row:1232799308    182.55 GB (196,015,089,972)

index:69.61 GB (74,743,799,808)

machineid(btree) jobid(btree)

task event:(导入数据有点问题,正在处理)

row:144648292 12.76 GB (13,700,652,148)

index: 6.90 GB (7,414,187,008)


explain part1:字段

explain part2:表格








machine event type:0.add 1.remove 2.update

job和task的event type:0.submit 1.schedule 2.evict 3.fail 4.kill 5.finish 6.lost 7.update_pending 8.update_running


infrastructure (11)—this is the highest (most entitled to get resources) priority in the trace and accounts for most of the recorded disk I/O, so we speculate it includes some storage services;
monitoring (10)
normal production (9)—this is the lowest (and most occupied) of the priorities labeled ‘production’. The trace providers indicate that jobs at this priority and higher which are latency-sensitive should not be “evicted due to over-allocation of machine resources” .
other (2-8) — we speculate that these priorities are dominated by batch jobs; 
gratis (free) (0-1) — the trace providers indicate that resources used by tasks at these priorities are generally not charged.

missing info:正常数据为NULL,丢失数据为0-2.

0.SNAPSHOT_BUT_NO_TRANSITION:we did not find a record representing the given event, but a later snapshot of the job or task state indicated that the transition must have occurred. The timestamp of the synthesized event is the timestamp of the snapshot.

1.NO_SNAPSHOT_OR_TRANSITION : we did not find a record representing the given termination event, but the job or task disappeared from later snapshots of cluster states, so it must have been terminated. The timestamp of the synthesized
event is a pessimistic upper bound on its actual termination time assuming it could have legitimately been missing from one snapshot.
2.EXISTS_BUT_NO_CREATION : we did not find a record representing the creation of the given task or job. In this case, we may be missing metadata (job name, resource requests, etc.) about the job or task and we may have placed SCHEDULE or SUBMIT events latter than they actually are.



comparison operator:??





1.Machine events
Each machine is described by one or more records in the machine event table. The majority of records describe machines that existed at the start of the trace.
1. timestamp
2. machine ID
3. event type
4. platform ID
5. capacity: CPU
6. capacity: memory

2.job event&task event

The two event tables describe jobs/tasks and their lifecycles. The constraints table describes task placement constraints that restrict the machines onto which tasks can schedule.

The simplest case is shown by the top path in the diagram above: a job is SUBMITted and gets put into a pending queue; soon afterwards, it is SCHEDULEd onto a machine and starts running; some time later it FINISHes successfully.


3.task usage



分别是各平台内包含的机器id,以及所有中等优先级的task(priority为2-8),以及所有成功进入队列的task(event type为1)的表,并建立相应的索引。(使用中间表后,检索时间由数小时级别下降到1min以内)


