翻译——1_Project Overview, Data Wrangling and Exploratory Analysis-checkpoint
为提高提高大学能源效率进行建筑能源需求预测
本文翻译哈佛大学的能源分析和预测报告,这是原文
暂无数据源,个人认为学习分析方法就足够
内容:
- 项目概述
- 了解数据
- 探索性分析
- 使用不同的机器学习方法进行预测
- 总结
- 结论
- 讨论
1. 项目概述
用机器学习来进行能源预测,希望能够节约能源
2. 了解数据
有三种类型的能源消耗,电力,冷水和热水。图显示了哈佛工厂供应的冷水和热水的建筑物。
图:哈佛的冷水和热水供应。(左:冷水,用蓝色标出。右边:热水,用黄色突出显示。)
我们选择了一栋建筑,并获得了2011年7月1日至2014年10月31日的能耗数据。由于仪表故障,有几个月的数据丢失。数据分辨率是每小时一次。在原始数据中,每小时的数据是仪表读数。为了得到每小时的消耗,我们需要抵消数据然后减去。我们有2012年1月到2014年10月的每小时天气和能源数据(2.75年)。天气数据来自剑桥气象站。
在本节中,我们将完成以下任务。
。手动从哈佛能源见证网站下载原始数据,获取每小时的电力、冷水和热水。
。干净的天气数据,增加了更多的功能,包括冷度,热度和湿度比。
。根据假期、学年和周末估算每日入住率。
。创建与小时相关的特性,即cos(hourOfDay * 2 * pi / 24)。
。合并电力、冷水和热水数据流与天气、时间和占用功能。
%matplotlib inline
import requests
from StringIO import StringIO
import numpy as np
import pandas as pd # pandas
import matplotlib.pyplot as plt # module for plotting
import datetime as dt # module for manipulating dates and times
import numpy.linalg as lin # 执行线性代数运算的模块
from __future__ import division
from math import log10,exp
pd.options.display.mpl_style = 'default'
原始能耗数据
原始数据从哈佛能源见证网站下载
然后我们用Pandas 把它们放在一个dataframe里。
file = 'Data/Org/0701-0930-2011.xls'
df = pd.read_excel(file, header = 0, skiprows = np.arange(0,6))
files = ['Data/Org/1101-1130-2011.xls',
'Data/Org/1201-2011-0131-2012.xls',
'Data/Org/0201-0331-2012.xls',
'Data/Org/0401-0531-2012.xls',
'Data/Org/0101-0228-2013.xls',
'Data/Org/0301-0430-2013.xls',
'Data/Org/0501-0630-2013.xls',
'Data/Org/0701-0831-2013.xls',
'Data/Org/0901-1031-2013.xls',
'Data/Org/1101-1231-2013.xls',
'Data/Org/0101-0228-2014.xls',
'Data/Org/0301-0430-2014.xls',
'Data/Org/0501-0630-2014.xls',
'Data/Org/0701-0831-2014.xls',
'Data/Org/0901-1031-2014.xls']
for file in files:
data = pd.read_excel(file, header = 0, skiprows = np.arange(0,6))
df = df.append(data)
df.head()
WARNING *** file size (2481102) not 512 + multiple of sector size (512)
WARNING *** file size (848833) not 512 + multiple of sector size (512)
WARNING *** file size (1694257) not 512 + multiple of sector size (512)
WARNING *** file size (1640459) not 512 + multiple of sector size (512)
WARNING *** file size (1667907) not 512 + multiple of sector size (512)
WARNING *** file size (847258) not 512 + multiple of sector size (512)
WARNING *** file size (1691449) not 512 + multiple of sector size (512)
WARNING *** file size (1666647) not 512 + multiple of sector size (512)
WARNING *** file size (1665736) not 512 + multiple of sector size (512)
WARNING *** file size (1614814) not 512 + multiple of sector size (512)
WARNING *** file size (1665980) not 512 + multiple of sector size (512)
WARNING *** file size (1667276) not 512 + multiple of sector size (512)
WARNING *** file size (1691736) not 512 + multiple of sector size (512)
WARNING *** file size (1666704) not 512 + multiple of sector size (512)
WARNING *** file size (1665920) not 512 + multiple of sector size (512)
WARNING *** file size (1614900) not 512 + multiple of sector size (512)
WARNING *** file size (1666228) not 512 + multiple of sector size (512)
WARNING *** file size (1666191) not 512 + multiple of sector size (512)
WARNING *** file size (1691845) not 512 + multiple of sector size (512)
WARNING *** file size (1663846) not 512 + multiple of sector size (512)
Unnamed: 0 | Unnamed: 1 | Gund Bus-A 15 Min Block Demand - kW | Gund Bus-A CurrentA - Amps | Unnamed: 4 | Unnamed: 5 | Gund Bus-A CurrentB - Amps | Unnamed: 7 | Gund Bus-A CurrentC - Amps | Unnamed: 9 | ... | Gund Main Demand - Tons | Gund Main Energy - Ton-Days | Gund Main FlowRate - gpm | Gund Main FlowTotal - kgal(1000) | Gund Main SignalAeration - Count | Gund Main SignalStrength - Count | Gund Main SonicVelocity - Ft/Sec | Gund Main TempDelta - Deg F | Gund Main TempReturn - Deg F | Gund Main TempSupply - Deg F | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2011-07-01 01:00:00 | White | 48.458733 | 65.977882 | NaN | NaN | 52.631417 | NaN | 55.603840 | NaN | ... | 4.677294 | 17912.537804 | 6.916454 | 48168.083414 | 0.693405 | 57.208127 | 1437.640543 | 16.238684 | 59.757447 | 43.516103 |
1 | 2011-07-01 02:00:00 | White | 40.472697 | 57.230223 | NaN | NaN | 42.483092 | NaN | 50.243230 | NaN | ... | 4.586403 | 17912.853518 | 6.739337 | 48168.645429 | 0.567355 | 57.082909 | 1438.030719 | 16.263573 | 59.710199 | 43.495128 |
2 | 2011-07-01 03:00:00 | #d2e4b0 | 39.472809 | 55.487443 | NaN | NaN | 41.911784 | NaN | 48.482163 | NaN | ... | 4.462877 | 17913.169232 | 6.725142 | 48169.207444 | 0.441304 | 57.001646 | 1439.111130 | 15.797043 | 59.248158 | 43.457344 |
3 | 2011-07-01 04:00:00 | White | 39.198879 | 55.849806 | NaN | NaN | 41.525529 | NaN | 48.987457 | NaN | ... | 4.696993 | 17913.484946 | 7.041330 | 48169.769458 | 0.315254 | 57.000000 | 1440.768604 | 15.947392 | 59.207097 | 43.267682 |
4 | 2011-07-01 05:00:00 | White | 39.297522 | 55.736219 | NaN | NaN | 41.299381 | NaN | 48.710408 | NaN | ... | 4.550372 | 17913.800660 | 6.863004 | 48170.331473 | 0.189204 | 57.000000 | 1442.426077 | 15.903679 | 59.282707 | 43.372615 |
5 rows × 55 columns
以上是原始的每小时数据。
正如你所看到的,它很乱。首先要删除没有意义的列。
df.rename(columns={'Unnamed: 0':'Datetime'}, inplace=True)
nonBlankColumns = ['Unnamed' not in s for s in df.columns]
columns = df.columns[nonBlankColumns]
df = df[columns]
df = df.set_index(['Datetime'])
df.index.name = None
df.head()
Gund Bus-A 15 Min Block Demand - kW | Gund Bus-A CurrentA - Amps | Gund Bus-A CurrentB - Amps | Gund Bus-A CurrentC - Amps | Gund Bus-A CurrentN - Amps | Gund Bus-A EnergyReal - kWhr | Gund Bus-A Freq - Hertz | Gund Bus-A Max Monthly Demand - kW | Gund Bus-A PowerApp - kVA | Gund Bus-A PowerReac - kVAR | ... | Gund Main Demand - Tons | Gund Main Energy - Ton-Days | Gund Main FlowRate - gpm | Gund Main FlowTotal - kgal(1000) | Gund Main SignalAeration - Count | Gund Main SignalStrength - Count | Gund Main SonicVelocity - Ft/Sec | Gund Main TempDelta - Deg F | Gund Main TempReturn - Deg F | Gund Main TempSupply - Deg F | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2011-07-01 01:00:00 | 48.458733 | 65.977882 | 52.631417 | 55.603840 | 15.982278 | 1796757.502803 | 59.837524 | 96.117915 | 48.757073 | 12.344712 | ... | 4.677294 | 17912.537804 | 6.916454 | 48168.083414 | 0.693405 | 57.208127 | 1437.640543 | 16.238684 | 59.757447 | 43.516103 |
2011-07-01 02:00:00 | 40.472697 | 57.230223 | 42.483092 | 50.243230 | 13.423762 | 1796800.145991 | 60.005569 | 96.117915 | 42.238685 | 12.967984 | ... | 4.586403 | 17912.853518 | 6.739337 | 48168.645429 | 0.567355 | 57.082909 | 1438.030719 | 16.263573 | 59.710199 | 43.495128 |
2011-07-01 03:00:00 | 39.472809 | 55.487443 | 41.911784 | 48.482163 | 13.478933 | 1796840.146023 | 59.833880 | 96.117915 | 41.278573 | 12.732046 | ... | 4.462877 | 17913.169232 | 6.725142 | 48169.207444 | 0.441304 | 57.001646 | 1439.111130 | 15.797043 | 59.248158 | 43.457344 |
2011-07-01 04:00:00 | 39.198879 | 55.849806 | 41.525529 | 48.987457 | 13.603309 | 1796879.023607 | 59.673044 | 96.117915 | 41.345776 | 12.687845 | ... | 4.696993 | 17913.484946 | 7.041330 | 48169.769458 | 0.315254 | 57.000000 | 1440.768604 | 15.947392 | 59.207097 | 43.267682 |
2011-07-01 05:00:00 | 39.297522 | 55.736219 | 41.299381 | 48.710408 | 13.797331 | 1796918.273558 | 59.986672 | 96.117915 | 41.166736 | 12.437842 | ... | 4.550372 | 17913.800660 | 6.863004 | 48170.331473 | 0.189204 | 57.000000 | 1442.426077 | 15.903679 | 59.282707 | 43.372615 |
5 rows × 48 columns
然后我们打印出所有的列名。只有几根柱子可用来获得每小时的电力、冷水和热水。
for item in df.columns:
print item
Gund Bus-A 15 Min Block Demand - kW
Gund Bus-A CurrentA - Amps
Gund Bus-A CurrentB - Amps
Gund Bus-A CurrentC - Amps
Gund Bus-A CurrentN - Amps
Gund Bus-A EnergyReal - kWhr
Gund Bus-A Freq - Hertz
Gund Bus-A Max Monthly Demand - kW
Gund Bus-A PowerApp - kVA
Gund Bus-A PowerReac - kVAR
Gund Bus-A PowerReal - kW
Gund Bus-A TruePF - PF
Gund Bus-A VoltageAB - Volts
Gund Bus-A VoltageAN - Volts
Gund Bus-A VoltageBC - Volts
Gund Bus-A VoltageBN - Volts
Gund Bus-A VoltageCA - Volts
Gund Bus-A VoltageCN - Volts
Gund Bus-B 15 Min Block Demand - kW
Gund Bus-B CurrentA - Amps
Gund Bus-B CurrentB - Amps
Gund Bus-B CurrentC - Amps
Gund Bus-B CurrentN - Amps
Gund Bus-B EnergyReal - kWhr
Gund Bus-B Freq - Hertz
Gund Bus-B Max Monthly Demand - kW
Gund Bus-B PowerApp - kVA
Gund Bus-B PowerReac - kVAR
Gund Bus-B PowerReal - kW
Gund Bus-B TruePF - PF
Gund Bus-B VoltageAB - Volts
Gund Bus-B VoltageAN - Volts
Gund Bus-B VoltageBC - Volts
Gund Bus-B VoltageBN - Volts
Gund Bus-B VoltageCA - Volts
Gund Bus-B VoltageCN - Volts
Gund Condensate Counter - count
Gund Condensate FlowTotal - LBS
Gund Main Demand - Tons
Gund Main Energy - Ton-Days
Gund Main FlowRate - gpm
Gund Main FlowTotal - kgal(1000)
Gund Main SignalAeration - Count
Gund Main SignalStrength - Count
Gund Main SonicVelocity - Ft/Sec
Gund Main TempDelta - Deg F
Gund Main TempReturn - Deg F
Gund Main TempSupply - Deg F
电力
以电力为例,“Gund Bus A”和“Gund Bus B”。“EnergyReal - kWhr”记录累计消耗量。我们不确定什么是“PowerReal”。为了以防万一,我们也把它放进了电日计。
electricity=df[['Gund Bus-A EnergyReal - kWhr',
'Gund Bus-B EnergyReal - kWhr',
'Gund Bus-A PowerReal - kW',
'Gund Bus-B PowerReal - kW',]]
electricity.head()
Gund Bus-A EnergyReal - kWhr | Gund Bus-B EnergyReal - kWhr | Gund Bus-A PowerReal - kW | Gund Bus-B PowerReal - kW | |
---|---|---|---|---|
2011-07-01 01:00:00 | 1796757.502803 | 3657811.582122 | 47.184015 | 63.486186 |
2011-07-01 02:00:00 | 1796800.145991 | 3657873.464938 | 40.208796 | 61.270542 |
2011-07-01 03:00:00 | 1796840.146023 | 3657934.837505 | 39.209866 | 61.464394 |
2011-07-01 04:00:00 | 1796879.023607 | 3657995.470348 | 39.378507 | 59.396581 |
2011-07-01 05:00:00 | 1796918.273558 | 3658054.470285 | 39.240837 | 58.911729 |
通过检查每月的能耗来验证我们的数据处理方法
为了检验我们对数据的理解是否正确,我们想从每小时的数据中计算出每个月的用电量,然后将结果与facalities提供的每个月的数据进行比较,这些数据也可以在Energy Witness上找到。
以下是facalities提供的月度数据,"Bus A & B"以月度形式称为"CE603B kWh"和"CE604B kWh"。请注意,查表周期不是公历月份。
file = 'Data/monthly electricity.csv'
monthlyElectricityFromFacility = pd.read_csv(file, header=0)
monthlyElectricityFromFacility
monthlyElectricityFromFacility = monthlyElectricityFromFacility.set_index(['month'])
monthlyElectricityFromFacility.head()
startDate | endDate | CE603B kWh | CE604B kWh | |
---|---|---|---|---|
month | ||||
Jul 11 | 6/16/11 | 7/18/11 | 43968.1 | 106307.1 |
Aug 11 | 7/18/11 | 8/17/11 | 41270.1 | 83121.1 |
Sep 11 | 8/17/11 | 9/16/11 | 51514.1 | 107083.1 |
Oct 11 | 9/16/11 | 10/18/11 | 65338.1 | 114350.1 |
Nov 11 | 10/18/11 | 11/17/11 | 65453.1 | 115318.1 |
我们用“EnergyReal - kWhr”列表示两个电表。我们计算了查表周期的开始日期和结束日期的数字,用结束日期的数字减去开始日期的数字,就得到了每月的电量消耗。
monthlyElectricityFromFacility['startDate'] = pd.to_datetime(monthlyElectricityFromFacility['startDate'], format="%m/%d/%y")
values = monthlyElectricityFromFacility.index.values
keys = np.array(monthlyElectricityFromFacility['startDate'])
dates = {}
for key, value in zip(keys, values):
dates[key] = value
sortedDates = np.sort(dates.keys())
sortedDates = sortedDates[sortedDates > np.datetime64('2011-11-01')]
months = []
monthlyElectricityOrg = np.zeros((len(sortedDates) - 1, 2))
for i in range(len(sortedDates) - 1):
begin = sortedDates[i]
end = sortedDates[i+1]
months.append(dates[sortedDates[i]])
monthlyElectricityOrg[i, 0] = (np.round(electricity.loc[end,'Gund Bus-A EnergyReal - kWhr']
- electricity.loc[begin,'Gund Bus-A EnergyReal - kWhr'], 1))
monthlyElectricityOrg[i, 1] = (np.round(electricity.loc[end,'Gund Bus-B EnergyReal - kWhr']
- electricity.loc[begin,'Gund Bus-B EnergyReal - kWhr'], 1))
monthlyElectricity = pd.DataFrame(data = monthlyElectricityOrg, index = months, columns = ['CE603B kWh', 'CE604B kWh'])
plt.figure()
fig, ax = plt.subplots()
fig = monthlyElectricity.plot(marker = 'o', figsize=(15,6), rot = 40, fontsize = 13, ax = ax, linestyle='')
fig.set_axis_bgcolor('w')
plt.xlabel('Billing month', fontsize = 15)
plt.ylabel('kWh', fontsize = 15)
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 13)
plt.xticks(np.arange(0,len(months)),months)
plt.title('Original monthly consumption from hourly data',fontsize = 17)
text = 'Meter malfunction'
ax.annotate(text, xy = (9, 4500000),
xytext = (5, 2), fontsize = 15,
textcoords = 'offset points', ha = 'center', va = 'top')
ax.annotate(text, xy = (8, -4500000),
xytext = (5, 2), fontsize = 15,
textcoords = 'offset points', ha = 'center', va = 'bottom')
ax.annotate(text, xy = (14, -2500000),
xytext = (5, 2), fontsize = 15,
textcoords = 'offset points', ha = 'center', va = 'bottom')
ax.annotate(text, xy = (15, 2500000),
xytext = (5, 2), fontsize = 15,
textcoords = 'offset points', ha = 'center', va = 'top')
plt.show()
上面是使用我们的数据处理方法的每月用电量图。显然,这两个电表故障了好几个月。有两组圆点“CE603B”和“CE604B”,它们来自两个电表。有两个电表。它们的总和就是该建筑物的总耗电量。
monthlyElectricity.loc['Aug 12','CE604B kWh'] = np.nan
monthlyElectricity.loc['Sep 12','CE604B kWh'] = np.nan
monthlyElectricity.loc['Feb 13','CE603B kWh'] = np.nan
monthlyElectricity.loc['Mar 13','CE603B kWh'] = np.nan
fig,ax = plt.subplots(1, 1,figsize=(15,8))
#ax.set_axis_bgcolor('w')
plt.bar(np.arange(0, len(monthlyElectricity))-0.5,monthlyElectricity['CE603B kWh'], label='Our data processing from hourly data')
plt.plot(monthlyElectricityFromFacility.loc[months,'CE603B kWh'],'or', label='Facility data')
plt.xticks(np.arange(0,len(months)),months)
plt.xlabel('Month',fontsize=15)
plt.ylabel('kWh',fontsize=15)
plt.xlim([0, len(monthlyElectricity)])
plt.legend()
ax.set_xticklabels(months, rotation=40, fontsize=13)
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
plt.title('Comparison between our data processing and facilities, Meter CE603B',fontsize=20)
text = 'Meter malfunction \n estimated by facility'
ax.annotate(text, xy = (14, monthlyElectricityFromFacility.loc['Feb 13','CE603B kWh']),
xytext = (5, 50), fontsize = 15,
arrowprops=dict(facecolor='black', shrink=0.15),
textcoords = 'offset points', ha = 'center', va = 'bottom')
ax.annotate(text, xy = (15, monthlyElectricityFromFacility.loc['Mar 13','CE603B kWh']),
xytext = (5, 50), fontsize = 15,
arrowprops=dict(facecolor='black', shrink=0.15),
textcoords = 'offset points', ha = 'center', va = 'bottom')
plt.show()
fig,ax = plt.subplots(1, 1,figsize=(15,8))
#ax.set_axis_bgcolor('w')
plt.bar(np.arange(0, len(monthlyElectricity))-0.5, monthlyElectricity['CE604B kWh'], label='Our data processing from hourly data')
plt.plot(monthlyElectricityFromFacility.loc[months,'CE604B kWh'],'or', label='Facility data')
plt.xticks(np.arange(0,len(months)),months)
plt.xlabel('Month',fontsize=15)
plt.ylabel('kWh',fontsize=15)
plt.xlim([0, len(monthlyElectricity)])
plt.legend()
ax.set_xticklabels(months, rotation=40, fontsize=13)
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
plt.title('Comparison between our data processing and facilities, Meter CE604B',fontsize=20)
ax.annotate(text, xy = (9, monthlyElectricityFromFacility.loc['Sep 12','CE604B kWh']),
xytext = (5, 50), fontsize = 15,
arrowprops=dict(facecolor='black', shrink=0.15),
textcoords = 'offset points', ha = 'center', va = 'bottom')
ax.annotate(text, xy = (8, monthlyElectricityFromFacility.loc['Aug 12','CE604B kWh']),
xytext = (5, 50), fontsize = 15,
arrowprops=dict(facecolor='black', shrink=0.15),
textcoords = 'offset points', ha = 'center', va = 'bottom')
plt.show()
图中的点是设施每月提供的数据。它们是钞票上的账单。当电表发生故障时,该设施只是对表的能量消耗进行估算。我们排除的数据点不在图中。它们只是比正常水平高得多(是其他正常消费水平的30倍)。“仪表故障,设施预估”是指本月仪表出现故障,这个点是设施预估的。
2013年2月和3月,电表“CE603B”出现故障。2012年8月和9月,电表“CE604B”出现故障。我们把这几个月的关键字设为np.nan。简单地把他们排除在回归之外。再次绘制它们,并与设施月度数据进行比较。 当然,当仪表出故障时除外。在这几个月里,这些设施为了记账而估算能源消耗。对于我们来说,我们只是在回归中剔除了这些月份的每小时和每天的数据点。
建筑总用电量是这电表的总和。为了以防万一,我们把“power”和“energy”放在一起比较
electricity['energy'] = electricity['Gund Bus-A EnergyReal - kWhr'] + electricity['Gund Bus-B EnergyReal - kWhr']
electricity['power'] = electricity['Gund Bus-A PowerReal - kW'] + electricity['Gund Bus-B PowerReal - kW']
electricity.head()
Gund Bus-A EnergyReal - kWhr | Gund Bus-B EnergyReal - kWhr | Gund Bus-A PowerReal - kW | Gund Bus-B PowerReal - kW | energy | power | startTime | endTime | |
---|---|---|---|---|---|---|---|---|
2011-07-01 00:00:00 | NaN | NaN | NaN | NaN | NaN | NaN | 2011-07-01 00:00:00 | 2011-07-01 01:00:00 |
2011-07-01 01:00:00 | 1796757.502803 | 3657811.582122 | 47.184015 | 63.486186 | 5454569.084925 | 110.670201 | 2011-07-01 01:00:00 | 2011-07-01 02:00:00 |
2011-07-01 02:00:00 | 1796800.145991 | 3657873.464938 | 40.208796 | 61.270542 | 5454673.610929 | 101.479338 | 2011-07-01 02:00:00 | 2011-07-01 03:00:00 |
2011-07-01 03:00:00 | 1796840.146023 | 3657934.837505 | 39.209866 | 61.464394 | 5454774.983528 | 100.674260 | 2011-07-01 03:00:00 | 2011-07-01 04:00:00 |
2011-07-01 04:00:00 | 1796879.023607 | 3657995.470348 | 39.378507 | 59.396581 | 5454874.493955 | 98.775088 | 2011-07-01 04:00:00 | 2011-07-01 05:00:00 |
获得每小时消耗
这个小时和下一个小时之间的每小时能耗是下一个小时减去这个小时的电表kWh读数(dataframe中的“energy”)。我们假设电表读数是在一小时结束时记录的。为了避免混淆,我们还在dataframe中标记了抄表的开始时间和结束时间。我们将下一小时的每小时电量与推导出的每小时电量进行了比较,大多数时候,电量接近每小时电量。有时,会有很大的不同。
电表记录了累计能耗。假设今天一开始,数字是10。一小时后,用电量为1,然后电表号加1变成11,等等。所以仪表的数量应该不断增加。然而,我们发现,有时,水表读数突然下降到0,过了一段时间,又得到一个很高的正数。这将产生负的每小时消费,然后是一个荒谬的高正电荷数。我们用“t + 1时刻的电表读数- t时刻的电表读数”来计算每小时/每天的能源消耗。如果结果是负数,或者高得离谱,我们就认为仪表出了故障。我们把这些数据点设为np.nan。通过查看excel文件中的原始数据和每月的图表,我们计算出了仪表发生故障的确切日期。
除此之外,偶尔还会有更多的无意义数据点,它们比正常值高出一千倍。正常的小时消耗量范围在100到400之间。我们创建了一个过滤器:“index = abs(hourlyEnergy) < 200000”,这意味着只保留低于200000的值。
# 如果有任何遗漏的小时,重新索引,以获得整个时间跨度。填写nan数据。
hourlyTimestamp = pd.date_range(start = '2011/7/1', end = '2014/10/31', freq = 'H')
# 不知何故,重索引并不能很好地工作。2011年10月和其他几个小时的失踪。
# 基本上就是原始长度的长度。
# electricity.reindex(hourlyTimestamp, inplace = True, fill_value = np.nan)
startTime = hourlyTimestamp
endTime = hourlyTimestamp + np.timedelta64(1,'h')
hourlyTime = pd.DataFrame(data = np.transpose([startTime, endTime]), index = hourlyTimestamp, columns = ['startTime', 'endTime'])
electricity = electricity.join(hourlyTime, how = 'outer')
# 以防万一,为了使用diff方法,时间戳必须是按升序排列的。
electricity.sort_index(inplace = True)
hourlyEnergy = electricity.diff(periods=1)['energy']
hourlyElectricity = pd.DataFrame(data = hourlyEnergy.values, index = hourlyEnergy.index, columns = ['electricity-kWh'])
hourlyElectricity = hourlyElectricity.join(hourlyTime, how = 'inner')
print "Data length: ", len(hourlyElectricity)/24, " days"
hourlyElectricity.head()
Data length: 1218.04166667 days
electricity-kWh | startTime | endTime | |
---|---|---|---|
2011-07-01 00:00:00 | NaN | 2011-07-01 00:00:00 | 2011-07-01 01:00:00 |
2011-07-01 01:00:00 | NaN | 2011-07-01 01:00:00 | 2011-07-01 02:00:00 |
2011-07-01 02:00:00 | 104.526004 | 2011-07-01 02:00:00 | 2011-07-01 03:00:00 |
2011-07-01 03:00:00 | 101.372599 | 2011-07-01 03:00:00 | 2011-07-01 04:00:00 |
2011-07-01 04:00:00 | 99.510428 | 2011-07-01 04:00:00 | 2011-07-01 05:00:00 |
以上是小时耗电量。数据被导出到excel文件中。我们假设电表读数是在一小时结束时记录的。
# 过滤数据,保留NaN并生成两个excels,有NaN和没有NaN
hourlyElectricity.loc[abs(hourlyElectricity['electricity-kWh']) > 100000,'electricity-kWh'] = np.nan
time = hourlyElectricity.index
index = ((time > np.datetime64('2012-07-26')) & (time < np.datetime64('2012-08-18'))) \
| ((time > np.datetime64('2013-01-21')) & (time < np.datetime64('2013-03-08')))
hourlyElectricity.loc[index,'electricity-kWh'] = np.nan
hourlyElectricityWithoutNaN = hourlyElectricity.dropna(axis=0, how='any')
hourlyElectricity.to_excel('Data/hourlyElectricity.xlsx')
hourlyElectricityWithoutNaN.to_excel('Data/hourlyElectricityWithoutNaN.xlsx')
plt.figure()
fig = hourlyElectricity.plot(fontsize = 15, figsize = (15, 6))
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
fig.set_axis_bgcolor('w')
plt.title('All the hourly electricity data', fontsize = 16)
plt.ylabel('kWh')
plt.show()
plt.figure()
fig = hourlyElectricity.iloc[26200:27400,:].plot(marker = 'o',label='hourly electricity', fontsize = 15, figsize = (15, 6))
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
fig.set_axis_bgcolor('w')
plt.title('Hourly electricity data of selected days', fontsize = 16)
plt.ylabel('kWh')
plt.legend()
plt.show()
上面是每小时的数据图。在第一个图中,空白部分是由于仪表故障时数据丢失造成的。在第二张图中,你可以看到白天和晚上,工作日和周末的区别。
获得日常消耗
我们通过“reindex”来获得每天的用电量。
dailyTimestamp = pd.date_range(start = '2011/7/1', end = '2014/10/31', freq = 'D')
electricityReindexed = electricity.reindex(dailyTimestamp, inplace = False)
# 以防万一,为了使用diff方法,时间戳必须是按升序排列的。
electricityReindexed.sort_index(inplace = True)
dailyEnergy = electricityReindexed.diff(periods=1)['energy']
dailyElectricity = pd.DataFrame(data = dailyEnergy.values, index = electricityReindexed.index - np.timedelta64(1,'D'), columns = ['electricity-kWh'])
dailyElectricity['startDay'] = dailyElectricity.index
dailyElectricity['endDay'] = dailyElectricity.index + np.timedelta64(1,'D')
# 过滤数据,保留NaN并生成两个excels,有NaN和没有NaN
dailyElectricity.loc[abs(dailyElectricity['electricity-kWh']) > 2000000,'electricity-kWh'] = np.nan
time = dailyElectricity.index
index = ((time > np.datetime64('2012-07-26')) & (time < np.datetime64('2012-08-18'))) | ((time > np.datetime64('2013-01-21')) & (time < np.datetime64('2013-03-08')))
dailyElectricity.loc[index,'electricity-kWh'] = np.nan
dailyElectricityWithoutNaN = dailyElectricity.dropna(axis=0, how='any')
dailyElectricity.to_excel('Data/dailyElectricity.xlsx')
dailyElectricityWithoutNaN.to_excel('Data/dailyElectricityWithoutNaN.xlsx')
dailyElectricity.head()
electricity-kWh | startDay | endDay | |
---|---|---|---|
2011-06-30 | NaN | 2011-06-30 | 2011-07-01 |
2011-07-01 | NaN | 2011-07-01 | 2011-07-02 |
2011-07-02 | 3630.398480 | 2011-07-02 | 2011-07-03 |
2011-07-03 | 3750.648885 | 2011-07-03 | 2011-07-04 |
2011-07-04 | 4568.724044 | 2011-07-04 | 2011-07-05 |
以上是日用电量。
plt.figure()
fig = dailyElectricity.plot(figsize = (15, 6))
fig.set_axis_bgcolor('w')
plt.title('All the daily electricity data', fontsize = 16)
plt.ylabel('kWh')
plt.show()
plt.figure()
fig = dailyElectricity.iloc[1000:1130,:].plot(marker = 'o', figsize = (15, 6))
fig.set_axis_bgcolor('w')
plt.title('Daily electricity data of selected days', fontsize = 16)
plt.ylabel('kWh')
plt.show()
上面是每日电量图。
超低耗电量发生在学期结束后,包括圣诞节假期。另外,如图2所示,开学时夏季的能耗较低。
冷水
我们处理冷水数据的方式与电力相同。
chilledWater = df[['Gund Main Energy - Ton-Days']]
chilledWater.head()
Gund Main Energy - Ton-Days | |
---|---|
2011-07-01 01:00:00 | 17912.537804 |
2011-07-01 02:00:00 | 17912.853518 |
2011-07-01 03:00:00 | 17913.169232 |
2011-07-01 04:00:00 | 17913.484946 |
2011-07-01 05:00:00 | 17913.800660 |
file = 'Data/monthly chilled water.csv'
monthlyChilledWaterFromFacility = pd.read_csv(file, header=0)
monthlyChilledWaterFromFacility.set_index(['month'], inplace = True)
monthlyChilledWaterFromFacility.head()
startDate | endDate | chilledWater | |
---|---|---|---|
month | |||
11-Jul | 6/12/11 | 7/12/11 | 2258 |
11-Aug | 7/12/11 | 8/12/11 | 2095 |
11-Sep | 8/12/11 | 9/12/11 | 2200 |
11-Oct | 9/12/11 | 10/12/11 | 1664 |
11-Nov | 10/12/11 | 11/12/11 | 447 |
monthlyChilledWaterFromFacility['startDate'] = pd.to_datetime(monthlyChilledWaterFromFacility['startDate'], format="%m/%d/%y")
values = monthlyChilledWaterFromFacility.index.values
keys = np.array(monthlyChilledWaterFromFacility['startDate'])
dates = {}
for key, value in zip(keys, values):
dates[key] = value
sortedDates = np.sort(dates.keys())
sortedDates = sortedDates[sortedDates > np.datetime64('2011-11-01')]
months = []
monthlyChilledWaterOrg = np.zeros((len(sortedDates) - 1))
for i in range(len(sortedDates) - 1):
begin = sortedDates[i]
end = sortedDates[i+1]
months.append(dates[sortedDates[i]])
monthlyChilledWaterOrg[i] = (np.round(chilledWater.loc[end,:] - chilledWater.loc[begin,:], 1))
monthlyChilledWater = pd.DataFrame(data = monthlyChilledWaterOrg, index = months, columns = ['chilledWater-TonDays'])
fig,ax = plt.subplots(1, 1,figsize=(15,8))
#ax.set_axis_bgcolor('w')
#plt.plot(monthlyChilledWater, label='Our data processing from hourly data', marker = 'x', markersize = 15, linestyle = '')
plt.bar(np.arange(len(monthlyChilledWater))-0.5, monthlyChilledWater.values, label='Our data processing from hourly data')
plt.plot(monthlyChilledWaterFromFacility[5:-1]['chilledWater'],'or', label='Facility data')
plt.xticks(np.arange(0,len(months)),months)
plt.xlabel('Month',fontsize=15)
plt.ylabel('kWh',fontsize=15)
plt.xlim([0,len(months)])
plt.legend()
ax.set_xticklabels(months, rotation=40, fontsize=13)
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
plt.title('Data Validation: comparison between our data processing and facilities',fontsize=20)
text = 'Match! Our processing method is valid.'
ax.annotate(text, xy = (15, 2000),
xytext = (5, 50), fontsize = 15,
textcoords = 'offset points', ha = 'center', va = 'bottom')
plt.show()
。
hourlyTimestamp = pd.date_range(start = '2011/7/1', end = '2014/10/31', freq = 'H')
chilledWater.reindex(hourlyTimestamp, inplace = True)
# 以防万一,为了使用diff方法,时间戳必须是按升序排列的。
chilledWater.sort_index(inplace = True)
hourlyEnergy = chilledWater.diff(periods=1)
hourlyChilledWater = pd.DataFrame(data = hourlyEnergy.values, index = hourlyEnergy.index, columns = ['chilledWater-TonDays'])
hourlyChilledWater['startTime'] = hourlyChilledWater.index
hourlyChilledWater['endTime'] = hourlyChilledWater.index + np.timedelta64(1,'h')
hourlyChilledWater.loc[abs(hourlyChilledWater['chilledWater-TonDays']) > 50,'chilledWater-TonDays'] = np.nan
hourlyChilledWaterWithoutNaN = hourlyChilledWater.dropna(axis=0, how='any')
hourlyChilledWater.to_excel('Data/hourlyChilledWater.xlsx')
hourlyChilledWaterWithoutNaN.to_excel('Data/hourlyChilledWaterWithoutNaN.xlsx')
plt.figure()
fig = hourlyChilledWater.plot(fontsize = 15, figsize = (15, 6))
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
fig.set_axis_bgcolor('w')
plt.title('All the hourly chilled water data', fontsize = 16)
plt.ylabel('Ton-Days')
plt.show()
hourlyChilledWater.head()
chilledWater-TonDays | startTime | endTime | |
---|---|---|---|
2011-07-01 01:00:00 | NaN | 2011-07-01 01:00:00 | 2011-07-01 02:00:00 |
2011-07-01 02:00:00 | 0.315714 | 2011-07-01 02:00:00 | 2011-07-01 03:00:00 |
2011-07-01 03:00:00 | 0.315714 | 2011-07-01 03:00:00 | 2011-07-01 04:00:00 |
2011-07-01 04:00:00 | 0.315714 | 2011-07-01 04:00:00 | 2011-07-01 05:00:00 |
2011-07-01 05:00:00 | 0.315714 | 2011-07-01 05:00:00 | 2011-07-01 06:00:00 |
dailyTimestamp = pd.date_range(start = '2011/7/1', end = '2014/10/31', freq = 'D')
chilledWaterReindexed = chilledWater.reindex(dailyTimestamp, inplace = False)
chilledWaterReindexed.sort_index(inplace = True)
dailyEnergy = chilledWaterReindexed.diff(periods=1)['Gund Main Energy - Ton-Days']
dailyChilledWater = pd.DataFrame(data = dailyEnergy.values, index = chilledWaterReindexed.index - np.timedelta64(1,'D'), columns = ['chilledWater-TonDays'])
dailyChilledWater['startDay'] = dailyChilledWater.index
dailyChilledWater['endDay'] = dailyChilledWater.index + np.timedelta64(1,'D')
dailyChilledWaterWithoutNaN = dailyChilledWater.dropna(axis=0, how='any')
dailyChilledWater.to_excel('Data/dailyChilledWater.xlsx')
dailyChilledWaterWithoutNaN.to_excel('Data/dailyChilledWaterWithoutNaN.xlsx')
plt.figure()
fig = dailyChilledWater.plot(fontsize = 15, figsize = (15, 6))
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
fig.set_axis_bgcolor('w')
plt.title('All the daily chilled water data', fontsize = 16)
plt.ylabel('Ton-Days')
plt.show()
dailyChilledWater.head()
chilledWater-TonDays | startDay | endDay | |
---|---|---|---|
2011-06-30 | NaN | 2011-06-30 | 2011-07-01 |
2011-07-01 | NaN | 2011-07-01 | 2011-07-02 |
2011-07-02 | 54.741028 | 2011-07-02 | 2011-07-03 |
2011-07-03 | 55.649728 | 2011-07-03 | 2011-07-04 |
2011-07-04 | 109.049077 | 2011-07-04 | 2011-07-05 |
热水
steam = df[['Gund Condensate FlowTotal - LBS']]
steam.head()
Gund Condensate FlowTotal - LBS | |
---|---|
2011-07-01 01:00:00 | 15443350.388455 |
2011-07-01 02:00:00 | 15443459.322917 |
2011-07-01 03:00:00 | 15443574.687500 |
2011-07-01 04:00:00 | 15443701.953125 |
2011-07-01 05:00:00 | 15443818.359375 |
file = 'Data/monthly steam.csv'
monthlySteamFromFacility = pd.read_csv(file, header=0)
monthlySteamFromFacility.set_index(['month'], inplace = True)
monthlySteamFromFacility.head()
startDate | endDate | steam | |
---|---|---|---|
month | |||
Jul 11 | 6/17/2011 | 7/21/2011 | 0.0 |
Aug 11 | 7/21/2011 | 8/20/2011 | 0.0 |
Sep 11 | 8/20/2011 | 9/17/2011 | 0.0 |
Oct 11 | 9/17/2011 | 10/20/2011 | 246.5 |
Nov 11 | 10/20/2011 | 11/20/2011 | 786.1 |
monthlySteamFromFacility['startDate'] = pd.to_datetime(monthlySteamFromFacility['startDate'], format="%m/%d/%Y")
values = monthlySteamFromFacility.index.values
keys = np.array(monthlySteamFromFacility['startDate'])
dates = {}
for key, value in zip(keys, values):
dates[key] = value
sortedDates = np.sort(dates.keys())
sortedDates = sortedDates[sortedDates > np.datetime64('2011-11-01')]
months = []
monthlySteamOrg = np.zeros((len(sortedDates) - 1))
for i in range(len(sortedDates) - 1):
begin = sortedDates[i]
end = sortedDates[i+1]
months.append(dates[sortedDates[i]])
monthlySteamOrg[i] = (np.round(steam.loc[end,:] - steam.loc[begin,:], 1))
monthlySteam = pd.DataFrame(data = monthlySteamOrg, index = months, columns = ['steam-LBS'])
# 867 LBS ~= 1MMBTU steam
fig,ax = plt.subplots(1, 1,figsize=(15,8))
#ax.set_axis_bgcolor('w')
#plt.plot(monthlySteam/867, label='Our data processing from hourly data')
plt.bar(np.arange(len(monthlySteam))-0.5, monthlySteam.values/867, label='Our data processing from hourly data')
plt.plot(monthlySteamFromFacility.loc[months,'steam'],'or', label='Facility data')
plt.xticks(np.arange(0,len(months)),months)
plt.xlabel('Month',fontsize=15)
plt.ylabel('Steam (MMBTU)',fontsize=15)
plt.xlim([0,len(months)])
plt.legend()
ax.set_xticklabels(months, rotation=40, fontsize=13)
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
plt.title('Comparison between our data processing and facilities - Steam',fontsize=20)
text = 'Match! Our processing method is valid.'
ax.annotate(text, xy = (9, 1500),
xytext = (5, 50), fontsize = 15,
textcoords = 'offset points', ha = 'center', va = 'bottom')
plt.show()
hourlyTimestamp = pd.date_range(start = '2011/7/1', end = '2014/10/31', freq = 'H')
steam.reindex(hourlyTimestamp, inplace = True)
# Just in case, in order to use diff method, timestamp has to be in asending order.
steam.sort_index(inplace = True)
hourlyEnergy = steam.diff(periods=1)
hourlySteam = pd.DataFrame(data = hourlyEnergy.values, index = hourlyEnergy.index, columns = ['steam-LBS'])
hourlySteam['startTime'] = hourlySteam.index
hourlySteam['endTime'] = hourlySteam.index + np.timedelta64(1,'h')
hourlySteam.loc[abs(hourlySteam['steam-LBS']) > 100000,'steam-LBS'] = np.nan
plt.figure()
fig = hourlySteam.plot(fontsize = 15, figsize = (15, 6))
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 17)
fig.set_axis_bgcolor('w')
plt.title('All the hourly steam data', fontsize = 16)
plt.ylabel('LBS')
plt.show()
hourlySteamWithoutNaN = hourlySteam.dropna(axis=0, how='any')
hourlySteam.to_excel('Data/hourlySteam.xlsx')
hourlySteamWithoutNaN.to_excel('Data/hourlySteamWithoutNaN.xlsx')
hourlySteam.head()
steam-LBS | startTime | endTime | |
---|---|---|---|
2011-07-01 01:00:00 | NaN | 2011-07-01 01:00:00 | 2011-07-01 02:00:00 |
2011-07-01 02:00:00 | 108.934462 | 2011-07-01 02:00:00 | 2011-07-01 03:00:00 |
2011-07-01 03:00:00 | 115.364583 | 2011-07-01 03:00:00 | 2011-07-01 04:00:00 |
2011-07-01 04:00:00 | 127.265625 | 2011-07-01 04:00:00 | 2011-07-01 05:00:00 |
2011-07-01 05:00:00 | 116.406250 | 2011-07-01 05:00:00 | 2011-07-01 06:00:00 |
dailyTimestamp = pd.date_range(start = '2011/7/1', end = '2014/10/31', freq = 'D')
steamReindexed = steam.reindex(dailyTimestamp, inplace = False)
steamReindexed.sort_index(inplace = True)
dailyEnergy = steamReindexed.diff(periods=1)['Gund Condensate FlowTotal - LBS']
dailySteam = pd.DataFrame(data = dailyEnergy.values, index = steamReindexed.index - np.timedelta64(1,'D'), columns = ['steam-LBS'])
dailySteam['startDay'] = dailySteam.index
dailySteam['endDay'] = dailySteam.index + np.timedelta64(1,'D')
plt.figure()
fig = dailySteam.plot(fontsize = 15, figsize = (15, 6))
plt.tick_params(which=u'major', reset=False, axis = 'y', labelsize = 15)
fig.set_axis_bgcolor('w')
plt.title('All the daily steam data', fontsize = 16)
plt.ylabel('LBS')
plt.show()
dailySteamWithoutNaN = dailyChilledWater.dropna(axis=0, how='any')
dailySteam.to_excel('Data/dailySteam.xlsx')
dailySteamWithoutNaN.to_excel('Data/dailySteamWithoutNaN.xlsx')
dailySteam.head()
steam-LBS | startDay | endDay | |
---|---|---|---|
2011-06-30 | NaN | 2011-06-30 | 2011-07-01 |
2011-07-01 | NaN | 2011-07-01 | 2011-07-02 |
2011-07-02 | 3250.651042 | 2011-07-02 | 2011-07-03 |
2011-07-03 | 3271.276042 | 2011-07-03 | 2011-07-04 |
2011-07-04 | 3236.718750 | 2011-07-04 | 2011-07-05 |
天气数据
原始数据
天气有两个数据源。
- 2014年: GSD(设计研究生院)楼顶上的天气。
- 2012 & 2013年: 从位于Cambridge, MA的气象站购买天气数据。请注意,2012年和2013年的数据来自不同的气象站。
这是经过一些清洗后的原始天气数据,包括单位转换。间隔是5分钟。
weather2014 = pd.read_excel('Data/weather-2014.xlsx')
weather2014.head()
Datetime | T-C | RH-% | Tdew-C | windDirection | windSpeed-m/s | pressure-mbar | solarRadiation-W/m2 | |
---|---|---|---|---|---|---|---|---|
0 | 2014-01-01 00:00:00 | -5.02 | 53.2 | -13.07 | 256 | 4.5 | 1020.8 | 1 |
1 | 2014-01-01 00:05:00 | -5.14 | 54.3 | -12.93 | 257 | 4.0 | 1020.7 | 1 |
2 | 2014-01-01 00:10:00 | -5.08 | 53.4 | -13.08 | 258 | 3.5 | 1021.1 | 1 |
3 | 2014-01-01 00:15:00 | -5.17 | 52.6 | -13.35 | 257 | 2.8 | 1021.0 | 1 |
4 | 2014-01-01 00:20:00 | -5.23 | 52.9 | -13.33 | 248 | 2.5 | 1021.2 | 1 |
通过重采样方法转换为每小时一次。
这是2014年每小时的天气数据。
weather2014 = weather2014.set_index('Datetime')
weather2014 = weather2014.resample('H')
weather2014.head()
T-C | RH-% | Tdew-C | windDirection | windSpeed-m/s | pressure-mbar | solarRadiation-W/m2 | |
---|---|---|---|---|---|---|---|
Datetime | |||||||
2014-01-01 00:00:00 | -5.281667 | 52.858333 | -13.393333 | 253.500000 | 2.775000 | 1021.158333 | 1 |
2014-01-01 01:00:00 | -5.725000 | 51.650000 | -14.090833 | 253.000000 | 2.350000 | 1021.700000 | 1 |
2014-01-01 02:00:00 | -6.002500 | 50.766667 | -14.555833 | 239.250000 | 1.133333 | 1022.708333 | 1 |
2014-01-01 03:00:00 | -6.320833 | 49.675000 | -15.114167 | 234.333333 | 1.191667 | 1023.233333 | 1 |
2014-01-01 04:00:00 | -6.535833 | 49.708333 | -15.305833 | 227.333333 | 1.633333 | 1023.925000 | 1 |
这里是2013年和2014年的原始天气数据,经过一些清洗,包括单位转换。可以看出,Datetime格式不正确。
weather2012and2013 = pd.read_excel('Data/weather-2012-2013.xlsx')
weather2012and2013.head()
Datetime | RH-% | windDirection | solarRadiation-W/m2 | T-C | Tdew-C | pressure-mbar | windSpeed-m/s | |
---|---|---|---|---|---|---|---|---|
0 | 2012-01-01-01 | 87 | 310 | 0 | 4 | 1.9 | 1004 | 4.166670 |
1 | 2012-01-01-02 | 87 | 280 | 0 | 4 | 1.9 | 1005 | 4.166670 |
2 | 2012-01-01-03 | 81 | 270 | 0 | 5 | 1.9 | 1006 | 4.166670 |
3 | 2012-01-01-04 | 76 | 290 | 0 | 6 | 1.9 | 1007 | 4.722226 |
4 | 2012-01-01-05 | 87 | 280 | 0 | 4 | 1.9 | 1007 | 3.055558 |
这是清洗后每小时的数据。
weather2012and2013['Datetime'] = pd.to_datetime(weather2012and2013['Datetime'], format='%Y-%m-%d-%H')
weather2012and2013 = weather2012and2013.set_index('Datetime')
weather2012and2013.head()
RH-% | windDirection | solarRadiation-W/m2 | T-C | Tdew-C | pressure-mbar | windSpeed-m/s | |
---|---|---|---|---|---|---|---|
Datetime | |||||||
2012-01-01 01:00:00 | 87 | 310 | 0 | 4 | 1.9 | 1004 | 4.166670 |
2012-01-01 02:00:00 | 87 | 280 | 0 | 4 | 1.9 | 1005 | 4.166670 |
2012-01-01 03:00:00 | 81 | 270 | 0 | 5 | 1.9 | 1006 | 4.166670 |
2012-01-01 04:00:00 | 76 | 290 | 0 | 6 | 1.9 | 1007 | 4.722226 |
2012-01-01 05:00:00 | 87 | 280 | 0 | 4 | 1.9 | 1007 | 3.055558 |
每小时天气数据
合并两个文件,增加更多的功能,包括冷度,热度,湿度比和除湿。
这是每小时的天气数据。
# 合并两个天气文件
hourlyWeather = weather2014.append(weather2012and2013)
hourlyWeather.index.name = None
hourlyWeather.sort_index(inplace = True)
# 增加更多的特征
# 将相对湿度转换为相对湿度
Mw=18.0160 # 水的分子量
Md=28.9660 # 干燥空气的分子量
R = 8.31432E3 # 气体常数
Rd = R/Md # 干燥空气的比气体常数
Rv = R/Mw # 蒸汽的比气体常数
Lv = 2.5e6 # 水蒸气凝结放热[J kg-1]
eps = Mw/Md
# 饱和压力
def esat(T):
''' 给定空气温度(单位[K]),得到给定压力(单位[Pa]) '''
from numpy import log10
TK = 273.15
e1 = 101325.0
logTTK = log10(T/TK)
esat = e1*10**(10.79586*(1-TK/T)-5.02808*logTTK+ 1.50474*1e-4*(1.-10**(-8.29692*(T/TK-1)))+ 0.42873*1e-3*(10**(4.76955*(1-TK/T))-1)-2.2195983)
return esat
def rh2sh(RH,p,T):
'''用途:将相对湿度(无单位)换算成比湿度(湿度比)[kg/kg]'''
es = esat(T)
W = Mw/Md*RH*es/(p-RH*es)
return W/(1.+W)
p = hourlyWeather['pressure-mbar'] * 100
RH = hourlyWeather['RH-%'] / 100
T = hourlyWeather['T-C'] + 273.15
w = rh2sh(RH,p,T)
hourlyWeather['humidityRatio-kg/kg'] = w
hourlyWeather['coolingDegrees'] = hourlyWeather['T-C'] - 12
hourlyWeather.loc[hourlyWeather['coolingDegrees'] < 0, 'coolingDegrees'] = 0
hourlyWeather['heatingDegrees'] = 15 - hourlyWeather['T-C']
hourlyWeather.loc[hourlyWeather['heatingDegrees'] < 0, 'heatingDegrees'] = 0
hourlyWeather['dehumidification'] = hourlyWeather['humidityRatio-kg/kg'] - 0.00886
hourlyWeather.loc[hourlyWeather['dehumidification'] < 0, 'dehumidification'] = 0
#hourlyWeather.to_excel('Data/hourlyWeather.xlsx')
hourlyWeather.head()
RH-% | T-C | Tdew-C | pressure-mbar | solarRadiation-W/m2 | windDirection | windSpeed-m/s | humidityRatio-kg/kg | coolingDegrees | heatingDegrees | dehumidification | |
---|---|---|---|---|---|---|---|---|---|---|---|
2012-01-01 01:00:00 | 87 | 4 | 1.9 | 1004 | 0 | 310 | 4.166670 | 0.004396 | 0 | 11 | 0 |
2012-01-01 02:00:00 | 87 | 4 | 1.9 | 1005 | 0 | 280 | 4.166670 | 0.004391 | 0 | 11 | 0 |
2012-01-01 03:00:00 | 81 | 5 | 1.9 | 1006 | 0 | 270 | 4.166670 | 0.004380 | 0 | 10 | 0 |
2012-01-01 04:00:00 | 76 | 6 | 1.9 | 1007 | 0 | 290 | 4.722226 | 0.004401 | 0 | 9 | 0 |
2012-01-01 05:00:00 | 87 | 4 | 1.9 | 1007 | 0 | 280 | 3.055558 | 0.004382 | 0 | 11 | 0 |
plt.figure()
fig = hourlyWeather.plot(y = 'T-C', figsize = (15, 6))
fig.set_axis_bgcolor('w')
plt.title('All hourly temperture', fontsize = 16)
plt.ylabel(r'Temperature ($\circ$C)')
plt.show()
plt.figure()
fig = hourlyWeather.plot(y = 'solarRadiation-W/m2', figsize = (15, 6))
fig.set_axis_bgcolor('w')
plt.title('All hourly solar radiation', fontsize = 16)
plt.ylabel(r'$W/m^2$', fontsize = 13)
plt.show()
plt.figure()
fig = hourlyWeather['2014-10'].plot(y = 'T-C', figsize = (15, 6), marker = 'o')
fig.set_axis_bgcolor('w')
plt.title('Selected hourly temperture',fontsize = 16)
plt.ylabel(r'Temperature ($\circ$C)',fontsize = 13)
plt.show()
plt.figure()
fig = hourlyWeather['2014-10'].plot(y = 'solarRadiation-W/m2', figsize = (15, 6), marker ='o')
fig.set_axis_bgcolor('w')
plt.title('Selected hourly solar radiation', fontsize = 16)
plt.ylabel(r'$W/m^2$', fontsize = 13)
plt.show()
每天的天气数据
dailyWeather = hourlyWeather.resample('D')
#dailyWeather.to_excel('Data/dailyWeather.xlsx')
dailyWeather.head()
RH-% | T-C | Tdew-C | pressure-mbar | solarRadiation-W/m2 | windDirection | windSpeed-m/s | humidityRatio-kg/kg | coolingDegrees | heatingDegrees | dehumidification | |
---|---|---|---|---|---|---|---|---|---|---|---|
2012-01-01 | 76.652174 | 7.173913 | 3.073913 | 1004.956522 | 95.260870 | 236.086957 | 4.118361 | 0.004796 | 0 | 7.826087 | 0 |
2012-01-02 | 55.958333 | 5.833333 | -2.937500 | 994.625000 | 87.333333 | 253.750000 | 5.914357 | 0.003415 | 0 | 9.166667 | 0 |
2012-01-03 | 42.500000 | -3.208333 | -12.975000 | 1002.125000 | 95.708333 | 302.916667 | 6.250005 | 0.001327 | 0 | 18.208333 | 0 |
2012-01-04 | 41.541667 | -7.083333 | -16.958333 | 1008.250000 | 98.750000 | 286.666667 | 5.127319 | 0.000890 | 0 | 22.083333 | 0 |
2012-01-05 | 46.916667 | -0.583333 | -9.866667 | 1002.041667 | 90.750000 | 258.333333 | 5.162041 | 0.001746 | 0 | 15.583333 | 0 |
plt.figure()
fig = dailyWeather.plot(y = 'T-C', figsize = (15, 6), marker ='o')
fig.set_axis_bgcolor('w')
plt.title('All daily temperture', fontsize = 16)
plt.ylabel(r'Temperature ($\circ$C)', fontsize = 13)
plt.show()
plt.figure()
fig = dailyWeather['2014'].plot(y = 'T-C', figsize = (15, 6), marker ='o')
fig.set_axis_bgcolor('w')
plt.title('Selected daily temperture', fontsize = 16)
plt.ylabel(r'Temperature ($\circ$C)', fontsize = 13)
plt.show()
plt.figure()
fig = dailyWeather['2014'].plot(y = 'solarRadiation-W/m2', figsize = (15, 6), marker ='o')
fig.set_axis_bgcolor('w')
plt.title('Selected daily solar radiation', fontsize = 16)
plt.ylabel(r'$W/m^2$', fontsize = 14)
plt.show()
与占用有关的特征
这是一个0到1之间的数。0表示无人居住,1表示正常居住。这是一个基于假期、周末和学校学术日历的估计。
holidays = pd.read_excel('Data/holidays.xlsx')
holidays.head()
startDate | endDate | value | |
---|---|---|---|
0 | 2011-07-01 | 2011-09-06 | 0.5 |
1 | 2011-10-10 | 2011-10-11 | 0.6 |
2 | 2011-11-24 | 2011-11-28 | 0.2 |
3 | 2011-12-22 | 2011-12-24 | 0.1 |
4 | 2011-12-24 | 2012-01-02 | 0.0 |
hourlyTimestamp = pd.date_range(start = '2011/7/1', end = '2014/10/31', freq = 'H')
occupancy = np.ones(len(hourlyTimestamp))
hourlyOccupancy = pd.DataFrame(data = occupancy, index = hourlyTimestamp, columns = ['occupancy'])
Saturdays = hourlyOccupancy.index.weekday == 5
Sundays = hourlyOccupancy.index.weekday == 6
hourlyOccupancy.loc[Saturdays, 'occupancy'] = 0.5
hourlyOccupancy.loc[Sundays, 'occupancy'] = 0.5
for i in range(len(holidays)):
timestamp = pd.date_range(start = holidays.loc[i, 'startDate'], end = holidays.loc[i, 'endDate'], freq = 'H')
hourlyOccupancy.loc[timestamp, 'occupancy'] = holidays.loc[i, 'value']
#hourlyHolidays['Datetime'] = pd.to_datetime(hourlyHolidays['Datetime'], format="%Y-%m-%d %H:%M:%S")
hourlyOccupancy['cosHour'] = np.cos((hourlyOccupancy.index.hour - 3) * 2 * np.pi / 24)
dailyOccupancy = hourlyOccupancy.resample('D')
dailyOccupancy.drop('cosHour', axis = 1, inplace = True)
合并能源消耗数据与天气和占用特征
hourlyElectricityWithFeatures = hourlyElectricity.join(hourlyWeather, how = 'inner')
hourlyElectricityWithFeatures = hourlyElectricityWithFeatures.join(hourlyOccupancy, how = 'inner')
hourlyElectricityWithFeatures.dropna(axis=0, how='any', inplace = True)
hourlyElectricityWithFeatures.to_excel('Data/hourlyElectricityWithFeatures.xlsx')
hourlyElectricityWithFeatures.head()
electricity-kWh | startTime | endTime | RH-% | T-C | Tdew-C | pressure-mbar | solarRadiation-W/m2 | windDirection | windSpeed-m/s | humidityRatio-kg/kg | coolingDegrees | heatingDegrees | dehumidification | occupancy | cosHour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2012-01-01 01:00:00 | 111.479277 | 2012-01-01 01:00:00 | 2012-01-01 02:00:00 | 87 | 4 | 1.9 | 1004 | 0 | 310 | 4.166670 | 0.004396 | 0 | 11 | 0 | 0 | 0.866025 |
2012-01-01 02:00:00 | 117.989395 | 2012-01-01 02:00:00 | 2012-01-01 03:00:00 | 87 | 4 | 1.9 | 1005 | 0 | 280 | 4.166670 | 0.004391 | 0 | 11 | 0 | 0 | 0.965926 |
2012-01-01 03:00:00 | 119.010131 | 2012-01-01 03:00:00 | 2012-01-01 04:00:00 | 81 | 5 | 1.9 | 1006 | 0 | 270 | 4.166670 | 0.004380 | 0 | 10 | 0 | 0 | 1.000000 |
2012-01-01 04:00:00 | 116.005587 | 2012-01-01 04:00:00 | 2012-01-01 05:00:00 | 76 | 6 | 1.9 | 1007 | 0 | 290 | 4.722226 | 0.004401 | 0 | 9 | 0 | 0 | 0.965926 |
2012-01-01 05:00:00 | 111.132977 | 2012-01-01 05:00:00 | 2012-01-01 06:00:00 | 87 | 4 | 1.9 | 1007 | 0 | 280 | 3.055558 | 0.004382 | 0 | 11 | 0 | 0 | 0.866025 |
hourlyChilledWaterWithFeatures = hourlyChilledWater.join(hourlyWeather, how = 'inner')
hourlyChilledWaterWithFeatures = hourlyChilledWaterWithFeatures.join(hourlyOccupancy, how = 'inner')
hourlyChilledWaterWithFeatures.dropna(axis=0, how='any', inplace = True)
hourlyChilledWaterWithFeatures.to_excel('Data/hourlyChilledWaterWithFeatures.xlsx')
hourlySteamWithFeatures = hourlySteam.join(hourlyWeather, how = 'inner')
hourlySteamWithFeatures = hourlySteamWithFeatures.join(hourlyOccupancy, how = 'inner')
hourlySteamWithFeatures.dropna(axis=0, how='any', inplace = True)
hourlySteamWithFeatures.to_excel('Data/hourlySteamWithFeatures.xlsx')
dailyElectricityWithFeatures = dailyElectricity.join(dailyWeather, how = 'inner')
dailyElectricityWithFeatures = dailyElectricityWithFeatures.join(dailyOccupancy, how = 'inner')
dailyElectricityWithFeatures.dropna(axis=0, how='any', inplace = True)
dailyElectricityWithFeatures.to_excel('Data/dailyElectricityWithFeatures.xlsx')
dailyChilledWaterWithFeatures = dailyChilledWater.join(dailyWeather, how = 'inner')
dailyChilledWaterWithFeatures = dailyChilledWaterWithFeatures.join(dailyOccupancy, how = 'inner')
dailyChilledWaterWithFeatures.dropna(axis=0, how='any', inplace = True)
dailyChilledWaterWithFeatures.to_excel('Data/dailyChilledWaterWithFeatures.xlsx')
dailySteamWithFeatures = dailySteam.join(dailyWeather, how = 'inner')
dailySteamWithFeatures = dailySteamWithFeatures.join(dailyOccupancy, how = 'inner')
dailySteamWithFeatures.dropna(axis=0, how='any', inplace = True)
dailySteamWithFeatures.to_excel('Data/dailySteamWithFeatures.xlsx')
特性说明
命名法(按字母顺序)
coolingDegrees: 如果T-C - 12>0,则为T-C-12,否则为0。假设当室外温度低于12C时,不需要冷却,这对许多建筑来说是真实的。这对于每天的预测是有用的,因为每小时的平均冷却度比每小时的平均温度要好。
cosHour: cos(hourOfDay⋅2
翻译——1_Project Overview, Data Wrangling and Exploratory Analysis-checkpoint的更多相关文章
- Learning notes | Data Analysis: 1.2 data wrangling
| Data Wrangling | # Sort all the data into one file files = ['BeijingPM20100101_20151231.csv','Chen ...
- 《Data Structures and Algorithm Analysis in C》学习与刷题笔记
<Data Structures and Algorithm Analysis in C>学习与刷题笔记 为什么要学习DSAAC? 某个月黑风高的夜晚,下班的我走在黯淡无光.冷清无人的冲之 ...
- 翻译-In-Stream Big Data Processing 流式大数据处理
相当长一段时间以来,大数据社区已经普遍认识到了批量数据处理的不足.很多应用都对实时查询和流式处理产生了迫切需求.最近几年,在这个理念的推动下,催生出了一系列解决方案,Twitter Storm,Yah ...
- [翻译]MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large Clusters MapReduce:面向大型集群的简化数据处理 摘要 MapReduce既是一种编程模型 ...
- Zore copy(翻译《Efficient data transfer through zero copy》)
原文:https://www.ibm.com/developerworks/library/j-zerocopy/ <Efficient data transfer through zero c ...
- 翻译:load data infile(已提交到MariaDB官方手册)
本文为mariadb官方手册:LOAD DATA INFILE的译文. 原文:https://mariadb.com/kb/en/load-data-infile/ 我提交到MariaDB官方手册的译 ...
- 【一】H.264/MPEG-4 Part 10 White Paper 翻译之 Overview of H.264
翻译版权所有,转载请注明出处~ xzrch@2018.09.14 ------------------------------------------------------------------- ...
- 翻译——2_Linear Regression and Support Vector Regression
续上篇 1_Project Overview, Data Wrangling and Exploratory Analysis 使用不同的机器学习方法进行预测 线性回归 在这本笔记本中,将训练一个线性 ...
- 探索性数据分析(Exploratory Data Analysis,EDA)
探索性数据分析(Exploratory Data Analysis,EDA)主要的工作是:对数据进行清洗,对数据进行描述(描述统计量,图表),查看数据的分布,比较数据之间的关系,培养对数据的直觉,对数 ...
随机推荐
- sourcetree安装以及跳过sourcetree注册登录 - git仓库管理工具桌面版
腾讯软件下载:https://pc.qq.com/detail/17/detail_23237.html 官网下载:https://www.sourcetreeapp.com/ 下载完直接安装 ...
- jQuery文档加载事件
$(document).ready(handler) $().ready(handler) (this is not recommended) $(handler) 相当于: $(document). ...
- SpringBoot 系列教程之编程式事务使用姿势介绍篇
SpringBoot 系列教程之编程式事务使用姿势介绍篇 前面介绍的几篇事务的博文,主要是利用@Transactional注解的声明式使用姿势,其好处在于使用简单,侵入性低,可辨识性高(一看就知道使用 ...
- 吴裕雄--天生自然JAVA SPRING框架开发学习笔记:Spring体系结构详解
Spring 框架采用分层架构,根据不同的功能被划分成了多个模块,这些模块大体可分为 Data Access/Integration.Web.AOP.Aspects.Messaging.Instrum ...
- CLR .net windows对win32 core抽象的新用处
断断续续 花了一周的时间,把.net clr的一些知识看完了(确切的说是 一个段落),总体的感觉就是,ms把win32 core创建进程线程.虚拟地址.内存隔离的思想又重用了一遍,有所不同的是这次的场 ...
- PAT 2012 冬
A Shortest Distance 题意相当于一个环,找出两点间从不同方向得到的距离中的最小值. #include <cstdio> #include <iostream> ...
- 关于Oracle中job定时器(通过create_job创建的)配置示例
begin dbms_scheduler.create_job(job_name => 'JOB_BASIC_STATISTIC', job_type => 'STORED_PROCEDU ...
- java多线程之volatile关键字
public class ThreadVolatile extends Thread { public boolean flag=true; @Override public void run() { ...
- go语言小练习——给定英语文章统计单词数量
给定一篇英语文章,要求统计出所有单词的个数,并按一定次序输出.思路是利用go语言的map类型,以每个单词作为关键字存储数量信息,代码实现如下: package main import ( " ...
- cmd定时自动弹窗命令
at 17:00 /e:m,t,w,th,f,s,su msg * 弹窗文字