使用不同的机器学习方法进行预测

上篇2_Linear Regression and Support Vector Regression

高斯过程回归

  1. %matplotlib inline
  2. import requests
  3. from StringIO import StringIO
  4. import numpy as np
  5. import pandas as pd # pandas
  6. import matplotlib.pyplot as plt # module for plotting
  7. import datetime as dt # module for manipulating dates and times
  8. import numpy.linalg as lin # module for performing linear algebra operations
  9. from __future__ import division
  10. import matplotlib
  11. import sklearn.decomposition
  12. import sklearn.metrics
  13. from sklearn import gaussian_process
  14. from sklearn import cross_validation
  15. pd.options.display.mpl_style = 'default'

从数据预处理中读取数据

  1. # Read in data from Preprocessing results
  2. hourlyElectricityWithFeatures = pd.read_excel('Data/hourlyElectricityWithFeatures.xlsx')
  3. hourlyChilledWaterWithFeatures = pd.read_excel('Data/hourlyChilledWaterWithFeatures.xlsx')
  4. hourlySteamWithFeatures = pd.read_excel('Data/hourlySteamWithFeatures.xlsx')
  5. dailyElectricityWithFeatures = pd.read_excel('Data/dailyElectricityWithFeatures.xlsx')
  6. dailyChilledWaterWithFeatures = pd.read_excel('Data/dailyChilledWaterWithFeatures.xlsx')
  7. dailySteamWithFeatures = pd.read_excel('Data/dailySteamWithFeatures.xlsx')
  8. # An example of Dataframe
  9. dailyChilledWaterWithFeatures.head()
chilledWater-TonDays startDay endDay RH-% T-C Tdew-C pressure-mbar solarRadiation-W/m2 windDirection windSpeed-m/s humidityRatio-kg/kg coolingDegrees heatingDegrees dehumidification occupancy
2012-01-01 0.961857 2012-01-01 2012-01-02 76.652174 7.173913 3.073913 1004.956522 95.260870 236.086957 4.118361 0.004796 0 7.826087 0 0.0
2012-01-02 0.981725 2012-01-02 2012-01-03 55.958333 5.833333 -2.937500 994.625000 87.333333 253.750000 5.914357 0.003415 0 9.166667 0 0.3
2012-01-03 1.003672 2012-01-03 2012-01-04 42.500000 -3.208333 -12.975000 1002.125000 95.708333 302.916667 6.250005 0.001327 0 18.208333 0 0.3
2012-01-04 1.483192 2012-01-04 2012-01-05 41.541667 -7.083333 -16.958333 1008.250000 98.750000 286.666667 5.127319 0.000890 0 22.083333 0 0.3
2012-01-05 3.465091 2012-01-05 2012-01-06 46.916667 -0.583333 -9.866667 1002.041667 90.750000 258.333333 5.162041 0.001746 0 15.583333 0 0.3

以上是我们的数据样本。

每日预测:

每日用电

获取训练/验证和测试集。dataframe显示功能和目标。

  1. def addDailyTimeFeatures(df):
  2. df['weekday'] = df.index.weekday
  3. df['day'] = df.index.dayofyear
  4. df['week'] = df.index.weekofyear
  5. return df
  6. dailyElectricityWithFeatures = addDailyTimeFeatures(dailyElectricityWithFeatures)
  7. df = dailyElectricityWithFeatures[['weekday', 'day', 'week',
  8. 'occupancy', 'electricity-kWh']]
  9. #df.to_excel('Data/trainSet.xlsx')
  10. trainSet = df['2012-01':'2013-06']
  11. testSet_dailyElectricity = df['2013-07':'2014-10']
  12. #normalizer = np.max(trainSet)
  13. #trainSet = trainSet / normalizer
  14. #testSet = testSet / normalizer
  15. trainX_dailyElectricity = trainSet.values[:,0:-1]
  16. trainY_dailyElectricity = trainSet.values[:,4]
  17. testX_dailyElectricity = testSet_dailyElectricity.values[:,0:-1]
  18. testY_dailyElectricity = testSet_dailyElectricity.values[:,4]
  19. trainSet.head()
weekday day week occupancy electricity-kWh
2012-01-01 6 1 52 0.0 2800.244977
2012-01-02 0 2 1 0.3 3168.974047
2012-01-03 1 3 1 0.3 5194.533376
2012-01-04 2 4 1 0.3 5354.861935
2012-01-05 3 5 1 0.3 5496.223993

以上是在预测每日用电时使用的特性。

交叉验证以获得输入参数 theta , nuggest

实际上有一些后台测试。首先做一个粗糙的网格serach参数,这里没有显示。之后,就可以得到精细网格搜索的参数范围,如下图所示。

  1. def crossValidation_all(theta, nugget, nfold, trainX, trainY):
  2. thetaU = theta * 2
  3. thetaL = theta/2
  4. scores = np.zeros((len(nugget) * len(theta), nfold))
  5. labels = ["" for x in range(len(nugget) * len(theta))]
  6. k = 0
  7. for j in range(len(theta)):
  8. for i in range(len(nugget)):
  9. gp = gaussian_process.GaussianProcess(theta0 = theta[j], nugget = nugget[i])
  10. scores[k, :] = cross_validation.cross_val_score(gp,
  11. trainX, trainY,
  12. scoring='r2', cv = nfold)
  13. labels[k] = str(theta[j]) + '|' + str(nugget[i])
  14. k = k + 1
  15. plt.figure(figsize=(20,8))
  16. plt.boxplot(scores.T, sym='b+', labels = labels, whis = 0.5)
  17. plt.ylim([0,1])
  18. plt.title('R2 score as a function of nugget')
  19. plt.ylabel('R2 Score')
  20. plt.xlabel('Choice of theta | nugget')
  21. plt.show()
  22. theta = np.arange(1, 8, 2)
  23. nfold = 10
  24. nugget = np.arange(0.01, 0.2, 0.03)
  25. crossValidation_all(theta, nugget, nfold, trainX_dailyElectricity, trainY_dailyElectricity)

我选择theta = 3, nuggest = 0.04,这是最好的中位数预测精度。

预测,计算精度和可视化

  1. def predictAll(theta, nugget, trainX, trainY, testX, testY, testSet, title):
  2. gp = gaussian_process.GaussianProcess(theta0=theta, nugget =nugget)
  3. gp.fit(trainX, trainY)
  4. predictedY, MSE = gp.predict(testX, eval_MSE = True)
  5. sigma = np.sqrt(MSE)
  6. results = testSet.copy()
  7. results['predictedY'] = predictedY
  8. results['sigma'] = sigma
  9. print "Train score R2:", gp.score(trainX, trainY)
  10. print "Test score R2:", sklearn.metrics.r2_score(testY, predictedY)
  11. plt.figure(figsize = (9,8))
  12. plt.scatter(testY, predictedY)
  13. plt.plot([min(testY), max(testY)], [min(testY), max(testY)], 'r')
  14. plt.xlim([min(testY), max(testY)])
  15. plt.ylim([min(testY), max(testY)])
  16. plt.title('Predicted vs. observed: ' + title)
  17. plt.xlabel('Observed')
  18. plt.ylabel('Predicted')
  19. plt.show()
  20. return gp, results
  21. gp_dailyElectricity, results_dailyElectricity = predictAll(3, 0.04,
  22. trainX_dailyElectricity,
  23. trainY_dailyElectricity,
  24. testX_dailyElectricity,
  25. testY_dailyElectricity,
  26. testSet_dailyElectricity,
  27. 'Daily Electricity')
  1. Train score R2: 0.922109831389
  2. Test score R2: 0.822408541698

  1. def plotGP(testY, predictedY, sigma):
  2. fig = plt.figure(figsize = (20,6))
  3. plt.plot(testY, 'r.', markersize=10, label=u'Observations')
  4. plt.plot(predictedY, 'b-', label=u'Prediction')
  5. x = range(len(testY))
  6. plt.fill(np.concatenate([x, x[::-1]]),
  7. np.concatenate([predictedY - 1.9600 * sigma,
  8. (predictedY + 1.9600 * sigma)[::-1]]),
  9. alpha=.5, fc='b', ec='None', label='95% confidence interval')
  10. subset = results_dailyElectricity['2013-12':'2014-06']
  11. testY = subset['electricity-kWh']
  12. predictedY = subset['predictedY']
  13. sigma = subset['sigma']
  14. plotGP(testY, predictedY, sigma)
  15. plt.ylabel('Electricity (kWh)', fontsize = 13)
  16. plt.title('Gaussian Process Regression: Daily Electricity Prediction', fontsize = 17)
  17. plt.legend(loc='upper right')
  18. plt.xlim([0, len(testY)])
  19. plt.ylim([1000,8000])
  20. xTickLabels = pd.DataFrame(data = subset.index[np.arange(0,len(subset.index),10)], columns=['datetime'])
  21. xTickLabels['date'] = xTickLabels['datetime'].apply(lambda x: x.strftime('%Y-%m-%d'))
  22. ax = plt.gca()
  23. ax.set_xticks(np.arange(0, len(subset), 10))
  24. ax.set_xticklabels(labels = xTickLabels['date'], fontsize = 13, rotation = 90)
  25. plt.show()

以上是部分预测的可视化。

每日用电预测相当成功。

每日冷水

获取训练/验证和测试集。dataframe显示功能和目标。

  1. dailyChilledWaterWithFeatures = addDailyTimeFeatures(dailyChilledWaterWithFeatures)
  2. df = dailyChilledWaterWithFeatures[['weekday', 'day', 'week', 'occupancy',
  3. 'coolingDegrees', 'T-C', 'humidityRatio-kg/kg',
  4. 'dehumidification', 'chilledWater-TonDays']]
  5. #df.to_excel('Data/trainSet.xlsx')
  6. trainSet = df['2012-01':'2013-06']
  7. testSet_dailyChilledWater = df['2013-07':'2014-10']
  8. trainX_dailyChilledWater = trainSet.values[:,0:-1]
  9. trainY_dailyChilledWater = trainSet.values[:,8]
  10. testX_dailyChilledWater = testSet_dailyChilledWater.values[:,0:-1]
  11. testY_dailyChilledWater = testSet_dailyChilledWater.values[:,8]
  12. trainSet.head()
weekday day week occupancy coolingDegrees T-C humidityRatio-kg/kg dehumidification chilledWater-TonDays
2012-01-01 6 1 52 0.0 0 7.173913 0.004796 0 0.961857
2012-01-02 0 2 1 0.3 0 5.833333 0.003415 0 0.981725
2012-01-03 1 3 1 0.3 0 -3.208333 0.001327 0 1.003672
2012-01-04 2 4 1 0.3 0 -7.083333 0.000890 0 1.483192
2012-01-05 3 5 1 0.3 0 -0.583333 0.001746 0 3.465091

以上是用于日常冷水预测的特征。

同样,需要对冷水进行交叉验证。

交叉验证以获得输入参数

  1. def crossValidation(theta, nugget, nfold, trainX, trainY):
  2. scores = np.zeros((len(theta), nfold))
  3. for i in range(len(theta)):
  4. gp = gaussian_process.GaussianProcess(theta0 = theta[i], nugget = nugget)
  5. scores[i, :] = cross_validation.cross_val_score(gp, trainX, trainY, scoring='r2', cv = nfold)
  6. plt.boxplot(scores.T, sym='b+', labels = theta, whis = 0.5)
  7. #plt.ylim(ylim)
  8. plt.title('R2 score as a function of theta0')
  9. plt.ylabel('R2 Score')
  10. plt.xlabel('Choice of theta0')
  11. plt.show()
  1. theta = np.arange(0.001, 0.1, 0.02)
  2. nugget = 0.05
  3. crossValidation(theta, nugget, 3, trainX_dailyChilledWater, trainY_dailyChilledWater)

选择 theta = 0.021.

交叉验证实际上并不适用于冷冻水。首先,我必须使用三倍交叉验证,而不是十倍。此外,为了简化过程,从现在开始,我只显示针对theta的初始值的交叉验证结果:theta0。 我在后台搜索了nugget。

预测,计算精度和可视化

  1. theta = 0.021
  2. nugget =0.05
  3. # Predict
  4. gp, results_dailyChilledWater = predictAll(theta, nugget,
  5. trainX_dailyChilledWater,
  6. trainY_dailyChilledWater,
  7. testX_dailyChilledWater,
  8. testY_dailyChilledWater,
  9. testSet_dailyChilledWater,
  10. 'Daily Chilled Water')
  11. # Visualize
  12. subset = results_dailyChilledWater['2014-08':'2014-10']
  13. testY = subset['chilledWater-TonDays']
  14. predictedY = subset['predictedY']
  15. sigma = subset['sigma']
  16. plotGP(testY, predictedY, sigma)
  17. plt.ylabel('Chilled water (ton-days)', fontsize = 13)
  18. plt.title('Gaussian Process Regression: Daily Chilled Water Prediction', fontsize = 17)
  19. plt.legend(loc='upper left')
  20. plt.xlim([0, len(testY)])
  21. #plt.ylim([1000,8000])
  22. xTickLabels = pd.DataFrame(data = subset.index[np.arange(0,len(subset.index),7)],
  23. columns=['datetime'])
  24. xTickLabels['date'] = xTickLabels['datetime'].apply(lambda x: x.strftime('%Y-%m-%d'))
  25. ax = plt.gca()
  26. ax.set_xticks(np.arange(0, len(subset), 7))
  27. ax.set_xticklabels(labels = xTickLabels['date'], fontsize = 13, rotation = 90)
  28. plt.show()
  1. Train score R2: 0.91483040513
  2. Test score R2: 0.901408621289

以上是部分预测的可视化。

预测又相当成功。

再增加一个特性:电量预测

为了提高预测精度,又增加了一个特性,即耗电量预测。预测的电量是根据前面电力部分中描述的时间特性产生的。这描述了入住计划。需要训练数据集中的实际用电量进行训练,然后将用电量作为测试数据集的特征进行预测。因此,这种预测方法仍然只需要时间和天气信息以及历史用电量来预测未来的用电量

  1. dailyChilledWaterWithFeatures['predictedElectricity'] = gp_dailyElectricity.predict(
  2. dailyChilledWaterWithFeatures[['weekday', 'day', 'week', 'occupancy']].values)
  3. df = dailyChilledWaterWithFeatures[['weekday', 'day', 'week',
  4. 'occupancy',
  5. 'coolingDegrees', 'T-C',
  6. 'humidityRatio-kg/kg',
  7. 'dehumidification',
  8. 'predictedElectricity',
  9. 'chilledWater-TonDays']]
  10. #df.to_excel('Data/trainSet.xlsx')
  11. trainSet = df['2012-01':'2013-06']
  12. testSet_dailyChilledWaterMoreFeatures = df['2013-07':'2014-10']
  13. trainX_dailyChilledWaterMoreFeatures = trainSet.values[:,0:-1]
  14. trainY_dailyChilledWaterMoreFeatures = trainSet.values[:,9]
  15. testX_dailyChilledWaterMoreFeatures = testSet_dailyChilledWaterMoreFeatures.values[:,0:-1]
  16. testY_dailyChilledWaterMoreFeatures = testSet_dailyChilledWaterMoreFeatures.values[:,9]
  17. trainSet.head()
weekday day week occupancy coolingDegrees T-C humidityRatio-kg/kg dehumidification predictedElectricity chilledWater-TonDays
2012-01-01 6 1 52 0.0 0 7.173913 0.004796 0 2883.784617 0.961857
2012-01-02 0 2 1 0.3 0 5.833333 0.003415 0 4231.128909 0.981725
2012-01-03 1 3 1 0.3 0 -3.208333 0.001327 0 5013.703046 1.003672
2012-01-04 2 4 1 0.3 0 -7.083333 0.000890 0 4985.929899 1.483192
2012-01-05 3 5 1 0.3 0 -0.583333 0.001746 0 5106.976841 3.465091

交叉验证

  1. theta = np.arange(0.001, 0.15, 0.02)
  2. nugget = 0.05
  3. crossValidation(theta, nugget, 3, trainX_dailyChilledWaterMoreFeatures, trainY_dailyChilledWaterMoreFeatures)

选择 theta =0.021.

预测,计算精度和可视化

  1. # Predict
  2. gp, results_dailyChilledWaterMoreFeatures = predictAll(
  3. 0.021, 0.05,
  4. trainX_dailyChilledWaterMoreFeatures,
  5. trainY_dailyChilledWaterMoreFeatures,
  6. testX_dailyChilledWaterMoreFeatures,
  7. testY_dailyChilledWaterMoreFeatures,
  8. testSet_dailyChilledWaterMoreFeatures,
  9. 'Daily Chilled Water')
  10. # Visualize
  11. subset = results_dailyChilledWaterMoreFeatures['2014-08':'2014-10']
  12. testY = subset['chilledWater-TonDays']
  13. predictedY = subset['predictedY']
  14. sigma = subset['sigma']
  15. plotGP(testY, predictedY, sigma)
  16. plt.ylabel('Chilled water (ton-days)', fontsize = 13)
  17. plt.title('Gaussian Process Regression: Daily Chilled Water Prediction', fontsize = 17)
  18. plt.legend(loc='upper left')
  19. plt.xlim([0, len(testY)])
  20. #plt.ylim([1000,8000])
  21. xTickLabels = pd.DataFrame(data = subset.index[np.arange(0,len(subset.index),7)],
  22. columns=['datetime'])
  23. xTickLabels['date'] = xTickLabels['datetime'].apply(lambda x: x.strftime('%Y-%m-%d'))
  24. ax = plt.gca()
  25. ax.set_xticks(np.arange(0, len(subset), 7))
  26. ax.set_xticklabels(labels = xTickLabels['date'], fontsize = 13, rotation = 90)
  27. plt.show()
  1. Train score R2: 0.937952834542
  2. Test score R2: 0.926978705167

以上是部分预测的可视化。

根据R2的得分,准确性确实提高了一点。因此,我将把预测的电量包括在预测中。

热水

获取训练/验证和测试集。dataframe显示功能和目标。

  1. dailySteamWithFeatures = addDailyTimeFeatures(dailySteamWithFeatures)
  2. dailySteamWithFeatures['predictedElectricity'] = gp_dailyElectricity.predict(
  3. dailySteamWithFeatures[['weekday', 'day', 'week', 'occupancy']].values)
  4. df = dailySteamWithFeatures[['weekday', 'day', 'week', 'occupancy',
  5. 'heatingDegrees', 'T-C',
  6. 'humidityRatio-kg/kg', 'predictedElectricity', 'steam-LBS']]
  7. #df.to_excel('Data/trainSet.xlsx')
  8. trainSet = df['2012-01':'2013-06']
  9. testSet_dailySteam = df['2013-07':'2014-10']
  10. trainX_dailySteam = trainSet.values[:,0:-1]
  11. trainY_dailySteam = trainSet.values[:,8]
  12. testX_dailySteam = testSet_dailySteam.values[:,0:-1]
  13. testY_dailySteam = testSet_dailySteam.values[:,8]
  14. trainSet.head()
weekday day week occupancy heatingDegrees T-C humidityRatio-kg/kg predictedElectricity steam-LBS
2012-01-01 6 1 52 0.0 7.826087 7.173913 0.004796 2883.784617 17256.468099
2012-01-02 0 2 1 0.3 9.166667 5.833333 0.003415 4231.128909 17078.440755
2012-01-03 1 3 1 0.3 18.208333 -3.208333 0.001327 5013.703046 59997.969401
2012-01-04 2 4 1 0.3 22.083333 -7.083333 0.000890 4985.929899 56104.878906
2012-01-05 3 5 1 0.3 15.583333 -0.583333 0.001746 5106.976841 45231.708984

交叉验证以获得输入参数

  1. theta = np.arange(0.06, 0.15, 0.01)
  2. nugget = 0.05
  3. crossValidation(theta, nugget, 3, trainX_dailySteam, trainY_dailySteam)

选择theta = 0.1.

预测,计算精度和可视化

  1. # Predict
  2. gp, results_dailySteam = predictAll(0.1, 0.05, trainX_dailySteam, trainY_dailySteam,
  3. testX_dailySteam, testY_dailySteam, testSet_dailySteam, 'Daily Steam')
  4. # Visualize
  5. subset = results_dailySteam['2013-10':'2014-02']
  6. testY = subset['steam-LBS']
  7. predictedY = subset['predictedY']
  8. sigma = subset['sigma']
  9. plotGP(testY, predictedY, sigma)
  10. plt.ylabel('Steam (LBS)', fontsize = 13)
  11. plt.title('Gaussian Process Regression: Daily Steam Prediction', fontsize = 17)
  12. plt.legend(loc='upper left')
  13. plt.xlim([0, len(testY)])
  14. #plt.ylim([1000,8000])
  15. xTickLabels = pd.DataFrame(data = subset.index[np.arange(0,len(subset.index),10)],
  16. columns=['datetime'])
  17. xTickLabels['date'] = xTickLabels['datetime'].apply(lambda x: x.strftime('%Y-%m-%d'))
  18. ax = plt.gca()
  19. ax.set_xticks(np.arange(0, len(subset), 10))
  20. ax.set_xticklabels(labels = xTickLabels['date'], fontsize = 13, rotation = 90)
  21. plt.show()
  1. Train score R2: 0.96729657748
  2. Test score R2: 0.933120481633

以上是部分预测的可视化。

每小时的预测

我用同样的方法来训练和测试每小时的模型。因为随着数据点数量的增加,计算成本显著增加。在大型数据集上进行交叉验证是不可能的,因为我们没有太多的时间进行项目。因此,我只使用一小组训练/验证数据来获取参数,然后使用完整的数据集进行训练和测试,并让我的计算机连夜运行。

每小时用电量

获取训练/验证和测试集。先尝试小样本。dataframe显示特性和目标。

  1. def addHourlyTimeFeatures(df):
  2. df['hour'] = df.index.hour
  3. df['weekday'] = df.index.weekday
  4. df['day'] = df.index.dayofyear
  5. df['week'] = df.index.weekofyear
  6. return df
  7. hourlyElectricityWithFeatures = addHourlyTimeFeatures(hourlyElectricityWithFeatures)
  8. df_hourlyElectricity = hourlyElectricityWithFeatures[['hour', 'weekday', 'day',
  9. 'week', 'cosHour',
  10. 'occupancy', 'electricity-kWh']]
  11. #df_hourlyElectricity.to_excel('Data/trainSet_hourlyElectricity.xlsx')
  12. def setTrainTestSets(df, trainStart, trainEnd, testStart, testEnd, indY):
  13. trainSet = df[trainStart : trainEnd]
  14. testSet = df[testStart : testEnd]
  15. trainX = trainSet.values[:,0:-1]
  16. trainY = trainSet.values[:,indY]
  17. testX = testSet.values[:,0:-1]
  18. testY = testSet.values[:,indY]
  19. return trainX, trainY, testX, testY, testSet
  20. trainStart = '2013-02'
  21. trainEnd = '2013-05'
  22. testStart = '2014-03'
  23. testEnd = '2014-04'
  24. df = df_hourlyElectricity;
  25. trainX_hourlyElectricity, trainY_hourlyElectricity, \
  26. testX_hourlyElectricity, testY_hourlyElectricity, \
  27. testSet_hourlyElectricity = setTrainTestSets(df, trainStart,
  28. trainEnd, testStart, testEnd, 6)
  29. df_hourlyElectricity.head()
hour weekday day week cosHour occupancy electricity-kWh
2012-01-01 01:00:00 1 6 1 52 0.866025 0 111.479277
2012-01-01 02:00:00 2 6 1 52 0.965926 0 117.989395
2012-01-01 03:00:00 3 6 1 52 1.000000 0 119.010131
2012-01-01 04:00:00 4 6 1 52 0.965926 0 116.005587
2012-01-01 05:00:00 5 6 1 52 0.866025 0 111.132977

交叉验证以获得输入参数

  1. nugget = 0.008
  2. theta = np.arange(0.05, 0.5, 0.05)
  3. crossValidation(theta, nugget, 10, trainX_hourlyElectricity, trainY_hourlyElectricity)

预测,计算精度和可视化

  1. gp_hourlyElectricity, results_hourlyElectricity = predictAll(0.1, 0.008,
  2. trainX_hourlyElectricity,
  3. trainY_hourlyElectricity,
  4. testX_hourlyElectricity,
  5. testY_hourlyElectricity,
  6. testSet_hourlyElectricity,
  7. 'Hourly Electricity (Partial)')
  1. Train score R2: 0.957912601164
  2. Test score R2: 0.893873175566

  1. subset = results_hourlyElectricity['2014-03-08 23:00:00':'2014-03-15']
  2. testY = subset['electricity-kWh']
  3. predictedY = subset['predictedY']
  4. sigma = subset['sigma']
  5. plotGP(testY, predictedY, sigma)
  6. plt.ylabel('Electricity (kWh)', fontsize = 13)
  7. plt.title('Gaussian Process Regression: Hourly Electricity Prediction', fontsize = 17)
  8. plt.legend(loc='upper right')
  9. plt.xlim([0, len(testY)])
  10. #plt.ylim([1000,8000])
  11. xTickLabels = subset.index[np.arange(0,len(subset.index),6)]
  12. ax = plt.gca()
  13. ax.set_xticks(np.arange(0, len(subset), 6))
  14. ax.set_xticklabels(labels = xTickLabels, fontsize = 13, rotation = 90)
  15. plt.show()

以上是部分预测的可视化。

训练和测试每小时的数据

每小时电量的最终预测精度。

  1. trainStart = '2012-01'
  2. trainEnd = '2013-06'
  3. testStart = '2013-07'
  4. testEnd = '2014-10'
  5. results_allHourlyElectricity = pd.read_excel('Data/results_allHourlyElectricity.xlsx')
  6. def plotR2(df, energyType, title):
  7. testY = df[energyType]
  8. predictedY = df['predictedY']
  9. print "Test score R2:", sklearn.metrics.r2_score(testY, predictedY)
  10. plt.figure(figsize = (9,8))
  11. plt.scatter(testY, predictedY)
  12. plt.plot([min(testY), max(testY)], [min(testY), max(testY)], 'r')
  13. plt.xlim([min(testY), max(testY)])
  14. plt.ylim([min(testY), max(testY)])
  15. plt.title('Predicted vs. observed: ' + title)
  16. plt.xlabel('Observed')
  17. plt.ylabel('Predicted')
  18. plt.show()
  19. plotR2(results_allHourlyElectricity, 'electricity-kWh', 'All Hourly Electricity')
  1. Test score R2: 0.882986662109

每小时用冷水

获取训练/验证和测试集。先尝试小样本。dataframe显示特性和目标。

  1. trainStart = '2013-08'
  2. trainEnd = '2013-10'
  3. testStart = '2014-08'
  4. testEnd = '2014-10'
  5. def getPredictedElectricity(trainStart, trainEnd, testStart, testEnd):
  6. trainX, trainY, testX, testY, testSet = setTrainTestSets(df_hourlyElectricity,
  7. trainStart, trainEnd,
  8. testStart, testEnd, 6)
  9. gp = gaussian_process.GaussianProcess(theta0 = 0.15, nugget = 0.008)
  10. gp.fit(trainX, trainY)
  11. trainSet = df_hourlyElectricity[trainStart : trainEnd]
  12. predictedElectricity = pd.DataFrame(data = np.zeros(len(trainSet)),
  13. index = trainSet.index,
  14. columns = ['predictedElectricity'])
  15. predictedElectricity = \
  16. predictedElectricity.append(pd.DataFrame(data = np.zeros(len(testSet)),
  17. index = testSet.index,
  18. columns = ['predictedElectricity']))
  19. predictedElectricity.loc[trainStart:trainEnd,
  20. 'predictedElectricity'] = gp.predict(trainX)
  21. predictedElectricity.loc[testStart:testEnd,
  22. 'predictedElectricity'] = gp.predict(testX)
  23. return predictedElectricity
  24. predictedElectricity = getPredictedElectricity(trainStart, trainEnd, testStart, testEnd)
  25. hourlyChilledWaterWithMoreFeatures = \ hourlyChilledWaterWithFeatures.join(predictedElectricity, how = 'inner')
  26. hourlyChilledWaterWithMoreFeatures = addHourlyTimeFeatures(hourlyChilledWaterWithMoreFeatures)
  27. df_hourlyChilledWater = hourlyChilledWaterWithMoreFeatures[['hour', 'cosHour',
  28. 'weekday', 'day', 'week',
  29. 'occupancy', 'T-C',
  30. 'humidityRatio-kg/kg',
  31. 'predictedElectricity',
  32. 'chilledWater-TonDays']]
  33. df = df_hourlyChilledWater;
  34. trainX_hourlyChilledWater, \
  35. trainY_hourlyChilledWater, \
  36. testX_hourlyChilledWater, \
  37. testY_hourlyChilledWater, \
  38. testSet_hourlyChilledWater = setTrainTestSets(df, trainStart, trainEnd,
  39. testStart, testEnd, 9)
  40. df_hourlyChilledWater.head()
hour cosHour weekday day week occupancy T-C humidityRatio-kg/kg predictedElectricity chilledWater-TonDays
2013-08-01 00:00:00 0 0.707107 3 213 31 0.5 17.7 0.010765 123.431454 0.200647
2013-08-01 01:00:00 1 0.866025 3 213 31 0.5 17.8 0.012102 119.266616 0.183318
2013-08-01 02:00:00 2 0.965926 3 213 31 0.5 17.7 0.012026 121.821887 0.183318
2013-08-01 03:00:00 3 1.000000 3 213 31 0.5 17.6 0.011962 124.888675 0.183318
2013-08-01 04:00:00 4 0.965926 3 213 31 0.5 17.7 0.011912 124.824413 0.183318

交叉验证以获得输入参数

  1. nugget = 0.01
  2. theta = np.arange(0.001, 0.05, 0.01)
  3. crossValidation(theta, nugget, 5, trainX_hourlyChilledWater, trainY_hourlyChilledWater)

预测,计算精度和可视化

  1. gp_hourlyChilledWater, results_hourlyChilledWater = predictAll(0.011, 0.01,
  2. trainX_hourlyChilledWater,
  3. trainY_hourlyChilledWater,
  4. testX_hourlyChilledWater,
  5. testY_hourlyChilledWater,
  6. testSet_hourlyChilledWater,
  7. 'Hourly Chilled Water (Particial)')
  1. Train score R2: 0.914352370778
  2. Test score R2: 0.865202975683

  1. subset = results_hourlyChilledWater['2014-08-24':'2014-08-30']
  2. testY = subset['chilledWater-TonDays']
  3. predictedY = subset['predictedY']
  4. sigma = subset['sigma']
  5. plotGP(testY, predictedY, sigma)
  6. plt.ylabel('Chilled water (ton-days)', fontsize = 13)
  7. plt.title('Gaussian Process Regression: Hourly Chilled Water Prediction', fontsize = 17)
  8. plt.legend(loc='upper right')
  9. plt.xlim([0, len(testY)])
  10. #plt.ylim([1000,8000])
  11. xTickLabels = subset.index[np.arange(0,len(subset.index),6)]
  12. ax = plt.gca()
  13. ax.set_xticks(np.arange(0, len(subset), 6))
  14. ax.set_xticklabels(labels = xTickLabels, fontsize = 13, rotation = 90)
  15. plt.show()

以上是部分预测的可视化。

训练和测试每小时的数据

警告:这可能需要很长时间才能执行代码。

  1. # This could take forever.
  2. trainStart = '2012-01'
  3. trainEnd = '2013-06'
  4. testStart = '2013-07'
  5. testEnd = '2014-10'
  6. predictedElectricity = getPredictedElectricity(trainStart, trainEnd, testStart, testEnd)
  7. # Chilled water
  8. hourlyChilledWaterWithMoreFeatures = hourlyChilledWaterWithFeatures.join(predictedElectricity, how = 'inner')
  9. hourlyChilledWaterWithMoreFeatures = addHourlyTimeFeatures(hourlyChilledWaterWithMoreFeatures)
  10. df_hourlyChilledWater = hourlyChilledWaterWithMoreFeatures[['hour', 'cosHour',
  11. 'weekday', 'day',
  12. 'week', 'occupancy', 'T-C',
  13. 'humidityRatio-kg/kg',
  14. 'predictedElectricity',
  15. 'chilledWater-TonDays']]
  16. df = df_hourlyChilledWater;
  17. trainX_hourlyChilledWater, \
  18. trainY_hourlyChilledWater, \
  19. testX_hourlyChilledWater, \
  20. testY_hourlyChilledWater, \
  21. testSet_hourlyChilledWater = setTrainTestSets(df, trainStart, trainEnd,
  22. testStart, testEnd, 9)
  23. gp_hourlyChilledWater, results_hourlyChilledWater = predictAll(0.011, 0.01,
  24. trainX_hourlyChilledWater,
  25. trainY_hourlyChilledWater,
  26. testX_hourlyChilledWater,
  27. testY_hourlyChilledWater,
  28. testSet_hourlyChilledWater)
  29. results_hourlyChilledWater.to_excel('Data/results_allHourlyChilledWater.xlsx')

每小时冷水预报的最终精度。

  1. results_allHourlyChilledWater = pd.read_excel('Data/results_allHourlyChilledWater.xlsx')
  2. plotR2(results_allHourlyChilledWater, 'chilledWater-TonDays', 'All Hourly Chilled Water')
  1. Test score R2: 0.887195665411

每小时用热水

获取训练/验证和测试集。先尝试小样本。dataframe显示特性和目标。

  1. trainStart = '2012-01'
  2. trainEnd = '2012-03'
  3. testStart = '2014-01'
  4. testEnd = '2014-03'
  5. predictedElectricity = getPredictedElectricity(trainStart, trainEnd, testStart, testEnd)
  6. hourlySteamWithMoreFeatures = hourlySteamWithFeatures.join(predictedElectricity,
  7. how = 'inner')
  8. hourlySteamWithMoreFeatures = addHourlyTimeFeatures(hourlySteamWithMoreFeatures)
  9. df_hourlySteam = hourlySteamWithMoreFeatures[['hour', 'cosHour', 'weekday',
  10. 'day', 'week', 'occupancy', 'T-C',
  11. 'humidityRatio-kg/kg',
  12. 'predictedElectricity', 'steam-LBS']]
  13. df = df_hourlySteam;
  14. trainX_hourlySteam,
  15. trainY_hourlySteam,
  16. testX_hourlySteam,
  17. testY_hourlySteam, \
  18. testSet_hourlySteam = setTrainTestSets(df, trainStart, trainEnd, testStart, testEnd, 9)
  19. df_hourlySteam.head()
hour cosHour weekday day week occupancy T-C humidityRatio-kg/kg predictedElectricity steam-LBS
2012-01-01 01:00:00 1 0.866025 6 1 52 0 4 0.004396 117.502720 513.102214
2012-01-01 02:00:00 2 0.965926 6 1 52 0 4 0.004391 114.720226 1353.371311
2012-01-01 03:00:00 3 1.000000 6 1 52 0 5 0.004380 113.079503 1494.904514
2012-01-01 04:00:00 4 0.965926 6 1 52 0 6 0.004401 112.428146 490.090061
2012-01-01 05:00:00 5 0.866025 6 1 52 0 4 0.004382 113.892620 473.258464

交叉验证以获得输入参数

  1. nugget = 0.01
  2. theta = np.arange(0.001, 0.014, 0.002)
  3. crossValidation(theta, nugget, 5, trainX_hourlySteam, trainY_hourlySteam)

训练和测试每小时的数据

  1. gp_hourlySteam, results_hourlySteam = predictAll(0.007, 0.01,
  2. trainX_hourlySteam, trainY_hourlySteam,
  3. testX_hourlySteam, testY_hourlySteam, testSet_hourlySteam, 'Hourly Steam (Particial)')
  1. Train score R2: 0.844826486064
  2. Test score R2: 0.570427290315

  1. subset = results_hourlySteam['2014-01-26':'2014-02-01']
  2. testY = subset['steam-LBS']
  3. predictedY = subset['predictedY']
  4. sigma = subset['sigma']
  5. plotGP(testY, predictedY, sigma)
  6. plt.ylabel('Steam (LBS)', fontsize = 13)
  7. plt.title('Gaussian Process Regression: Hourly Steam Prediction', fontsize = 17)
  8. plt.legend(loc='upper right')
  9. plt.xlim([0, len(testY)])
  10. xTickLabels = subset.index[np.arange(0,len(subset.index),6)]
  11. ax = plt.gca()
  12. ax.set_xticks(np.arange(0, len(subset), 6))
  13. ax.set_xticklabels(labels = xTickLabels, fontsize = 13, rotation = 90)
  14. plt.show()

以上是部分预测的可视化。

尝试训练和测试每小时的热水

警告:这可能需要很长时间才能执行代码。

  1. # Warning: this could take forever to execute the code.
  2. trainStart = '2012-01'
  3. trainEnd = '2013-06'
  4. testStart = '2013-07'
  5. testEnd = '2014-10'
  6. predictedElectricity = getPredictedElectricity(trainStart, trainEnd, testStart, testEnd)
  7. # Steam
  8. hourlySteamWithMoreFeatures = hourlySteamWithFeatures.join(predictedElectricity,
  9. how = 'inner')
  10. hourlySteamWithMoreFeatures = addHourlyTimeFeatures(hourlySteamWithMoreFeatures)
  11. df_hourlySteam = hourlySteamWithMoreFeatures[['hour', 'cosHour', 'weekday',
  12. 'day', 'week', 'occupancy', 'T-C',
  13. 'humidityRatio-kg/kg',
  14. 'predictedElectricity', 'steam-LBS']]
  15. df = df_hourlySteam;
  16. trainX_hourlySteam, trainY_hourlySteam, testX_hourlySteam, testY_hourlySteam, \
  17. testSet_hourlySteam = setTrainTestSets(df, trainStart, trainEnd,
  18. testStart, testEnd, 9)
  19. gp_hourlySteam, results_hourlySteam = predictAll(0.007, 0.01,
  20. trainX_hourlySteam, trainY_hourlySteam,
  21. testX_hourlySteam, testY_hourlySteam,
  22. testSet_hourlySteam)
  23. results_hourlySteam.to_excel('Data/results_allHourlySteam.xlsx')

Final accuracy for hourly steam prediction.

  1. results_allHourlySteam = pd.read_excel('Data/results_allHourlySteam.xlsx')
  2. plotR2(results_allHourlySteam, 'steam-LBS', 'All Hourly Steam')
  1. Test score R2: 0.84405417838

每小时的预测不如每天的好。但是R2的值仍然是0.85。

总结

高斯过程回归的整体性能非常好。GP的优点是提供了预测的不确定范围。缺点是对于大型数据集来说,它的计算开销非常大。

下篇4_Random Forests and K-Nearest Neighbours

翻译——3_Gaussian Process Regression的更多相关文章

  1. 翻译——2_Linear Regression and Support Vector Regression

    续上篇 1_Project Overview, Data Wrangling and Exploratory Analysis 使用不同的机器学习方法进行预测 线性回归 在这本笔记本中,将训练一个线性 ...

  2. Scalaz(50)- scalaz-stream: 安全的无穷运算-running infinite stream freely

    scalaz-stream支持无穷数据流(infinite stream),这本身是它强大的功能之一,试想有多少系统需要通过无穷运算才能得以实现.这是因为外界的输入是不可预料的,对于系统本身就是无穷的 ...

  3. The Model Complexity Myth

    The Model Complexity Myth (or, Yes You Can Fit Models With More Parameters Than Data Points) An oft- ...

  4. Summary on Visual Tracking: Paper List, Benchmarks and Top Groups

    Summary on Visual Tracking: Paper List, Benchmarks and Top Groups 2018-07-26 10:32:15 This blog is c ...

  5. 机器学习理论基础学习18---高斯过程回归(GPR)

    一.高斯(分布)过程(随机过程)是什么? 一维高斯分布 多维高斯分布 无限维高斯分布   高斯网络 高斯过程 简单的说,就是一系列关于连续域(时间或空间)的随机变量的联合,而且针对每一个时间或是空间点 ...

  6. Resources in Visual Tracking

    这个应该是目前最全的Tracking相关的文章了 一.Surveyand benchmark: 1.      PAMI2014:VisualTracking_ An Experimental Sur ...

  7. ECCV 2014 Results (16 Jun, 2014) 结果已出

    Accepted Papers     Title Primary Subject Area ID 3D computer vision 93 UPnP: An optimal O(n) soluti ...

  8. CVPR 2015 papers

    CVPR2015 Papers震撼来袭! CVPR 2015的文章可以下载了,如果链接无法下载,可以在Google上通过搜索paper名字下载(友情提示:可以使用filetype:pdf命令). Go ...

  9. nodejs api 中文文档

    文档首页 英文版文档 本作品采用知识共享署名-非商业性使用 3.0 未本地化版本许可协议进行许可. Node.js v0.10.18 手册 & 文档 索引 | 在单一页面中浏览 | JSON格 ...

随机推荐

  1. 015、MySQL取今天是第几季度,往后几个月是第几季度

    #取今天是第几季度 SELECT QUARTER( curdate( ) ); #取往后几个月是第几季度 , INTERVAL MONTH ) ); , INTERVAL MONTH ) ); , I ...

  2. ios端简单改变webView的黑白夜模式

    extension HTController:WKUIDelegate, WKNavigationDelegate,WKScriptMessageHandler { func userContentC ...

  3. 【LOJ6498】「雅礼集训 2018 Day2」农民

    题面 solution 直接暴力模拟,原数据可获得满分的成绩. 对于每个点,其父亲对其都有一个限制.故我们只需要判断当前点到根的路径上的限制是否都能满足即可. 考虑用树剖+线段树维护这个限制.考虑到翻 ...

  4. CharacterEncodingFilter详解及源码解析

    字符编码过滤器  (Spring框架对字符编码的处理) 基于函数回调,对所有请求起作用,只在容器初始化时调用一次,依赖于servlet容器. web.xml配置文件 <filter> &l ...

  5. E. MaratonIME does (not do) PAs

    E. MaratonIME does (not do) PAs time limit per test 2.0 s memory limit per test 256 MB input standar ...

  6. 3分钟教你用python制作一个简单词云

    首先需要安装三个包: # 安装:pip install matplotlib # 安装:pip install jieba # 安装pip install wordcloud 1.制作英文字母的词云 ...

  7. servlet中调用注入spring管理的dao(转)

    今天做大型仪器的的时候遇到的问题,转下为了以后能用 http://blog.csdn.net/jiyingying_up/article/details/44803585 我们用spring的依赖注入 ...

  8. Codeforces Round #585 (Div. 2) CF1215A~C

    CF1215A. Yellow Cards简单的模拟,给定了黄票张数,判断最少和最多有多少人被罚下场. #include <bits/stdc++.h> using namespace s ...

  9. SMPL模型Shape和Pose参数

    两部分 1.Pose参数 2.Shape参数 一 Pose参数 共24个关节点,对应idx从0到23,图中3个小图分别表示zero shape只有idx节点分别绕x/y/z轴旋转. 其中蓝色线表示-p ...

  10. 部署Ambari Server实战案例

    部署Ambari Server实战案例 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.准备三台虚拟机(需要自行安装jdk环境) 1>.角色分配 NameNode节点: h ...