【SVM】kaggle之澳大利亚天气预测

项目目标

由于大气运动极为复杂，影响天气的因素较多，而人们认识大气本身运动的能力极为有限，因此天气预报水平较低，预报员在预报实践中，每次预报的过程都极为复杂，需要综合分析，并预报各气象要素，比如温度、降水等。本项目需要训练一个二分类模型，来预测在给定天气因素下，城市是否下雨。

数据说明

本数据包含了来自澳大利亚多个气候站的日常共15W的数据，项目随机抽取了1W条数据作为样本。特征如下：

特征	含义
Date	观察日期
Location	获取该信息的气象站的名称
MinTemp	以摄氏度为单位的低温度
MaxTemp	以摄氏度为单位的高温度
Rainfall	当天记录的降雨量，单位为mm
Evaporation	到早上9点之前的24小时的A级蒸发量(mm)
Sunshine	白日受到日照的完整小时
WindGustDir	在到午夜12点前的24小时中的强风的风向
WindGustSpeed	在到午夜12点前的24小时中的强风速(km/h)
WindDir9am	上午9点时的风向
WindDir3pm	下午3点时的风向
WindSpeed9am	上午9点之前每个十分钟的风速的平均值(km/h)
WindSpeed3pm	下午3点之前每个十分钟的风速的平均值(km/h)
Humidity9am	上午9点的湿度(百分比)
Humidity3am	下午3点的湿度(百分比)
Pressure9am	上午9点平均海平面上的大气压(hpa)
Pressure3pm	下午3点平均海平面上的大气压(hpa)
Cloud9am	上午9点的天空被云层遮蔽的程度，0表示完全晴朗的天空，而8表示它完全是阴天
Cloud3pm	下午3点的天空被云层遮蔽的程度
Temp9am	上午9点的摄氏度温度
Temp3pm	下午3点的摄氏度温度

项目过程

-处理缺失值，删除与预测无关的特征

-随机抽样

-对分类变量进行编码

-处理异常值

-数据归一化

-训练模型

-模型预测

项目代码（Jupyter）

import pandas as pd

import numpy as np

读取数据探索数据

weather = pd.read_csv("weather.csv", index_col=0)

weather.head()

weather.info()

<class 'pandas.core.frame.DataFrame'>

Int64Index: 142193 entries, 0 to 142192

Data columns (total 20 columns):

 #   Column         Non-Null Count   Dtype

---  ------         --------------   -----

 0   MinTemp        141556 non-null  float64

 1   MaxTemp        141871 non-null  float64

 2   Rainfall       140787 non-null  float64

 3   Evaporation    81350 non-null   float64

 4   Sunshine       74377 non-null   float64

 5   WindGustDir    132863 non-null  object

 6   WindGustSpeed  132923 non-null  float64

 7   WindDir9am     132180 non-null  object

 8   WindDir3pm     138415 non-null  object

 9   WindSpeed9am   140845 non-null  float64

 10  WindSpeed3pm   139563 non-null  float64

 11  Humidity9am    140419 non-null  float64

 12  Humidity3pm    138583 non-null  float64

 13  Pressure9am    128179 non-null  float64

 14  Pressure3pm    128212 non-null  float64

 15  Cloud9am       88536 non-null   float64

 16  Cloud3pm       85099 non-null   float64

 17  Temp9am        141289 non-null  float64

 18  Temp3pm        139467 non-null  float64

 19  RainTomorrow   142193 non-null  object

dtypes: float64(16), object(4)

memory usage: 22.8+ MB

删除与预测无关的特征

weather.drop(["Date", "Location"],inplace=True, axis=1)

删除缺失值，重置索引

weather.dropna(inplace=True)

weather.index = range(len(weather))

1.WindGustDir WindDir9am WindDir3pm 属于定性数据中的无序数据——OneHotEncoder
2.Cloud9am Cloud3pm 属于定性数据中的有序数据——OrdinalEncoder
3.RainTomorrow 属于标签变量——LabelEncoder

为了简便起见，WindGustDir WindDir9am WindDir3pm 三个风向中只保留第一个最强风向

weather_sample.drop(["WindDir9am", "WindDir3pm"], inplace=True, axis=1)

编码分类变量

from sklearn.preprocessing import OneHotEncoder,OrdinalEncoder,LabelEncoder

print(np.unique(weather_sample["RainTomorrow"]))

print(np.unique(weather_sample["WindGustDir"]))

print(np.unique(weather_sample["Cloud9am"]))

print(np.unique(weather_sample["Cloud3pm"]))

['No' 'Yes']

['E' 'ENE' 'ESE' 'N' 'NE' 'NNE' 'NNW' 'NW' 'S' 'SE' 'SSE' 'SSW' 'SW' 'W'

 'WNW' 'WSW']

[0. 1. 2. 3. 4. 5. 6. 7. 8.]

[0. 1. 2. 3. 4. 5. 6. 7. 8.]

# 查看样本不均衡问题，较轻微

weather_sample["RainTomorrow"].value_counts()

No     7750

Yes    2250

Name: RainTomorrow, dtype: int64

# 编码标签

weather_sample["RainTomorrow"] = pd.DataFrame(LabelEncoder().fit_transform(weather_sample["RainTomorrow"]))

# 编码Cloud9am Cloud3pm

oe = OrdinalEncoder().fit(weather_sample["Cloud9am"].values.reshape(-1, 1))

weather_sample["Cloud9am"] = pd.DataFrame(oe.transform(weather_sample["Cloud9am"].values.reshape(-1, 1)))

weather_sample["Cloud3pm"] = pd.DataFrame(oe.transform(weather_sample["Cloud3pm"].values.reshape(-1, 1)))

# 编码WindGustDir

ohe = OneHotEncoder(sparse=False)

ohe.fit(weather_sample["WindGustDir"].values.reshape(-1, 1))

WindGustDir_df = pd.DataFrame(ohe.transform(weather_sample["WindGustDir"].values.reshape(-1, 1)), columns=ohe.get_feature_names())

WindGustDir_df.tail()

合并数据

weather_sample_new = pd.concat([weather_sample,WindGustDir_df],axis=1)

weather_sample_new.drop(["WindGustDir"], inplace=True, axis=1)

weather_sample_new

调整列顺序，将数值型变量与分类变量分开，便于数据归一化

Cloud9am = weather_sample_new.iloc[:,12]

Cloud3pm = weather_sample_new.iloc[:,13]

weather_sample_new.drop(["Cloud9am"], inplace=True, axis=1)

weather_sample_new.drop(["Cloud3pm"], inplace=True, axis=1)

weather_sample_new["Cloud9am"] = Cloud9am

weather_sample_new["Cloud3pm"] = Cloud3pm

RainTomorrow = weather_sample_new["RainTomorrow"]

weather_sample_new.drop(["RainTomorrow"], inplace=True, axis=1)

weather_sample_new["RainTomorrow"] = RainTomorrow

weather_sample_new.head()

为了防止数据归一化受到异常值影响，在此之前先处理异常值

# 观察数据异常情况

weather_sample_new.describe([0.01,0.99])

因为数据归一化只针对数值型变量，所以将两者进行分离

# 对数值型变量和分类变量进行切片

weather_sample_mv = weather_sample_new.iloc[:,0:14]

weather_sample_cv = weather_sample_new.iloc[:,14:33]

盖帽法处理异常值

## 盖帽法处理数值型变量的异常值

def cap(df,quantile=[0.01,0.99]):

    for col in df:

        # 生成分位数

        Q01,Q99 = df[col].quantile(quantile).values.tolist()

        # 替换异常值为指定的分位数

        if Q01 > df[col].min():

            df.loc[df[col] < Q01, col] = Q01

        if Q99 < df[col].max():

            df.loc[df[col] > Q99, col] = Q99

cap(weather_sample_mv)

weather_sample_mv.describe([0.01,0.99])

数据归一化

from sklearn.preprocessing import StandardScaler

weather_sample_mv = pd.DataFrame(StandardScaler().fit_transform(weather_sample_mv))

weather_sample_mv

重新合并数据

weather_sample = pd.concat([weather_sample_mv, weather_sample_cv], axis=1)

weather_sample.head()

划分特征与标签

X = weather_sample.iloc[:,:-1]

y = weather_sample.iloc[:,-1]

print(X.shape)

print(y.shape)

(10000, 32)

(10000,)

创建模型与交叉验证

from sklearn.svm import SVC

from sklearn.model_selection import cross_val_score

from sklearn.metrics import roc_auc_score, recall_score

for kernel in ["linear","poly","rbf"]:

    accuracy = cross_val_score(SVC(kernel=kernel), X, y, cv=5, scoring="accuracy").mean()

    print("{}:{}".format(kernel,accuracy))

linear:0.8564

poly:0.8532

rbf:0.8531000000000001

weather_sample.head()

【SVM】kaggle之澳大利亚天气预测的更多相关文章

【原创】基于SVM作短期时间序列的预测
[面试思路拓展] 对时间序列进行预测的方法有很多, 但如果只有几周的数据,而没有很多线性的趋势.各种实际的背景该如何去预测时间序列? 或许可以尝试下利用SVM去预测时间序列,那么如何提取预测的特征呢? ...
kaggle之数字序列预测
数字序列预测 Github地址 Kaggle地址 # -*- coding: UTF-8 -*- %matplotlib inline import pandas as pd import strin ...
数据挖掘竞赛kaggle初战——泰坦尼克号生还预测
1.题目这道题目的地址在https://www.kaggle.com/c/titanic,题目要求大致是给出一部分泰坦尼克号乘船人员的信息与最后生还情况,利用这些数据,使用机器学习的算法,来分析预测 ...
Kaggle入门——泰坦尼克号生还者预测
前言这个是Kaggle比赛中泰坦尼克号生存率的分析.强烈建议在做这个比赛的时候,再看一遍电源<泰坦尼克号>,可能会给你一些启发,比如妇女儿童先上船等.所以是否获救其实并非随机,而是基于一 ...
【项目实战】Kaggle泰坦尼克号的幸存者预测
前言这是学习视频中留下来的一个作业,我决定根据大佬的步骤来一步一步完成整个项目,项目的下载地址如下:https://www.kaggle.com/c/titanic/data 大佬的传送门:http ...
pytorch kaggle 泰坦尼克生存预测
也不知道对不对,就凭着自己的思路写了一个数据集:https://www.kaggle.com/c/titanic/data import torch import torch.nn as nn im ...
模式识别之bayes---bayes 简单天气预测实现实例
Bayes Classifier 分类在模式识别的实际应用中,贝叶斯方法绝非就是post正比于prior*likelihood这个公式这么简单,一般而言我们都会用正态分布拟合likelihood来实 ...
Kaggle之泰坦尼克号幸存预测估计
上次已经讲了怎么下载数据,这次就不说废话了,直接开始.首先导入相应的模块,然后检视一下数据情况.对数据有一个大致的了解之后,开始进行下一步操作. 一.分析数据 1.Survived 的情况 train ...
天气预测（CNN）
import torch import torch.nn as nn import torch.utils.data as Data import numpy as np import pymysql ...

随机推荐

Linux提权常用漏洞速查表
漏洞列表 #CVE #Description #Kernels CVE–2018–18955 [map_write() in kernel/user_namespace.c allows privil ...
msf+cobaltstrike联动(二)：把cs中的机器spwan给msf
前提:CS已经获取到session,可以进入图形化管理机器,现在需要使用msf进行进一步渗透,需要msf的metepreter. 开启msf msf设置监听 msf > use exploit/ ...
eclipse下执行maprdeuc程序报错 java.lang.ClassNotFoundException
最近遇到一个问题,不知怎么突然运行hadoop的map程序报错,困扰了我很久,现在来给大家分享分享.. 错误信息 2017-05-18 21:34:22,104 INFO [main] client. ...
Flink-v1.12官方网站翻译-P020-Builtin Watermark Generators
内置水印生成器正如在Generating Watermarks一文中所描述的,Flink提供了抽象,允许程序员分配自己的时间戳和发射自己的水印.更具体地说,可以通过实现WatermarkGenera ...
Atcoder(134)E - Sequence Decomposing
E - Sequence Decomposing Time Limit: 2 sec / Memory Limit: 1024 MB Score : 500500 points Problem Sta ...
西南民族大学第十二届程序设计竞赛（同步赛） A.逃出机房 (bfs)
题意:有来两个人A和B,A追B,A和B每次向上下左右移动一个单位,一共有两扇门,问A是否可以追上B(在门口追上也算合法). 题解:当时看题意说在门口也算?就觉得是判断两个人到门口的时间,对他们两个人分 ...
Dire Wolf——HDU5115
Dire wolves, also known as Dark wolves, are extraordinarily large and powerful wolves. Many, if not ...
使用 Tye 辅助开发 k8s 应用竟如此简单（二）
续上篇,这篇我们来进一步探索 Tye 更多的使用方法.本篇我们来了解一下如何在 Tye 中使用服务发现. Newbe.Claptrap 是一个用于轻松应对并发问题的分布式开发框架.如果您是首次阅读本系 ...
markdown 公式编写及不同平台公式转换
1.markdown 用法及公式编写,这块就不再重复,已有很多官方平台的文档说明很完善有道云markdown写作文档在博客园中插入公式 markdown公式输入(特殊符号) markdown 特殊 ...
Oracle数据库故障处理方法
1.启动数据库报错:ORA-01102:cannot mount database in EXCLUSIVE mode 给客户处理oracle故障,遇到如下报错: 以sys登录至数据库,执行shutd ...