kaggle& titanic代码

　　这两天报名参加了阿里天池的’公交线路客流预测‘赛，就顺便先把以前看的kaggle的titanic的训练赛代码在熟悉下数据的一些处理。题目根据titanic乘客的信息来预测乘客的生还情况。给了titanic_test.csv和titanic_train.csv两数据表。首先是表的一些字段说明：

PassengerId -- A numerical id assigned to each passenger.
Survived -- Whether the passenger survived (1), or didn't (0). We'll be making predictions for this column.
Pclass -- The class the passenger was in -- first class (1), second class (2), or third class (3).
Name -- the name of the passenger.
Sex -- The gender of the passenger -- male or female.
Age -- The age of the passenger. Fractional.
SibSp -- The number of siblings and spouses the passenger had on board.
Parch -- The number of parents and children the passenger had on board.
Ticket -- The ticket number of the passenger.
Fare -- How much the passenger paid for the ticker.
Cabin -- Which cabin the passenger was in.
Embarked -- Where the passenger boarded the Titanic.

下面是python处理代码：

 from sklearn.ensemble import AdaBoostClassifier

 import numpy as np

 import pandas as pd

 from sklearn.linear_model import LogisticRegression

 from sklearn.ensemble import RandomForestClassifier

 from sklearn.feature_selection import SelectKBest, f_classif

 import  matplotlib.pyplot as plt

 train = pd.read_csv("titanic_train.csv", dtype={"Age": np.float64})

 test = pd.read_csv("titanic_test.csv", dtype={"Age": np.float64} )

 print("\n\nTop of the training data:")

 print(train.head())

 print("\n\nSummary statistics of training data")

 print(train.describe())

 #train.to_csv('copy_of_the_training_data.csv', index=False)

 train["Age"]=train["Age"].fillna(-1)

 test["Age"]=test["Age"].fillna(-1)

 train.loc[train["Sex"]=="male","Sex"]=0

 test.loc[test["Sex"]=="male","Sex"]=0

 train.loc[train["Sex"]=="female","Sex"]=1

 test.loc[test["Sex"]=="female","Sex"]=1

 print(train["Embarked"].unique())

 train["Embarked"]=train["Embarked"].fillna("S")

 test["Embarked"]=test["Embarked"].fillna("S")

 train.loc[train["Embarked"]=="S","Embarked"]=0

 train.loc[train["Embarked"]=="C","Embarked"]=1

 train.loc[train["Embarked"]=="Q","Embarked"]=2

 test.loc[test["Embarked"]=="S","Embarked"]=0

 test.loc[test["Embarked"]=="C","Embarked"]=1

 test.loc[test["Embarked"]=="Q","Embarked"]=2

 train["Fare"]=train["Fare"].fillna(train["Fare"].median())

 test["Fare"]=test["Fare"].fillna(test["Fare"].median())

 #Generating a familysize column

 train["FamilySize"]=train["SibSp"]+train["Parch"]

 test["FamilySize"]=train["SibSp"]+test["Parch"]

 train["NameLength"]=train["Name"].apply(lambda x:len(x))

 test["NameLength"]=test["Name"].apply(lambda x:len(x))

 import re 

 def get_title(name):

     # Use a regular expression to search for a title.  Titles always consist of capital and lowercase letters, and end with a period.

     title_search = re.search(' ([A-Za-z]+)\.', name)

     # If the title exists, extract and return it.

     if title_search:

         return title_search.group(1)

     return ""

 # Get all the titles and print how often each one occurs.

 train_titles = train["Name"].apply(get_title)

 test_titles=test["Name"].apply(get_title)

 # Map each title to an integer.  Some titles are very rare, and are compressed into the same codes as other titles.

 title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2,"Dona":9}

 for k,v in title_mapping.items():

     train_titles[train_titles == k] =v

     test_titles[test_titles==k]=v

 # Add in the title column.

 train["Title"] = train_titles

 test["Title"]= test_titles

 #print (test["Title"])

 import operator

 # A dictionary mapping family name to id

 family_id_mapping = {}

 # A function to get the id given a row

 def get_family_id(row):

     # Find the last name by splitting on a comma

     last_name = row["Name"].split(",")[0]

     # Create the family id

     family_id = "{0}{1}".format(last_name, row["FamilySize"])

     # Look up the id in the mapping

     if family_id not in family_id_mapping:

         if len(family_id_mapping) == 0:

             current_id = 1

         else:

             # Get the maximum id from the mapping and add one to it if we don't have an id

             current_id = (max(family_id_mapping.items(), key=operator.itemgetter(1))[1] + 1)

         family_id_mapping[family_id] = current_id

     return family_id_mapping[family_id]

 # Get the family ids with the apply method

 train_family_ids = train.apply(get_family_id, axis=1)

 test_family_ids = test.apply(get_family_id,axis=1)

 # There are a lot of family ids, so we'll compress all of the families under 3 members into one code.

 train_family_ids[train["FamilySize"] < 3] = -1

 test_family_ids[test["FamilySize"]<3]=-1

 train["FamilyId"] = train_family_ids

 test["FamilyId"]=test_family_ids

 alg = AdaBoostClassifier()

 #alg=RandomForestClassifier(random_state=1,n_estimators=150,min_samples_split=4,min_samples_leaf=2)

 predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize", "Title", "FamilyId"]

 # Perform feature selection

 selector = SelectKBest(f_classif, k=5)

 selector.fit(train[predictors], train["Survived"])

 # Get the raw p-values for each feature, and transform from p-values into scores

 scores = -np.log10(selector.pvalues_)

 # Plot the scores.  See how "Pclass", "Sex", "Title", and "Fare" are the best?

 plt.bar(range(len(predictors)),scores)

 plt.xticks(range(len(predictors)), predictors, rotation='vertical')

 plt.show()

 print("#######")

 predictors = ["Pclass", "Sex", "Fare","Title"]

 x_train = train[predictors]

 y_train = train['Survived']

 x_test=test[predictors]

 alg.fit(x_train,y_train)

 predictions = alg.predict(x_test)

 submission = pd.DataFrame({

         "PassengerId": test["PassengerId"],

         "Survived": predictions

     })

 submission.to_csv('submission.csv', index=False)

顺便总结一上上面有用到的格式化字符串的输出结果str.format():

 #coding=utf-8

 #str.format() 函数

 #使用‘｛｝’占位符

 print('what\'s {},{}'.format('wrong','hong!'))

 #使用｛0｝，｛1｝形式的占位符

 print ('{},I\'m {},my qq is {}'.format('hello','hong',''))

 print('{},I\'m {},my E-mail is {}'.format('Hello','Hongten','hongtenzone@foxmail.com'))

 print ('{1},{0}'.format('hello','world'))

 #使用'{name}'形式的占位符

 print('Hi,{name},{message}'.format(name="hongten",message='how are you?'))

 #格式控制：

 import math

 print('The value of PI is approximately {0:.3f}.'.format(math.pi))

 table = {'Sjoerd': 4127, 'Jack': 4098, 'Dcab': 7678}

 for name,phone in table.items():

     print('{0:10}==>{1:10d}'.format(name,phone))

还有就是python中正则表达式的模块学习：

#coding=utf-8

import  re

#re.match()

s='I like hongten! he is so cool!'

m=re.match(r'(\w+)\s',s)

if m:

    print m.group(0)    #print groups()    正则表达式中用 ()表示的是要提取的分组（Group）

else:

    print 'not match'

#re.seach()

text = "JGood is a handsome boy, he is cool, clever, and so on..."

m=re.search(r'\shan(ds)ome\s',text)

if m:

    print m.groups()

else:

    print "No serch"

#re.sub  替换字符串的匹配项

import re

text = "JGood is a handsome boy, he is cool, clever, and so on..."

print re.sub(r'\s+', '-', text)

#re.split

#re.findall

#re.compile

re_telephone=re.compile(r'^(\d{3})-(\d{3,8})$')

print re_telephone.match('010-8086').groups()

kaggle& titanic代码的更多相关文章

kaggle Titanic心得
Titanic是kaggle上一个练手的比赛,kaggle平台提供一部分人的特征,以及是否遇难,目的是预测另一部分人是否遇难.目前抽工作之余,断断续续弄了点,成绩为0.79426.在这个比赛过程中,接 ...
Kaggle:Titanic: Machine Learning from Disaster
一直想着抓取股票的变化,偶然的机会在看股票数据抓取的博客看到了kaggle,然后看了看里面的题,感觉挺新颖的,就试了试. 题目如图:给了一个train.csv,现在预测test.csv里面的Passa ...
Kaggle Titanic补充篇
1.关于年龄Age 除了利用平均数来填充,还可以利用正态分布得到一些随机数来填充,首先得到已知年龄的平均数mean和方差std,然后生成[ mean-std, mean+std ]之间的随机数,然后 ...
Kaggle Titanic solution 纯规则学习
其实就是把train.csv拿出来看了看,找了找规律,调了调参数而已. 找到如下规律: 1.男的容易死,女的容易活 2.一等舱活,三等舱死 3.老人死,小孩活 4.兄弟姐妹多者死 5.票价高的活 6. ...
kaggle Titanic
# coding: utf-8 # In[19]: # 0.78468 # In[20]: import numpy as np import pandas as pd import warnings ...
逻辑回归应用之Kaggle泰坦尼克之灾(转）
正文:14pt 代码:15px 1 初探数据先看看我们的数据,长什么样吧.在Data下我们train.csv和test.csv两个文件,分别存着官方给的训练和测试数据. import pandas ...
机器学习案例学习【每周一例】之 Titanic: Machine Learning from Disaster
下面一文章就总结几点关键: 1.要学会观察,尤其是输入数据的特征提取时,看各输入数据和输出的关系,用绘图看! 2.训练后,看测试数据和训练数据误差,确定是否过拟合还是欠拟合: 3.欠拟合的话,说明模 ...
Kaggle 泰坦尼克
入门kaggle,开始机器学习应用之旅. 参看一些入门的博客,感觉pandas,sklearn需要熟练掌握,同时也学到了一些很有用的tricks,包括数据分析和机器学习的知识点.下面记录一些有趣的数据 ...
kaggle House_Price_XGBoost
kaggle House_Price_final 代码 import numpy as np import pandas as pd from sklearn.ensemble import Rand ...

随机推荐

简述Session
Session的原理 1.session技术的概述 * session是服务器端技术 * 服务器在运行时可以为每一个用户的浏览器创建一个其独享的session对象 * 由于session为用户浏览器独 ...
CSS 伪类 (Pseudo-classes)
CSS 伪类用于向某些选择器添加特殊的效果. CSS 伪类 (Pseudo-classes)实例: 超链接本例演示如何向文档中的超链接添加不同的颜色. 超链接 2 本例演示如何向超链接添加其他样式. ...
ios 写项目的时候遇到的问题及解决方案(3)
22.看了苹果的文档,里面有这一句话:All launch images must be PNG files and must reside in the top level of your appl ...
将JSON格式的时间/Date(2367828670431)/格式转为正常的年-月-日格式
function formatDate(NewDtime) var dt = new Date(parseInt(NewDtime.slice(6, 19))); var year = dt.getF ...
libZPlay 音频编码解码器库
libZPlay 音频编码解码器库 http://www.oschina.net/p/libzplay libZPlay 播放音乐并显示 FFT 图形 :http://www.oschina.net/ ...
fdisk -c 0 350 1000 300命令
在Linux中有一个fdisk的分区命令,在对开发板的nand或者emmc分区也会用到这个命令, fdisk -c 这里0 350 1000 300分别代表: 每个扇区大小为0,一共350个柱面,起始 ...
jquery文本框内容改变事件
/** * 内容改变时并不会触发事件,但是在失去焦点的时候会触发. */ $("#inputid").change(function(){ console.log($(this). ...
日本DARTS 支撑的一系列应用项目
DARTS是多学科空间科学数据平台,例如天体物理.太阳物理.太阳物理.月球与行星科学和微重力科学.在此数据支撑下,有许多应用. 1.http://wms.selene.darts.isas.jaxa. ...
Oracle基础笔记
=====================================第一章:oracle数据库基础============================================= Or ...
一起学习KenDo Mobile之一建立一个简单的移动APP
开发KenDo Mobile的开发工具只要求支持文本编辑即可,当然我自己用VS2013,大材小用. 移动应用程序开发不同于桌面应用程序开发,前者需要在移动设备上部署,后者使用台式电脑测试和调试应用程序 ...

kaggle& titanic代码

kaggle& titanic代码的更多相关文章

随机推荐

热门专题