这两天报名参加了阿里天池的’公交线路客流预测‘赛,就顺便先把以前看的kaggle的titanic的训练赛代码在熟悉下数据的一些处理。题目根据titanic乘客的信息来预测乘客的生还情况。给了titanic_test.csv和titanic_train.csv两数据表。首先是表的一些字段说明:

  • PassengerId -- A numerical id assigned to each passenger.
  • Survived -- Whether the passenger survived (1), or didn't (0). We'll be making predictions for this column.
  • Pclass -- The class the passenger was in -- first class (1), second class (2), or third class (3).
  • Name -- the name of the passenger.
  • Sex -- The gender of the passenger -- male or female.
  • Age -- The age of the passenger. Fractional.
  • SibSp -- The number of siblings and spouses the passenger had on board.
  • Parch -- The number of parents and children the passenger had on board.
  • Ticket -- The ticket number of the passenger.
  • Fare -- How much the passenger paid for the ticker.
  • Cabin -- Which cabin the passenger was in.
  • Embarked -- Where the passenger boarded the Titanic.

下面是python处理代码:

 from sklearn.ensemble import AdaBoostClassifier
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, f_classif
import matplotlib.pyplot as plt train = pd.read_csv("titanic_train.csv", dtype={"Age": np.float64})
test = pd.read_csv("titanic_test.csv", dtype={"Age": np.float64} ) print("\n\nTop of the training data:")
print(train.head()) print("\n\nSummary statistics of training data")
print(train.describe()) #train.to_csv('copy_of_the_training_data.csv', index=False) train["Age"]=train["Age"].fillna(-1)
test["Age"]=test["Age"].fillna(-1) train.loc[train["Sex"]=="male","Sex"]=0
test.loc[test["Sex"]=="male","Sex"]=0
train.loc[train["Sex"]=="female","Sex"]=1
test.loc[test["Sex"]=="female","Sex"]=1 print(train["Embarked"].unique())
train["Embarked"]=train["Embarked"].fillna("S")
test["Embarked"]=test["Embarked"].fillna("S") train.loc[train["Embarked"]=="S","Embarked"]=0
train.loc[train["Embarked"]=="C","Embarked"]=1
train.loc[train["Embarked"]=="Q","Embarked"]=2 test.loc[test["Embarked"]=="S","Embarked"]=0
test.loc[test["Embarked"]=="C","Embarked"]=1
test.loc[test["Embarked"]=="Q","Embarked"]=2 train["Fare"]=train["Fare"].fillna(train["Fare"].median())
test["Fare"]=test["Fare"].fillna(test["Fare"].median()) #Generating a familysize column
train["FamilySize"]=train["SibSp"]+train["Parch"]
test["FamilySize"]=train["SibSp"]+test["Parch"] train["NameLength"]=train["Name"].apply(lambda x:len(x))
test["NameLength"]=test["Name"].apply(lambda x:len(x)) import re def get_title(name):
# Use a regular expression to search for a title. Titles always consist of capital and lowercase letters, and end with a period.
title_search = re.search(' ([A-Za-z]+)\.', name)
# If the title exists, extract and return it.
if title_search:
return title_search.group(1)
return "" # Get all the titles and print how often each one occurs.
train_titles = train["Name"].apply(get_title)
test_titles=test["Name"].apply(get_title) # Map each title to an integer. Some titles are very rare, and are compressed into the same codes as other titles.
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2,"Dona":9}
for k,v in title_mapping.items():
train_titles[train_titles == k] =v
test_titles[test_titles==k]=v # Add in the title column.
train["Title"] = train_titles
test["Title"]= test_titles #print (test["Title"]) import operator # A dictionary mapping family name to id
family_id_mapping = {} # A function to get the id given a row
def get_family_id(row):
# Find the last name by splitting on a comma
last_name = row["Name"].split(",")[0]
# Create the family id
family_id = "{0}{1}".format(last_name, row["FamilySize"])
# Look up the id in the mapping
if family_id not in family_id_mapping:
if len(family_id_mapping) == 0:
current_id = 1
else:
# Get the maximum id from the mapping and add one to it if we don't have an id
current_id = (max(family_id_mapping.items(), key=operator.itemgetter(1))[1] + 1)
family_id_mapping[family_id] = current_id
return family_id_mapping[family_id] # Get the family ids with the apply method
train_family_ids = train.apply(get_family_id, axis=1)
test_family_ids = test.apply(get_family_id,axis=1) # There are a lot of family ids, so we'll compress all of the families under 3 members into one code.
train_family_ids[train["FamilySize"] < 3] = -1
test_family_ids[test["FamilySize"]<3]=-1 train["FamilyId"] = train_family_ids
test["FamilyId"]=test_family_ids alg = AdaBoostClassifier()
#alg=RandomForestClassifier(random_state=1,n_estimators=150,min_samples_split=4,min_samples_leaf=2)
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize", "Title", "FamilyId"] # Perform feature selection
selector = SelectKBest(f_classif, k=5)
selector.fit(train[predictors], train["Survived"]) # Get the raw p-values for each feature, and transform from p-values into scores
scores = -np.log10(selector.pvalues_) # Plot the scores. See how "Pclass", "Sex", "Title", and "Fare" are the best?
plt.bar(range(len(predictors)),scores)
plt.xticks(range(len(predictors)), predictors, rotation='vertical')
plt.show() print("#######")
predictors = ["Pclass", "Sex", "Fare","Title"] x_train = train[predictors]
y_train = train['Survived']
x_test=test[predictors]
alg.fit(x_train,y_train)
predictions = alg.predict(x_test) submission = pd.DataFrame({
"PassengerId": test["PassengerId"],
"Survived": predictions
}) submission.to_csv('submission.csv', index=False)

顺便总结一上上面有用到的格式化字符串的输出结果str.format():

 #coding=utf-8
#str.format() 函数 #使用‘{}’占位符
print('what\'s {},{}'.format('wrong','hong!')) #使用{0},{1}形式的占位符
print ('{},I\'m {},my qq is {}'.format('hello','hong',''))
print('{},I\'m {},my E-mail is {}'.format('Hello','Hongten','hongtenzone@foxmail.com')) print ('{1},{0}'.format('hello','world')) #使用'{name}'形式的占位符
print('Hi,{name},{message}'.format(name="hongten",message='how are you?')) #格式控制:
import math
print('The value of PI is approximately {0:.3f}.'.format(math.pi)) table = {'Sjoerd': 4127, 'Jack': 4098, 'Dcab': 7678} for name,phone in table.items():
print('{0:10}==>{1:10d}'.format(name,phone))

还有就是python中正则表达式的模块学习:

#coding=utf-8
import re #re.match()
s='I like hongten! he is so cool!'
m=re.match(r'(\w+)\s',s)
if m:
print m.group(0) #print groups() 正则表达式中用 ()表示的是要提取的分组(Group)
else:
print 'not match' #re.seach()
text = "JGood is a handsome boy, he is cool, clever, and so on..."
m=re.search(r'\shan(ds)ome\s',text)
if m:
print m.groups()
else:
print "No serch" #re.sub 替换字符串的匹配项
import re
text = "JGood is a handsome boy, he is cool, clever, and so on..."
print re.sub(r'\s+', '-', text) #re.split #re.findall #re.compile
re_telephone=re.compile(r'^(\d{3})-(\d{3,8})$')
print re_telephone.match('010-8086').groups()

kaggle& titanic代码的更多相关文章

  1. kaggle Titanic心得

    Titanic是kaggle上一个练手的比赛,kaggle平台提供一部分人的特征,以及是否遇难,目的是预测另一部分人是否遇难.目前抽工作之余,断断续续弄了点,成绩为0.79426.在这个比赛过程中,接 ...

  2. Kaggle:Titanic: Machine Learning from Disaster

    一直想着抓取股票的变化,偶然的机会在看股票数据抓取的博客看到了kaggle,然后看了看里面的题,感觉挺新颖的,就试了试. 题目如图:给了一个train.csv,现在预测test.csv里面的Passa ...

  3. Kaggle Titanic补充篇

    1.关于年龄Age 除了利用平均数来填充,还可以利用正态分布得到一些随机数来填充,首先得到已知年龄的平均数mean和方差std,然后生成[ mean-std,  mean+std ]之间的随机数,然后 ...

  4. Kaggle Titanic solution 纯规则学习

    其实就是把train.csv拿出来看了看,找了找规律,调了调参数而已. 找到如下规律: 1.男的容易死,女的容易活 2.一等舱活,三等舱死 3.老人死,小孩活 4.兄弟姐妹多者死 5.票价高的活 6. ...

  5. kaggle Titanic

    # coding: utf-8 # In[19]: # 0.78468 # In[20]: import numpy as np import pandas as pd import warnings ...

  6. 逻辑回归应用之Kaggle泰坦尼克之灾(转)

    正文:14pt 代码:15px 1 初探数据 先看看我们的数据,长什么样吧.在Data下我们train.csv和test.csv两个文件,分别存着官方给的训练和测试数据. import pandas ...

  7. 机器学习案例学习【每周一例】之 Titanic: Machine Learning from Disaster

     下面一文章就总结几点关键: 1.要学会观察,尤其是输入数据的特征提取时,看各输入数据和输出的关系,用绘图看! 2.训练后,看测试数据和训练数据误差,确定是否过拟合还是欠拟合: 3.欠拟合的话,说明模 ...

  8. Kaggle 泰坦尼克

    入门kaggle,开始机器学习应用之旅. 参看一些入门的博客,感觉pandas,sklearn需要熟练掌握,同时也学到了一些很有用的tricks,包括数据分析和机器学习的知识点.下面记录一些有趣的数据 ...

  9. kaggle House_Price_XGBoost

    kaggle House_Price_final 代码 import numpy as np import pandas as pd from sklearn.ensemble import Rand ...

随机推荐

  1. 简述Session

    Session的原理 1.session技术的概述 * session是服务器端技术 * 服务器在运行时可以为每一个用户的浏览器创建一个其独享的session对象 * 由于session为用户浏览器独 ...

  2. CSS 伪类 (Pseudo-classes)

    CSS 伪类用于向某些选择器添加特殊的效果. CSS 伪类 (Pseudo-classes)实例: 超链接 本例演示如何向文档中的超链接添加不同的颜色. 超链接 2 本例演示如何向超链接添加其他样式. ...

  3. ios 写项目的时候遇到的问题及解决方案(3)

    22.看了苹果的文档,里面有这一句话:All launch images must be PNG files and must reside in the top level of your appl ...

  4. 将JSON格式的时间/Date(2367828670431)/格式 转为正常的年-月-日 格式

    function formatDate(NewDtime) var dt = new Date(parseInt(NewDtime.slice(6, 19))); var year = dt.getF ...

  5. libZPlay 音频编码解码器库

    libZPlay 音频编码解码器库 http://www.oschina.net/p/libzplay libZPlay 播放音乐并显示 FFT 图形 :http://www.oschina.net/ ...

  6. fdisk -c 0 350 1000 300命令

    在Linux中有一个fdisk的分区命令,在对开发板的nand或者emmc分区也会用到这个命令, fdisk -c 这里0 350 1000 300分别代表: 每个扇区大小为0,一共350个柱面,起始 ...

  7. jquery文本框内容改变事件

    /** * 内容改变时并不会触发事件,但是在失去焦点的时候会触发. */ $("#inputid").change(function(){ console.log($(this). ...

  8. 日本DARTS 支撑的一系列应用项目

    DARTS是多学科空间科学数据平台,例如天体物理.太阳物理.太阳物理.月球与行星科学和微重力科学.在此数据支撑下,有许多应用. 1.http://wms.selene.darts.isas.jaxa. ...

  9. Oracle基础笔记

    =====================================第一章:oracle数据库基础============================================= Or ...

  10. 一起学习KenDo Mobile之一 建立一个简单的移动APP

    开发KenDo Mobile的开发工具只要求支持文本编辑即可,当然我自己用VS2013,大材小用. 移动应用程序开发不同于桌面应用程序开发,前者需要在移动设备上部署,后者使用台式电脑测试和调试应用程序 ...