这两天报名参加了阿里天池的’公交线路客流预测‘赛,就顺便先把以前看的kaggle的titanic的训练赛代码在熟悉下数据的一些处理。题目根据titanic乘客的信息来预测乘客的生还情况。给了titanic_test.csv和titanic_train.csv两数据表。首先是表的一些字段说明:

  • PassengerId -- A numerical id assigned to each passenger.
  • Survived -- Whether the passenger survived (1), or didn't (0). We'll be making predictions for this column.
  • Pclass -- The class the passenger was in -- first class (1), second class (2), or third class (3).
  • Name -- the name of the passenger.
  • Sex -- The gender of the passenger -- male or female.
  • Age -- The age of the passenger. Fractional.
  • SibSp -- The number of siblings and spouses the passenger had on board.
  • Parch -- The number of parents and children the passenger had on board.
  • Ticket -- The ticket number of the passenger.
  • Fare -- How much the passenger paid for the ticker.
  • Cabin -- Which cabin the passenger was in.
  • Embarked -- Where the passenger boarded the Titanic.

下面是python处理代码:

 from sklearn.ensemble import AdaBoostClassifier
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, f_classif
import matplotlib.pyplot as plt train = pd.read_csv("titanic_train.csv", dtype={"Age": np.float64})
test = pd.read_csv("titanic_test.csv", dtype={"Age": np.float64} ) print("\n\nTop of the training data:")
print(train.head()) print("\n\nSummary statistics of training data")
print(train.describe()) #train.to_csv('copy_of_the_training_data.csv', index=False) train["Age"]=train["Age"].fillna(-1)
test["Age"]=test["Age"].fillna(-1) train.loc[train["Sex"]=="male","Sex"]=0
test.loc[test["Sex"]=="male","Sex"]=0
train.loc[train["Sex"]=="female","Sex"]=1
test.loc[test["Sex"]=="female","Sex"]=1 print(train["Embarked"].unique())
train["Embarked"]=train["Embarked"].fillna("S")
test["Embarked"]=test["Embarked"].fillna("S") train.loc[train["Embarked"]=="S","Embarked"]=0
train.loc[train["Embarked"]=="C","Embarked"]=1
train.loc[train["Embarked"]=="Q","Embarked"]=2 test.loc[test["Embarked"]=="S","Embarked"]=0
test.loc[test["Embarked"]=="C","Embarked"]=1
test.loc[test["Embarked"]=="Q","Embarked"]=2 train["Fare"]=train["Fare"].fillna(train["Fare"].median())
test["Fare"]=test["Fare"].fillna(test["Fare"].median()) #Generating a familysize column
train["FamilySize"]=train["SibSp"]+train["Parch"]
test["FamilySize"]=train["SibSp"]+test["Parch"] train["NameLength"]=train["Name"].apply(lambda x:len(x))
test["NameLength"]=test["Name"].apply(lambda x:len(x)) import re def get_title(name):
# Use a regular expression to search for a title. Titles always consist of capital and lowercase letters, and end with a period.
title_search = re.search(' ([A-Za-z]+)\.', name)
# If the title exists, extract and return it.
if title_search:
return title_search.group(1)
return "" # Get all the titles and print how often each one occurs.
train_titles = train["Name"].apply(get_title)
test_titles=test["Name"].apply(get_title) # Map each title to an integer. Some titles are very rare, and are compressed into the same codes as other titles.
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2,"Dona":9}
for k,v in title_mapping.items():
train_titles[train_titles == k] =v
test_titles[test_titles==k]=v # Add in the title column.
train["Title"] = train_titles
test["Title"]= test_titles #print (test["Title"]) import operator # A dictionary mapping family name to id
family_id_mapping = {} # A function to get the id given a row
def get_family_id(row):
# Find the last name by splitting on a comma
last_name = row["Name"].split(",")[0]
# Create the family id
family_id = "{0}{1}".format(last_name, row["FamilySize"])
# Look up the id in the mapping
if family_id not in family_id_mapping:
if len(family_id_mapping) == 0:
current_id = 1
else:
# Get the maximum id from the mapping and add one to it if we don't have an id
current_id = (max(family_id_mapping.items(), key=operator.itemgetter(1))[1] + 1)
family_id_mapping[family_id] = current_id
return family_id_mapping[family_id] # Get the family ids with the apply method
train_family_ids = train.apply(get_family_id, axis=1)
test_family_ids = test.apply(get_family_id,axis=1) # There are a lot of family ids, so we'll compress all of the families under 3 members into one code.
train_family_ids[train["FamilySize"] < 3] = -1
test_family_ids[test["FamilySize"]<3]=-1 train["FamilyId"] = train_family_ids
test["FamilyId"]=test_family_ids alg = AdaBoostClassifier()
#alg=RandomForestClassifier(random_state=1,n_estimators=150,min_samples_split=4,min_samples_leaf=2)
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize", "Title", "FamilyId"] # Perform feature selection
selector = SelectKBest(f_classif, k=5)
selector.fit(train[predictors], train["Survived"]) # Get the raw p-values for each feature, and transform from p-values into scores
scores = -np.log10(selector.pvalues_) # Plot the scores. See how "Pclass", "Sex", "Title", and "Fare" are the best?
plt.bar(range(len(predictors)),scores)
plt.xticks(range(len(predictors)), predictors, rotation='vertical')
plt.show() print("#######")
predictors = ["Pclass", "Sex", "Fare","Title"] x_train = train[predictors]
y_train = train['Survived']
x_test=test[predictors]
alg.fit(x_train,y_train)
predictions = alg.predict(x_test) submission = pd.DataFrame({
"PassengerId": test["PassengerId"],
"Survived": predictions
}) submission.to_csv('submission.csv', index=False)

顺便总结一上上面有用到的格式化字符串的输出结果str.format():

 #coding=utf-8
#str.format() 函数 #使用‘{}’占位符
print('what\'s {},{}'.format('wrong','hong!')) #使用{0},{1}形式的占位符
print ('{},I\'m {},my qq is {}'.format('hello','hong',''))
print('{},I\'m {},my E-mail is {}'.format('Hello','Hongten','hongtenzone@foxmail.com')) print ('{1},{0}'.format('hello','world')) #使用'{name}'形式的占位符
print('Hi,{name},{message}'.format(name="hongten",message='how are you?')) #格式控制:
import math
print('The value of PI is approximately {0:.3f}.'.format(math.pi)) table = {'Sjoerd': 4127, 'Jack': 4098, 'Dcab': 7678} for name,phone in table.items():
print('{0:10}==>{1:10d}'.format(name,phone))

还有就是python中正则表达式的模块学习:

#coding=utf-8
import re #re.match()
s='I like hongten! he is so cool!'
m=re.match(r'(\w+)\s',s)
if m:
print m.group(0) #print groups() 正则表达式中用 ()表示的是要提取的分组(Group)
else:
print 'not match' #re.seach()
text = "JGood is a handsome boy, he is cool, clever, and so on..."
m=re.search(r'\shan(ds)ome\s',text)
if m:
print m.groups()
else:
print "No serch" #re.sub 替换字符串的匹配项
import re
text = "JGood is a handsome boy, he is cool, clever, and so on..."
print re.sub(r'\s+', '-', text) #re.split #re.findall #re.compile
re_telephone=re.compile(r'^(\d{3})-(\d{3,8})$')
print re_telephone.match('010-8086').groups()

kaggle& titanic代码的更多相关文章

  1. kaggle Titanic心得

    Titanic是kaggle上一个练手的比赛,kaggle平台提供一部分人的特征,以及是否遇难,目的是预测另一部分人是否遇难.目前抽工作之余,断断续续弄了点,成绩为0.79426.在这个比赛过程中,接 ...

  2. Kaggle:Titanic: Machine Learning from Disaster

    一直想着抓取股票的变化,偶然的机会在看股票数据抓取的博客看到了kaggle,然后看了看里面的题,感觉挺新颖的,就试了试. 题目如图:给了一个train.csv,现在预测test.csv里面的Passa ...

  3. Kaggle Titanic补充篇

    1.关于年龄Age 除了利用平均数来填充,还可以利用正态分布得到一些随机数来填充,首先得到已知年龄的平均数mean和方差std,然后生成[ mean-std,  mean+std ]之间的随机数,然后 ...

  4. Kaggle Titanic solution 纯规则学习

    其实就是把train.csv拿出来看了看,找了找规律,调了调参数而已. 找到如下规律: 1.男的容易死,女的容易活 2.一等舱活,三等舱死 3.老人死,小孩活 4.兄弟姐妹多者死 5.票价高的活 6. ...

  5. kaggle Titanic

    # coding: utf-8 # In[19]: # 0.78468 # In[20]: import numpy as np import pandas as pd import warnings ...

  6. 逻辑回归应用之Kaggle泰坦尼克之灾(转)

    正文:14pt 代码:15px 1 初探数据 先看看我们的数据,长什么样吧.在Data下我们train.csv和test.csv两个文件,分别存着官方给的训练和测试数据. import pandas ...

  7. 机器学习案例学习【每周一例】之 Titanic: Machine Learning from Disaster

     下面一文章就总结几点关键: 1.要学会观察,尤其是输入数据的特征提取时,看各输入数据和输出的关系,用绘图看! 2.训练后,看测试数据和训练数据误差,确定是否过拟合还是欠拟合: 3.欠拟合的话,说明模 ...

  8. Kaggle 泰坦尼克

    入门kaggle,开始机器学习应用之旅. 参看一些入门的博客,感觉pandas,sklearn需要熟练掌握,同时也学到了一些很有用的tricks,包括数据分析和机器学习的知识点.下面记录一些有趣的数据 ...

  9. kaggle House_Price_XGBoost

    kaggle House_Price_final 代码 import numpy as np import pandas as pd from sklearn.ensemble import Rand ...

随机推荐

  1. Beta版本冲刺——day5

    No Bug 031402401鲍亮 031402402曹鑫杰 031402403常松 031402412林淋 031402418汪培侨 031402426许秋鑫 站立式会议 今日计划表 人员 工作 ...

  2. Memcached 数据缓存系统

    Memcached 数据缓存系统 常用命令及使用:http://www.cnblogs.com/wayne173/p/5652034.html Memcached是一个自由开源的,高性能,分布式内存对 ...

  3. [2014.01.27]wfTextImage 文字图像组件 1.6

    全新开发的文字转图像组件--wfTextImage,使用简单,功能强大,图像处理效果极佳.     将大段的文本内容转换成GIF图片.     有效防止文字内容被复制抄袭,有效保护文字资料.      ...

  4. resizable.js

    (function($){ var boundbar= { html:"<div class=\"boundbar\" style=\"overflow: ...

  5. SQL 导出表结构到Excel

    SQL 导出表结构到Excel SELECT 表名 then d.name else '' end, 表说明 then isnull(f.value,'') else '' end, 字段序号 = a ...

  6. SpringMVC实现查询功能

    1 web.xml <?xml version="1.0" encoding="UTF-8"?> <web-app xmlns:xsi=&qu ...

  7. <<面向模式的软件架构2-并发和联网对象模式>>读书笔记

    服务访问和配置模式 Wrapper Facade可以将有非对象API提供的函数和数据封装到面向对象的类接口中 就是把底层API再封装一次,让外部不用关心是调用哪个平台的API,不如锁,在不同的平台上可 ...

  8. ERROR: HHH000123: IllegalArgumentException in class: com.tt.hibernate.helloworld.News, setter method of property: date

    问题描述: Hibernate连接数据库时,报出如下错误: 十一月 29, 2016 3:08:31 下午 org.hibernate.tool.hbm2ddl.SchemaUpdate execut ...

  9. 关于语句#ifdef OS_GLOBALS #define OS_EXT #else #define OS_EXT extern #endif 的说明

    声明全局变量使用的技术——摘自uC/OS-II中文版 以下是如何定义全局 变量.众所周知,全局变量应该是得到内存分配且可以被其他模块通过C 语言中extern 关键字调用的变量.因此,必须在 .C 和 ...

  10. Convert Excel data to MDB file

    所需组件: microsoft ado ext. 2.8 for ddl and security 或者更新的组件. 添加: using ADOX;using System.Runtime.Inter ...