kaggle& titanic代码
这两天报名参加了阿里天池的’公交线路客流预测‘赛,就顺便先把以前看的kaggle的titanic的训练赛代码在熟悉下数据的一些处理。题目根据titanic乘客的信息来预测乘客的生还情况。给了titanic_test.csv和titanic_train.csv两数据表。首先是表的一些字段说明:
PassengerId
-- A numerical id assigned to each passenger.Survived
-- Whether the passenger survived (1), or didn't (0). We'll be making predictions for this column.Pclass
-- The class the passenger was in -- first class (1), second class (2), or third class (3).Name
-- the name of the passenger.Sex
-- The gender of the passenger -- male or female.Age
-- The age of the passenger. Fractional.SibSp
-- The number of siblings and spouses the passenger had on board.Parch
-- The number of parents and children the passenger had on board.Ticket
-- The ticket number of the passenger.Fare
-- How much the passenger paid for the ticker.Cabin
-- Which cabin the passenger was in.Embarked
-- Where the passenger boarded the Titanic.
下面是python处理代码:
from sklearn.ensemble import AdaBoostClassifier
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, f_classif
import matplotlib.pyplot as plt train = pd.read_csv("titanic_train.csv", dtype={"Age": np.float64})
test = pd.read_csv("titanic_test.csv", dtype={"Age": np.float64} ) print("\n\nTop of the training data:")
print(train.head()) print("\n\nSummary statistics of training data")
print(train.describe()) #train.to_csv('copy_of_the_training_data.csv', index=False) train["Age"]=train["Age"].fillna(-1)
test["Age"]=test["Age"].fillna(-1) train.loc[train["Sex"]=="male","Sex"]=0
test.loc[test["Sex"]=="male","Sex"]=0
train.loc[train["Sex"]=="female","Sex"]=1
test.loc[test["Sex"]=="female","Sex"]=1 print(train["Embarked"].unique())
train["Embarked"]=train["Embarked"].fillna("S")
test["Embarked"]=test["Embarked"].fillna("S") train.loc[train["Embarked"]=="S","Embarked"]=0
train.loc[train["Embarked"]=="C","Embarked"]=1
train.loc[train["Embarked"]=="Q","Embarked"]=2 test.loc[test["Embarked"]=="S","Embarked"]=0
test.loc[test["Embarked"]=="C","Embarked"]=1
test.loc[test["Embarked"]=="Q","Embarked"]=2 train["Fare"]=train["Fare"].fillna(train["Fare"].median())
test["Fare"]=test["Fare"].fillna(test["Fare"].median()) #Generating a familysize column
train["FamilySize"]=train["SibSp"]+train["Parch"]
test["FamilySize"]=train["SibSp"]+test["Parch"] train["NameLength"]=train["Name"].apply(lambda x:len(x))
test["NameLength"]=test["Name"].apply(lambda x:len(x)) import re def get_title(name):
# Use a regular expression to search for a title. Titles always consist of capital and lowercase letters, and end with a period.
title_search = re.search(' ([A-Za-z]+)\.', name)
# If the title exists, extract and return it.
if title_search:
return title_search.group(1)
return "" # Get all the titles and print how often each one occurs.
train_titles = train["Name"].apply(get_title)
test_titles=test["Name"].apply(get_title) # Map each title to an integer. Some titles are very rare, and are compressed into the same codes as other titles.
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2,"Dona":9}
for k,v in title_mapping.items():
train_titles[train_titles == k] =v
test_titles[test_titles==k]=v # Add in the title column.
train["Title"] = train_titles
test["Title"]= test_titles #print (test["Title"]) import operator # A dictionary mapping family name to id
family_id_mapping = {} # A function to get the id given a row
def get_family_id(row):
# Find the last name by splitting on a comma
last_name = row["Name"].split(",")[0]
# Create the family id
family_id = "{0}{1}".format(last_name, row["FamilySize"])
# Look up the id in the mapping
if family_id not in family_id_mapping:
if len(family_id_mapping) == 0:
current_id = 1
else:
# Get the maximum id from the mapping and add one to it if we don't have an id
current_id = (max(family_id_mapping.items(), key=operator.itemgetter(1))[1] + 1)
family_id_mapping[family_id] = current_id
return family_id_mapping[family_id] # Get the family ids with the apply method
train_family_ids = train.apply(get_family_id, axis=1)
test_family_ids = test.apply(get_family_id,axis=1) # There are a lot of family ids, so we'll compress all of the families under 3 members into one code.
train_family_ids[train["FamilySize"] < 3] = -1
test_family_ids[test["FamilySize"]<3]=-1 train["FamilyId"] = train_family_ids
test["FamilyId"]=test_family_ids alg = AdaBoostClassifier()
#alg=RandomForestClassifier(random_state=1,n_estimators=150,min_samples_split=4,min_samples_leaf=2)
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize", "Title", "FamilyId"] # Perform feature selection
selector = SelectKBest(f_classif, k=5)
selector.fit(train[predictors], train["Survived"]) # Get the raw p-values for each feature, and transform from p-values into scores
scores = -np.log10(selector.pvalues_) # Plot the scores. See how "Pclass", "Sex", "Title", and "Fare" are the best?
plt.bar(range(len(predictors)),scores)
plt.xticks(range(len(predictors)), predictors, rotation='vertical')
plt.show() print("#######")
predictors = ["Pclass", "Sex", "Fare","Title"] x_train = train[predictors]
y_train = train['Survived']
x_test=test[predictors]
alg.fit(x_train,y_train)
predictions = alg.predict(x_test) submission = pd.DataFrame({
"PassengerId": test["PassengerId"],
"Survived": predictions
}) submission.to_csv('submission.csv', index=False)
顺便总结一上上面有用到的格式化字符串的输出结果str.format():
#coding=utf-8
#str.format() 函数 #使用‘{}’占位符
print('what\'s {},{}'.format('wrong','hong!')) #使用{0},{1}形式的占位符
print ('{},I\'m {},my qq is {}'.format('hello','hong',''))
print('{},I\'m {},my E-mail is {}'.format('Hello','Hongten','hongtenzone@foxmail.com')) print ('{1},{0}'.format('hello','world')) #使用'{name}'形式的占位符
print('Hi,{name},{message}'.format(name="hongten",message='how are you?')) #格式控制:
import math
print('The value of PI is approximately {0:.3f}.'.format(math.pi)) table = {'Sjoerd': 4127, 'Jack': 4098, 'Dcab': 7678} for name,phone in table.items():
print('{0:10}==>{1:10d}'.format(name,phone))
还有就是python中正则表达式的模块学习:
#coding=utf-8
import re #re.match()
s='I like hongten! he is so cool!'
m=re.match(r'(\w+)\s',s)
if m:
print m.group(0) #print groups() 正则表达式中用 ()表示的是要提取的分组(Group)
else:
print 'not match' #re.seach()
text = "JGood is a handsome boy, he is cool, clever, and so on..."
m=re.search(r'\shan(ds)ome\s',text)
if m:
print m.groups()
else:
print "No serch" #re.sub 替换字符串的匹配项
import re
text = "JGood is a handsome boy, he is cool, clever, and so on..."
print re.sub(r'\s+', '-', text) #re.split #re.findall #re.compile
re_telephone=re.compile(r'^(\d{3})-(\d{3,8})$')
print re_telephone.match('010-8086').groups()
kaggle& titanic代码的更多相关文章
- kaggle Titanic心得
Titanic是kaggle上一个练手的比赛,kaggle平台提供一部分人的特征,以及是否遇难,目的是预测另一部分人是否遇难.目前抽工作之余,断断续续弄了点,成绩为0.79426.在这个比赛过程中,接 ...
- Kaggle:Titanic: Machine Learning from Disaster
一直想着抓取股票的变化,偶然的机会在看股票数据抓取的博客看到了kaggle,然后看了看里面的题,感觉挺新颖的,就试了试. 题目如图:给了一个train.csv,现在预测test.csv里面的Passa ...
- Kaggle Titanic补充篇
1.关于年龄Age 除了利用平均数来填充,还可以利用正态分布得到一些随机数来填充,首先得到已知年龄的平均数mean和方差std,然后生成[ mean-std, mean+std ]之间的随机数,然后 ...
- Kaggle Titanic solution 纯规则学习
其实就是把train.csv拿出来看了看,找了找规律,调了调参数而已. 找到如下规律: 1.男的容易死,女的容易活 2.一等舱活,三等舱死 3.老人死,小孩活 4.兄弟姐妹多者死 5.票价高的活 6. ...
- kaggle Titanic
# coding: utf-8 # In[19]: # 0.78468 # In[20]: import numpy as np import pandas as pd import warnings ...
- 逻辑回归应用之Kaggle泰坦尼克之灾(转)
正文:14pt 代码:15px 1 初探数据 先看看我们的数据,长什么样吧.在Data下我们train.csv和test.csv两个文件,分别存着官方给的训练和测试数据. import pandas ...
- 机器学习案例学习【每周一例】之 Titanic: Machine Learning from Disaster
下面一文章就总结几点关键: 1.要学会观察,尤其是输入数据的特征提取时,看各输入数据和输出的关系,用绘图看! 2.训练后,看测试数据和训练数据误差,确定是否过拟合还是欠拟合: 3.欠拟合的话,说明模 ...
- Kaggle 泰坦尼克
入门kaggle,开始机器学习应用之旅. 参看一些入门的博客,感觉pandas,sklearn需要熟练掌握,同时也学到了一些很有用的tricks,包括数据分析和机器学习的知识点.下面记录一些有趣的数据 ...
- kaggle House_Price_XGBoost
kaggle House_Price_final 代码 import numpy as np import pandas as pd from sklearn.ensemble import Rand ...
随机推荐
- Beta版本冲刺——day5
No Bug 031402401鲍亮 031402402曹鑫杰 031402403常松 031402412林淋 031402418汪培侨 031402426许秋鑫 站立式会议 今日计划表 人员 工作 ...
- Memcached 数据缓存系统
Memcached 数据缓存系统 常用命令及使用:http://www.cnblogs.com/wayne173/p/5652034.html Memcached是一个自由开源的,高性能,分布式内存对 ...
- [2014.01.27]wfTextImage 文字图像组件 1.6
全新开发的文字转图像组件--wfTextImage,使用简单,功能强大,图像处理效果极佳. 将大段的文本内容转换成GIF图片. 有效防止文字内容被复制抄袭,有效保护文字资料. ...
- resizable.js
(function($){ var boundbar= { html:"<div class=\"boundbar\" style=\"overflow: ...
- SQL 导出表结构到Excel
SQL 导出表结构到Excel SELECT 表名 then d.name else '' end, 表说明 then isnull(f.value,'') else '' end, 字段序号 = a ...
- SpringMVC实现查询功能
1 web.xml <?xml version="1.0" encoding="UTF-8"?> <web-app xmlns:xsi=&qu ...
- <<面向模式的软件架构2-并发和联网对象模式>>读书笔记
服务访问和配置模式 Wrapper Facade可以将有非对象API提供的函数和数据封装到面向对象的类接口中 就是把底层API再封装一次,让外部不用关心是调用哪个平台的API,不如锁,在不同的平台上可 ...
- ERROR: HHH000123: IllegalArgumentException in class: com.tt.hibernate.helloworld.News, setter method of property: date
问题描述: Hibernate连接数据库时,报出如下错误: 十一月 29, 2016 3:08:31 下午 org.hibernate.tool.hbm2ddl.SchemaUpdate execut ...
- 关于语句#ifdef OS_GLOBALS #define OS_EXT #else #define OS_EXT extern #endif 的说明
声明全局变量使用的技术——摘自uC/OS-II中文版 以下是如何定义全局 变量.众所周知,全局变量应该是得到内存分配且可以被其他模块通过C 语言中extern 关键字调用的变量.因此,必须在 .C 和 ...
- Convert Excel data to MDB file
所需组件: microsoft ado ext. 2.8 for ddl and security 或者更新的组件. 添加: using ADOX;using System.Runtime.Inter ...