这两天报名参加了阿里天池的’公交线路客流预测‘赛,就顺便先把以前看的kaggle的titanic的训练赛代码在熟悉下数据的一些处理。题目根据titanic乘客的信息来预测乘客的生还情况。给了titanic_test.csv和titanic_train.csv两数据表。首先是表的一些字段说明:

  • PassengerId -- A numerical id assigned to each passenger.
  • Survived -- Whether the passenger survived (1), or didn't (0). We'll be making predictions for this column.
  • Pclass -- The class the passenger was in -- first class (1), second class (2), or third class (3).
  • Name -- the name of the passenger.
  • Sex -- The gender of the passenger -- male or female.
  • Age -- The age of the passenger. Fractional.
  • SibSp -- The number of siblings and spouses the passenger had on board.
  • Parch -- The number of parents and children the passenger had on board.
  • Ticket -- The ticket number of the passenger.
  • Fare -- How much the passenger paid for the ticker.
  • Cabin -- Which cabin the passenger was in.
  • Embarked -- Where the passenger boarded the Titanic.

下面是python处理代码:

 from sklearn.ensemble import AdaBoostClassifier
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, f_classif
import matplotlib.pyplot as plt train = pd.read_csv("titanic_train.csv", dtype={"Age": np.float64})
test = pd.read_csv("titanic_test.csv", dtype={"Age": np.float64} ) print("\n\nTop of the training data:")
print(train.head()) print("\n\nSummary statistics of training data")
print(train.describe()) #train.to_csv('copy_of_the_training_data.csv', index=False) train["Age"]=train["Age"].fillna(-1)
test["Age"]=test["Age"].fillna(-1) train.loc[train["Sex"]=="male","Sex"]=0
test.loc[test["Sex"]=="male","Sex"]=0
train.loc[train["Sex"]=="female","Sex"]=1
test.loc[test["Sex"]=="female","Sex"]=1 print(train["Embarked"].unique())
train["Embarked"]=train["Embarked"].fillna("S")
test["Embarked"]=test["Embarked"].fillna("S") train.loc[train["Embarked"]=="S","Embarked"]=0
train.loc[train["Embarked"]=="C","Embarked"]=1
train.loc[train["Embarked"]=="Q","Embarked"]=2 test.loc[test["Embarked"]=="S","Embarked"]=0
test.loc[test["Embarked"]=="C","Embarked"]=1
test.loc[test["Embarked"]=="Q","Embarked"]=2 train["Fare"]=train["Fare"].fillna(train["Fare"].median())
test["Fare"]=test["Fare"].fillna(test["Fare"].median()) #Generating a familysize column
train["FamilySize"]=train["SibSp"]+train["Parch"]
test["FamilySize"]=train["SibSp"]+test["Parch"] train["NameLength"]=train["Name"].apply(lambda x:len(x))
test["NameLength"]=test["Name"].apply(lambda x:len(x)) import re def get_title(name):
# Use a regular expression to search for a title. Titles always consist of capital and lowercase letters, and end with a period.
title_search = re.search(' ([A-Za-z]+)\.', name)
# If the title exists, extract and return it.
if title_search:
return title_search.group(1)
return "" # Get all the titles and print how often each one occurs.
train_titles = train["Name"].apply(get_title)
test_titles=test["Name"].apply(get_title) # Map each title to an integer. Some titles are very rare, and are compressed into the same codes as other titles.
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2,"Dona":9}
for k,v in title_mapping.items():
train_titles[train_titles == k] =v
test_titles[test_titles==k]=v # Add in the title column.
train["Title"] = train_titles
test["Title"]= test_titles #print (test["Title"]) import operator # A dictionary mapping family name to id
family_id_mapping = {} # A function to get the id given a row
def get_family_id(row):
# Find the last name by splitting on a comma
last_name = row["Name"].split(",")[0]
# Create the family id
family_id = "{0}{1}".format(last_name, row["FamilySize"])
# Look up the id in the mapping
if family_id not in family_id_mapping:
if len(family_id_mapping) == 0:
current_id = 1
else:
# Get the maximum id from the mapping and add one to it if we don't have an id
current_id = (max(family_id_mapping.items(), key=operator.itemgetter(1))[1] + 1)
family_id_mapping[family_id] = current_id
return family_id_mapping[family_id] # Get the family ids with the apply method
train_family_ids = train.apply(get_family_id, axis=1)
test_family_ids = test.apply(get_family_id,axis=1) # There are a lot of family ids, so we'll compress all of the families under 3 members into one code.
train_family_ids[train["FamilySize"] < 3] = -1
test_family_ids[test["FamilySize"]<3]=-1 train["FamilyId"] = train_family_ids
test["FamilyId"]=test_family_ids alg = AdaBoostClassifier()
#alg=RandomForestClassifier(random_state=1,n_estimators=150,min_samples_split=4,min_samples_leaf=2)
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize", "Title", "FamilyId"] # Perform feature selection
selector = SelectKBest(f_classif, k=5)
selector.fit(train[predictors], train["Survived"]) # Get the raw p-values for each feature, and transform from p-values into scores
scores = -np.log10(selector.pvalues_) # Plot the scores. See how "Pclass", "Sex", "Title", and "Fare" are the best?
plt.bar(range(len(predictors)),scores)
plt.xticks(range(len(predictors)), predictors, rotation='vertical')
plt.show() print("#######")
predictors = ["Pclass", "Sex", "Fare","Title"] x_train = train[predictors]
y_train = train['Survived']
x_test=test[predictors]
alg.fit(x_train,y_train)
predictions = alg.predict(x_test) submission = pd.DataFrame({
"PassengerId": test["PassengerId"],
"Survived": predictions
}) submission.to_csv('submission.csv', index=False)

顺便总结一上上面有用到的格式化字符串的输出结果str.format():

 #coding=utf-8
#str.format() 函数 #使用‘{}’占位符
print('what\'s {},{}'.format('wrong','hong!')) #使用{0},{1}形式的占位符
print ('{},I\'m {},my qq is {}'.format('hello','hong',''))
print('{},I\'m {},my E-mail is {}'.format('Hello','Hongten','hongtenzone@foxmail.com')) print ('{1},{0}'.format('hello','world')) #使用'{name}'形式的占位符
print('Hi,{name},{message}'.format(name="hongten",message='how are you?')) #格式控制:
import math
print('The value of PI is approximately {0:.3f}.'.format(math.pi)) table = {'Sjoerd': 4127, 'Jack': 4098, 'Dcab': 7678} for name,phone in table.items():
print('{0:10}==>{1:10d}'.format(name,phone))

还有就是python中正则表达式的模块学习:

#coding=utf-8
import re #re.match()
s='I like hongten! he is so cool!'
m=re.match(r'(\w+)\s',s)
if m:
print m.group(0) #print groups() 正则表达式中用 ()表示的是要提取的分组(Group)
else:
print 'not match' #re.seach()
text = "JGood is a handsome boy, he is cool, clever, and so on..."
m=re.search(r'\shan(ds)ome\s',text)
if m:
print m.groups()
else:
print "No serch" #re.sub 替换字符串的匹配项
import re
text = "JGood is a handsome boy, he is cool, clever, and so on..."
print re.sub(r'\s+', '-', text) #re.split #re.findall #re.compile
re_telephone=re.compile(r'^(\d{3})-(\d{3,8})$')
print re_telephone.match('010-8086').groups()

kaggle& titanic代码的更多相关文章

  1. kaggle Titanic心得

    Titanic是kaggle上一个练手的比赛,kaggle平台提供一部分人的特征,以及是否遇难,目的是预测另一部分人是否遇难.目前抽工作之余,断断续续弄了点,成绩为0.79426.在这个比赛过程中,接 ...

  2. Kaggle:Titanic: Machine Learning from Disaster

    一直想着抓取股票的变化,偶然的机会在看股票数据抓取的博客看到了kaggle,然后看了看里面的题,感觉挺新颖的,就试了试. 题目如图:给了一个train.csv,现在预测test.csv里面的Passa ...

  3. Kaggle Titanic补充篇

    1.关于年龄Age 除了利用平均数来填充,还可以利用正态分布得到一些随机数来填充,首先得到已知年龄的平均数mean和方差std,然后生成[ mean-std,  mean+std ]之间的随机数,然后 ...

  4. Kaggle Titanic solution 纯规则学习

    其实就是把train.csv拿出来看了看,找了找规律,调了调参数而已. 找到如下规律: 1.男的容易死,女的容易活 2.一等舱活,三等舱死 3.老人死,小孩活 4.兄弟姐妹多者死 5.票价高的活 6. ...

  5. kaggle Titanic

    # coding: utf-8 # In[19]: # 0.78468 # In[20]: import numpy as np import pandas as pd import warnings ...

  6. 逻辑回归应用之Kaggle泰坦尼克之灾(转)

    正文:14pt 代码:15px 1 初探数据 先看看我们的数据,长什么样吧.在Data下我们train.csv和test.csv两个文件,分别存着官方给的训练和测试数据. import pandas ...

  7. 机器学习案例学习【每周一例】之 Titanic: Machine Learning from Disaster

     下面一文章就总结几点关键: 1.要学会观察,尤其是输入数据的特征提取时,看各输入数据和输出的关系,用绘图看! 2.训练后,看测试数据和训练数据误差,确定是否过拟合还是欠拟合: 3.欠拟合的话,说明模 ...

  8. Kaggle 泰坦尼克

    入门kaggle,开始机器学习应用之旅. 参看一些入门的博客,感觉pandas,sklearn需要熟练掌握,同时也学到了一些很有用的tricks,包括数据分析和机器学习的知识点.下面记录一些有趣的数据 ...

  9. kaggle House_Price_XGBoost

    kaggle House_Price_final 代码 import numpy as np import pandas as pd from sklearn.ensemble import Rand ...

随机推荐

  1. webbrowser 禁用 alert

    void web_Navigated(object sender, WebBrowserNavigatedEventArgs e) { var web = sender as System.Windo ...

  2. 深入浅出SQL Server中的死锁

    简介 死锁的本质是一种僵持状态,是多个主体对于资源的争用而导致的.理解死锁首先需要对死锁所涉及的相关观念有一个理解. 一些基础知识 要理解SQL Server中的死锁,更好的方式是通过类比从更大的面理 ...

  3. if you end up with a boring miserable life

  4. gcc警告: warning: dereferencing type-punned pointer will break strict-aliasing rules

    Q: 在高优化级别下,不同类型指针之间的强制类型转换可能会触发以下警告: warning: dereferencing type-punned pointer will break strict-al ...

  5. ARM-ContexM3/4组优先级和子优先级抢占规则

    多个中断源在它们的抢占式优先级相同的情况下,子优先级不论是否相同,如果某个中断已经在服务当中,则其它中断源都不能打断它:只有抢占式优先级高的中断才可以打断其它抢占式优先级低的中断. 就是说, 组优先级 ...

  6. How Garbage Collection Really Works

    Java Memory Management, with its built-in garbage collection, is one of the language's finest achiev ...

  7. (Xaml) Type 'DeviceA' is not defined.

    修改了一些Xaml, 始终提示 Compiler error(s) encountered processing expression "deviceA.B".Type 'Devi ...

  8. 解析JDK 7的动态类型语言支持

    http://www.infoq.com/cn/articles/jdk-dynamically-typed-language

  9. Linux 忘记root密码 的解决办法

    以单用户维护模式登录 先将系统重启, 在读秒时按下任意键进入菜单界面,再仔细看菜单下的说明,按下e就能进入grub的编辑模式,如下 将光标移动到kernel那行, 再次按e进入kernel的编辑界面中 ...

  10. string相关总结

    一 <string> 1 string类常见成员函数 (1)属性相关 s.empty()   s为空时返回true,否则返回false s.size()      返回s中字符的个数,不包 ...