


1. How Models Work


This step of capturing patterns from data is called fitting or training the model

The data used to train the data is called the trainning data

After the model has been fit, you can apply it to new data to predict prices of additional homes

2.Basic Data Exploration


    melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'

    melbourne_data      =  pd.read_csv(melbourne_file_path)




count:                     非缺失值的数量

mean:                                           平均值

std:                                              标准偏差,它度量值在数值上的分布情况

min、25%、50%、75%、max:        将每一列按照从lowest到highest排序,最小值是min, 1/4位置上,大于25%而小于50%是25%

3.Your First Machine Learning Model

  • Selecting Data for Modeling


  import  pandas as pd 

  melbourne_file_path     =         ' ../input/melbourne-housing-snapshot/melb_data.csv'

  melbourne_data          =          pd.read_csv(melbourne_file_path


  • Selecting The Prediction Target

方法:使用dot-notation来挑选prediction target

  y = melbourne_data.Price

  • Choosing "Features"


  melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']

  X = melbourne_data[melbourne_features]






  • Building Your Model





    •   Define:

         What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.

    •   Fit:

Capture patterns from provided data. This is the heart of modeling

    •   Predict:

        Just what it sounds like

    •   Evaluate:

         Determine how accurate the model's predictions are


    from sklearn.tree import DecisionTreeRegressor

    melbourne_mode = DecisionTreeRegressor(random_state=1)

    melbourne_mode.fit(X , y)



    print (X.head())


    print (melbourne_mode.predict(X.head())

4.Model Validation


使用Mean Absolute Error

    error= actual-predicted

在实际过程中,我们要将数据分成两份,一份用于训练,叫做training data, 一份用于验证叫validataion data

    from sklearn.model_selection import train_test_split

    train_X, val_X,  train_y, val_y =     train_test_split(X, y, random_state=0)

    melbourne_model               =      DecisionTreeRegressor()

    melbourne_model.fit(train_X, train_y)

    val_predictions                  =      melbourne_model.predict(val_X)

    print(mean_absolute_error(val_y, val_predictions))

5.Underfitting and Overfitting

  • overfitting:     A model matches the data almost perfectly, but does poorly in validation and other new data.
  • underfitting:   When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data.

The more leaves we allow the model to make, the more we move from the underfitting area in the above graph to overfitting area.

  from sklearn.metrics import mean_absolute_error

  from sklearn.tree import DecsionTreeRegressor


  def get_ame(max_leaf_nodes, train_X, val_X, train_y, val_y):

    model = DecisionTreeRegressor(max_leaf_nodes = max_leaf_nodes, random_state = 0)

    model.fit(train_X, train_y)

    preds_val = model.predict(val_X)

    mae = mean_absolute_error(val_y, preds_val)



    for max_leaf_nodes in [5,50,500,5000]:

      my_ame = get_ame(max_leaf_nodes, train_X, val_X, train_y, val_y)

      print(max_leaf_nodes, my_ame)

最后可以发现,当max leaf nodes 为 500时,MAE最小, 接下来我们换另外一种模型

6.Random Forests

The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree. It generally has much better predictive accuracy than a single decision tree and it works well with default parameters.

    from sklearn.ensemble import RandomForestRegressor

    from sklearn.metrics import mean_absolute_error


    forest_model = RandomForestRegressor(random_state=1)


    melb_preds = forest_model.predict(val_X)

    print(mean_absolute_error(val_y, melb_preds))


one of the best features of Random Forest models is that they generally work reasonably even without this tuning.

7.Machine Learning Competitions

  • Build a Random Forest model with all of your data
  • Read in the "test" data, which doesn't include values for the target. Predict home values in the test data with your Random Forest model.
  • Submit those predictions to the competition and see your score.
  • Optionally, come back to see if you can improve your model by adding features or changing your model. Then you can resubmit to see how that stacks up on the competition leaderboard.

