• Introductory Overview

    • Regression Problems
    • Multivariate Adaptive Regression Splines
    • Model Selection and Pruning
    • Applications
  • Technical Notes: The MARSplines Algorithm
  • Technical Notes: The MARSplines Model

Introductory Overview

Multivariate Adaptive Regression Splines (MARSplines) is an implementation of techniques popularized by Friedman (1991) for solving regression-type problems (see also, Multiple Regression), with the main purpose to predict the values of a continuous dependent or outcome variable from a set of independent or predictor variables. There are a large number of methods available for fitting models to continuous variables, such as a linear regression [e.g., Multiple Regression, General Linear Model (GLM)], nonlinear regression (Generalized Linear/Nonlinear Models), regression trees (see Classification and Regression Trees), CHAID, Neural Networks, etc.  (see also Hastie, Tibshirani, and Friedman, 2001, for an overview).

MARSplines is a nonparametric regression procedure that makes no assumption about the underlying functional relationship between the dependent and independent variables. Instead, MARSplines constructs this relation from a set of coefficients and basis functions that are entirely "driven" from the regression data. In a sense, the method is based on the "divide and conquer" strategy, which partitions the input space into regions, each with its own regression equation. This makes MARSplines particularly suitable for problems with higher input dimensions (i.e., with more than 2 variables), where the curse of dimensionality would likely create problems for other techniques.

The MARSplines technique has become particularly popular in the area of data mining because it does not assume or impose any particular type or class of relationship (e.g., linear, logistic, etc.) between the predictor variables and the dependent (outcome) variable of interest. Instead, useful models (i.e., models that yield accurate predictions) can be derived even in situations where the relationship between the predictors and the dependent variables is non-monotone and difficult to approximate with parametric models. For more information about this technique and how it compares to other methods for nonlinear regression (or regression trees), see Hastie, Tibshirani, and Friedman (2001).

Regression Problems

Regression problems are used to determine the relationship between a set of dependent variables (also called output, outcome, or response variables) and one or more independent variables (also known as input or predictor variables). The dependent variable is the one whose values you want to predict, based on the values of the independent (predictor) variables. For instance, one might be interested in the number of car accidents on the roads, which can be caused by 1) bad weather and 2) drunk driving. In this case one might write, for example,

Number_of_Accidents =  Some Constant + 0.5*Bad_Weather + 2.0*Drunk_Driving

The variable Number of Accidents is the dependent variable that is thought to be caused by (among other variables) Bad Weather and Drunk Driving (hence the name dependent variable). Note that the independent variables are multiplied by factors, i.e., 0.5 and 2.0. These are known as regression coefficients. The larger these coefficients, the stronger the influence of the independent variables on the dependent variable. If the two predictors in this simple (fictitious) example were measured on the same scale (e.g., if the variables were standardized to a mean of 0.0 and standard deviation 1.0), then Drunk Driving could be inferred to contribute 4 times more to car accidents than Bad Weather. (If the variables are not measured on the same scale, then direct comparisons between these coefficients are not meaningful, and, usually, some other standardized measure of predictor "importance" is included in the results.)

For additional details regarding these types of statistical models, refer to Multiple Regression or General Linear Models (GLM), as well as General Regression Models (GRM). In general, the social and natural sciences regression procedures are widely used in research. Regression allows the researcher to ask (and hopefully answer) the general question "what is the best predictor of ..." For example, educational researchers might want to learn what the best predictors of success in high-school are. Psychologists may want to determine which personality variable best predicts social adjustment. Sociologists may want to find out which of the multiple social indicators best predict whether a new immigrant group will adapt and be absorbed into society.

Multivariate Adaptive Regression Splines

The car accident example we considered previously is a typical application for linear regression, where the response variable is hypothesized to depend linearly on the predictor variables. Linear regression also falls into the category of so-called parametric regression, which assumes that the nature of the relationships (but not the specific parameters) between the dependent and independent variables is known a priori (e.g., is linear). By contrast, nonparametric regression (see Nonparametrics) does not make any such assumption as to how the dependent variables are related to the predictors. Instead it allows the regression function to be "driven" directly from data.

Multivariate Adaptive Regression Splines is a nonparametric regression procedure that makes no assumption about the underlying functional relationship between the dependent and independent variables. Instead, MARSplines constructs this relation from a set of coefficients and so-called basis functions that are entirely determined from the regression data. You can think of the general "mechanism" by which the MARSplines algorithm operates as multiple piecewise linear regression (see Nonlinear Estimation), where each breakpoint (estimated from the data) defines the "region of application" for a particular (very simple) linear regression equation.

Basis functions. Specifically, MARSplines uses two-sided truncated functions of the form (as shown below) as basis functions for linear or nonlinear expansion, which approximates the relationships between the response and predictor variables.

Shown above is a simple example of two basis functions (t-x)+ and (x-t)+ (adapted from Hastie, et al., 2001, Figure 9.9). Parameter t is the knot of the basis functions (defining the "pieces" of the piecewise linear regression); these knots (parameters) are also determined from the data. The "+" signs next to the terms (t-x) and (x-t) simply denote that only positive results of the respective equations are considered; otherwise the respective functions evaluate to zero. This can also be seen in the illustration.

The MARSplines model. The basis functions together with the model parameters (estimated via least squares estimation) are combined to produce the predictions given the inputs. The general MARSplines model equation (see Hastie et al., 2001, equation 9.19) is given as:

where the summation is over the M nonconstant terms in the model (further details regarding the model are also provided in Technical Notes). To summarize, y is predicted as a function of the predictor variables X (and their interactions); this function consists of an intercept parameter () and the weighted (by ) sum of one or more basis functions , of the kind illustrated earlier. You can also think of this model as "selecting" a weighted sum of basis functions from the set of (a large number of) basis functions that span all values of each predictor (i.e., that set would consist of one basis function, and parameter t, for each distinct value for each predictor variable). The MARSplines algorithm then searches over the space of all inputs and predictor values (knot locations t) as well as interactions between variables. During this search, an increasingly larger number of basis functions are added to the model (selected from the set of possible basis functions), to maximize an overall least squares goodness-of-fit criterion. As a result of these operations, MARSplines automatically determines the most important independent variables as well as the most significant interactions among them. The details of this algorithm are further described in Technical Notes, as well as in Hastie et al., 2001).

Categorical predictors. In practice, both continuous and categorical predictors could be used, and will often yield useful results. However, the basic MARSplines algorithm assumes that the predictor variables are continuous in nature, and, for example, the computed knots program will usually not coincide with actual class codes found in the categorical predictors. For a detailed discussion of categorical predictor variables in MARSplines, see Friedman (1993).

Multiple dependent (outcome) variables. The MARSplines algorithm can be applied to multiple dependent (outcome) variables. In this case, the algorithm will determine a common set of basis functions in the predictors, but estimate different coefficients for each dependent variable. This method of treating multiple outcome variables is not unlike some neural networks architectures, where multiple outcome variables can be predicted from common neurons and hidden layers; in the case of MARSplines, multiple outcome variables are predicted from common basis functions, with different coefficients.

MARSplines and classification problems. Because MARSplines can handle multiple dependent variables, it is easy to apply the algorithm to classification problems as well. First, code the classes in the categorical response variable into multiple indicator variables (e.g., 1 = observation belongs to class k, 0 = observation does not belong to class k); then apply the MARSplines algorithm to fit a model, and compute predicted (continuous) values or scores; finally, for prediction, assign each case to the class for which the highest score is predicted (see also Hastie, Tibshirani, and Freedman, 2001, for a description of this procedure). Note that this type of application will yield heuristic classifications that may work very well in practice, but is not based on a statistical model for deriving classification probabilities.

Model Selection and Pruning

In general, nonparametric models are adaptive and can exhibit a high degree of flexibility that may ultimately result in overfitting if no measures are taken to counteract it. Although such models can achieve zero error on training data, they have the tendency to perform poorly when presented with new observations or instances (i.e., they do not generalize well to the prediction of "new" cases). MARSplines, like most methods of this kind, tend to overfit the data as well. To combat this problem, MARSplines uses a pruning technique (similar to pruning in classification trees) to limit the complexity of the model by reducing the number of its basis functions.

MARSplines as a predictor (feature) selection method. This feature - the selection of and pruning of basis functions - makes this method a very powerful tool for predictor selection. The MARSplines algorithm will pick up only those basis functions (and those predictor variables) that make a "sizeable" contribution to the prediction (refer to Technical Notes for details).

Applications

Multivariate Adaptive Regression Splines have become very popular recently for finding predictive models for "difficult" data mining problems, i.e., when the predictor variables do not exhibit simple and/or monotone relationships to the dependent variable of interest. Alternative models or approaches that you can consider for such cases are CHAID, Classification and Regression Trees, or any of the many Neural Networks architectures available. Because of the specific manner in which MARSplines selects predictors (basis functions) for the model, it does generally "well" in situations where regression-tree models are also appropriate, i.e., where hierarchically organized successive splits on the predictor variables yield good (accurate) predictions. In fact, instead of considering this technique as a generalization of multiple regression (as it was presented in this introduction), you may consider MARSplines as a generalization of regression trees, where the "hard" binary splits are replaced by "smooth" basis functions. Refer to Hastie, Tibshirani, and Friedman (2001) for additional details.

To index

Technical Notes: The MARSplines Algorithm

Implementing MARSplines involves a two step procedure that is applied successively until a desired model is found. In the first step, we build the model, i.e. increase its complexity by adding basis functions until a preset (user-defined) maximum level of complexity has been reached. Then we begin a backward procedure to remove the least significant basis functions from the model, i.e. those whose removal will lead to the least reduction in the (least-squares) goodness of fit. This algorithm is implemented as follows:

  1. Start with the simplest model involving only the constant basis function.

  2. Search the space of basis functions, for each variable and for all possible knots, and add those which maximize a certain measure of goodness of fit (minimize prediction error).

  3. Step 2 is recursively applied until a model of pre-determined maximum complexity is derived.

  4. Finally, in the last stage, a pruning procedure is applied where those basis functions are removed that contribute least to the overall (least squares) goodness of fit.

To index

Technical Notes: The Multivariate Adaptive Regression Splines (MARSplines) Model

The MARSplines algorithm builds models from two sided truncated functions of the predictors (x) of the form:

These serve as basis functions for linear or nonlinear expansion that approximates some true underlying function f(x).

The MARSplines model for a dependent (outcome) variable y, and M terms , can be summarized in the following equation:

where the summation is over the M terms in the model, and bo and bm are parameters of the model (along with the knots t for each basis function, which are also estimated from the data). Function H is defined as:

where xv(k,m) is the predictor in the k'th of the m'th product. For order of interactions K=1, the model is additive and for K=2 the model pairwise interactive.

During forward stepwise, a number of basis functions are added to the model according to a pre-determined maximum which should be considerably larger (twice as much at least) than the optimal (best least-squares fit).

After implementing the forward stepwise selection of basis functions, a backward procedure is applied in which the model is pruned by removing those basis functions that are associated with the smallest increase in the (least squares) goodness-of-fit. A least squares error function (inverse of goodness-of-fit) is computed. The so-called Generalized Cross Validation error is a measure of the goodness of fit that takes into account not only the residual error but also the model complexity as well. It is given by

with

where N is the number of cases in the data set, d is the effective degrees of freedom, which is equal to the number of independent basis functions. The quantity c is the penalty for adding a basis function. Experiments have shown that the best value for C can be found somewhere in the range 2 < d < 3 (see Hastie et al., 2001).

原文地址: Link

T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical Learning: data Mining, Inference, and Prediction, 2nd ed. Berlin, Germany: Springer-Verlagt, 1998.

Multivariate Adaptive Regression Splines (MARSplines)的更多相关文章

  1. Multivariate Linear Regression

    Multiple Features Linear regression with multiple variables is also known as "multivariate line ...

  2. Machine Learning - week 2 - Multivariate Linear Regression

    Multiple Features 上一章中,hθ(x) = θ0 + θ1x,表示只有一个 feature.现在,有多个 features,所以 hθ(x) = θ0 + θ1x1 + θ2x2 + ...

  3. 多元线性回归(Multivariate Linear Regression)简单应用

    警告:本文为小白入门学习笔记 数据集: http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=DeepLearnin ...

  4. [Machine Learning] 机器学习常见算法分类汇总

    声明:本篇博文根据http://www.ctocio.com/hotnews/15919.html整理,原作者张萌,尊重原创. 机器学习无疑是当前数据分析领域的一个热点内容.很多人在平时的工作中都或多 ...

  5. SAS提供的机器学习算法

    SAS graphical user interfaces help you build machine-learning models and implement an iterative mach ...

  6. Spark入门实战系列--8.Spark MLlib(上)--机器学习及SparkMLlib简介

    [注]该系列文章以及使用到安装包/测试数据 可以在<倾情大奉送--Spark入门实战系列>获取 .机器学习概念 1.1 机器学习的定义 在维基百科上对机器学习提出以下几种定义: l“机器学 ...

  7. R--基本统计分析方法(包及函数)

    摘要:目前经典的统计学分析方法主要有回归分析,Logistic回归,决策树,支持向量机,聚类分析,关联分析,主成分分析,对应分析,因子分析等,那么对于这些经典的分析方法在R中的使用主要有那些程序包及函 ...

  8. paper 19 :机器学习算法(简介)

    本来看了一天的分类器方面的代码,乱乱的,索性再把最基础的概念拿过来,现总结一下机器学习的算法吧! 1.机器学习算法简述 按照不同的分类标准,可以把机器学习的算法做不同的分类. 1.1 从机器学习问题角 ...

  9. paper 12:机器学习常见算法分类汇总

    机器学习无疑是当前数据分析领域的一个热点内容.很多人在平时的工作中都或多或少会用到机器学习的算法.这里南君先生为您总结一下常见的机器学习算法,以供您在工作和学习中参考. 机器学习的算法很多.很多时候困 ...

随机推荐

  1. BZOJ 4241: 历史研究——莫队 二叉堆

    传送门:http://www.lydsy.com/JudgeOnline/problem.php?id=4241 题意:N个int范围内的数,M次询问一个区间最大的(数字*出现次数)(加权众数),可以 ...

  2. [New learn]AutoLayout调查基于IB

    代码:https://github.com/xufeng79x/AutoLayout-IB 1.简介 Autolayout旨在解决不同高宽度的屏幕下的显示问题,通过增加给控件增加约束来达到不同屏幕间的 ...

  3. linux命令(14):ifup/ifdown/ip addr命令

    开启网卡:ifup eth0 关闭网卡:ifdown eth0 查看网卡接入状态:ip addr[可查看哪块网卡up/down状态]

  4. 基于flask和百度AI接口实现前后端的语音交互

    话不多说,直接怼代码,有不懂的,可以留言 简单的实现,前后端的语音交互. import os from uuid import uuid4 from aip import AipSpeech from ...

  5. MySQL 的数据存储引擎

    MySQL的存储引擎 InnoDB: MySQL5.5之后的默认存储引擎. 采用MVCC来支持高并发,并且实现了四个标准的隔离级别(默认可重复读). 支持事务,支持外键.支持行锁.非锁定读(默认读取操 ...

  6. 关于在C#中对抽象类的理解

    先说一下自己对类的理解吧.类就是指将一系列具有一些共同特性的事物归纳起来,按照不同的特性分为不同的类.比如在现实世界中人是一类,动物是一类.植物 又是一类.但他们都是生命这一类的派生类.他们都继承了生 ...

  7. 从零开始,学习web前端之HTML5开发

    什么是HTML5 HTML5是HTML最新的修订版本,2014年10月由万维网联盟(W3C)完成标准制定.是下一代 HTML 标准. 为什么要学习HTML5 HTML5定义了一系列新元素,如新语义标签 ...

  8. 这个程序员有点牛,现场直接用JS写了个飞机游戏,半小时吸粉三千

    程序员昨晚在b站直播的时用JavaScript代码写了一个飞机大战游戏,半小时不到粉丝关注就上千了. 今日就拿出来跟大家分享一下,对许多大佬来说做这个特效也不是很难,但是对于刚开始学习前端这方面还是有 ...

  9. CentOS7安装私有gitlab

    1.安装依赖包 yum install -y curl policycoreutils openssh-server openssh-clients postfix systemctl start p ...

  10. CentOS 7.4 上如何安装 tomcat 9

    本文将详细讲解在 CentOS 7.4 系统上如何安装tomcat 9,tomcat是没有32位和64位之分的. 创建tomcat的安装路径 首先在/usr/local/下建立一个tomcat的文件夹 ...