Week 1 Machine Learning with Big Data KNime - GUI based Spark MLlib - inside Spark CRISP-DM Week 2, Data Exploration 一般有两种方法,summary statistics 和 visualization Summary statistics (mean  平均数,median 中位数, mode 最常见的数) high Kurtosis 预示着有outlier的存在 visuali…
week 3 Classification KNN :基本思想是 input value 类似,就可能是同一类的 Decision Tree Naive Bayes Week 4 Evaluating model Over-fitting 怎么在Decision Tree 训练时避免 overfitting: Pre-Pruning 和 Post-Pruning pre-pruning 两个停止条件:1. 某个node上的record数目小于一定量,比如 <20个, 2. 纯度到达一定数值,比如…
In machine learning, is more data always better than better algorithms? No. There are times when more data helps, there are times when it doesn't. Probably one of the most famous quotes defending the power of data is that of Google's Research Directo…
In this lesson, we will learn how to train a Naive Bayes classifier and a Logistic Regression classifier - basic machine learning algorithms - on JSON text data, and classify it into categories. While this dataset is still considered a small dataset…
/ 20220404 Week 1 - 2 / Chapter 1 - Introduction 1.1 Definition Arthur Samuel The field of study that gives computers the ability to learn without being explicitly programmed. Tom Mitchell A computer program is said to learn from experience E with re…
In the former article "Data Preparation by Pandas and Scikit-Learn", we discussed about a series of steps in data preparation. Scikit-Learn provides the Pipeline class to help with such sequences of transformations. The Pipeline constructor take…
In this article, we dicuss some main steps in data preparation. Drop Labels Firstly, we drop labels for train set. Here we use drop() method in Pandas library. housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for traini…
下图为四种不同算法应用在不同大小数据量时的表现,可以看出,随着数据量的增大,算法的表现趋于接近.即不管多么糟糕的算法,数据量非常大的时候,算法表现也可以很好. 数据量很大时,学习算法表现比较好的原理: 使用比较大的训练集(意味着不可能过拟合),此时方差会比较低:此时,如果在逻辑回归或者线性回归模型中加入很多参数以及层数的话,则偏差会很低.综合起来,这会是一个很好的高性能的学习算法.…
Before you can plot anything, you need to specify which backend Matplotlib should use. The simplest option is to use Jupyter’s magic command %matplotlib inline. This tells Jupyter to set up Matplotlib so it uses Jupyter’s own backend. Scatter Plot ho…
本笔记为Coursera在线课程<Machine Learning>中的数据降维章节的笔记. 十四.降维 (Dimensionality Reduction) 14.1 动机一:数据压缩 本小节主要介绍第二种无监督学习方法:dimensionality reduction,从而实现数据的压缩,这样不仅可以减少数据所占磁盘空间,还可以提高程序的运行速度.如下图所示的例子,假设有一个具有很多维特征的数据集(虽然下图只画出2个特征),可以看到x1以cm为单位,x2以inches为单位,它们都是测量长…