学习笔记之scikit-learn
scikit-learn: machine learning in Python — scikit-learn 0.20.0 documentation
- https://scikit-learn.org/stable/index.html
- Simple and efficient tools for data mining and data analysis
- Accessible to everybody, and reusable in various contexts
- Built on NumPy, SciPy, and matplotlib
- Open source, commercially usable - BSD license
scikit-learn - Wikipedia
- https://en.wikipedia.org/wiki/Scikit-learn
- Scikit-learn (formerly scikits.learn) is a free softwaremachine learninglibrary for the Python programming language.[3] It features various classification, regression and clusteringalgorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.
Installing scikit-learn — scikit-learn 0.20.2 documentation
- https://scikit-learn.org/stable/install.html
Check scikit-learn version
- import sklearn
- print(sklearn.__version__)
全面了解机器学习包scikit-learn - AI遇见机器学习
- https://mp.weixin.qq.com/s/RVPiCHl70uCqwqKINJ9Xtg
Classification
- Identifying to which category an object belongs to.
- Applications: Spam detection, Image recognition.
- Algorithms: SVM, nearest neighbors,random forest, …
1. Supervised learning — scikit-learn 0.20.2 documentation
- https://scikit-learn.org/stable/supervised_learning.html#supervised-learning
1.6. Nearest Neighbors — scikit-learn 0.20.2 documentation
- https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-classification
- Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.
- scikit-learn implements two different nearest neighbors classifiers:
KNeighborsClassifier
implements learning based on the k nearest neighbors of each query point, where k is an integer value specified by the user.RadiusNeighborsClassifier
implements learning based on the number of neighbors within a fixed radius r of each training point, where r is a floating-point value specified by the user.
- Nearest Neighbors Classification — scikit-learn 0.20.2 documentation
- https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html#sphx-glr-auto-examples-neighbors-plot-classification-py
- Sample usage of Nearest Neighbors classification. It will plot the decision boundaries for each class.
- sklearn.neighbors.KNeighborsClassifier — scikit-learn 0.20.2 documentation
- https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier
- class
sklearn.neighbors.
KNeighborsClassifier
(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, metric_params=None, n_jobs=None, **kwargs) - Classifier implementing the k-nearest neighbors vote.
fit
(X, y)- Fit the model using X as training data and y as target values
predict
(X)- Predict the class labels for the provided data
score
(X, y, sample_weight=None)- Returns the mean accuracy on the given test data and labels.
- In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
Classifier comparison — scikit-learn 0.20.2 documentation
- https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
- A comparison of a several classifiers in scikit-learn on synthetic datasets. The point of this example is to illustrate the nature of decision boundaries of different classifiers. This should be taken with a grain of salt, as the intuition conveyed by these examples does not necessarily carry over to real datasets.
- sklearn.datasets.make_classification — scikit-learn 0.20.2 documentation
- https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html#sklearn-datasets-make-classification
- sklearn.datasets.make_classification(n_samples=100, n_features=20, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)
- Generate a random n-class classification problem.
- Decision boundary - Wikipedia
- https://en.wikipedia.org/wiki/Decision_boundary
- In a statistical-classification problem with two classes, a decision boundary or decision surface is a hypersurface that partitions the underlying vector space into two sets, one for each class. The classifier will classify all the points on one side of the decision boundary as belonging to one class and all those on the other side as belonging to the other class.
- A decision boundary is the region of a problem space in which the output label of a classifier is ambiguous.[1]
- If the decision surface is a hyperplane, then the classification problem is linear, and the classes are linearly separable.
- Decision boundaries are not always clear cut. That is, the transition from one class in the feature space to another is not discontinuous, but gradual. This effect is common in fuzzy logic based classification algorithms, where membership in one class or another is ambiguous.
- What is the difference between decision_function, predict_proba, and predict function for logistic regression problem? - Cross Validated
- https://stats.stackexchange.com/questions/329857/what-is-the-difference-between-decision-function-predict-proba-and-predict-fun
- What's the difference between predict_proba and decision_function in scikit-learn? - Stack Overflow
- https://stackoverflow.com/questions/36543137/whats-the-difference-between-predict-proba-and-decision-function-in-scikit-lear
- python - Scikit Learn SVC decision_function and predict - Stack Overflow
- https://stackoverflow.com/questions/20113206/scikit-learn-svc-decision-function-and-predict
- machine learning - Negative decision_function values - Stack Overflow
- https://stackoverflow.com/questions/46820154/negative-decision-function-values
Model selection
- Comparing, validating and choosing parameters and models.
- Goal: Improved accuracy via parameter tuning
- Modules: grid search, cross validation,metrics.
3. Model selection and evaluation — scikit-learn 0.20.3 documentation
- https://scikit-learn.org/stable/model_selection.html#model-selection
3.1. Cross-validation: evaluating estimator performance — scikit-learn 0.20.3 documentation
- https://scikit-learn.org/stable/modules/cross_validation.html
- sklearn.model_selection.train_test_split — scikit-learn 0.20.2 documentation
- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- sklearn.model_selection.train_test_split(*arrays, **options)
- Split arrays or matrices into random train and test subsets
- Quick utility that wraps input validation and
next(ShuffleSplit().split(X, y))
and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner.
3.3. Model evaluation: quantifying the quality of predictions — scikit-learn 0.20.3 documentation
- https://scikit-learn.org/stable/modules/model_evaluation.html
- There are 3 different APIs for evaluating the quality of a model’s predictions:
- Estimator score method: Estimators have a
score
method providing a default evaluation criterion for the problem they are designed to solve. This is not discussed on this page, but in each estimator’s documentation. - Scoring parameter: Model-evaluation tools using cross-validation (such as
model_selection.cross_val_score
andmodel_selection.GridSearchCV
) rely on an internal scoring strategy. This is discussed in the section The scoring parameter: defining model evaluation rules. - Metric functions: The
metrics
module implements functions assessing prediction error for specific purposes. These metrics are detailed in sections on Classification metrics, Multilabel ranking metrics, Regression metrics and Clustering metrics.
- Estimator score method: Estimators have a
- Finally, Dummy estimators are useful to get a baseline value of those metrics for random predictions.
- 3.3.2. Classification metrics
- 3.3.2.2. Accuracy score
- sklearn.metrics.accuracy_score — scikit-learn 0.20.2 documentation
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
- sklearn.metrics.accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)
- Accuracy classification score.
- In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.
- sklearn.metrics.accuracy_score — scikit-learn 0.20.2 documentation
- 3.3.2.5. Confusion matrix
- sklearn.metrics.confusion_matrix — scikit-learn 0.20.2 documentation
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
- sklearn.metrics.confusion_matrix(y_true, y_pred, labels=None, sample_weight=None)
- Compute confusion matrix to evaluate the accuracy of a classification
- sklearn.metrics.confusion_matrix — scikit-learn 0.20.2 documentation
- 3.3.2.6. Classification report
- The
classification_report
function builds a text report showing the main classification metrics.
- The
- 3.3.2.9. Precision, recall and F-measures
- https://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-and-f-measures
- Intuitively, precision is the ability of the classifier not to label as positive a sample that is negative, and recall is the ability of the classifier to find all the positive samples.
- sklearn.metrics.precision_score — scikit-learn 0.20.3 documentation
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn-metrics-precision-score
sklearn.metrics.
precision_score
(y_true, y_pred, labels=None, pos_label=1, average=’binary’, sample_weight=None)- Compute the precision
- The precision is the ratio
tp / (tp + fp)
wheretp
is the number of true positives andfp
the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative. - The best value is 1 and the worst value is 0.
- sklearn.metrics.recall_score — scikit-learn 0.20.3 documentation
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn-metrics-recall-score
sklearn.metrics.
recall_score
(y_true, y_pred, labels=None, pos_label=1, average=’binary’, sample_weight=None)- Compute the recall
- The recall is the ratio
tp / (tp + fn)
wheretp
is the number of true positives andfn
the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples. - The best value is 1 and the worst value is 0.
- sklearn.metrics.precision_score — scikit-learn 0.20.3 documentation
- The F-measure (Fβ and F1 measures) can be interpreted as a weighted harmonic mean of the precision and recall. A Fβ measure reaches its best value at 1 and its worst score at 0. With β=1, Fβ and F1 are equivalent, and the recall and the precision are equally important.
- sklearn.metrics.f1_score — scikit-learn 0.20.3 documentation
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn-metrics-f1-score
sklearn.metrics.
f1_score
(y_true, y_pred, labels=None, pos_label=1, average=’binary’, sample_weight=None)- Compute the F1 score, also known as balanced F-score or F-measure
- The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:
- F1 = 2 * (precision * recall) / (precision + recall)
- In the multi-class and multi-label case, this is the average of the F1 score of each class with weighting depending on the
average
parameter.
- sklearn.metrics.f1_score — scikit-learn 0.20.3 documentation
- The
precision_recall_curve
computes a precision-recall curve from the ground truth label and a score given by the classifier by varying a decision threshold.- sklearn.metrics.precision_recall_curve — scikit-learn 0.20.3 documentation
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html#sklearn.metrics.precision_recall_curve
- sklearn.metrics.precision_recall_curve(y_true, probas_pred, pos_label=None, sample_weight=None)
- sklearn.metrics.precision_recall_curve — scikit-learn 0.20.3 documentation
- The
average_precision_score
function computes the average precision (AP) from prediction scores. The value is between 0 and 1 and higher is better. AP is defined as- AP=∑n(Rn−Rn−1)Pn
- where Pn and Rn are the precision and recall at the nth threshold. With random predictions, the AP is the fraction of positive samples.
- sklearn.metrics.average_precision_score — scikit-learn 0.20.3 documentation
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html#sklearn.metrics.average_precision_score
- sklearn.metrics.average_precision_score(y_true, y_score, average=’macro’, pos_label=1, sample_weight=None)
- Area under the precision-recall curve
- Precision-Recall — scikit-learn 0.20.3 documentation
- https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html
- Example of Precision-Recall metric to evaluate classifier output quality.
- Precision-Recall is a useful measure of success of prediction when the classes are very imbalanced. In information retrieval, precision is a measure of result relevancy, while recall is a measure of how many truly relevant results are returned.
- The precision-recall curve shows the tradeoff between precision and recall for different threshold. A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate. High scores for both show that the classifier is returning accurate results (high precision), as well as returning a majority of all positive results (high recall).
- 3.3.2.13. Receiver operating characteristic (ROC)
- https://scikit-learn.org/stable/modules/model_evaluation.html#receiver-operating-characteristic-roc
- The function
roc_curve
computes the receiver operating characteristic curve, or ROC curve. Quoting Wikipedia :- “A receiver operating characteristic (ROC), or simply ROC curve, is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied. It is created by plotting the fraction of true positives out of the positives (TPR = true positive rate) vs. the fraction of false positives out of the negatives (FPR = false positive rate), at various threshold settings. TPR is also known as sensitivity, and FPR is one minus the specificity or true negative rate.”
- This function requires the true binary value and the target scores, which can either be probability estimates of the positive class, confidence values, or binary decisions.
- sklearn.metrics.roc_curve — scikit-learn 0.20.3 documentation
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
- sklearn.metrics.roc_curve(y_true, y_score, pos_label=None, sample_weight=None, drop_intermediate=True)
- sklearn.metrics.roc_curve — scikit-learn 0.20.3 documentation
- The
roc_auc_score
function computes the area under the receiver operating characteristic (ROC) curve, which is also denoted by AUC or AUROC. By computing the area under the roc curve, the curve information is summarized in one number. For more information see the Wikipedia article on AUC.- sklearn.metrics.roc_auc_score — scikit-learn 0.20.3 documentation
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score
- sklearn.metrics.roc_auc_score(y_true, y_score, average=’macro’, sample_weight=None, max_fpr=None)
- sklearn.metrics.roc_auc_score — scikit-learn 0.20.3 documentation
- Receiver Operating Characteristic (ROC) — scikit-learn 0.20.3 documentation
- https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py
- Example of Receiver Operating Characteristic (ROC) metric to evaluate classifier output quality.
- ROC curves typically feature true positive rate on the Y axis, and false positive rate on the X axis. This means that the top left corner of the plot is the “ideal” point - a false positive rate of zero, and a true positive rate of one. This is not very realistic, but it does mean that a larger area under the curve (AUC) is usually better.
- The “steepness” of ROC curves is also important, since it is ideal to maximize the true positive rate while minimizing the false positive rate.
- 3.3.2.2. Accuracy score
Preprocessing
- Feature extraction and normalization.
- Application: Transforming input data such as text for use with machine learning algorithms.
- Modules: preprocessing, feature extraction.
4.3. Preprocessing data — scikit-learn 0.20.2 documentation
- https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing
- 4.3.1.1. Scaling features to a range
- https://scikit-learn.org/stable/modules/preprocessing.html#scaling-features-to-a-range
- An alternative standardization is scaling features to lie between a given minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size. This can be achieved using
MinMaxScaler
orMaxAbsScaler
, respectively. - sklearn.preprocessing.MinMaxScaler — scikit-learn 0.20.3 documentation
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn-preprocessing-minmaxscaler
- class sklearn.preprocessing.MinMaxScaler(feature_range=(0, 1), copy=True)
- Transforms features by scaling each feature to a given range.
- This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.
4.4. Imputation of missing values — scikit-learn 0.20.2 documentation
- https://scikit-learn.org/stable/modules/impute.html
- sklearn.impute.SimpleImputer — scikit-learn 0.20.2 documentation
- https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn-impute-simpleimputer
- class
sklearn.impute.
SimpleImputer
(missing_values=nan, strategy=’mean’, fill_value=None, verbose=0, copy=True) - Imputation transformer for completing missing values.
How to fix ValueError: need more than 1 value to unpack when call tn, fp, fn, tp = confusion_matrix(y_actual, y_predict).ravel() ?
- To force it to output both classes even when one of them is not predicted, use the label attribute.
- tn, fp, fn, tp = confusion_matrix(y_actual, y_predict, labels=[0,1]).ravel()
- python - How to make sklearn.metrics.confusion_matrix() to always return TP, TN, FP, FN? - Stack Overflow
- https://stackoverflow.com/questions/46229965/how-to-make-sklearn-metrics-confusion-matrix-to-always-return-tp-tn-fp-fn
- y_actual, y_predict = [0,0,0,0],[0,0,0,0]
- tn, fp, fn, tp = confusion_matrix(y_actual, y_predict, labels=[0,1]).ravel()
- >> array([[4, 0], [0, 0]])
学习笔记之scikit-learn的更多相关文章
- 学习笔记TF042:TF.Learn、分布式Estimator、深度学习Estimator
TF.Learn,TensorFlow重要模块,各种类型深度学习及流行机器学习算法.TensorFlow官方Scikit Flow项目迁移,谷歌员工Illia Polosukhin.唐源发起.Scik ...
- (原创)(三)机器学习笔记之Scikit Learn的线性回归模型初探
一.Scikit Learn中使用estimator三部曲 1. 构造estimator 2. 训练模型:fit 3. 利用模型进行预测:predict 二.模型评价 模型训练好后,度量模型拟合效果的 ...
- (原创)(四)机器学习笔记之Scikit Learn的Logistic回归初探
目录 5.3 使用LogisticRegressionCV进行正则化的 Logistic Regression 参数调优 一.Scikit Learn中有关logistics回归函数的介绍 1. 交叉 ...
- 学习笔记TF043:TF.Learn 机器学习Estimator、DataFrame、监督器Monitors
线性.逻辑回归.input_fn()建立简单两个特征列数据,用特证列API建立特征列.特征列传入LinearClassifier建立逻辑回归分类器,fit().evaluate()函数,get_var ...
- The Road to learn React书籍学习笔记(第二章)
The Road to learn React书籍学习笔记(第二章) 组件的内部状态 组件的内部状态也称为局部状态,允许保存.修改和删除在组件内部的属性,使用ES6类组件可以在构造函数中初始化组件的状 ...
- Learning How to Learn学习笔记(转)
add by zhj: 工作中提高自己水平的最重要的一点是——快速的学习能力.这篇文章就是探讨这个问题的,掌握了快速学习能力的规律,你自然就有了快速学习能力了. 原文:Learning How to ...
- The Road to learn React书籍学习笔记(第三章)
The Road to learn React书籍学习笔记(第三章) 代码详情 声明周期方法 通过之前的学习,可以了解到ES6 类组件中的生命周期方法 constructor() 和 render() ...
- 学习笔记之Model selection and evaluation
学习笔记之scikit-learn - 浩然119 - 博客园 https://www.cnblogs.com/pegasus923/p/9997485.html 3. Model selection ...
- Sublime3学习笔记
学习笔记: 学习内容:sublime 3 学习时间:2015-10-20 预计学习时长:1 hour/3 day 学习工具&资料: 官网:http://www.sublimetext.com/ ...
- Android学习笔记(十七)——数据库操作(下)
//此系列博文是<第一行Android代码>的学习笔记,如有错漏,欢迎指正! 这一次我们来试一试升级数据库,并进行数据库的CRUD操作,其中, C 代表添加(Create) ,R 代表查询 ...
随机推荐
- sparse_tensor feed_dict的时候十分不方便。
假如说,你再处理文本的时候,写tfrecord的时候用的变长的类型, example = tf.train.Example(features=tf.train.Features(feature={ ' ...
- Java toBinaryString()函数探究及Math.abs(-2147483648)=-2147483648原理探究
toBinaryString()函数 public class Customer { public static void main(String[] args) { int m=-8; System ...
- 2. Net、ASP.Net、C#、VisualStudio之间的关系
.Net一般指的是.NetFramework 是一个开发和运行环境,是框架, 提供了基础的.Net类.这些类可以被任何一种.Net编程语言调用,.NetFramework还提供了CLR,JIT,GC等 ...
- HTML5:表格相关标记及其属性
表格相关标记及其属性 <table>:表格,包括以下属性 属性 说明 width 宽度(有像素和百分比两种表示方法) height 高度(有像素和百分比两种表示方法) border 边框粗 ...
- method&interface
method Go中虽没有class,但依旧有method 通过显示说明receiver来实现与某个类型组合 只能为同一个包的类型定义方法 Receiver可以是类型的值或指针 不存在方法重载 可以使 ...
- FCC JS基础算法题(13):Caesars Cipher(凯撒密码)
题目描述: 下面我们来介绍风靡全球的凯撒密码Caesar cipher,又叫移位密码.移位密码也就是密码中的字母会按照指定的数量来做移位.一个常见的案例就是ROT13密码,字母会移位13个位置.由'A ...
- 微信小程序富文本中的图片大小超出屏幕
这个问题我在小程序社区中提的,后来有个帮我回答了这个问题,我试了一下可以. 解决办法是过滤富文本内容,给图片标签添加一个样式,限制图片的最大宽度. replace(/\<img/gi, '& ...
- Windows10下安装MySQL8.0
1:首先去官网下载安装包 下载地址:https://dev.mysql.com/downloads/mysql/ 这是我下载版本 2:将解压文件解压到你安装的目录:E:\mysql-8.0.11-wi ...
- 本地maven库导入架包
mvn install:install-file -DgroupId=com.alipay -DartifactId=sdk-java -Dversion=20170725114550 -Dpacka ...
- OpenCV3 SVM ANN Adaboost KNN 随机森林等机器学习方法对OCR分类
转摘自http://www.cnblogs.com/denny402/p/5032839.html opencv3中的ml类与opencv2中发生了变化,下面列举opencv3的机器学习类方法实例: ...