


  • 特征排名Rank Features:对单个特征和成对特征进行排名以检测协方差
  • RadViz Visualizer:沿围绕圆形排列的轴绘制数据点以检测可分离性
  • 平行坐标Parallel Coordinates:沿垂直轴将样本绘制为线以检测类或聚类
  • PCA投影:使用PCA将更高维投影到可视空间中
  • 流形可视化Manifold Visualization:使用流形学习可视化高维数据
  • 双变量关系图:(又名Jointplots)绘制特征和目标之间的二维相关性

功能分析可视化工具Transformer从scikit-learn 实现API,这意味着它们可以用作Pipeline(尤其是a VisualPipeline)中的中间转换步骤。它们以相同的方式实例化,然后在它们上调用fit和transform,从而正确绘制了实例。最后show被调用以完成并显示图像。


# Feature Analysis Imports
# NOTE that all these are available for import directly from the ``yellowbrick.features`` module
from yellowbrick.features.rankd import Rank1D, Rank2D
from yellowbrick.features.radviz import RadViz
from yellowbrick.features.pcoords import ParallelCoordinates
from yellowbrick.features.jointplot import JointPlotVisualizer
from yellowbrick.features.pca import PCADecomposition
from yellowbrick.features.manifold import Manifold


# 多行输出
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

1 特征排名Rank Features

Rank1D和Rank2D使用各种指标对单个要素或要素对进行评估,这些指标以[-1,1]或[0,1]等级对要素进行评分,从而可以对它们进行排名。数在左下角的三角形热图上可视化,因此可以轻松识别特征对之间的模式以进行下游分析。Rank1D, Rank2D具体对比如下:

展示器 Rank1D, Rank2D
快速使用方法 rank1d(), rank2d()
模型 通用线性模型
工作流程 特征工程和模型选择


1.1 Rank 1D使用


from yellowbrick.datasets import load_credit
from yellowbrick.features import Rank1D # Load the credit dataset
# 导入数据
X, y = load_credit()
(30000, 23)



# Instantiate the 1D visualizer with the Sharpiro ranking algorithm
# 使用Sharpiro排名算法实例化1D可视化工具
visualizer = Rank1D(algorithm='shapiro') # Fit the data to the visualizer
# 可视化工具拟合
visualizer.fit(X, y) # Transform the data
# 转换数据
result=visualizer.transform(X) # Finalize and render the figure
# 显示图片
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/scipy/stats/morestats.py:1660: UserWarning: p-value may not be accurate for N > 5000.
warnings.warn("p-value may not be accurate for N > 5000.")

1.2 Rank 2D


from yellowbrick.datasets import load_credit
from yellowbrick.features import Rank2D X, y = load_credit()
# Instantiate the visualizer with the Pearson ranking algorithm
visualizer = Rank2D(algorithm='pearson') # Fit the data to the visualizer
# 可视化工具拟合
visualizer.fit(X, y) # Transform the data
# 转换数据
result=visualizer.transform(X) # Finalize and render the figure
# 显示图片


from yellowbrick.datasets import load_credit
from yellowbrick.features import Rank2D X, y = load_credit() # Instantiate the visualizer with the Covariance ranking algorithm
# #使用协方差排序算法实例化可视化工具
visualizer = Rank2D( algorithm='covariance') # Fit the data to the visualizer
# 可视化工具拟合
visualizer.fit(X, y) # Transform the data
# 转换数据
result=visualizer.transform(X) # Finalize and render the figure
# 显示图片

1.3 快速方法


from yellowbrick.datasets import load_concrete
from yellowbrick.features import rank1d, rank2d
from matplotlib import pyplot as plt # Load the concrete dataset
X, _ = load_concrete() _, axes = plt.subplots(ncols=2, figsize=(8,4)) rank1d(X, ax=axes[0], show=False)
rank2d(X, ax=axes[1], show=False)

2 RadViz Visualizer


数据科学家使用这种方法来检测类之间的可分性。E、 g.是否有机会从特征集中学习,或者只是噪音太大?

如果您的数据包含缺失值(numpy.nan)的行,则将不会绘制那些缺失值。换句话说,您可能无法完全了解数据。RadViz会提示DataWarning您丢失的百分比。如果确实收到此警告,则可能需要查看数据插补策略。scikit-learn Imputer是一个很好的起点。

RadViz Visualizer具体信息如下

可视化器 RadialVisualizer
快速使用方法 radviz()
模型 分类,回归
工作流程 特征工程
from yellowbrick.datasets import load_occupancy
from yellowbrick.features import RadViz # Load the classification dataset
# 导入分类数据
X, y = load_occupancy()
(20560, 5)


2.1 基础使用


# Specify the target classes
# 设定分类类别
classes = ["unoccupied", "occupied"] # Instantiate the visualizer
visualizer = RadViz(classes=classes) # Fit the data to the visualizer
visualizer.fit(X, y)
# Transform the data
# Finalize and render the figure


2.2 快速方法


from yellowbrick.features.radviz import radviz
from yellowbrick.datasets import load_occupancy #Load the classification dataset
X, y = load_occupancy() # Specify the target classes
classes = ["unoccupied", "occupied"] # Instantiate the visualizer
radviz(X, y, classes=classes);

3 平行坐标Parallel Coordinates




Parallel Coordinates具体信息如下:

可视化器 ParallelCoordinates
快速使用方法 parallel_coordinates()
模型 分类
工作流程 特征分析

3.1 基础使用

from yellowbrick.features import ParallelCoordinates
from yellowbrick.datasets import load_occupancy # Load the classification data set
# 载入房屋使用率数据库
X, y = load_occupancy() # Specify the features of interest and the classes of the target
# 感兴趣特征
features = [
"temperature", "relative humidity", "light", "CO2", "humidity"
] # 分类结果
classes = ["unoccupied", "occupied"] # Instantiate the visualizer
# 可视化 # feature表示要可视化的特征,
# sample表示指定要从数据显示多少个示例。如果为int,则指定要显示的最大样本数。如果为floa则指定要显示的百分比。
# shuffle表示是否随机选择样本
visualizer = ParallelCoordinates(
classes=classes, features=features, sample=0.05, shuffle=True
) # Fit and transform the data to the visualizer
result=visualizer.fit_transform(X, y) # Finalize the title and axes then display the visualization



from yellowbrick.features import ParallelCoordinates
from yellowbrick.datasets import load_occupancy # Load the classification data set
X, y = load_occupancy() # Specify the features of interest and the classes of the target
features = [
"temperature", "relative humidity", "light", "CO2", "humidity"
classes = ["unoccupied", "occupied"] # Instantiate the visualizer
visualizer = ParallelCoordinates(
classes=classes, features=features,
normalize='standard', sample=0.05, shuffle=True,
) # Fit the visualizer and display it
result=visualizer.fit_transform(X, y)


3.2 加速平行坐标图绘制


  1. 使用sample=0.2和shuffle=True参数可以对图上绘制的数据集进行混洗和采样。sample参数将对数据执行统一的随机抽样,选择指定的百分比。
  2. 使用该ast=True参数启用“快速绘制模式”。


from yellowbrick.features import ParallelCoordinates
from yellowbrick.datasets import load_occupancy # Load the classification data set
X, y = load_occupancy() # Specify the features of interest and the classes of the target
features = [
"temperature", "relative humidity", "light", "CO2", "humidity"
classes = ["unoccupied", "occupied"] # Instantiate the visualizer
visualizer = ParallelCoordinates(
classes=classes, features=features,
normalize='standard', sample=0.05, faster=True,shuffle=True,
) # Fit the visualizer and display it
result=visualizer.fit_transform(X, y)

3.3 快速方法


from yellowbrick.features.pcoords import parallel_coordinates
from yellowbrick.datasets import load_occupancy # Load the classification data set
X, y = load_occupancy() # Specify the features of interest and the classes of the target
features = [
"temperature", "relative humidity", "light", "CO2", "humidity"
classes = ["unoccupied", "occupied"] # Instantiate the visualizer
visualizer = parallel_coordinates(X, y, classes=classes, features=features,sample=0.05,shuffle=True)

4 PCA投影



可视化器 PCA
快速使用方法 pca_decomposition()
模型 分类/回归
工作流程 特征工程/选择

4.1 基本使用

from yellowbrick.datasets import load_credit
from yellowbrick.features import PCA # Specify the features of interest and the target
X, y = load_credit()
classes = ['account in default', 'current with bills'] # scale表示是否可视化,降维为两个维度
visualizer = PCA(scale=True, classes=classes)
result=visualizer.fit_transform(X, y)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sklearn/preprocessing/data.py:617: DataConversionWarning: Data with input dtype int64 were all converted to float64 by StandardScaler.
return self.partial_fit(X, y)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sklearn/base.py:462: DataConversionWarning: Data with input dtype int64 were all converted to float64 by StandardScaler.
return self.fit(X, **fit_params).transform(X)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sklearn/pipeline.py:451: DataConversionWarning: Data with input dtype int64 were all converted to float64 by StandardScaler.
Xt = transform.transform(Xt)


from yellowbrick.datasets import load_credit
from yellowbrick.features import PCA X, y = load_credit()
classes = ['account in default', 'current with bills'] # projection表示维度,只有二维和三维
visualizer = PCA(
scale=True, projection=3, classes=classes
result=visualizer.fit_transform(X, y)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sklearn/preprocessing/data.py:617: DataConversionWarning: Data with input dtype int64 were all converted to float64 by StandardScaler.
return self.partial_fit(X, y)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sklearn/base.py:462: DataConversionWarning: Data with input dtype int64 were all converted to float64 by StandardScaler.
return self.fit(X, **fit_params).transform(X)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sklearn/pipeline.py:451: DataConversionWarning: Data with input dtype int64 were all converted to float64 by StandardScaler.
Xt = transform.transform(Xt)

4.2 双标图Biplot


from yellowbrick.features import PCA

# Load the concrete dataset
X, y = load_concrete() visualizer = PCA(scale=True, proj_features=True)
result=visualizer.fit_transform(X, y)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sklearn/preprocessing/data.py:617: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
return self.partial_fit(X, y)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sklearn/base.py:462: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
return self.fit(X, **fit_params).transform(X)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sklearn/pipeline.py:451: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
Xt = transform.transform(Xt)


from yellowbrick.datasets import load_concrete
from yellowbrick.features import PCA X, y = load_concrete() visualizer = PCA(scale=True, proj_features=True, projection=3)
result=visualizer.fit_transform(X, y)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sklearn/preprocessing/data.py:617: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
return self.partial_fit(X, y)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sklearn/base.py:462: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
return self.fit(X, **fit_params).transform(X)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sklearn/pipeline.py:451: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
Xt = transform.transform(Xt)

4.3 快速方法


from yellowbrick.features import pca_decomposition

# Specify the features of interest and the target
X, y = load_credit()
classes = ['account in default', 'current with bills'] # Create, fit, and show the visualizer
X, y, scale=True, classes=classes
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sklearn/preprocessing/data.py:617: DataConversionWarning: Data with input dtype int64 were all converted to float64 by StandardScaler.
return self.partial_fit(X, y)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sklearn/base.py:462: DataConversionWarning: Data with input dtype int64 were all converted to float64 by StandardScaler.
return self.fit(X, **fit_params).transform(X)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sklearn/pipeline.py:451: DataConversionWarning: Data with input dtype int64 were all converted to float64 by StandardScaler.
Xt = transform.transform(Xt)

5 流形可视化Manifold Visualization

流行学习简单来说就是降维方法的一种,具体介绍见流形学习(manifold learning)综述



可视化器 Manifold
快速使用方法 manifold_embedding()
模型 分类/回归
工作流程 特征工程


方法 说明
"lle" 局部线性嵌入(LLE)使用许多局部线性分解来保留全局非线性结构。
"ltsa" LTSA LLE:局部切线空间对齐与LLE相似,因为它使用局部性来保留邻域距离。
"hessian" Hessian LLE一种LLE正则化方法,该方法在每个邻域应用基于hessian的二次形式
"modified" 修改后的LLE将正则化参数应用于LLE。
"isomap" Isomap寻求较低维的嵌入,以保持每个实例之间的几何距离。
"mds" MDS:多维缩放使用相似性来绘制在嵌入中彼此靠近的点。
"spectral" 频谱使用图形表示嵌入低维流形的离散逼近。
"tsne" t-SNE:将点的相似度转换为概率,然后使用这些概率创建嵌入。


5.1 离散目标


from yellowbrick.features import Manifold
from yellowbrick.datasets import load_occupancy
from sklearn.model_selection import train_test_split # Load the classification dataset
X, y = load_occupancy()
classes = ["unoccupied", "occupied"] # 数据集太大,影响速度,所以提取部分数据
_, X, _, y = train_test_split(X, y, test_size = 0.1, random_state = 7)
X.shape # Instantiate the visualizer
# manifold选择实例方法
viz = Manifold(manifold="tsne", classes=classes) # Fit the data to the visualizer
result=viz.fit_transform(X, y)
# Finalize and render the figure



from sklearn.pipeline import Pipeline
from sklearn.feature_selection import f_classif, SelectKBest from yellowbrick.features import Manifold
from yellowbrick.datasets import load_occupancy
from sklearn.model_selection import train_test_split # Load the classification dataset
X, y = load_occupancy()
classes = ["unoccupied", "occupied"] # 数据集太大,影响速度,所以提取部分数据
_, X, _, y = train_test_split(X, y, test_size = 0.1, random_state = 7)
X.shape # Create a pipeline
model = Pipeline([
("selectk", SelectKBest(k=3, score_func=f_classif)),
("viz", Manifold(manifold="tsne", classes=classes)),
]) result=model.fit_transform(X, y) # Fit the data to the model
model.named_steps['viz'].show(); # Finalize and render the figure

5.2 连续目标

对于回归目标或将颜色指定为连续值的热图,请指定target_type="continuous"。请注意,默认情况下target_type="auto"已设置参数,该参数 通过计算中的唯一值的数量来确定目标是离散的还是连续的y。

from yellowbrick.features import Manifold
from yellowbrick.datasets import load_concrete # Load the regression dataset
X, y = load_concrete() # Instantiate the visualizer
# 许多流形算法都是基于最近邻居的,对于这些算法,此参数指定要在嵌入中使用的邻居的数量。
# 如果未为这些嵌入指定n_neighbors,则将其设置为5并发出警告。如果流形算法不使用最近的邻居,则忽略此参数。
viz = Manifold(manifold="isomap", n_neighbors=10) result=viz.fit_transform(X, y) # Fit the data to the visualizer
viz.show() # Finalize and render the figure

<matplotlib.axes._subplots.AxesSubplot at 0x7f4869e8afd0>

5.3 快速方法


from yellowbrick.features import manifold_embedding
from yellowbrick.datasets import load_concrete # Load the regression dataset
X, y = load_concrete() # Instantiate the visualizer
manifold_embedding(X, y, manifold="isomap", n_neighbors=10);

6 双变量关系图



可视化器 JointPlot
快速使用方法 joint_plot()
模型 分类/回归
工作流程 特征工程/选择

6.1 基础使用

from yellowbrick.datasets import load_concrete
from yellowbrick.features import JointPlotVisualizer # Load the dataset
X, y = load_concrete() # Instantiate the visualizer
# columns表示指定的特征名
visualizer = JointPlotVisualizer(columns="cement") result=visualizer.fit_transform(X, y) # Fit and transform the data
visualizer.show() # Finalize and render the figure

<matplotlib.axes._subplots.AxesSubplot at 0x7f486ed49810>


from yellowbrick.features import JointPlotVisualizer

# Load the dataset
X, y = load_concrete() # Instantiate the visualizer
visualizer = JointPlotVisualizer(columns=["cement", "ash"]) result=visualizer.fit_transform(X, y) # Fit and transform the data
visualizer.show() # Finalize and render the figure

<matplotlib.axes._subplots.AxesSubplot at 0x7f486e7c0090>


from yellowbrick.datasets import load_concrete
from yellowbrick.features import JointPlotVisualizer # Load the dataset
X, y = load_concrete() # Instantiate the visualizer
# kind设置点的显示方法,scatter或者hexbin。默认scatter
visualizer = JointPlotVisualizer(columns="cement", kind="hexbin") result=visualizer.fit_transform(X, y) # Fit and transform the data
# Finalize and render the figure

6.2 快速方法


from yellowbrick.datasets import load_concrete
from yellowbrick.features import joint_plot # Load the dataset
X, y = load_concrete() # Instantiate the visualizer
visualizer = joint_plot(X, y, columns="cement");

7 参考








