

  • t-分布领域嵌入算法
  • 虽然主打非线性高维数据降维,但是很少用,因为
  • 比较适合应用于可视化,测试模型的效果
  • 保证在低维上数据的分布与原始特征空间分布的相似性高


1.1 复现demo

  1. # Import TSNE
  2. from sklearn.manifold import TSNE
  3. # Create a TSNE instance: model
  4. model = TSNE(learning_rate=200)
  5. # Apply fit_transform to samples: tsne_features
  6. tsne_features = model.fit_transform(samples)
  7. # Select the 0th feature: xs
  8. xs = tsne_features[:,0]
  9. # Select the 1st feature: ys
  10. ys = tsne_features[:,1]
  11. # Scatter plot, coloring by variety_numbers
  12. plt.scatter(xs,ys,c=variety_numbers)
  13. plt.show()



  • 当特征变量很多的时候,变量之间往往存在多重共线性。
  • 主成分分析,用于高维数据降维,提取数据的主要特征分量
  • PCA能“一箭双雕”的地方在于
    • 既可以选择具有代表性的特征,
    • 每个特征之间线性无关
    • 总结一下就是原始特征空间的最佳线性组合


2.1 数学推理


Making sense of principal component analysis, eigenvectors & eigenvalues



  1. from sklearn.decomposition import PCA
  1. # Perform the necessary imports
  2. import matplotlib.pyplot as plt
  3. from scipy.stats import pearsonr
  4. # Assign the 0th column of grains: width
  5. width = grains[:,0]
  6. # Assign the 1st column of grains: length
  7. length = grains[:,1]
  8. # Scatter plot width vs length
  9. plt.scatter(width, length)
  10. plt.axis('equal')
  11. plt.show()
  12. # Calculate the Pearson correlation
  13. correlation, pvalue = pearsonr(width, length)
  14. # Display the correlation
  15. print(correlation)

  1. # Import PCA
  2. from sklearn.decomposition import PCA
  3. # Create PCA instance: model
  4. model = PCA()
  5. # Apply the fit_transform method of model to grains: pca_features
  6. pca_features = model.fit_transform(grains)
  7. # Assign 0th column of pca_features: xs
  8. xs = pca_features[:,0]
  9. # Assign 1st column of pca_features: ys
  10. ys = pca_features[:,1]
  11. # Scatter plot xs vs ys
  12. plt.scatter(xs, ys)
  13. plt.axis('equal')
  14. plt.show()
  15. # Calculate the Pearson correlation of xs and ys
  16. correlation, pvalue = pearsonr(xs, ys)
  17. # Display the correlation
  18. print(correlation)
  19. <script.py> output:
  20. 2.5478751053409354e-17

2.3intrinsic dimension


2.3.1 提取主成分



  1. # Perform the necessary imports
  2. from sklearn.decomposition import PCA
  3. from sklearn.preprocessing import StandardScaler
  4. from sklearn.pipeline import make_pipeline
  5. import matplotlib.pyplot as plt
  6. # Create scaler: scaler
  7. scaler = StandardScaler()
  8. # Create a PCA instance: pca
  9. pca = PCA()
  10. # Create pipeline: pipeline
  11. pipeline = make_pipeline(scaler,pca)
  12. # Fit the pipeline to 'samples'
  13. pipeline.fit(samples)
  14. # Plot the explained variances
  15. features =range( pca.n_components_)
  16. plt.bar(features, pca.explained_variance_)
  17. plt.xlabel('PCA feature')
  18. plt.ylabel('variance')
  19. plt.xticks(features)
  20. plt.show()

2.3.2Dimension reduction with PCA




  1. # Import TfidfVectorizer
  2. from sklearn.feature_extraction.text import TfidfVectorizer
  3. # Create a TfidfVectorizer: tfidf
  4. tfidf = TfidfVectorizer()
  5. # Apply fit_transform to document: csr_mat
  6. csr_mat = tfidf.fit_transform(documents)
  7. # Print result of toarray() method
  8. print(csr_mat.toarray())
  9. # Get the words: words
  10. words = tfidf.get_feature_names()
  11. # Print words
  12. print(words)
  13. ['cats say meow', 'dogs say woof', 'dogs chase cats']
  14. <script.py> output:
  15. [[0.51785612 0. 0. 0.68091856 0.51785612 0. ]
  16. [0. 0. 0.51785612 0. 0.51785612 0.68091856]
  17. [0.51785612 0.68091856 0.51785612 0. 0. 0. ]]
  18. ['cats', 'chase', 'dogs', 'meow', 'say', 'woof']
  19. <script.py> output:
  20. [[0.51785612 0. 0. 0.68091856 0.51785612 0. ]
  21. [0. 0. 0.51785612 0. 0.51785612 0.68091856]
  22. [0.51785612 0.68091856 0.51785612 0. 0. 0. ]]
  23. ['cats', 'chase', 'dogs', 'meow', 'say', 'woof']

