R TUTORIAL: VISUALIZING MULTIVARIATE RELATIONSHIPS IN LARGE DATASETS
In two previous blog posts I discussed some techniques for visualizing relationships involving two or three variables and a large number of cases. In this tutorial I will extend that discussion to show some techniques that can be used on large datasets and complex multivariate relationships involving three or more variables.
In this tutorial I will use the R package nmle which contains the dataset MathAchieve. Use the code below to install the package and load it into the R environment:
####################################################
#code for visual large dataset MathAchieve
#first show 3d scatterplot; then show tableplot variations
####################################################
install.packages(“nmle”) #install nmle package
library(nlme) #load the package into the R environment
####################################################
Once the package is installed take a look at the structure of the data set by using:
####################################################
attach(MathAchieve) #take a look at the structure of the dataset
str(MathAchieve)
####################################################
Classes ‘nfnGroupedData’, ‘nfGroupedData’, ‘groupedData’ and ‘data.frame’: 7185 obs. of 6 variables:
$ School : Ord.factor w/ 160 levels “8367”<“8854″<..: 59 59 59 59 59 59 59 59 59 59 …
$ Minority: Factor w/ 2 levels “No”,”Yes”: 1 1 1 1 1 1 1 1 1 1 …
$ Sex : Factor w/ 2 levels “Male”,”Female”: 2 2 1 1 1 1 2 1 2 1 …
$ SES : num -1.528 -0.588 -0.528 -0.668 -0.158 …
$ MathAch : num 5.88 19.71 20.35 8.78 17.9 …
$ MEANSES : num -0.428 -0.428 -0.428 -0.428 -0.428 -0.428 -0.428 -0.428 -0.428 -0.428 …
– attr(*, “formula”)=Class ‘formula’ language MathAch ~ SES | School
.. ..- attr(*, “.Environment”)=<environment: R_GlobalEnv>
– attr(*, “labels”)=List of 2
..$ y: chr “Mathematics Achievement score”
..$ x: chr “Socio-economic score”
– attr(*, “FUN”)=function (x)
..- attr(*, “source”)= chr “function (x) max(x, na.rm = TRUE)”
– attr(*, “order.groups”)= logi TRUE
>
As can be seen from the output shown above the MathAchievedataset consists of 7185 observations and six variables. Three of these variables are numeric and three are factors. This presents some difficulties when visualizing the data. With over 7000 cases a two-dimensional scatterplot showing bivariate correlations among the three numeric variables is of limited utility.
We can use a 3D scatterplot and a linear regression model to more clearly visualize and examine relationships among the three numeric variables. The variable SES is a vector measuring socio-economic status, MathAch is a numeric vector measuring mathematics achievment scores, and MEANSES is a vector measuring the mean SESfor the school attended by each student in the sample.
We can look at the correlation matrix of these 3 variables to get a sense of the relationships among the variables:
> ####################################################
> #do a correlation matrix with the 3 numeric vars;
> ###################################################
> data(“MathAchieve”)
> cor(as.matrix(MathAchieve[c(4,5,6)]), method=”pearson”)
SES MathAch MEANSES
SES 1.0000000 0.3607556 0.5306221
MathAch 0.3607556 1.0000000 0.3437221
MEANSES 0.5306221 0.3437221 1.0000000
>
In using the cor() function as seen above we can determine the variables used by specifying the column that each numeric variable is in as shown in the output from the str() function. The 3 numeric variables, for example, are in columns 4, 5, and 6 of the matrix.
As discussed in previous tutorials we can visualize the relationship among these three variable by using a 3D scatterplot. Use the code as seen below:
####################################################
#install.packages(“nlme”)
install.packages(“scatterplot3d”)
library(scatterplot3d)
library(nlme) #load nmle package
attach(MathAchieve) #MathAchive dataset is in environment
scatterplot3d(SES, MEANSES, MathAch, main=”Basic 3D Scatterplot”) #do the plot with default options
####################################################
The resulting plot is:
Even though the scatter plot lacks detail due to the large sample size it is still possible to see the moderate correlations shown in the correlation matrix by noting the shape and direction of the data points . A regression plane can be calculated and added to the plot using the following code:
scatterplot3d(SES, MEANSES, MathAch, main=”Basic 3D Scatterplot”) #do the plot with default options
####################################################
##use a linear regression model to plot a regression plane
#y=MathAchieve, SES, MEANSES are predictor variables
####################################################
model1=lm(MathAch ~ SES + MEANSES) ## generate a regression
#take a look at the regression output
summary(model1)
#run scatterplot again putting results in model
model <- scatterplot3d(SES, MEANSES, MathAch, main=”Basic 3D Scatterplot”) #do the plot with default options
#link the scatterplot and linear model using the plane3d function
model$plane3d(model1) ## link the 3d scatterplot in ‘model’ to the ‘plane3d’ option with ‘model1’ regression information
####################################################
The resulting output is seen below:
Call:
lm(formula = MathAch ~ SES + MEANSES)
Residuals:
Min 1Q Median 3Q Max
-20.4242 -4.6365 0.1403 4.8534 17.0496
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.72590 0.07429 171.31 <2e-16 ***
SES 2.19115 0.11244 19.49 <2e-16 ***
MEANSES 3.52571 0.21190 16.64 <2e-16 ***
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.296 on 7182 degrees of freedom
Multiple R-squared: 0.1624, Adjusted R-squared: 0.1622
F-statistic: 696.4 on 2 and 7182 DF, p-value: < 2.2e-16
and the plot with the plane is:
While the above analysis gives us useful information, it is limited by the mixture of numeric values and factors. A more detailed visual analysis that will allow the display and comparison of all six of the variables is possible by using the functions available in the R packageTableplots. This package was created to aid in the visualization and inspection of large datasets with multiple variables.
The MathAchieve contains a total of six variables and 7185 cases. TheTableplots package can be used with datasets larger than 10,000 observations and up to 12 or so variables. It can be used visualize relationships among variables using the same measurement scale or mixed measurement types.
To look at a comparisons of each data type and then view all 6 together begin with the following:
####################################################
attach(MathAchieve) #attach the dataset
#set up 3 data frames with numeric, factors, and mixed
####################################################
mathmix <- data.frame(SES,MathAch,MEANSES,School=factor(School),Minority=factor(Minority),Sex=factor(Sex)) #all 6 vars
mathfact <- data.frame(School=factor(School),Minority=factor(Minority),Sex=factor(Sex)) #3 factor vars
mathnum <- data.frame(SES,MathAch,MEANSES) #3 numeric vars
####################################################
To view a comparison of the 3 numeric variables use:
####################################################
require(tabplot) #load tabplot package
tableplot(mathnum) #generate a table plot with numeric vars only
####################################################
resulting in the following output:
To view only the 3 factor variables use:
####################################################
require(tabplot) #load tabplot package
tableplot(mathfact) #generate a table plot with factors only
####################################################
Resulting in:
To view and compare table plots of all six variables use:
####################################################
require(tabplot) #load tabplot package
tableplot(mathmix) #generate a table plot with all six variables
####################################################
Resulting in:
Using tableplots is useful in visualizing relationships among a set of variabes. The fact that comparisons can be made using mixed levels of measurement and very large sample sizes provides a tool that the researcher can use for initial exploratory data analysis.
The above visual table comparisons agree with the moderate correlation among the three numeric variables found in the correlation and regression models discussed above. It is also possible to add some additional interpretation by viewing and comparing the mix of both factor and numeric variables.
In this tutorial I have provided a very basic introduction to the use of table plots in visualizing data. Interested readers can find an abundance of information about Tableplot options and interpretations in the CRAN documentation.
In my next tutorial I will continue a discussion of methods to visualize large and complex datasets by looking at some techniques that allow exploration of very large datasets and up to 12 variables or more.
转自:https://dmwiig.net/2017/02/06/r-tutorial-visualizing-multivariate-relationships-in-large-datasets/
R TUTORIAL: VISUALIZING MULTIVARIATE RELATIONSHIPS IN LARGE DATASETS的更多相关文章
- THE R QGRAPH PACKAGE: USING R TO VISUALIZE COMPLEX RELATIONSHIPS AMONG VARIABLES IN A LARGE DATASET, PART ONE
The R qgraph Package: Using R to Visualize Complex Relationships Among Variables in a Large Dataset, ...
- Factoextra R Package: Easy Multivariate Data Analyses and Elegant Visualization
factoextra is an R package making easy to extract and visualize the output of exploratory multivaria ...
- R tutorial
http://www.clemson.edu/economics/faculty/wilson/R-tutorial/Introduction.html https://www.youtube.com ...
- A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python)
A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python) MACHINE LEARNING PYTHON ...
- How-to go parallel in R – basics + tips(转)
Today is a good day to start parallelizing your code. I’ve been using the parallel package since its ...
- The leaflet package for online mapping in R(转)
It has been possible for some years to launch a web map from within R. A number of packages for doin ...
- Toward Scalable Systems for Big Data Analytics: A Technology Tutorial (I - III)
ABSTRACT Recent technological advancement have led to a deluge of data from distinctive domains (e.g ...
- A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets(中英双语)
文章标题 A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets 且谈Apache Spark的API三剑客:RDD.Dat ...
- 多组学分析及可视化R包
最近打算开始写一个多组学(包括宏基因组/16S/转录组/蛋白组/代谢组)关联分析的R包,避免重复造轮子,在开始之前随便在网上调研了下目前已有的R包工具,部分罗列如下: 1. mixOmics 应该是在 ...
随机推荐
- Java面试题04-final关键字详解
Java面试题04-final关键字详解 本篇博客将会讨论java中final关键字的含义,以及final用在什么地方,感觉看书总会有一些模糊,而且解释的不是很清楚,在此做个总结,以备准备面试的时候查 ...
- mysql 免安装版 + sqlyog 安装 步骤 --- 发的有点晚
总有些朋友不会安装mysql,其实软件安装不是学习mysql的重点,基本上也就安装一次,工作后一般公司里也不会让你安装,如果非要安装,百度一下就行了.安装版本百度上有许多,下面就提供一个免安装版的步骤 ...
- java集合基础
集合概念与作用 1现实生活中把很多事物凑在一起就是集合.java中的集合类:是一种工具,就像是容器,存储任意数量的有共同属性的对象. 2在类的内部,对数据进行组织: 简单而快速的搜索大数量的条目 有的 ...
- maven私服搭建nexus/windows/linux(一)
为什么要搭建nexus私服,原因很简单,有些公司都不提供外网给项目组人员,因此就不能使用maven访问远程的仓库地址,还有就是公司内部开发的一些版本的jar包,如果没有私服需要一人拷贝一份然后再自己安 ...
- 在eclipse中使用Maven建web工程项目
在eclipse中使用Maven建web工程项目: 第一种方式: 右键新建maven工程,勾选创建一个简单工程 填入信息,注意打包方式要改为war 点击完成,创建完的工程目录如下: 项目中没有WEB- ...
- Vue.js动画在项目使用的两个示例
欢迎大家关注腾讯云技术社区-博客园官方主页,我们将持续在博客园为大家推荐技术精品文章哦~ 李萌,16年毕业,Web前端开发从业者,目前就职于腾讯,喜欢node.js.vue.js等技术,热爱新技术,热 ...
- 分享一本书<<谁都不敢欺负你>>
有些人,不管在工作还是生活上,总是被人欺负. 分享这本书给大家,能给大家带来正能量.你强大了,就没人敢欺负你. 有的时候,感到为什么倒霉的总是我?为什么我的命运是这样?为什么总欺负我? 也许有很多人会 ...
- Android系统--输入系统(七)Reader_Dispatcher线程启动分析
Android系统--输入系统(七)Reader_Dispatcher线程启动分析 1. Reader/Dispatcher的引入 对于输入系统来说,将会创建两个线程: Reader线程(读取事件) ...
- ABP入门系列(16)——通过webapi与系统进行交互
ABP入门系列目录--学习Abp框架之实操演练 源码路径:Github-LearningMpaAbp 1. 引言 上一节我们讲解了如何创建微信公众号模块,这一节我们就继续跟进,来讲一讲公众号模块如何与 ...
- 微软的STRIDE模型
微软的STRIDE模型: https://msdn.microsoft.com/en-us/library/ee823878(v=cs.20).aspx Spoofing identity. An e ...