机器学习实用案例解析(1) 使用R语言
R最大的优势是:它是由统计学家们开发的。R最大的劣势是……它是由统计学家们开发的。——Bo Cowgill, Google公司
library(package) and require(package) both load the namespace of the package with name package and attach it on the search list. require is designed for use inside other functions; it returns FALSE and gives a warning (rather than an error as library() does by default) if the package does not exist. Both functions check and update the list of currently attached packages and do not reload a namespace which is already loaded.
> a<-require(tm)
Warning messages:
1: 程辑包‘tm’是用R版本3.5.3 来建造的
2: 程辑包‘NLP’是用R版本3.5.2 来建造的
> a
[1] TRUE
# Load libraries and data
library(ggplot2) # We'll use ggplot2 for all of our visualizations
library(plyr) # For data manipulation
library(scales) # We'll need to fix date formats in plots ufo <- read.delim("ufo_awesome.tsv",
sep = "\t",
stringsAsFactors = FALSE,
header = FALSE,
na.strings = "")
- read.delim: Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file.
Similarly, read.delim is for reading delimited files, defaulting to the TAB character for the delimiter.
在本例中,每一行的数据类型都是strings(字符串),但是所有read.*函数都默认把字符串转换为factor类型,因此,我们需要设置stringsAsFactors=FALSE来防止其转换。此外,这份数据第一行并没有表头,因此还需要把表头的参数设置为FALSE。最后,数据中有许多空元素,我们想把这些空元素设置为R中的特殊值N A,为此,我们显式地定义空字符串为na.string。
# Inspect the data frame
- names: Functions to get or set the names of an object.
names(ufo) <- c("DateOccurred", "DateReported",
"Location", "ShortDescription",
"Duration", "LongDescription")
good.rows <- ifelse(nchar(ufo$DateOccurred) != 8 |
nchar(ufo$DateReported) != 8,
## [1] 688
ufo <- ufo[good.rows, ] # Now we can convert the strings to Date objects and work with them properly
ufo$DateOccurred <- as.Date(ufo$DateOccurred, format = "%Y%m%d")
ufo$DateReported <- as.Date(ufo$DateReported, format = "%Y%m%d")
- ifelse(test, yes, no)
ifelse returns a value with the same shape as test which is filled with elements selected from either yes or no depending on whether the element of test is TRUE or FALSE.
- nchar: nchar takes a character vector as an argument and returns a vector whose elements contain the sizes of the corresponding elements of x.
- nzchar is a fast way to find out if elements of a character vector are non-empty strings.
- as.Date: Functions to convert between character representations and objects of class "Date" representing calendar dates.
get.location <- function(l)
split.location <- tryCatch(strsplit(l, ",")[[1]],
error = function(e) return(c(NA, NA)))
clean.location <- gsub("^ ","",split.location)
if (length(clean.location) > 2)
} # We use 'lapply' to return a list with [City, State] vector as each element
city.state <- lapply(ufo$Location, get.location) # We use 'do.call' to collapse the list to an N-by-2 matrix
location.matrix <- do.call(rbind, city.state)
- lapply: lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.
l <- "Iowa City, IA"
strsplit(l, ",")
## [[1]]
## [1] "Iowa City" " IA"
- strsplit: Split the elements of a character vector x into substrings according to the matches to substring split within them.
split.location <- tryCatch(strsplit(l, ",")[[1]], error = function(e) return(c(NA, NA)))
## [1] "Iowa City" " IA"
- tryCatch: These functions provide a mechanism for handling unusual conditions, including errors and warnings.
clean.location <- gsub("^ ","",split.location)
## [1] "Iowa City" "IA"
- do.call: constructs and executes a function call from a name or a function and a list of arguments to be passed to it.
> head(location.matrix)
[,1] [,2]
[1,] "Iowa City" "IA"
[2,] "Milwaukee" "WI"
[3,] "Shelton" "WA"
[4,] "Columbia" "MO"
[5,] "Seattle" "WA"
[6,] "Brunswick County" "ND"
ufo <- transform(ufo,
USCity = location.matrix[, 1],
USState = location.matrix[, 2],
stringsAsFactors = FALSE) ufo$USState <- state.abb[match(ufo$USState, state.abb)] ufo.us <- subset(ufo, !is.na(USState))
- transform: transform is a generic function, which—at least currently—only does anything useful with data frames. transform.default converts its first argument to a data frame if possible and calls transform.data.frame.
- state.abb state.area state.center state.division state.name state.region state.x77: Data sets related to the 50 states of the United States of America.
- subset: Return subsets of vectors, matrices or data frames which meet conditions.
ufo.us <- subset(ufo.us, DateOccurred >= as.Date("1990-01-01"))
new.hist <- ggplot(ufo.us, aes(x = DateOccurred)) +
geom_histogram(aes(fill='white', color='red')) +
scale_fill_manual(values=c('white'='white'), guide="none") +
scale_color_manual(values=c('red'='red'), guide="none") +
scale_x_date(breaks = "50 years") ggsave(plot = new.hist,
filename = "new_hist.bmp",
height = 6,
width = 8)
ufo.us$YearMonth <- strftime(ufo.us$DateOccurred, format = "%Y-%m") sightings.counts <- ddply(ufo.us, .(USState,YearMonth), nrow)
- strftime: Functions to convert between character representations and objects of classes "POSIXlt" and "POSIXct" representing calendar dates and times.
- ddply: For each subset of a data frame, apply function then combine results into a data frame. To apply a function for each row, use adply with .margins set to 1.
date.range <- seq.Date(from = min(ufo.us$DateOccurred),
to = max(ufo.us$DateOccurred),
by = "month")
date.strings <- strftime(date.range, "%Y-%m")
states.dates <- lapply(state.abb, function(s) cbind(s, date.strings))
states.dates <- data.frame(do.call(rbind, states.dates),
stringsAsFactors = FALSE)
all.sightings <- merge(states.dates,
by.x = c("s", "date.strings"),
by.y = c("USState", "YearMonth"),
all = TRUE)
names(all.sightings) <- c("State", "YearMonth", "Sightings")
all.sightings$Sightings[is.na(all.sightings$Sightings)] <- 0
all.sightings$YearMonth <- as.Date(rep(date.range, length(state.abb)))
all.sightings$State <- as.factor(all.sightings$State)
- merge: Merge two data frames by common columns or row names, or do other versions of database join operations.
state.plot <- ggplot(all.sightings, aes(x = YearMonth,y = Sightings)) +
geom_line(aes(color = "darkblue")) +
facet_wrap(~State, nrow = 10, ncol = 5) +
theme_bw() +
scale_color_manual(values = c("darkblue" = "darkblue"), guide = "none") +
scale_x_date(breaks = "5 years", labels = date_format('%Y')) +
xlab("Years") +
ylab("Number of Sightings") +
ggtitle("Number of UFO sightings by Month-Year and U.S. State (1990-2010)") # Save the plot as a PDF
ggsave(plot = state.plot,
filename = "ufo_sightings.bmp",
width = 14,
height = 8.5)
We can alse create a new graph where the number of signtings is normailzed by the state population.
