笔记来源:
http://wiki.stdout.org/rcookbook/
Indexing into a data structure
Problem
You want to get part of a data structure.
Solution
Elements from a vector, matrix, or data frame can be extracted using numeric indexing, or by using a boolean vector of the appropriate length.
In many of the examples, below, there are multiple ways of doing the same thing.
Indexing with numbers and names
With a vector:
# A sample vector
v <- c(1,4,4,3,2,2,3)
v[c(2,3,4)]
v[2:4]
# 4 4 3
v[c(2,4,3)]
# 4 3 4
With a data frame:
# Create a sample data frame
data <- read.table(header=T, con <- textConnection('
subject sex size
1 M 7
2 F 6
3 F 9
4 M 11
'))
close(con)
# Get the element at row 1, column 3
data[1,3]
data[1,"size"]
# 7
# Get rows 1 and 2, and all columns
data[1:2, ]
data[c(1,2), ]
# subject sex size
# 1 M 7
# 2 F 6
# Get rows 1 and 2, and only column 2
data[1:2, 2]
data[c(1,2), 2]
# [1] M F
# Levels: F M
# Get rows 1 and 2, and only the columns named "sex" and "size"
data[1:2, c("sex","size")]
data[c(1,2), c(2,3)]
# sex size
# M 7
# F 6
Indexing with a boolean vector
With the vector v from above:
v > 2
# FALSE TRUE TRUE TRUE FALSE FALSE TRUE
v[v>2]
v[ c(F,T,T,T,F,F,T)]
# 4 4 3 3
With the data frame from above:
# A boolean vector
data$subject < 3
# TRUE TRUE FALSE FALSE
data[data$subject < 3, ]
data[c(TRUE,TRUE,FALSE,FALSE), ]
# subject sex size
# 1 M 7
# 2 F 6
# It is also possible to get the numeric indices of the TRUEs
which(data$subject < 3)
# 1 2
Negative indexing
Unlike in some other programming languages, when you use negative numbers for indexing in R, it doesn't mean to index backward from the end. Instead, it means to drop the element at that index, counting the usual way, from the beginning.
# Here's the vector again.
v
# 1 4 4 3 2 2 3
# Drop the first element
v[-1]
# 4 4 3 2 2 3
# Drop first three
v[-1:-3]
# 3 2 2 3
# Drop just the last element
v[-length(v)]
# 1 4 4 3 2 2
Getting a subset of a data structure
Problem
You want to do get a subset of the elements of a vector, matrix, or data frame.
Solution
To get a subset based on some conditional criterion, the subset() function or indexing using square brackets can be used. In the examples here, both ways are shown.
# A sample vector
v <- c(1,4,4,3,2,2,3)
subset(v, v<3)
v[v<3]
# 1 2 2
# Another vector
t <- c("small", "small", "large", "medium")
# Remove "small" entries
subset(t, t!="small")
t[t!="small"]
# "large" "medium"
One important difference between the two methods is that you can assign values to elements with square bracket indexing, but you cannot with subset().
v[v<3] <- 9
# 9 4 4 3 9 9 3
subset(v, v<3) <- 9
# Error in subset(v, v < 3) <- 9 : could not find function "subset<-"
With data frames:
# A sample data frame
data <- read.table(header=T, con <- textConnection('
subject sex size
1 M 7
2 F 6
3 F 9
4 M 11
'))
close(con)
subset(data, subject < 3)
data[data$subject < 3, ]
# subject sex size
# 1 M 7
# 2 F 6
# Subset of particular rows and columns
subset(data, subject < 3, select = -subject)
subset(data, subject < 3, select = c(sex,size))
subset(data, subject < 3, select = sex:size)
data[data$subject < 3, c("sex","size")]
# sex size
# M 7
# F 6
# Logical AND of two conditions
subset(data, subject < 3 & sex=="M")
data[data$subject < 3 & data$sex=="M", ]
# subject sex size
# 1 M 7
# Logical OR of two conditions
subset(data, subject < 3 | sex=="M")
data[data$subject < 3 | data$sex=="M", ]
# subject sex size
# 1 M 7
# 2 F 6
# 4 M 11
# Condition based on transformed data
subset(data, log2(size)>3 )
data[log2(data$size) > 50, ]
# subject sex size
# 3 F 9
# 4 M 11
# Subset if elements are in another vector
subset(data, subject %in% c(1,3))
data[data$subject %in% c(1,3), ]
# subject sex size
# 1 M 7
# 3 F 9
Making a vector filled with values
Problem
You want to create a vector with values already filled in.
Solution
rep(1, 50)
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# [39] 1 1 1 1 1 1 1 1 1 1 1 1
rep(F, 20)
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
rep(1:5, 4)
# 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
rep(1:5, each=4)
# 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5
# Use it on a factor
rep(factor(LETTERS[1:3]), 5)
# A B C A B C A B C A B C A B C
# Levels: A B C
Information about variables
Problem
You want to find information about variables.
Solution
Here are some sample variables to work with in the examples below:
x <- 6
n <- 1:4
let <- LETTERS[1:4]
df <- data.frame(n, let)
Information about existence
# List currently defined variables
ls()
# "df" "let" "n" "x"
# Check if a variable named "x" exists
exists("x")
# TRUE
# Check if "y" exists
exists("y")
# FALSE
# Delete variable x
rm(x)
x
# Error: object "x" not found
Information about size/structure
# Get information about structure
str(n)
# int [1:4] 1 2 3 4
str(df)
# 'data.frame': 4 obs. of 2 variables:
# $ n : int 1 2 3 4
# $ let: Factor w/ 4 levels "A","B","C","D": 1 2 3 4
# Get the length of a vector
length(n)
# 4
# Length probably doesn't give us what we want here:
length(df)
# 2
# Number of rows
nrow(df)
# 4
# Number of columns
ncol(df)
# 2
# Get rows and columns
dim(df)
# 4 2
Working with NULL, NA, and NaN
Problem
You want to properly handle NULL
, NA
, or NaN
values.
Solution
Sometimes your data will include NULL
, NA
, or NaN
. These work somewhat differently from "normal" values, and may require explicit testing.
Here are some examples of comparisons with these values:
x <- NULL
x > 5
# logical(0)
y <- NA
y > 5
# NA
z <- NaN
z > 5
# NA
Here's how to test whether a variable has one of these values:
is.null(x)
# TRUE
is.na(y)
# TRUE
is.nan(z)
# TRUE
Note that NULL
is different from the other two. NULL
means that there is no value, while NA
and NaN
mean that there is some value, although one that is perhaps not usable. Here's an illustration of the difference:
# Is y null?
is.null(y)
# FALSE
# Is x NA?
is.na(x)
# logical(0)
# Warning message:
# In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
In the first case, it checks if y
is NULL
, and the answer is no. In the second case, it tries to check if x
is `NA, but there is no value to be checked.
Ignoring "bad" values in vector summary functions
If you run functions like mean()
or sum()
on a vector containing NA
or NaN
, they will return NA
and NaN
, which is generally unhelpful, though this will alert you to the presence of the bad value. Many of these functions take the flag na.rm
, which tells them to ignore these values.
vy <- c(1, 2, 3, NA, 5)
# 1 2 3 NA 5
mean(vy)
# NA
mean(vy, na.rm=TRUE)
# 2.75
vz <- c(1, 2, 3, NaN, 5)
# 1 2 3 NaN 5
sum(vz)
# NaN
sum(vz, na.rm=TRUE)
# 11
# NULL isn't a problem, because it doesn't exist
vx <- c(1, 2, 3, NULL, 5)
# 1 2 3 5
sum(vx)
# 11
Removing bad values from a vector
These values can be removed from a vector by filtering using is.na()
or is.nan()
.
vy
# 1 2 3 NA 5
vy[ !is.na(vy) ]
# 1 2 3 5
vz
# 1 2 3 NaN 5
vz[ !is.nan(vz) ]
# 1 2 3 5
Notes
There are also the infinite numerical values Inf
and -Inf
, and the associated functions is.finite()
andis.infinite()
.