笔记来源:

http://wiki.stdout.org/rcookbook/

 

Indexing into a data structure

Problem

You want to get part of a data structure.

Solution

Elements from a vector, matrix, or data frame can be extracted using numeric indexing, or by using a boolean vector of the appropriate length.

In many of the examples, below, there are multiple ways of doing the same thing.

Indexing with numbers and names

With a vector:

# A sample vector

v <- c(1,4,4,3,2,2,3)

 

v[c(2,3,4)]

v[2:4]

# 4 4 3

 

v[c(2,4,3)]

# 4 3 4

With a data frame:

# Create a sample data frame

data <- read.table(header=T, con <- textConnection('

 subject sex size

       1   M    7

       2   F    6

       3   F    9

       4   M   11

 '))

close(con)

 

# Get the element at row 1, column 3

data[1,3]

data[1,"size"]

# 7

 

# Get rows 1 and 2, and all columns

data[1:2, ]  

data[c(1,2), ]

# subject sex size

#       1   M    7

#       2   F    6

 

# Get rows 1 and 2, and only column 2

data[1:2, 2]

data[c(1,2), 2]

# [1] M F

# Levels: F M

 

# Get rows 1 and 2, and only the columns named "sex" and "size"

data[1:2, c("sex","size")]

data[c(1,2), c(2,3)]

# sex size

#   M    7

#   F    6

 

Indexing with a boolean vector

With the vector v from above:

v > 2

# FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE

 

v[v>2]

v[ c(F,T,T,T,F,F,T)]

# 4 4 3 3

With the data frame from above:

# A boolean vector  

data$subject < 3

# TRUE  TRUE FALSE FALSE

 

data[data$subject < 3, ]

data[c(TRUE,TRUE,FALSE,FALSE), ]

# subject sex size

#       1   M    7

#       2   F    6   

 

# It is also possible to get the numeric indices of the TRUEs

which(data$subject < 3)

# 1 2

Negative indexing

Unlike in some other programming languages, when you use negative numbers for indexing in R, it doesn't mean to index backward from the end. Instead, it means to drop the element at that index, counting the usual way, from the beginning.

# Here's the vector again.

v

# 1 4 4 3 2 2 3

 

# Drop the first element

v[-1]

# 4 4 3 2 2 3

 

# Drop first three

v[-1:-3]

# 3 2 2 3

 

# Drop just the last element

v[-length(v)]

# 1 4 4 3 2 2

 

Getting a subset of a data structure

Problem

You want to do get a subset of the elements of a vector, matrix, or data frame.

Solution

To get a subset based on some conditional criterion, the subset() function or indexing using square brackets can be used. In the examples here, both ways are shown.

# A sample vector

v <- c(1,4,4,3,2,2,3)

 

subset(v, v<3)

v[v<3]

# 1 2 2

 

# Another vector

t <- c("small", "small", "large", "medium")

 

# Remove "small" entries

subset(t, t!="small")

t[t!="small"]

# "large"  "medium"

One important difference between the two methods is that you can assign values to elements with square bracket indexing, but you cannot with subset().

v[v<3] <- 9

# 9 4 4 3 9 9 3

 

subset(v, v<3) <- 9

# Error in subset(v, v < 3) <- 9 : could not find function "subset<-"

With data frames:

# A sample data frame

data <- read.table(header=T, con <- textConnection('

 subject sex size

       1   M    7

       2   F    6

       3   F    9

       4   M   11

 '))

close(con)

 

subset(data, subject < 3)

data[data$subject < 3, ]

# subject sex size

#       1   M    7

#       2   F    6

 

# Subset of particular rows and columns

subset(data, subject < 3, select = -subject)

subset(data, subject < 3, select = c(sex,size))

subset(data, subject < 3, select = sex:size)

data[data$subject < 3, c("sex","size")]

# sex size

#   M    7

#   F    6

 

# Logical AND of two conditions

subset(data, subject < 3  &  sex=="M")

data[data$subject < 3  &  data$sex=="M", ]

# subject sex size

#       1   M    7

 

# Logical OR of two conditions

subset(data, subject < 3  |  sex=="M")

data[data$subject < 3  |  data$sex=="M", ]

# subject sex size

#       1   M    7

#       2   F    6

#       4   M   11

 

# Condition based on transformed data

subset(data, log2(size)>3 )

data[log2(data$size) > 50, ]

# subject sex size

#       3   F    9

#       4   M   11

 

# Subset if elements are in another vector

subset(data, subject %in% c(1,3))

data[data$subject %in% c(1,3), ]

# subject sex size

#       1   M    7

#       3   F    9

 

 

Making a vector filled with values

Problem

You want to create a vector with values already filled in.

Solution

rep(1, 50)
#  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# [39] 1 1 1 1 1 1 1 1 1 1 1 1
 
rep(F, 20)
#  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 
rep(1:5, 4)
# 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
 
rep(1:5, each=4)
# 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5
 
# Use it on a factor
rep(factor(LETTERS[1:3]), 5)
# A B C A B C A B C A B C A B C
# Levels: A B C

 

 

Information about variables

Problem

You want to find information about variables.

Solution

Here are some sample variables to work with in the examples below:

x <- 6
n <- 1:4
let <- LETTERS[1:4]
df <- data.frame(n, let)

Information about existence

# List currently defined variables
ls()
#  "df"  "let" "n"   "x"  
 
# Check if a variable named "x" exists
exists("x")
#  TRUE
 
# Check if "y" exists
exists("y")
#  FALSE
 
# Delete variable x
rm(x)
x
# Error: object "x" not found

Information about size/structure

# Get information about structure
str(n)
#  int [1:4] 1 2 3 4
 
str(df)
# 'data.frame': 4 obs. of  2 variables:
#  $ n  : int  1 2 3 4
#  $ let: Factor w/ 4 levels "A","B","C","D": 1 2 3 4
 
# Get the length of a vector
length(n)
#  4
 
# Length probably doesn't give us what we want here:
length(df)
#  2
 
# Number of rows
nrow(df)
#  4
 
# Number of columns
ncol(df)
#  2
 
# Get rows and columns
dim(df)
#  4 2

 

 

Working with NULL, NA, and NaN

Problem

You want to properly handle NULLNA, or NaN values.

Solution

Sometimes your data will include NULLNA, or NaN. These work somewhat differently from "normal" values, and may require explicit testing.

Here are some examples of comparisons with these values:

x <- NULL
x > 5
# logical(0)
 
y <- NA
y > 5
# NA
 
z <- NaN
z > 5
# NA

Here's how to test whether a variable has one of these values:

is.null(x)
# TRUE
 
is.na(y)
# TRUE
 
is.nan(z)
# TRUE

Note that NULL is different from the other two. NULL means that there is no value, while NA and NaN mean that there is some value, although one that is perhaps not usable. Here's an illustration of the difference:

# Is y null?
is.null(y)
# FALSE
 
# Is x NA?
is.na(x)
# logical(0)
# Warning message:
# In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'

In the first case, it checks if y is NULL, and the answer is no. In the second case, it tries to check if x is `NA, but there is no value to be checked.

Ignoring "bad" values in vector summary functions

If you run functions like mean() or sum() on a vector containing NA or NaN, they will return NA and NaN, which is generally unhelpful, though this will alert you to the presence of the bad value. Many of these functions take the flag na.rm, which tells them to ignore these values.

vy <- c(1, 2, 3, NA, 5)
# 1  2  3 NA  5
mean(vy)
# NA
mean(vy, na.rm=TRUE)
# 2.75
 
vz <- c(1, 2, 3, NaN, 5)
# 1   2   3 NaN   5
sum(vz)
# NaN
sum(vz, na.rm=TRUE)
# 11
 
# NULL isn't a problem, because it doesn't exist
vx <- c(1, 2, 3, NULL, 5)
# 1 2 3 5
sum(vx)
# 11

Removing bad values from a vector

These values can be removed from a vector by filtering using is.na() or is.nan().

vy
# 1  2  3 NA  5
vy[ !is.na(vy) ]
# 1  2  3  5
 
vz
# 1   2   3 NaN   5
vz[ !is.nan(vz) ]
# 1  2  3  5

Notes

There are also the infinite numerical values Inf and -Inf, and the associated functions is.finite() andis.infinite().