Base R provides a rich set of built-in functions for data manipulation, statistical calculations, and programming. The apply family of functions is particularly powerful for applying operations across different dimensions of data structures. This chapter covers essential base R functions and the apply family in detail.
4 Common Utility Functions
Base R includes numerous utility functions for statistical calculations, data generation, and manipulation.
4.1 Statistical Functions
# Create sample datax <-1:10print(x)
[1] 1 2 3 4 5 6 7 8 9 10
# Basic statistical functionssum(x) # Sum of all elements
[1] 55
mean(x) # Arithmetic mean
[1] 5.5
median(x) # Median value
[1] 5.5
sd(x) # Standard deviation
[1] 3.02765
var(x) # Variance
[1] 9.166667
min(x) # Minimum value
[1] 1
max(x) # Maximum value
[1] 10
range(x) # Returns min and max
[1] 1 10
length(x) # Number of elements
[1] 10
4.2 Quantile Function
The quantile() function calculates sample quantiles corresponding to given probabilities:
x <-1:10quantile(x) # Default quartiles (0%, 25%, 50%, 75%, 100%)
# Different ways to create sequencesseq(0, 1, by =0.1) # From 0 to 1, increment by 0.1
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
seq(0, 1, length.out =11) # From 0 to 1, 11 equally spaced values
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
seq(from =5, to =50, by =5) # From 5 to 50, increment by 5
[1] 5 10 15 20 25 30 35 40 45 50
seq_along(letters[1:5]) # Sequence along the length of an object
[1] 1 2 3 4 5
4.3.2 rep() - Repeat Values
The rep() function repeats values:
# Different repetition patternsrep(5, times =3) # Repeat 5 three times
[1] 5 5 5
rep(c(1, 2), times =3) # Repeat vector three times
[1] 1 2 1 2 1 2
rep(c(1, 2), each =3) # Repeat each element three times
[1] 1 1 1 2 2 2
rep(1:3, length.out =10) # Repeat to reach specified length
[1] 1 2 3 1 2 3 1 2 3 1
5 The Apply Family of Functions
The apply family provides efficient ways to apply functions across different dimensions of data structures. These functions are alternatives to writing explicit loops.
5.1 apply() - For Matrices and Arrays
The apply() function applies a function over the margins of an array or matrix.
Syntax: apply(X, MARGIN, FUN, ... ) - X: Array or matrix - MARGIN: 1 = rows, 2 = columns, c(1,2) = both - FUN: Function to apply - ...: Additional arguments to FUN
# Create a sample matrixm <-matrix(1:12, nrow =3, ncol =4)print(m)
lapply(lst, function(x) x[x >8]) # Filter elements > 8
$a
integer(0)
$b
[1] 9 10
$c
[1] 11 12 13 14 15
5.3 sapply() - Simplified lapply()
The sapply() function is similar to lapply() but attempts to simplify the result to a vector or matrix when possible.
# Same data as abovelst <-list(a =1:5, b =6:10, c =11:15)# sapply returns simplified resultssapply(lst, sum) # Returns named vector instead of list
a b c
15 40 65
sapply(lst, mean) # Returns named vector
a b c
3 8 13
sapply(lst, range) # Returns matrix when each result has same length
a b c
[1,] 1 6 11
[2,] 5 10 15
# Compare lapply vs sapplyresult_lapply <-lapply(lst, mean)result_sapply <-sapply(lst, mean)class(result_lapply) # List
[1] "list"
class(result_sapply) # Named numeric vector
[1] "numeric"
5.4 vapply() - Type-Safe sapply()
The vapply() function is like sapply() but with explicit specification of the return type, making it safer and faster.
Syntax: vapply(X, FUN, FUN.VALUE, ...)
# vapply with explicit return type specificationlst <-list(a =1:5, b =6:10, c =11:15)# Specify that each function call returns a single numeric valuevapply(lst, mean, FUN.VALUE =numeric(1))
a b c
3 8 13
# Specify that each function call returns two numeric valuesvapply(lst, range, FUN.VALUE =numeric(2))
a b c
[1,] 1 6 11
[2,] 5 10 15
# This would throw an error if the function doesn't return the expected type# vapply(lst, mean, FUN.VALUE = character(1)) # Error!
5.5 mapply() - Multivariate Apply
The mapply() function applies a function to multiple arguments element-wise.
# Apply function to multiple vectors element-wisevec1 <-1:3vec2 <-10:12vec3 <-100:102# Add corresponding elements from two vectorsmapply(sum, vec1, vec2)
[1] 11 13 15
# Equivalent to: c(sum(1, 10), sum(2, 11), sum(3, 12))# Add corresponding elements from three vectorsmapply(sum, vec1, vec2, vec3)
[1] 111 114 117
# Using mapply with custom functionsmapply(function(x, y) x^y, vec1, c(2, 3, 4))
[1] 1 8 81
# mapply with listslist1 <-list(a =1:3, b =4:6)list2 <-list(x =7:9, y =10:12)mapply(function(a, b) sum(a) +sum(b), list1, list2)
a b
30 48
5.6 tapply() - Apply by Groups
The tapply() function applies a function to subsets of a vector based on grouping factors.
Syntax: tapply(X, INDEX, FUN, ...)
# Sample data with groupingvalues <-c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)groups <-c("A", "B", "A", "B", "A", "B", "A", "B", "A", "B")# Apply function by groupstapply(values, groups, mean) # Mean by group
A B
5 6
tapply(values, groups, sum) # Sum by group
A B
25 30
tapply(values, groups, length) # Count by group
A B
5 5
# Using tapply with built-in datasetstapply(mtcars$mpg, mtcars$cyl, mean) # Mean mpg by cylinder count
4 6 8
26.66364 19.74286 15.10000
tapply(mtcars$hp, mtcars$gear, median) # Median horsepower by gear count
3 4 5
180 94 175
6 Data Subsetting and Indexing
R provides powerful subsetting capabilities for extracting specific elements from data structures.
6.1 Vector Subsetting
# Create sample vectorx <-c(10, 20, 30, 40, 50)names(x) <- letters[1:5]# Subsetting by positionx[1] # First element
a
10
x[c(1, 3, 5)] # Elements 1, 3, and 5
a c e
10 30 50
x[-2] # All except second element
a c d e
10 30 40 50
x[1:3] # First three elements
a b c
10 20 30
# Subsetting by namex["a"] # Element named 'a'
a
10
x[c("a", "c")] # Elements named 'a' and 'c'
a c
10 30
# Logical subsettingx[x >25] # Elements greater than 25
# Define custom functions for use with apply family# Function to calculate coefficient of variationcv <-function(x) {sd(x) /mean(x) *100}# Function to calculate rangemy_range <-function(x) {max(x) -min(x)}# Apply custom functionssample_data <-matrix(rnorm(20), nrow =4)apply(sample_data, 2, cv) # CV for each column
apply(sample_data, 1, my_range) # Range for each row
[1] 2.393392 3.768018 2.023416 1.693215
7.2 Nested Apply Operations
# Create nested list structurenested_list <-list(group1 =list(a =1:5, b =6:10),group2 =list(c =11:15, d =16:20))# Apply function to nested structurelapply(nested_list, function(group) {sapply(group, mean)})
$group1
a b
3 8
$group2
c d
13 18
# Alternative using nested applylapply(nested_list, function(group) {lapply(group, summary)})
$group1
$group1$a
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 2 3 3 4 5
$group1$b
Min. 1st Qu. Median Mean 3rd Qu. Max.
6 7 8 8 9 10
$group2
$group2$c
Min. 1st Qu. Median Mean 3rd Qu. Max.
11 12 13 13 14 15
$group2$d
Min. 1st Qu. Median Mean 3rd Qu. Max.
16 17 18 18 19 20
8 Performance Considerations
8.1 Vectorization vs Apply vs Loops
# Create large dataset for performance comparisonn <-10000x <-rnorm(n)m <-matrix(rnorm(n *100), nrow = n)# Timing different approaches (conceptual - actual timing may vary)# Vectorized (fastest)result1 <-colMeans(m)# Apply family (fast)result2 <-apply(m, 2, mean)# Explicit loop (slowest)result3 <-numeric(ncol(m))for(i in1:ncol(m)) { result3[i] <-mean(m[, i])}# All results should be equivalentall.equal(result1, result2)
[1] TRUE
all.equal(result1, result3)
[1] TRUE
9 Practical Examples
9.1 Data Analysis with Base R Functions
# Using mtcars dataset for practical examplesdata(mtcars)head(mtcars)
# Summary statistics by group using tapplytapply(mtcars$mpg, mtcars$cyl, summary)
$`4`
Min. 1st Qu. Median Mean 3rd Qu. Max.
21.40 22.80 26.00 26.66 30.40 33.90
$`6`
Min. 1st Qu. Median Mean 3rd Qu. Max.
17.80 18.65 19.70 19.74 21.00 21.40
$`8`
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.40 14.40 15.20 15.10 16.25 19.20
mpg cyl disp hp drat wt qsec
mean 20.090625 6.187500 230.7219 146.68750 3.5965625 3.2172500 17.848750
sd 6.026948 1.785922 123.9387 68.56287 0.5346787 0.9784574 1.786943
vs am gear carb
mean 0.4375000 0.4062500 3.6875000 2.8125
sd 0.5040161 0.4989909 0.7378041 1.6152
10 Exercises
10.1 Exercise 1: Matrix Operations
Create a 4x6 matrix of random numbers and use apply() to: 1. Calculate the maximum value in each row 2. Calculate the standard deviation of each column 3. Find the median of each row
Using the built-in iris dataset: 1. Filter rows where Sepal.Length > 6 2. Calculate mean values for each numeric column by Species 3. Find the species with the highest average Petal.Width
# Solution template:# # 1. Filter# filtered_iris <- iris[iris$Sepal.Length > 6, ]# # # 2. Mean by species# numeric_vars <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal. Width")# species_means <- sapply(numeric_vars, function(var) {# tapply(iris[[var]], iris$Species, mean)# })# # # 3. Species with highest average Petal.Width# petal_means <- tapply(iris$Petal. Width, iris$Species, mean)# names(petal_means)[which.max(petal_means)]
11 Summary
Base R functions and the apply family provide powerful tools for data manipulation and analysis:
Utility functions like sum(), mean(), seq(), and rep() form the foundation of data analysis
apply() works on matrices and arrays across specified dimensions
lapply() applies functions to lists, returning lists
sapply() simplifies lapply results when possible
vapply() provides type-safe apply operations
mapply() handles multiple arguments element-wise
tapply() applies functions by groups
Subsetting allows precise data extraction using various indexing methods
Understanding these functions is crucial for efficient R programming and forms the basis for more advanced data manipulation techniques.