2  Base R Functions & Apply Family

3 Introduction

Base R provides a rich set of built-in functions for data manipulation, statistical calculations, and programming. The apply family of functions is particularly powerful for applying operations across different dimensions of data structures. This chapter covers essential base R functions and the apply family in detail.

4 Common Utility Functions

Base R includes numerous utility functions for statistical calculations, data generation, and manipulation.

4.1 Statistical Functions

# Create sample data
x <- 1:10
print(x)
 [1]  1  2  3  4  5  6  7  8  9 10
# Basic statistical functions
sum(x)        # Sum of all elements
[1] 55
mean(x)       # Arithmetic mean
[1] 5.5
median(x)     # Median value
[1] 5.5
sd(x)         # Standard deviation
[1] 3.02765
var(x)        # Variance
[1] 9.166667
min(x)        # Minimum value
[1] 1
max(x)        # Maximum value
[1] 10
range(x)      # Returns min and max
[1]  1 10
length(x)     # Number of elements
[1] 10

4.2 Quantile Function

The quantile() function calculates sample quantiles corresponding to given probabilities:

x <- 1:10
quantile(x)                    # Default quartiles (0%, 25%, 50%, 75%, 100%)
   0%   25%   50%   75%  100% 
 1.00  3.25  5.50  7.75 10.00 
quantile(x, probs = 0.9)       # 90th percentile
90% 
9.1 
quantile(x, probs = c(0.1, 0.5, 0.9))  # Custom percentiles
10% 50% 90% 
1.9 5.5 9.1 

4.3 Sequence and Repetition Functions

4.3.1 seq() - Generate Sequences

The seq() function creates sequences of numbers:

# Different ways to create sequences
seq(0, 1, by = 0.1)           # From 0 to 1, increment by 0.1
 [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
seq(0, 1, length.out = 11)    # From 0 to 1, 11 equally spaced values
 [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
seq(from = 5, to = 50, by = 5) # From 5 to 50, increment by 5
 [1]  5 10 15 20 25 30 35 40 45 50
seq_along(letters[1:5])        # Sequence along the length of an object
[1] 1 2 3 4 5

4.3.2 rep() - Repeat Values

The rep() function repeats values:

# Different repetition patterns
rep(5, times = 3)              # Repeat 5 three times
[1] 5 5 5
rep(c(1, 2), times = 3)        # Repeat vector three times
[1] 1 2 1 2 1 2
rep(c(1, 2), each = 3)         # Repeat each element three times
[1] 1 1 1 2 2 2
rep(1:3, length.out = 10)      # Repeat to reach specified length
 [1] 1 2 3 1 2 3 1 2 3 1

5 The Apply Family of Functions

The apply family provides efficient ways to apply functions across different dimensions of data structures. These functions are alternatives to writing explicit loops.

5.1 apply() - For Matrices and Arrays

The apply() function applies a function over the margins of an array or matrix.

Syntax: apply(X, MARGIN, FUN, ... ) - X: Array or matrix - MARGIN: 1 = rows, 2 = columns, c(1,2) = both - FUN: Function to apply - ...: Additional arguments to FUN

# Create a sample matrix
m <- matrix(1:12, nrow = 3, ncol = 4)
print(m)
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12
# Apply functions across rows (MARGIN = 1)
apply(m, 1, sum)      # Row sums
[1] 22 26 30
apply(m, 1, mean)     # Row means
[1] 5.5 6.5 7.5
apply(m, 1, max)      # Row maxima
[1] 10 11 12
# Apply functions across columns (MARGIN = 2)
apply(m, 2, sum)      # Column sums
[1]  6 15 24 33
apply(m, 2, mean)     # Column means
[1]  2  5  8 11
apply(m, 2, sd)       # Column standard deviations
[1] 1 1 1 1
# Using built-in optimized functions (when available)
rowSums(m)            # Equivalent to apply(m, 1, sum) but faster
[1] 22 26 30
colSums(m)            # Equivalent to apply(m, 2, sum) but faster
[1]  6 15 24 33
rowMeans(m)           # Equivalent to apply(m, 1, mean) but faster
[1] 5.5 6.5 7.5
colMeans(m)           # Equivalent to apply(m, 2, mean) but faster
[1]  2  5  8 11

5.2 lapply() - For Lists and Vectors

The lapply() function applies a function to each element of a list or vector and returns a list.

Syntax: lapply(X, FUN, ...)

# Create sample data
lst <- list(a = 1:5, b = 6:10, c = 11:15)
print(lst)
$a
[1] 1 2 3 4 5

$b
[1]  6  7  8  9 10

$c
[1] 11 12 13 14 15
# Apply function to each list element
lapply(lst, sum)      # Sum of each element
$a
[1] 15

$b
[1] 40

$c
[1] 65
lapply(lst, mean)     # Mean of each element
$a
[1] 3

$b
[1] 8

$c
[1] 13
lapply(lst, length)   # Length of each element
$a
[1] 5

$b
[1] 5

$c
[1] 5
# Using lapply with custom functions
lapply(lst, function(x) x^2)              # Square each element
$a
[1]  1  4  9 16 25

$b
[1]  36  49  64  81 100

$c
[1] 121 144 169 196 225
lapply(lst, function(x) x[x > 8])         # Filter elements > 8
$a
integer(0)

$b
[1]  9 10

$c
[1] 11 12 13 14 15

5.3 sapply() - Simplified lapply()

The sapply() function is similar to lapply() but attempts to simplify the result to a vector or matrix when possible.

# Same data as above
lst <- list(a = 1:5, b = 6:10, c = 11:15)

# sapply returns simplified results
sapply(lst, sum)      # Returns named vector instead of list
 a  b  c 
15 40 65 
sapply(lst, mean)     # Returns named vector
 a  b  c 
 3  8 13 
sapply(lst, range)    # Returns matrix when each result has same length
     a  b  c
[1,] 1  6 11
[2,] 5 10 15
# Compare lapply vs sapply
result_lapply <- lapply(lst, mean)
result_sapply <- sapply(lst, mean)
class(result_lapply)  # List
[1] "list"
class(result_sapply)  # Named numeric vector
[1] "numeric"

5.4 vapply() - Type-Safe sapply()

The vapply() function is like sapply() but with explicit specification of the return type, making it safer and faster.

Syntax: vapply(X, FUN, FUN.VALUE, ...)

# vapply with explicit return type specification
lst <- list(a = 1:5, b = 6:10, c = 11:15)

# Specify that each function call returns a single numeric value
vapply(lst, mean, FUN.VALUE = numeric(1))
 a  b  c 
 3  8 13 
# Specify that each function call returns two numeric values
vapply(lst, range, FUN.VALUE = numeric(2))
     a  b  c
[1,] 1  6 11
[2,] 5 10 15
# This would throw an error if the function doesn't return the expected type
# vapply(lst, mean, FUN.VALUE = character(1))  # Error!

5.5 mapply() - Multivariate Apply

The mapply() function applies a function to multiple arguments element-wise.

Syntax: mapply(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE)

# Apply function to multiple vectors element-wise
vec1 <- 1:3
vec2 <- 10:12
vec3 <- 100:102

# Add corresponding elements from two vectors
mapply(sum, vec1, vec2)
[1] 11 13 15
# Equivalent to:  c(sum(1, 10), sum(2, 11), sum(3, 12))

# Add corresponding elements from three vectors
mapply(sum, vec1, vec2, vec3)
[1] 111 114 117
# Using mapply with custom functions
mapply(function(x, y) x^y, vec1, c(2, 3, 4))
[1]  1  8 81
# mapply with lists
list1 <- list(a = 1:3, b = 4:6)
list2 <- list(x = 7:9, y = 10:12)
mapply(function(a, b) sum(a) + sum(b), list1, list2)
 a  b 
30 48 

5.6 tapply() - Apply by Groups

The tapply() function applies a function to subsets of a vector based on grouping factors.

Syntax: tapply(X, INDEX, FUN, ...)

# Sample data with grouping
values <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
groups <- c("A", "B", "A", "B", "A", "B", "A", "B", "A", "B")

# Apply function by groups
tapply(values, groups, mean)    # Mean by group
A B 
5 6 
tapply(values, groups, sum)     # Sum by group
 A  B 
25 30 
tapply(values, groups, length)  # Count by group
A B 
5 5 
# Using tapply with built-in datasets
tapply(mtcars$mpg, mtcars$cyl, mean)    # Mean mpg by cylinder count
       4        6        8 
26.66364 19.74286 15.10000 
tapply(mtcars$hp, mtcars$gear, median)  # Median horsepower by gear count
  3   4   5 
180  94 175 

6 Data Subsetting and Indexing

R provides powerful subsetting capabilities for extracting specific elements from data structures.

6.1 Vector Subsetting

# Create sample vector
x <- c(10, 20, 30, 40, 50)
names(x) <- letters[1:5]

# Subsetting by position
x[1]              # First element
 a 
10 
x[c(1, 3, 5)]     # Elements 1, 3, and 5
 a  c  e 
10 30 50 
x[-2]             # All except second element
 a  c  d  e 
10 30 40 50 
x[1:3]            # First three elements
 a  b  c 
10 20 30 
# Subsetting by name
x["a"]            # Element named 'a'
 a 
10 
x[c("a", "c")]    # Elements named 'a' and 'c'
 a  c 
10 30 
# Logical subsetting
x[x > 25]         # Elements greater than 25
 c  d  e 
30 40 50 
x[x %in% c(20, 40)]  # Elements equal to 20 or 40
 b  d 
20 40 

6.2 Data Frame Subsetting

# Create sample data frame
df <- data.frame(
  id = 1:5,
  name = c("Alice", "Bob", "Charlie", "Diana", "Eve"),
  age = c(25, 30, 35, 28, 32),
  salary = c(50000, 60000, 70000, 55000, 65000)
)
print(df)
  id    name age salary
1  1   Alice  25  50000
2  2     Bob  30  60000
3  3 Charlie  35  70000
4  4   Diana  28  55000
5  5     Eve  32  65000
# Subsetting columns
df$name           # Single column by name (returns vector)
[1] "Alice"   "Bob"     "Charlie" "Diana"   "Eve"    
df["name"]        # Single column by name (returns data frame)
     name
1   Alice
2     Bob
3 Charlie
4   Diana
5     Eve
df[c("name", "age")]  # Multiple columns
     name age
1   Alice  25
2     Bob  30
3 Charlie  35
4   Diana  28
5     Eve  32
# Subsetting rows
df[1, ]           # First row
  id  name age salary
1  1 Alice  25  50000
df[c(1, 3), ]     # Rows 1 and 3
  id    name age salary
1  1   Alice  25  50000
3  3 Charlie  35  70000
df[df$age > 30, ] # Rows where age > 30
  id    name age salary
3  3 Charlie  35  70000
5  5     Eve  32  65000
# Subsetting specific elements
df[1, "name"]     # Specific cell
[1] "Alice"
df[2, 3]          # Row 2, column 3
[1] 30
df[df$salary > 60000, c("name", "salary")]  # Conditional row and column selection
     name salary
3 Charlie  70000
5     Eve  65000

6.3 Matrix Subsetting

# Create sample matrix
m <- matrix(1:20, nrow = 4, ncol = 5)
rownames(m) <- paste0("row", 1:4)
colnames(m) <- paste0("col", 1:5)
print(m)
     col1 col2 col3 col4 col5
row1    1    5    9   13   17
row2    2    6   10   14   18
row3    3    7   11   15   19
row4    4    8   12   16   20
# Subsetting by position
m[1, 2]           # Element in row 1, column 2
[1] 5
m[1, ]            # Entire first row
col1 col2 col3 col4 col5 
   1    5    9   13   17 
m[, 3]            # Entire third column
row1 row2 row3 row4 
   9   10   11   12 
m[1:2, 3:4]       # Submatrix:  rows 1-2, columns 3-4
     col3 col4
row1    9   13
row2   10   14
# Subsetting by name
m["row1", "col3"] # Element by row and column names
[1] 9
m[c("row1", "row3"), ] # Specific rows by name
     col1 col2 col3 col4 col5
row1    1    5    9   13   17
row3    3    7   11   15   19

7 Advanced Function Applications

7.1 Custom Functions with Apply

# Define custom functions for use with apply family
# Function to calculate coefficient of variation
cv <- function(x) {
  sd(x) / mean(x) * 100
}

# Function to calculate range
my_range <- function(x) {
  max(x) - min(x)
}

# Apply custom functions
sample_data <- matrix(rnorm(20), nrow = 4)
apply(sample_data, 2, cv)        # CV for each column
[1] -533.98116  809.70423  -94.42305 -682.97227   56.59814
apply(sample_data, 1, my_range)  # Range for each row
[1] 2.393392 3.768018 2.023416 1.693215

7.2 Nested Apply Operations

# Create nested list structure
nested_list <- list(
  group1 = list(a = 1:5, b = 6:10),
  group2 = list(c = 11:15, d = 16:20)
)

# Apply function to nested structure
lapply(nested_list, function(group) {
  sapply(group, mean)
})
$group1
a b 
3 8 

$group2
 c  d 
13 18 
# Alternative using nested apply
lapply(nested_list, function(group) {
  lapply(group, summary)
})
$group1
$group1$a
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1       2       3       3       4       5 

$group1$b
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      6       7       8       8       9      10 


$group2
$group2$c
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     11      12      13      13      14      15 

$group2$d
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     16      17      18      18      19      20 

8 Performance Considerations

8.1 Vectorization vs Apply vs Loops

# Create large dataset for performance comparison
n <- 10000
x <- rnorm(n)
m <- matrix(rnorm(n * 100), nrow = n)

# Timing different approaches (conceptual - actual timing may vary)
# Vectorized (fastest)
result1 <- colMeans(m)

# Apply family (fast)
result2 <- apply(m, 2, mean)

# Explicit loop (slowest)
result3 <- numeric(ncol(m))
for(i in 1:ncol(m)) {
  result3[i] <- mean(m[, i])
}

# All results should be equivalent
all.equal(result1, result2)
[1] TRUE
all.equal(result1, result3)
[1] TRUE

9 Practical Examples

9.1 Data Analysis with Base R Functions

# Using mtcars dataset for practical examples
data(mtcars)
head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
# Summary statistics by group using tapply
tapply(mtcars$mpg, mtcars$cyl, summary)
$`4`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  21.40   22.80   26.00   26.66   30.40   33.90 

$`6`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  17.80   18.65   19.70   19.74   21.00   21.40 

$`8`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.40   14.40   15.20   15.10   16.25   19.20 
# Multiple statistics using sapply
stats_list <- list(
  mean_mpg = mean(mtcars$mpg),
  median_hp = median(mtcars$hp),
  sd_wt = sd(mtcars$wt)
)
sapply(stats_list, round, digits = 2)
 mean_mpg median_hp     sd_wt 
    20.09    123.00      0.98 
# Apply functions to multiple columns
numeric_cols <- sapply(mtcars, is.numeric)
sapply(mtcars[numeric_cols], function(x) c(mean = mean(x), sd = sd(x)))
           mpg      cyl     disp        hp      drat        wt      qsec
mean 20.090625 6.187500 230.7219 146.68750 3.5965625 3.2172500 17.848750
sd    6.026948 1.785922 123.9387  68.56287 0.5346787 0.9784574  1.786943
            vs        am      gear   carb
mean 0.4375000 0.4062500 3.6875000 2.8125
sd   0.5040161 0.4989909 0.7378041 1.6152

10 Exercises

10.1 Exercise 1: Matrix Operations

Create a 4x6 matrix of random numbers and use apply() to: 1. Calculate the maximum value in each row 2. Calculate the standard deviation of each column 3. Find the median of each row

# Solution template: 
# set.seed(123)
# m <- matrix(rnorm(24), nrow = 4)
# apply(m, 1, max)      # Row maxima
# apply(m, 2, sd)       # Column standard deviations  
# apply(m, 1, median)   # Row medians

10.2 Exercise 2: IQR Calculation

Write a base R snippet to compute the Interquartile Range (IQR) for each numeric column of the mtcars dataset.

# Solution template: 
# sapply(mtcars, IQR)
# # or
# apply(mtcars, 2, IQR)

10.3 Exercise 3: lapply vs sapply Comparison

Create a list with mixed data types and compare the behavior of lapply() vs sapply().

# Solution template: 
# mixed_list <- list(
#   numbers = 1:5,
#   characters = letters[1:3],
#   logicals = c(TRUE, FALSE, TRUE),
#   matrix = matrix(1:4, nrow = 2)
# )
# 
# lapply(mixed_list, class)  # Returns list
# sapply(mixed_list, class)  # Returns character vector
# 
# lapply(mixed_list, length) # Returns list
# sapply(mixed_list, length) # Returns named numeric vector

10.4 Exercise 4: Custom Function with mapply

Use mapply() to calculate the area of rectangles given vectors of lengths and widths.

# Solution template:
# lengths <- c(2, 4, 6, 8)
# widths <- c(3, 5, 7, 9)
# mapply(function(l, w) l * w, lengths, widths)
# # or simply: 
# mapply(`*`, lengths, widths)

10.5 Exercise 5: Data Filtering and Aggregation

Using the built-in iris dataset: 1. Filter rows where Sepal.Length > 6 2. Calculate mean values for each numeric column by Species 3. Find the species with the highest average Petal.Width

# Solution template:
# # 1. Filter
# filtered_iris <- iris[iris$Sepal.Length > 6, ]
# 
# # 2. Mean by species
# numeric_vars <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal. Width")
# species_means <- sapply(numeric_vars, function(var) {
#   tapply(iris[[var]], iris$Species, mean)
# })
# 
# # 3. Species with highest average Petal.Width
# petal_means <- tapply(iris$Petal. Width, iris$Species, mean)
# names(petal_means)[which.max(petal_means)]

11 Summary

Base R functions and the apply family provide powerful tools for data manipulation and analysis:

  • Utility functions like sum(), mean(), seq(), and rep() form the foundation of data analysis
  • apply() works on matrices and arrays across specified dimensions
  • lapply() applies functions to lists, returning lists
  • sapply() simplifies lapply results when possible
  • vapply() provides type-safe apply operations
  • mapply() handles multiple arguments element-wise
  • tapply() applies functions by groups
  • Subsetting allows precise data extraction using various indexing methods

Understanding these functions is crucial for efficient R programming and forms the basis for more advanced data manipulation techniques.