3 Data Types & Data Structures

4 Introduction

This comprehensive guide covers R’s data types and structures with detailed explanations suitable for all skill levels. We’ll explore not just how to use each structure, but why and when to use them effectively.

4.1 Learning Objectives

By the end of this chapter, you will be able to:

Beginner: Understand basic data types and create simple data structures
Intermediate: Master complex data manipulation and understand memory concepts
Advanced: Optimize performance and implement advanced object-oriented concepts

4.2 R’s Memory Model: Foundation Concepts

R stores all data as objects in memory. Understanding this is crucial for efficient programming:: :: {.callout-important} ## Key Memory Concepts

Copy-on-modify: R creates copies when objects are modified
Reference counting: R tracks how many variables point to an object
Garbage collection: Automatic memory cleanup of unused objects
Object attributes: Metadata attached to R objects :::

5 Atomic Vectors: The Building Blocks

5.1 Understanding Atomic Vectors

Atomic vectors are the fundamental building blocks of R. Every other data structure is built upon them.

5.1.1 The Six Atomic Types

# 1. Logical (boolean)
logical_vec <- c(TRUE, FALSE, TRUE, NA)
cat("Logical:", typeof(logical_vec), "\n")

Logical: logical

# 2. Integer (whole numbers with L suffix)
integer_vec <- c(1L, 2L, 3L, NA_integer_)
cat("Integer:", typeof(integer_vec), "\n")

Integer: integer

# 3. Double (real numbers)
double_vec <- c(1.5, 2.7, 3.14159, NA_real_)
cat("Double:", typeof(double_vec), "\n")

Double: double

# 4. Character (strings)
character_vec <- c("hello", "world", "R", NA_character_)
cat("Character:", typeof(character_vec), "\n")

Character: character

# 5. Complex (complex numbers)
complex_vec <- c(1+2i, 3+4i, 5+0i, NA_complex_)
cat("Complex:", typeof(complex_vec), "\n")

Complex: complex

# 6. Raw (bytes - rarely used in typical analysis)
raw_vec <- charToRaw("Hello")
cat("Raw:", typeof(raw_vec), "\n")

Raw: raw

5.1.2 Vector Properties and Inspection

# Create a sample vector for analysis
sample_vec <- c(10, 20, 30, 40, 50)

# Essential properties every R user should know
length(sample_vec)    # Number of elements

[1] 5

typeof(sample_vec)    # Internal storage type

[1] "double"

class(sample_vec)     # Object class (what R thinks it is)

[1] "numeric"

mode(sample_vec)      # Storage mode (legacy, but still used)

[1] "numeric"

# Check what kind of object we have
is.vector(sample_vec)    # Is it a vector?

[1] TRUE

is.atomic(sample_vec)    # Is it atomic (not a list)?

[1] TRUE

is.numeric(sample_vec)   # Is it numeric?

[1] TRUE

is.double(sample_vec)    # Specifically double precision?

[1] TRUE

Theory: Vectors are homogeneous - all elements must be the same type. When you mix types, R applies coercion rules.

5.1.3 Type Coercion Hierarchy

# R's coercion hierarchy:  logical < integer < double < character
# Mixing types forces coercion to the "highest" type

# Logical to integer
logical_to_int <- c(TRUE, FALSE, 1L)
cat("Result type:", typeof(logical_to_int), "\n")

Result type: integer

print(logical_to_int)

[1] 1 0 1

# Integer to double  
int_to_double <- c(1L, 2.5)
cat("Result type:", typeof(int_to_double), "\n")

Result type: double

print(int_to_double)

[1] 1.0 2.5

# Double to character (everything becomes character)
mixed_all <- c(TRUE, 1L, 2.5, "hello")
cat("Result type:", typeof(mixed_all), "\n")

Result type: character

print(mixed_all)

[1] "TRUE"  "1"     "2.5"   "hello"

Coercion Warning

Automatic coercion can lead to unexpected results. Always check your data types when debugging!

5.1.4 Special Values in R

# R has several special values that beginners often find confusing
special_vals <- c(
  normal = 42,
  missing = NA,           # Not Available (missing value)
  not_number = NaN,       # Not a Number (0/0)
  positive_inf = Inf,     # Positive infinity (1/0)
  negative_inf = -Inf,    # Negative infinity (-1/0)
  null_length = NULL      # NULL has length 0
)

# Functions to test for special values
test_vec <- c(1, NA, NaN, Inf, -Inf)
cat("Original vector:\n")

Original vector:

print(test_vec)

[1]    1   NA  NaN  Inf -Inf

cat("\nTesting for special values:\n")


Testing for special values:

cat("is.na():", is.na(test_vec), "\n")         # TRUE for NA and NaN

is.na(): FALSE TRUE TRUE FALSE FALSE

cat("is.nan():", is.nan(test_vec), "\n")       # TRUE only for NaN

is.nan(): FALSE FALSE TRUE FALSE FALSE

cat("is. infinite():", is.infinite(test_vec), "\n") # TRUE for ±Inf

is. infinite(): FALSE FALSE FALSE TRUE TRUE

cat("is. finite():", is.finite(test_vec), "\n")     # FALSE for NA, NaN, ±Inf

is. finite(): TRUE FALSE FALSE FALSE FALSE

5.2 Vector Creation Methods

5.2.1 Basic Creation Functions

# Method 1: c() function (combine/concatenate)
basic_vec <- c(1, 2, 3, 4, 5)

# Method 2: Colon operator (integer sequences)
seq_colon <- 1:10
reverse_seq <- 10:1

# Method 3: seq() function (more flexible sequences)
seq_by_step <- seq(from = 0, to = 100, by = 10)
seq_by_length <- seq(from = 0, to = 1, length.out = 11)
seq_along_other <- seq_along(c("a", "b", "c", "d"))

cat("Basic vector:", basic_vec, "\n")

Basic vector: 1 2 3 4 5

cat("Colon sequence:", seq_colon, "\n")

Colon sequence: 1 2 3 4 5 6 7 8 9 10

cat("Sequence by step:", seq_by_step, "\n")

Sequence by step: 0 10 20 30 40 50 60 70 80 90 100

cat("Sequence by length:", seq_by_length, "\n")

Sequence by length: 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

5.2.2 Advanced Creation Patterns

# rep() function - repetition patterns
rep_times <- rep(c(1, 2, 3), times = 3)        # Repeat entire vector
rep_each <- rep(c(1, 2, 3), each = 3)          # Repeat each element
rep_length_out <- rep(c(1, 2), length.out = 7) # Repeat to specific length

cat("rep times:", rep_times, "\n")

rep times: 1 2 3 1 2 3 1 2 3

cat("rep each:", rep_each, "\n")

rep each: 1 1 1 2 2 2 3 3 3

cat("rep length. out:", rep_length_out, "\n")

rep length. out: 1 2 1 2 1 2 1

# Creating named vectors
named_vec <- c(first = 1, second = 2, third = 3)
print(named_vec)

 first second  third 
     1      2      3

# Alternative way to add names
numbers <- 1:5
names(numbers) <- c("one", "two", "three", "four", "five")
print(numbers)

  one   two three  four  five 
    1     2     3     4     5

# Using paste() for systematic naming
systematic_names <- paste0("item_", 1:5)
cat("Systematic names:", systematic_names, "\n")

Systematic names: item_1 item_2 item_3 item_4 item_5

5.3 Vector Indexing and Subsetting

Vector indexing in R is extremely powerful and flexible.

5.3.1 Positive Integer Indexing

# Create a sample vector with names
fruits <- c("apple", "banana", "cherry", "date", "elderberry", "fig")
names(fruits) <- paste0("fruit_", 1:6)
print(fruits)

     fruit_1      fruit_2      fruit_3      fruit_4      fruit_5      fruit_6 
     "apple"     "banana"     "cherry"       "date" "elderberry"        "fig"

# Single element access
cat("First fruit:", fruits[1], "\n")

First fruit: apple

cat("Third fruit:", fruits[3], "\n")

Third fruit: cherry

# Multiple specific elements
selected_fruits <- fruits[c(1, 3, 5)]
cat("Selected fruits:", selected_fruits, "\n")

Selected fruits: apple cherry elderberry

# Range selection (slicing)
fruit_range <- fruits[2:4]
cat("Fruit range (2-4):", fruit_range, "\n")

Fruit range (2-4): banana cherry date

# Using sequences for complex patterns
odd_positions <- fruits[seq(1, length(fruits), by = 2)]
cat("Odd positions:", odd_positions, "\n")

Odd positions: apple cherry elderberry

5.3.2 Negative Integer Indexing (Exclusion)

# Negative indexing excludes elements
all_but_first <- fruits[-1]
cat("All except first:", all_but_first, "\n")

All except first: banana cherry date elderberry fig

# Exclude multiple elements
exclude_multiple <- fruits[-c(2, 4, 6)]
cat("Excluding positions 2,4,6:", exclude_multiple, "\n")

Excluding positions 2,4,6: apple cherry elderberry

# Exclude ranges
exclude_range <- fruits[-(2:4)]
cat("Excluding range 2-4:", exclude_range, "\n")

Excluding range 2-4: apple elderberry fig

5.3.3 Logical Indexing (Conditional Selection)

# Create data for logical indexing examples
prices <- c(1.50, 3.20, 0.80, 4.50, 2.10, 6.00)
names(prices) <- fruits

# Simple logical conditions
expensive_items <- prices > 3.00
cat("Expensive items (>3.00):", prices[expensive_items], "\n")

Expensive items (>3.00): 3.2 4.5 6

# Multiple conditions with logical operators
moderate_prices <- prices >= 1.00 & prices <= 4.00
cat("Moderate prices (1-4):", prices[moderate_prices], "\n")

Moderate prices (1-4): 1.5 3.2 2.1

# Using %in% operator for membership testing
target_fruits <- c("apple", "cherry", "fig")
selected_by_name <- prices[names(prices) %in% target_fruits]
cat("Selected by name:", selected_by_name, "\n")

Selected by name: 1.5 0.8 6

# Complex logical expressions
complex_condition <- (prices > 2.00) | (nchar(names(prices)) > 10)
cat("Complex condition:", prices[complex_condition], "\n")

Complex condition: 3.2 4.5 2.1 6

5.3.4 Named Indexing

# Access by single name
apple_price <- prices["fruit_1"]
cat("Apple price:", apple_price, "\n")

Apple price: NA

# Access by multiple names
fruit_selection <- prices[c("fruit_1", "fruit_3", "fruit_6")]
cat("Selected fruit prices:", fruit_selection, "\n")

Selected fruit prices: NA NA NA

# Dynamic name creation
dynamic_names <- paste0("fruit_", c(1, 3, 5))
dynamic_selection <- prices[dynamic_names]
cat("Dynamic selection:", dynamic_selection, "\n")

Dynamic selection: NA NA NA

5.4 Vector Operations and Arithmetic

5.4.1 Scalar Operations (Broadcasting)

# Create sample data
values <- c(10, 20, 30, 40, 50)
scalar <- 5

# Arithmetic operations with scalars
cat("Original values:", values, "\n")

Original values: 10 20 30 40 50

cat("Add 5:", values + scalar, "\n")

Add 5: 15 25 35 45 55

cat("Multiply by 5:", values * scalar, "\n")

Multiply by 5: 50 100 150 200 250

cat("Divide by 5:", values / scalar, "\n")

Divide by 5: 2 4 6 8 10

cat("Power of 2:", values^2, "\n")

Power of 2: 100 400 900 1600 2500

cat("Modulus 3:", values %% 3, "\n")

Modulus 3: 1 2 0 1 2

cat("Integer division by 3:", values %/% 3, "\n")

Integer division by 3: 3 6 10 13 16

5.4.2 Vector-to-Vector Operations

# Element-wise operations between vectors of same length
vec1 <- c(1, 2, 3, 4, 5)
vec2 <- c(10, 20, 30, 40, 50)

cat("Vector 1:", vec1, "\n")

Vector 1: 1 2 3 4 5

cat("Vector 2:", vec2, "\n")

Vector 2: 10 20 30 40 50

cat("Addition:", vec1 + vec2, "\n")

Addition: 11 22 33 44 55

cat("Multiplication:", vec1 * vec2, "\n")

Multiplication: 10 40 90 160 250

cat("Division:", vec2 / vec1, "\n")

Division: 10 10 10 10 10

# Vector recycling with different lengths
short_vec <- c(1, 2)
long_vec <- c(10, 20, 30, 40, 50, 60)

cat("\nShort vector:", short_vec, "\n")


Short vector: 1 2

cat("Long vector:", long_vec, "\n")

Long vector: 10 20 30 40 50 60

cat("Addition (with recycling):", short_vec + long_vec, "\n")

Addition (with recycling): 11 22 31 42 51 62

: :: {.callout-note} ## Vector Recycling

When operating on vectors of different lengths, R recycles the shorter vector. This is powerful but can lead to unexpected results if not understood properly. :::

5.4.3 Mathematical Functions

# Mathematical functions work element-wise on vectors
x <- c(1, 4, 9, 16, 25)

cat("Original values:", x, "\n")

Original values: 1 4 9 16 25

cat("Square root:", sqrt(x), "\n")

Square root: 1 2 3 4 5

cat("Natural log:", log(x), "\n")

Natural log: 0 1.386294 2.197225 2.772589 3.218876

cat("Log base 10:", log10(x), "\n")

Log base 10: 0 0.60206 0.9542425 1.20412 1.39794

cat("Exponential:", exp(c(0, 1, 2)), "\n")

Exponential: 1 2.718282 7.389056

# Trigonometric functions
angles <- c(0, pi/4, pi/2, pi)
cat("Angles (radians):", round(angles, 3), "\n")

Angles (radians): 0 0.785 1.571 3.142

cat("Sine:", round(sin(angles), 3), "\n")

Sine: 0 0.707 1 0

cat("Cosine:", round(cos(angles), 3), "\n")

Cosine: 1 0.707 0 -1

# Rounding functions
messy_numbers <- c(3.14159, 2.71828, 1.41421)
cat("Original:", messy_numbers, "\n")

Original: 3.14159 2.71828 1.41421

cat("Round to 2 places:", round(messy_numbers, 2), "\n")

Round to 2 places: 3.14 2.72 1.41

cat("Ceiling:", ceiling(messy_numbers), "\n")

Ceiling: 4 3 2

cat("Floor:", floor(messy_numbers), "\n")

Floor: 3 2 1

cat("Truncate:", trunc(messy_numbers), "\n")

Truncate: 3 2 1

5.5 Set Operations and Comparisons

5.5.1 Set Operations

# Create sets for demonstration
set_a <- c("apple", "banana", "cherry", "date")
set_b <- c("cherry", "date", "elderberry", "fig")

cat("Set A:", set_a, "\n")

Set A: apple banana cherry date

cat("Set B:", set_b, "\n")

Set B: cherry date elderberry fig

# Core set operations
union_result <- union(set_a, set_b)
cat("Union (A ∪ B):", union_result, "\n")

Union (A ∪ B): apple banana cherry date elderberry fig

intersection_result <- intersect(set_a, set_b)
cat("Intersection (A ∩ B):", intersection_result, "\n")

Intersection (A ∩ B): cherry date

difference_result <- setdiff(set_a, set_b)
cat("Difference (A - B):", difference_result, "\n")

Difference (A - B): apple banana

difference_reverse <- setdiff(set_b, set_a)
cat("Difference (B - A):", difference_reverse, "\n")

Difference (B - A): elderberry fig

# Set equality and membership
cat("Are sets equal?", setequal(set_a, set_b), "\n")

Are sets equal? FALSE

cat("Is 'apple' in set A?", "apple" %in% set_a, "\n")

Is 'apple' in set A? TRUE

cat("Is 'grape' in set A?", "grape" %in% set_a, "\n")

Is 'grape' in set A? FALSE

5.5.2 Advanced Comparison Operations

# Comparison vectors
scores_1 <- c(85, 92, 78, 96, 88)
scores_2 <- c(88, 90, 78, 94, 91)

cat("Scores 1:", scores_1, "\n")

Scores 1: 85 92 78 96 88

cat("Scores 2:", scores_2, "\n")

Scores 2: 88 90 78 94 91

# Element-wise comparisons
cat("Equal elements:", scores_1 == scores_2, "\n")

Equal elements: FALSE FALSE TRUE FALSE FALSE

cat("Score 1 greater:", scores_1 > scores_2, "\n")

Score 1 greater: FALSE TRUE FALSE TRUE FALSE

cat("Score difference >= 5:", abs(scores_1 - scores_2) >= 5, "\n")

Score difference >= 5: FALSE FALSE FALSE FALSE FALSE

# Aggregated logical operations
cat("Any score1 > score2? ", any(scores_1 > scores_2), "\n")

Any score1 > score2?  TRUE

cat("All scores > 75?", all(scores_1 > 75), "\n")

All scores > 75? TRUE

cat("How many scores1 > 85?", sum(scores_1 > 85), "\n")

How many scores1 > 85? 3

# Finding positions where conditions are true
high_scores <- which(scores_1 > 90)
cat("Positions with scores > 90:", high_scores, "\n")

Positions with scores > 90: 2 4

# Finding maximum and minimum positions
max_position <- which.max(scores_1)
min_position <- which.min(scores_1)
cat("Highest score position:", max_position, "value:", scores_1[max_position], "\n")

Highest score position: 4 value: 96

cat("Lowest score position:", min_position, "value:", scores_1[min_position], "\n")

Lowest score position: 3 value: 78

6 Factors: Categorical Data Excellence

6.1 Understanding Factors

Factors are R’s sophisticated way of handling categorical data. They’re built on integers with associated labels, making them memory-efficient and statistically meaningful.

6.1.1 Basic Factor Creation

# Create a simple factor from character data
colors <- c("red", "blue", "green", "red", "blue", "green", "red")
color_factor <- factor(colors)

print(color_factor)

[1] red   blue  green red   blue  green red  
Levels: blue green red

cat("Levels:", levels(color_factor), "\n")

Levels: blue green red

cat("Number of levels:", nlevels(color_factor), "\n")

Number of levels: 3

# Internal representation
cat("Internal structure (integer codes):", as.integer(color_factor), "\n")

Internal structure (integer codes): 3 1 2 3 1 2 3

cat("Storage type:", typeof(color_factor), "\n")

Storage type: integer

cat("Memory usage comparison:\n")

Memory usage comparison:

cat("  Character:", object.size(colors), "bytes\n")

  Character: 280 bytes

cat("  Factor:", object.size(color_factor), "bytes\n")

  Factor: 664 bytes

6.1.2 Controlling Factor Levels

# Explicitly set levels and their order
sizes <- c("small", "large", "medium", "small", "medium", "large")

# Without specifying levels (alphabetical order)
size_factor_auto <- factor(sizes)
cat("Automatic levels:", levels(size_factor_auto), "\n")

Automatic levels: large medium small

# With explicitly ordered levels
size_factor_ordered <- factor(sizes, 
                             levels = c("small", "medium", "large"))
cat("Custom levels:", levels(size_factor_ordered), "\n")

Custom levels: small medium large

# Ordered factors (with ranking)
size_factor_ranked <- factor(sizes, 
                           levels = c("small", "medium", "large"),
                           ordered = TRUE)
cat("Ordered factor:", size_factor_ranked, "\n")

Ordered factor: 1 3 2 1 2 3

cat("Is ordered? ", is.ordered(size_factor_ranked), "\n")

Is ordered?  TRUE

# Comparisons work with ordered factors
cat("small < medium? ", size_factor_ranked[1] < size_factor_ranked[3], "\n")

small < medium?  TRUE

6.1.3 Factor Level Manipulation

# Create a factor for manipulation
ratings <- factor(c("good", "bad", "excellent", "good", "fair", "bad"))
print(ratings)

[1] good      bad       excellent good      fair      bad      
Levels: bad excellent fair good

# View current levels
cat("Original levels:", levels(ratings), "\n")

Original levels: bad excellent fair good

# Reorder levels logically
levels(ratings) <- c("bad", "fair", "good", "excellent")
cat("Reordered levels:", levels(ratings), "\n")

Reordered levels: bad fair good excellent

# Add new levels (useful before adding new data)
levels(ratings) <- c(levels(ratings), "outstanding")
cat("After adding level:", levels(ratings), "\n")

After adding level: bad fair good excellent outstanding

# Remove unused levels
ratings_subset <- ratings[ratings != "bad"]
cat("Before droplevels:", levels(ratings_subset), "\n")

Before droplevels: bad fair good excellent outstanding

ratings_clean <- droplevels(ratings_subset)
cat("After droplevels:", levels(ratings_clean), "\n")

After droplevels: fair good excellent

6.1.4 Advanced Factor Operations with forcats

library(forcats)

# Create sample survey data
responses <- factor(c("Agree", "Disagree", "Strongly Agree", "Neutral", 
                     "Agree", "Strongly Disagree", "Agree", "Neutral",
                     "Disagree", "Strongly Agree"))

# Reorder by frequency
freq_ordered <- fct_infreq(responses)
cat("By frequency:\n")

By frequency:

print(table(freq_ordered))

freq_ordered
            Agree          Disagree           Neutral    Strongly Agree 
                3                 2                 2                 2 
Strongly Disagree 
                1

# Reorder by another variable
scores <- c(4, 2, 5, 3, 4, 1, 4, 3, 2, 5)
score_ordered <- fct_reorder(responses, scores, mean)
cat("Ordered by mean score:\n")

Ordered by mean score:

print(tapply(scores, score_ordered, mean))

Strongly Disagree          Disagree           Neutral             Agree 
                1                 2                 3                 4 
   Strongly Agree 
                5

# Collapse rare levels
collapsed <- fct_lump_min(responses, min = 2)
cat("After collapsing rare levels:\n")

After collapsing rare levels:

print(table(collapsed))

collapsed
         Agree       Disagree        Neutral Strongly Agree          Other 
             3              2              2              2              1

# Recode levels
recoded <- fct_recode(responses,
  "Positive" = "Agree",
  "Very Positive" = "Strongly Agree", 
  "Negative" = "Disagree",
  "Very Negative" = "Strongly Disagree"
)
print(table(recoded))

recoded
     Positive      Negative       Neutral Very Positive Very Negative 
            3             2             2             2             1

7 Lists: The Swiss Army Knife

7.1 Understanding Lists

Lists are recursive data structures - they can contain any R object, including other lists. This makes them incredibly flexible for complex data structures.

7.1.1 Basic List Creation and Structure

# Create a heterogeneous list
student_record <- list(
  name = "Alice Johnson",
  student_id = 12345,
  grades = c(85, 92, 78, 96, 88),
  courses = c("Math", "Science", "English", "History", "Art"),
  is_honors = TRUE,
  graduation_date = as.Date("2024-06-15")
)

print(student_record)

$name
[1] "Alice Johnson"

$student_id
[1] 12345

$grades
[1] 85 92 78 96 88

$courses
[1] "Math"    "Science" "English" "History" "Art"    

$is_honors
[1] TRUE

$graduation_date
[1] "2024-06-15"

# Examine list structure
cat("\nList structure:\n")


List structure:

str(student_record)

List of 6
 $ name           : chr "Alice Johnson"
 $ student_id     : num 12345
 $ grades         : num [1:5] 85 92 78 96 88
 $ courses        : chr [1:5] "Math" "Science" "English" "History" ...
 $ is_honors      : logi TRUE
 $ graduation_date: Date[1:1], format: "2024-06-15"

# List properties
cat("Length (number of elements):", length(student_record), "\n")

Length (number of elements): 6

cat("Names of elements:", names(student_record), "\n")

Names of elements: name student_id grades courses is_honors graduation_date

cat("Is it a list?", is.list(student_record), "\n")

Is it a list? TRUE

7.1.2 List Indexing Methods

Understanding the difference between [, [[, and $ is crucial for list manipulation.

# Method 1: Single bracket [ ] - returns a list
subset_list <- student_record[1]  # Returns list containing first element
cat("Using [1] - class:", class(subset_list), "\n")

Using [1] - class: list

print(subset_list)

$name
[1] "Alice Johnson"

# Multiple elements with single bracket
multiple_elements <- student_record[c("name", "student_id")]
print(multiple_elements)

$name
[1] "Alice Johnson"

$student_id
[1] 12345

# Method 2: Double bracket [[ ]] - returns the actual element
actual_name <- student_record[[1]]  # Returns the character vector
cat("Using [[1]] - class:", class(actual_name), "\n")

Using [[1]] - class: character

print(actual_name)

[1] "Alice Johnson"

# Access by name with double bracket
grades <- student_record[["grades"]]
cat("Grades class:", class(grades), "\n")

Grades class: numeric

print(grades)

[1] 85 92 78 96 88

# Method 3: Dollar sign $ - convenient for named elements
student_name <- student_record$name
cat("Using $ - class:", class(student_name), "\n")

Using $ - class: character

print(student_name)

[1] "Alice Johnson"

List Indexing Memory Aid

my_list[1] → Returns a list (like a box containing the item)
my_list[[1]] → Returns the actual item (takes item out of box)
my_list$name → Convenient shortcut for my_list[["name"]]

7.1.3 List Modification and Manipulation

student_record <- list(
  name = "Alice Johnson",
  student_id = 12345,
  grades = c(85, 92, 78, 96, 88),
  courses = c("Math", "Science", "English", "History", "Art"),
  is_honors = TRUE,
  graduation_date = as.Date("2024-06-15")
)
# Start with a copy for modification
student_copy <- student_record

# Add new elements
student_copy$gpa <- mean(student_copy$grades) / 100 * 4  # Simple GPA calculation
student_copy$advisor <- "Dr. Smith"
student_copy$extracurricular <- c("Drama Club", "Science Olympics")

# Modify existing elements
student_copy$grades[1] <- 90  # Change first grade
student_copy$is_honors <- FALSE

# Remove elements (set to NULL)
student_copy$advisor <- NULL

cat("Modified list structure:\n")

Modified list structure:

str(student_copy)

List of 8
 $ name           : chr "Alice Johnson"
 $ student_id     : num 12345
 $ grades         : num [1:5] 90 92 78 96 88
 $ courses        : chr [1:5] "Math" "Science" "English" "History" ...
 $ is_honors      : logi FALSE
 $ graduation_date: Date[1:1], format: "2024-06-15"
 $ gpa            : num 3.51
 $ extracurricular: chr [1:2] "Drama Club" "Science Olympics"

# Adding nested structure
student_copy$contact_info <- list(
  email = "alice. johnson@school.edu",
  phone = "555-0123",
  address = list(
    street = "123 Main St",
    city = "Academic City",
    zip = "12345"
  )
)

cat("After adding nested structure:\n")

After adding nested structure:

str(student_copy, max.level = 2)  # Limit depth for readability

List of 9
 $ name           : chr "Alice Johnson"
 $ student_id     : num 12345
 $ grades         : num [1:5] 90 92 78 96 88
 $ courses        : chr [1:5] "Math" "Science" "English" "History" ...
 $ is_honors      : logi FALSE
 $ graduation_date: Date[1:1], format: "2024-06-15"
 $ gpa            : num 3.51
 $ extracurricular: chr [1:2] "Drama Club" "Science Olympics"
 $ contact_info   :List of 3
  ..$ email  : chr "alice. johnson@school.edu"
  ..$ phone  : chr "555-0123"
  ..$ address:List of 3

7.1.4 List Apply Functions

The apply family of functions is essential for working with lists efficiently.

# Create a list of numeric vectors for demonstration
data_list <- list(
  dataset_1 = c(23, 45, 67, 89, 12),
  dataset_2 = c(34, 56, 78, 90, 23, 45),
  dataset_3 = c(12, 34, 56, 78, 90, 12, 34)
)

# lapply - returns a list
means_list <- lapply(data_list, mean)
cat("lapply result (list):\n")

lapply result (list):

print(means_list)

$dataset_1
[1] 47.2

$dataset_2
[1] 54.33333

$dataset_3
[1] 45.14286

# sapply - simplifies to vector when possible
means_vector <- sapply(data_list, mean)
cat("sapply result (vector):", means_vector, "\n")

sapply result (vector): 47.2 54.33333 45.14286

# mapply - multivariate apply
weights <- c(0.3, 0.5, 0.2)
weighted_means <- mapply(function(x, w) sum(x * w), data_list, weights)
cat("Weighted means:", weighted_means, "\n")

Weighted means: 70.8 163 63.2

# vapply - like sapply but with type checking (safer)
lengths_safe <- vapply(data_list, length, integer(1))
cat("Lengths (vapply):", lengths_safe, "\n")

Lengths (vapply): 5 6 7

7.1.5 Advanced List Operations with purrr

library(purrr)

# Modern functional programming with purrr
# map family - consistent and predictable
data_list <- list(
  dataset_1 = c(23, 45, 67, 89, 12),
  dataset_2 = c(34, 56, 78, 90, 23, 45),
  dataset_3 = c(12, 34, 56, 78, 90, 12, 34)
)
# map returns a list
summary_stats <- map(data_list, summary)
print(summary_stats[[1]])

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   12.0    23.0    45.0    47.2    67.0    89.0

# map_dbl returns numeric vector
means_purrr <- map_dbl(data_list, mean)
cat("Means with map_dbl:", means_purrr, "\n")

Means with map_dbl: 47.2 54.33333 45.14286

# map_chr returns character vector  
length_descriptions <- map_chr(data_list, ~ paste("Length:", length(.x)))
cat("Descriptions:", length_descriptions, "\n")

Descriptions: Length: 5 Length: 6 Length: 7

# map2 for working with two lists simultaneously
multipliers <- c(2, 3, 4)
scaled_data <- map2(data_list, multipliers, ~ .x * .y)
cat("First scaled dataset:", scaled_data[[1]], "\n")

First scaled dataset: 46 90 134 178 24

# Complex transformations with map
complex_stats <- data_list |>
  map(~ list(
    mean = mean(.x),
    sd = sd(.x),
    min = min(.x),
    max = max(.x),
    n = length(.x)
  ))

print(complex_stats$dataset_1)

$mean
[1] 47.2

$sd
[1] 31.49921

$min
[1] 12

$max
[1] 89

$n
[1] 5

8 Matrices: Linear Algebra Powerhouse

8.1 Understanding Matrices

Matrices are 2-dimensional, homogeneous data structures. They’re essentially vectors with a dimension attribute, making them perfect for mathematical operations.

8.1.1 Matrix Creation Methods

# Method 1: matrix() function
basic_matrix <- matrix(1:12, nrow = 3, ncol = 4)
cat("Basic matrix (filled by columns):\n")

Basic matrix (filled by columns):

print(basic_matrix)

     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

# Fill by rows instead
by_rows <- matrix(1:12, nrow = 3, ncol = 4, byrow = TRUE)
cat("Matrix filled by rows:\n")

Matrix filled by rows:

print(by_rows)

     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12

# Method 2: Combining vectors
row_1 <- c(1, 2, 3, 4)
row_2 <- c(5, 6, 7, 8)
row_3 <- c(9, 10, 11, 12)

# Row binding
matrix_rbind <- rbind(row_1, row_2, row_3)
cat("Matrix from row binding:\n")

Matrix from row binding:

print(matrix_rbind)

      [,1] [,2] [,3] [,4]
row_1    1    2    3    4
row_2    5    6    7    8
row_3    9   10   11   12

# Column binding
col_matrix <- cbind(c(1, 5, 9), c(2, 6, 10), c(3, 7, 11), c(4, 8, 12))
cat("Matrix from column binding:\n")

Matrix from column binding:

print(col_matrix)

     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12

# Method 3: Adding dimensions to a vector
vector_to_matrix <- 1:12
dim(vector_to_matrix) <- c(3, 4)
cat("Vector converted to matrix:\n")

Vector converted to matrix:

print(vector_to_matrix)

     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

8.1.2 Matrix Properties and Attributes

# Create a named matrix for demonstration
sales_matrix <- matrix(
  c(100, 150, 120, 200, 
    110, 160, 130, 210,
    105, 155, 125, 195),
  nrow = 3,
  byrow = TRUE,
  dimnames = list(
    quarters = c("Q1", "Q2", "Q3"),
    products = c("Product_A", "Product_B", "Product_C", "Product_D")
  )
)

print(sales_matrix)

        products
quarters Product_A Product_B Product_C Product_D
      Q1       100       150       120       200
      Q2       110       160       130       210
      Q3       105       155       125       195

# Matrix properties
cat("Dimensions:", dim(sales_matrix), "\n")

Dimensions: 3 4

cat("Number of rows:", nrow(sales_matrix), "\n")

Number of rows: 3

cat("Number of columns:", ncol(sales_matrix), "\n")

Number of columns: 4

cat("Total elements:", length(sales_matrix), "\n")

Total elements: 12

# Dimension names
cat("Row names:", rownames(sales_matrix), "\n")

Row names: Q1 Q2 Q3

cat("Column names:", colnames(sales_matrix), "\n")

Column names: Product_A Product_B Product_C Product_D

# Matrix type information
cat("Class:", class(sales_matrix), "\n")

Class: matrix array

cat("Type:", typeof(sales_matrix), "\n")

Type: double

cat("Is matrix?", is.matrix(sales_matrix), "\n")

Is matrix? TRUE

cat("Is array?", is.array(sales_matrix), "\n")  # Matrices are also arrays

Is array? TRUE

8.1.3 Matrix Indexing and Subsetting

# Single element access
cat("Element [2,3]:", sales_matrix[2, 3], "\n")

Element [2,3]: 130

cat("Q2 Product_C:", sales_matrix["Q2", "Product_C"], "\n")

Q2 Product_C: 130

# Row access (returns vector)
q1_sales <- sales_matrix[1, ]
cat("Q1 sales:", q1_sales, "\n")

Q1 sales: 100 150 120 200

cat("Class of row:", class(q1_sales), "\n")  # Note: becomes vector

Class of row: numeric

# Column access
product_a_sales <- sales_matrix[, 1]
cat("Product A sales:", product_a_sales, "\n")

Product A sales: 100 110 105

# Multiple rows/columns (returns matrix)
multi_quarters <- sales_matrix[1:2, ]
cat("Q1-Q2 sales:\n")

Q1-Q2 sales:

print(multi_quarters)

        products
quarters Product_A Product_B Product_C Product_D
      Q1       100       150       120       200
      Q2       110       160       130       210

multi_products <- sales_matrix[, 2:4]
cat("Products B-D:\n")

Products B-D:

print(multi_products)

        products
quarters Product_B Product_C Product_D
      Q1       150       120       200
      Q2       160       130       210
      Q3       155       125       195

# Submatrix
submatrix <- sales_matrix[1:2, 2:3]
cat("Submatrix Q1-Q2, Products B-C:\n")

Submatrix Q1-Q2, Products B-C:

print(submatrix)

        products
quarters Product_B Product_C
      Q1       150       120
      Q2       160       130

# Logical indexing
high_sales <- sales_matrix > 150
cat("High sales (>150):\n")

High sales (>150):

print(sales_matrix[high_sales])  # Returns as vector

[1] 160 155 200 210 195

8.1.4 Matrix Arithmetic Operations

# Create matrices for operations
A <- matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2)
B <- matrix(c(5, 6, 7, 8), nrow = 2, ncol = 2)

cat("Matrix A:\n"); print(A)

Matrix A:

     [,1] [,2]
[1,]    1    3
[2,]    2    4

cat("Matrix B:\n"); print(B)

Matrix B:

     [,1] [,2]
[1,]    5    7
[2,]    6    8

# Element-wise operations
cat("A + B (element-wise addition):\n")

A + B (element-wise addition):

print(A + B)

     [,1] [,2]
[1,]    6   10
[2,]    8   12

cat("A * B (element-wise multiplication):\n")

A * B (element-wise multiplication):

print(A * B)

     [,1] [,2]
[1,]    5   21
[2,]   12   32

# Matrix multiplication (linear algebra)
cat("A %*% B (matrix multiplication):\n")

A %*% B (matrix multiplication):

print(A %*% B)

     [,1] [,2]
[1,]   23   31
[2,]   34   46

# Scalar operations
cat("A * 3 (scalar multiplication):\n")

A * 3 (scalar multiplication):

print(A * 3)

     [,1] [,2]
[1,]    3    9
[2,]    6   12

# Matrix powers
cat("A^2 (element-wise squaring):\n")

A^2 (element-wise squaring):

print(A^2)

     [,1] [,2]
[1,]    1    9
[2,]    4   16

8.1.5 Advanced Matrix Operations

# Create a square matrix for advanced operations
square_matrix <- matrix(c(4, 7, 2, 6), nrow = 2, ncol = 2)
cat("Square matrix:\n"); print(square_matrix)

Square matrix:

     [,1] [,2]
[1,]    4    2
[2,]    7    6

# Linear algebra operations
cat("Transpose:\n"); print(t(square_matrix))

Transpose:

     [,1] [,2]
[1,]    4    7
[2,]    2    6

cat("Determinant:", det(square_matrix), "\n")

Determinant: 10

# Matrix inverse
inverse_matrix <- solve(square_matrix)
cat("Inverse matrix:\n"); print(inverse_matrix)

Inverse matrix:

     [,1] [,2]
[1,]  0.6 -0.2
[2,] -0.7  0.4

# Verify inverse:  A * A^(-1) = I
identity_check <- square_matrix %*% inverse_matrix
cat("A * A^(-1) (should be identity):\n")

A * A^(-1) (should be identity):

print(round(identity_check, 10))  # Round to handle floating point errors

     [,1] [,2]
[1,]    1    0
[2,]    0    1

# Eigenvalues and eigenvectors
eigen_result <- eigen(square_matrix)
cat("Eigenvalues:", eigen_result$values, "\n")

Eigenvalues: 8.872983 1.127017

cat("Eigenvectors:\n"); print(eigen_result$vectors)

Eigenvectors:

           [,1]       [,2]
[1,] -0.3796908 -0.5713345
[2,] -0.9251135  0.8207173

# Matrix decomposition
qr_decomp <- qr(square_matrix)
cat("QR decomposition Q:\n"); print(qr.Q(qr_decomp))

QR decomposition Q:

           [,1]       [,2]
[1,] -0.4961389 -0.8682431
[2,] -0.8682431  0.4961389

cat("QR decomposition R:\n"); print(qr.R(qr_decomp))

QR decomposition R:

          [,1]      [,2]
[1,] -8.062258 -6.201737
[2,]  0.000000  1.240347

8.1.6 Matrix Apply Operations

# Create a larger matrix for demonstration
test_matrix <- matrix(rnorm(20, mean = 10, sd = 3), nrow = 4, ncol = 5)
rownames(test_matrix) <- paste("Row", 1:4)
colnames(test_matrix) <- paste("Col", 1:5)

cat("Test matrix:\n")

Test matrix:

print(round(test_matrix, 2))

      Col 1 Col 2 Col 3 Col 4 Col 5
Row 1  9.81 10.26 16.38 11.74 11.75
Row 2  8.34  4.48 12.53  6.07 16.25
Row 3  8.73 13.14 13.60  9.74  4.97
Row 4 14.09  7.82  9.99  9.66  5.32

# Apply functions across margins
# MARGIN = 1 for rows, MARGIN = 2 for columns

# Row operations
row_sums <- apply(test_matrix, 1, sum)
row_means <- apply(test_matrix, 1, mean)
row_ranges <- apply(test_matrix, 1, function(x) max(x) - min(x))

cat("Row sums:", round(row_sums, 2), "\n")

Row sums: 59.95 47.68 50.18 46.88

cat("Row means:", round(row_means, 2), "\n")

Row means: 11.99 9.54 10.04 9.38

cat("Row ranges:", round(row_ranges, 2), "\n")

Row ranges: 6.57 11.76 8.63 8.77

# Column operations  
col_sums <- apply(test_matrix, 2, sum)
col_sds <- apply(test_matrix, 2, sd)

cat("Column sums:", round(col_sums, 2), "\n")

Column sums: 40.97 35.71 52.5 37.21 38.29

cat("Column standard deviations:", round(col_sds, 2), "\n")

Column standard deviations: 2.64 3.67 2.65 2.36 5.43

# Built-in functions (faster than apply for simple operations)
cat("rowSums() result:", round(rowSums(test_matrix), 2), "\n")

rowSums() result: 59.95 47.68 50.18 46.88

cat("colMeans() result:", round(colMeans(test_matrix), 2), "\n")

colMeans() result: 10.24 8.93 13.13 9.3 9.57

9 Data Frames: The Data Analysis Workhorse

9.1 Understanding Data Frames

Data frames are the most important data structure for data analysis. They’re like matrices but allow different data types in different columns, making them perfect for real-world datasets.

9.1.1 Data Frame Creation

# Basic data frame creation
employee_df <- data.frame(
  employee_id = 1:6,
  first_name = c("Alice", "Bob", "Charlie", "Diana", "Eve", "Frank"),
  last_name = c("Johnson", "Smith", "Brown", "Davis", "Wilson", "Miller"),
  department = c("IT", "Sales", "IT", "HR", "Sales", "IT"),
  salary = c(75000, 65000, 80000, 70000, 68000, 82000),
  start_date = as.Date(c("2020-01-15", "2019-06-01", "2021-03-10", 
                        "2020-08-20", "2019-11-05", "2022-02-01")),
  is_manager = c(FALSE, TRUE, FALSE, TRUE, FALSE, TRUE),
  performance_rating = factor(c("Excellent", "Good", "Excellent", 
                               "Good", "Fair", "Excellent"),
                             levels = c("Poor", "Fair", "Good", "Excellent"),
                             ordered = TRUE),
  stringsAsFactors = FALSE  # Keep strings as characters
)

print(employee_df)

  employee_id first_name last_name department salary start_date is_manager
1           1      Alice   Johnson         IT  75000 2020-01-15      FALSE
2           2        Bob     Smith      Sales  65000 2019-06-01       TRUE
3           3    Charlie     Brown         IT  80000 2021-03-10      FALSE
4           4      Diana     Davis         HR  70000 2020-08-20       TRUE
5           5        Eve    Wilson      Sales  68000 2019-11-05      FALSE
6           6      Frank    Miller         IT  82000 2022-02-01       TRUE
  performance_rating
1          Excellent
2               Good
3          Excellent
4               Good
5               Fair
6          Excellent

9.1.2 Data Frame Properties and Inspection

# Basic properties
cat("Dimensions:", dim(employee_df), "\n")

Dimensions: 6 8

cat("Number of rows:", nrow(employee_df), "\n")

Number of rows: 6

cat("Number of columns:", ncol(employee_df), "\n")

Number of columns: 8

cat("Column names:", names(employee_df), "\n")

Column names: employee_id first_name last_name department salary start_date is_manager performance_rating

# Data structure overview
cat("Data frame structure:\n")

Data frame structure:

str(employee_df)

'data.frame':   6 obs. of  8 variables:
 $ employee_id       : int  1 2 3 4 5 6
 $ first_name        : chr  "Alice" "Bob" "Charlie" "Diana" ...
 $ last_name         : chr  "Johnson" "Smith" "Brown" "Davis" ...
 $ department        : chr  "IT" "Sales" "IT" "HR" ...
 $ salary            : num  75000 65000 80000 70000 68000 82000
 $ start_date        : Date, format: "2020-01-15" "2019-06-01" ...
 $ is_manager        : logi  FALSE TRUE FALSE TRUE FALSE TRUE
 $ performance_rating: Ord.factor w/ 4 levels "Poor"<"Fair"<..: 4 3 4 3 2 4

# Summary statistics
cat("Summary statistics:\n")

Summary statistics:

summary(employee_df)

  employee_id    first_name         last_name          department       
 Min.   :1.00   Length:6           Length:6           Length:6          
 1st Qu.:2.25   Class :character   Class :character   Class :character  
 Median :3.50   Mode  :character   Mode  :character   Mode  :character  
 Mean   :3.50                                                           
 3rd Qu.:4.75                                                           
 Max.   :6.00                                                           
     salary        start_date         is_manager      performance_rating
 Min.   :65000   Min.   :2019-06-01   Mode :logical   Poor     :0       
 1st Qu.:68500   1st Qu.:2019-11-22   FALSE:3         Fair     :1       
 Median :72500   Median :2020-05-03   TRUE :3         Good     :2       
 Mean   :73333   Mean   :2020-07-14                   Excellent:3       
 3rd Qu.:78750   3rd Qu.:2021-01-18                                     
 Max.   :82000   Max.   :2022-02-01

# Data types of each column
column_types <- sapply(employee_df, class)
cat("Column types:\n")

Column types:

print(column_types)

$employee_id
[1] "integer"

$first_name
[1] "character"

$last_name
[1] "character"

$department
[1] "character"

$salary
[1] "numeric"

$start_date
[1] "Date"

$is_manager
[1] "logical"

$performance_rating
[1] "ordered" "factor"

# Check for missing values
missing_values <- colSums(is.na(employee_df))
cat("Missing values per column:\n")

Missing values per column:

print(missing_values)

       employee_id         first_name          last_name         department 
                 0                  0                  0                  0 
            salary         start_date         is_manager performance_rating 
                 0                  0                  0                  0

9.1.3 Data Frame Indexing and Access

# Column access methods
cat("First names (using $):", employee_df$first_name, "\n")

First names (using $): Alice Bob Charlie Diana Eve Frank

cat("First names (using [[]]):", employee_df[["first_name"]], "\n")

First names (using [[]]): Alice Bob Charlie Diana Eve Frank

cat("Class of extracted column:", class(employee_df$salary), "\n")

Class of extracted column: numeric

# Single bracket returns data frame
first_name_df <- employee_df["first_name"]
cat("Class of [column]:", class(first_name_df), "\n")

Class of [column]: data.frame

# Multiple column selection
name_salary <- employee_df[c("first_name", "last_name", "salary")]
cat("Multiple columns:\n")

Multiple columns:

print(head(name_salary, 3))

  first_name last_name salary
1      Alice   Johnson  75000
2        Bob     Smith  65000
3    Charlie     Brown  80000

# Row access
first_employee <- employee_df[1, ]
cat("First employee:\n")

First employee:

print(first_employee)

  employee_id first_name last_name department salary start_date is_manager
1           1      Alice   Johnson         IT  75000 2020-01-15      FALSE
  performance_rating
1          Excellent

# Specific cell access
alice_salary <- employee_df[1, "salary"]
cat("Alice's salary:", alice_salary, "\n")

Alice's salary: 75000

# Range selection
first_three <- employee_df[1:3, c("first_name", "department", "salary")]
cat("First three employees (selected columns):\n")

First three employees (selected columns):

print(first_three)

  first_name department salary
1      Alice         IT  75000
2        Bob      Sales  65000
3    Charlie         IT  80000

9.1.4 Data Frame Filtering and Subsetting

employee_df <- data.frame(
  employee_id = 1:6,
  first_name = c("Alice", "Bob", "Charlie", "Diana", "Eve", "Frank"),
  last_name = c("Johnson", "Smith", "Brown", "Davis", "Wilson", "Miller"),
  department = c("IT", "Sales", "IT", "HR", "Sales", "IT"),
  salary = c(75000, 65000, 80000, 70000, 68000, 82000),
  start_date = as.Date(c("2020-01-15", "2019-06-01", "2021-03-10", 
                        "2020-08-20", "2019-11-05", "2022-02-01")),
  is_manager = c(FALSE, TRUE, FALSE, TRUE, FALSE, TRUE),
  performance_rating = factor(c("Excellent", "Good", "Excellent", 
                               "Good", "Fair", "Excellent"),
                             levels = c("Poor", "Fair", "Good", "Excellent"),
                             ordered = TRUE),
  stringsAsFactors = FALSE
)
# Simple filtering
it_employees <- employee_df[employee_df$department == "IT", ]
cat("IT employees:\n")

IT employees:

print(it_employees[, c("first_name", "last_name", "salary")])

  first_name last_name salary
1      Alice   Johnson  75000
3    Charlie     Brown  80000
6      Frank    Miller  82000

# Multiple conditions with logical operators
high_earners_it <- employee_df[employee_df$department == "IT" & 
                              employee_df$salary > 75000, ]
cat("High-earning IT employees:\n")

High-earning IT employees:

print(high_earners_it[, c("first_name", "salary")])

  first_name salary
3    Charlie  80000
6      Frank  82000

# Using subset() function (more readable)
managers_subset <- subset(employee_df, 
                         is_manager == TRUE,
                         select = c("first_name", "last_name", "department", "salary"))
cat("Managers (using subset):\n")

Managers (using subset):

print(managers_subset)

  first_name last_name department salary
2        Bob     Smith      Sales  65000
4      Diana     Davis         HR  70000
6      Frank    Miller         IT  82000

# Complex filtering with %in%
target_departments <- c("IT", "Sales")
it_or_sales <- employee_df[employee_df$department %in% target_departments, ]
cat("IT or Sales employees:\n")

IT or Sales employees:

print(it_or_sales[, c("first_name", "department", "salary")])

  first_name department salary
1      Alice         IT  75000
2        Bob      Sales  65000
3    Charlie         IT  80000
5        Eve      Sales  68000
6      Frank         IT  82000

# Filtering with date conditions
recent_hires <- employee_df[employee_df$start_date > as.Date("2020-01-01"), ]
cat("Recent hires (after 2020-01-01):\n")

Recent hires (after 2020-01-01):

print(recent_hires[, c("first_name", "start_date")])

  first_name start_date
1      Alice 2020-01-15
3    Charlie 2021-03-10
4      Diana 2020-08-20
6      Frank 2022-02-01

9.1.5 Data Frame Modification

# Create a working copy
df_work <- employee_df

# Add new columns
df_work$full_name <- paste(df_work$first_name, df_work$last_name)
df_work$years_employed <- as.numeric(Sys.Date() - df_work$start_date) / 365.25
df_work$salary_category <- ifelse(df_work$salary > 75000, "High", "Standard")

# Conditional column creation
df_work$bonus_eligible <- with(df_work, 
  (years_employed > 2) & (performance_rating %in% c("Good", "Excellent"))
)

# Calculate bonus amount based on conditions
df_work$bonus <- ifelse(df_work$bonus_eligible,
                       df_work$salary * 0.1,  # 10% bonus
                       0)

cat("Modified data frame with new columns:\n")

Modified data frame with new columns:

print(df_work[, c("full_name", "years_employed", "salary_category", "bonus")])

      full_name years_employed salary_category bonus
1 Alice Johnson       5.921971        Standard  7500
2     Bob Smith       6.546201        Standard  6500
3 Charlie Brown       4.772074            High  8000
4   Diana Davis       5.325120        Standard  7000
5    Eve Wilson       6.116359        Standard     0
6  Frank Miller       3.874059            High  8200

# Modify existing values
df_work$salary[df_work$first_name == "Alice"] <- 78000  # Give Alice a raise
df_work$is_manager[df_work$first_name == "Charlie"] <- TRUE  # Promote Charlie

# Add new rows
new_employee <- data.frame(
  employee_id = 7,
  first_name = "Grace",
  last_name = "Lee",
  department = "HR",
  salary = 72000,
  start_date = as.Date("2023-01-15"),
  is_manager = FALSE,
  performance_rating = factor("Good", levels = levels(df_work$performance_rating)),
  full_name = "Grace Lee",
  years_employed = 1.0,
  salary_category = "Standard",
  bonus_eligible = FALSE,
  bonus = 0
)

df_work <- rbind(df_work, new_employee)
cat("After adding new employee:", nrow(df_work), "total employees\n")

After adding new employee: 7 total employees

9.1.6 Advanced Data Frame Operations

# Using dplyr for modern data manipulation
library(dplyr)

# Modern data manipulation with pipes
summary_stats <- employee_df |>
  mutate(
    full_name = paste(first_name, last_name),
    years_employed = as.numeric(Sys.Date() - start_date) / 365.25,
    salary_k = salary / 1000  # Convert to thousands
  ) |>
  filter(department %in% c("IT", "Sales")) |>
  group_by(department, is_manager) |>
  summarise(
    count = n(),
    avg_salary = mean(salary),
    avg_years = mean(years_employed),
    .groups = "drop"
  ) |>
  arrange(department, desc(is_manager))

cat("Summary statistics by department and management status:\n")

Summary statistics by department and management status:

print(summary_stats)

# A tibble: 4 × 5
  department is_manager count avg_salary avg_years
  <chr>      <lgl>      <int>      <dbl>     <dbl>
1 IT         TRUE           1      82000      3.87
2 IT         FALSE          2      77500      5.35
3 Sales      TRUE           1      65000      6.55
4 Sales      FALSE          1      68000      6.12

# Advanced filtering with case_when
employee_df_enhanced <- employee_df |>
  mutate(
    experience_level = case_when(
      as.numeric(Sys.Date() - start_date) / 365.25 < 2 ~ "Junior",
      as.numeric(Sys.Date() - start_date) / 365.25 < 4 ~ "Mid-level", 
      TRUE ~ "Senior"
    ),
    compensation_tier = case_when(
      salary < 70000 ~ "Tier 1",
      salary < 80000 ~ "Tier 2",
      TRUE ~ "Tier 3"
    )
  )

cat("Enhanced employee data with calculated fields:\n")

Enhanced employee data with calculated fields:

print(employee_df_enhanced |> 
       select(first_name, department, experience_level, compensation_tier))

  first_name department experience_level compensation_tier
1      Alice         IT           Senior            Tier 2
2        Bob      Sales           Senior            Tier 1
3    Charlie         IT           Senior            Tier 3
4      Diana         HR           Senior            Tier 2
5        Eve      Sales           Senior            Tier 1
6      Frank         IT        Mid-level            Tier 3

9.1.7 Data Frame Aggregation and Grouping

# Base R aggregation
dept_summary <- aggregate(salary ~ department, data = employee_df, 
                         FUN = function(x) c(mean = mean(x), 
                                           median = median(x), 
                                           count = length(x)))
cat("Department salary summary:\n")

Department salary summary:

print(dept_summary)

  department salary.mean salary.median salary.count
1         HR       70000         70000            1
2         IT       79000         80000            3
3      Sales       66500         66500            2

# Multiple grouping variables
dept_mgr_summary <- aggregate(salary ~ department + is_manager, 
                             data = employee_df, 
                             FUN = mean)
cat("Average salary by department and management status:\n")

Average salary by department and management status:

print(dept_mgr_summary)

  department is_manager salary
1         IT      FALSE  77500
2      Sales      FALSE  68000
3         HR       TRUE  70000
4         IT       TRUE  82000
5      Sales       TRUE  65000

# Using by() function for more complex operations
performance_by_dept <- by(employee_df$performance_rating, 
                         employee_df$department, 
                         table)
cat("Performance ratings by department:\n")

Performance ratings by department:

print(performance_by_dept)

employee_df$department: HR

     Poor      Fair      Good Excellent 
        0         0         1         0 
------------------------------------------------------------ 
employee_df$department: IT

     Poor      Fair      Good Excellent 
        0         0         0         3 
------------------------------------------------------------ 
employee_df$department: Sales

     Poor      Fair      Good Excellent 
        0         1         1         0

# Table operations for cross-tabulation
dept_performance_table <- table(employee_df$department, 
                               employee_df$performance_rating)
cat("Department vs Performance cross-table:\n")

Department vs Performance cross-table:

print(dept_performance_table)

       
        Poor Fair Good Excellent
  HR       0    0    1         0
  IT       0    0    0         3
  Sales    0    1    1         0

# Proportional tables
dept_performance_prop <- prop.table(dept_performance_table, margin = 1)
cat("Performance distribution within each department:\n")

Performance distribution within each department:

print(round(dept_performance_prop, 3))

       
        Poor Fair Good Excellent
  HR     0.0  0.0  1.0       0.0
  IT     0.0  0.0  0.0       1.0
  Sales  0.0  0.5  0.5       0.0

10 Arrays: Multi-Dimensional Data

10.1 Understanding Arrays

Arrays extend matrices to more than two dimensions. They’re useful for storing multi-dimensional scientific or business data.

10.1.1 Array Creation and Structure

# Create a 3-dimensional array (think:  multiple matrices stacked)
# Dimensions: 4 quarters × 3 products × 2 years
sales_array <- array(
  data = c(100, 120, 110, 130,    # Q1-Q4 for Product A, Year 1
           150, 160, 140, 170,    # Q1-Q4 for Product B, Year 1  
           200, 210, 190, 220,    # Q1-Q4 for Product C, Year 1
           110, 130, 120, 140,    # Q1-Q4 for Product A, Year 2
           160, 170, 150, 180,    # Q1-Q4 for Product B, Year 2
           210, 220, 200, 230),   # Q1-Q4 for Product C, Year 2
  dim = c(4, 3, 2),  # 4 quarters, 3 products, 2 years
  dimnames = list(
    quarters = paste0("Q", 1:4),
    products = c("Product_A", "Product_B", "Product_C"), 
    years = c("2022", "2023")
  )
)

print(sales_array)

, , years = 2022

        products
quarters Product_A Product_B Product_C
      Q1       100       150       200
      Q2       120       160       210
      Q3       110       140       190
      Q4       130       170       220

, , years = 2023

        products
quarters Product_A Product_B Product_C
      Q1       110       160       210
      Q2       130       170       220
      Q3       120       150       200
      Q4       140       180       230

# Array properties
cat("Array dimensions:", dim(sales_array), "\n")

Array dimensions: 4 3 2

cat("Dimension names:\n")

Dimension names:

print(dimnames(sales_array))

$quarters
[1] "Q1" "Q2" "Q3" "Q4"

$products
[1] "Product_A" "Product_B" "Product_C"

$years
[1] "2022" "2023"

cat("Total elements:", length(sales_array), "\n")

Total elements: 24

10.1.2 Array Indexing and Slicing

# Single element access
q2_productb_2022 <- sales_array["Q2", "Product_B", "2022"]
cat("Q2 Product B 2022 sales:", q2_productb_2022, "\n")

Q2 Product B 2022 sales: 160

# Slice operations
# All quarters and products for 2022
sales_2022 <- sales_array[, , "2022"]
cat("All 2022 sales:\n"); print(sales_2022)

All 2022 sales:

        products
quarters Product_A Product_B Product_C
      Q1       100       150       200
      Q2       120       160       210
      Q3       110       140       190
      Q4       130       170       220

# Product A sales across all quarters and years
product_a_sales <- sales_array[, "Product_A", ]
cat("Product A sales across time:\n"); print(product_a_sales)

Product A sales across time:

        years
quarters 2022 2023
      Q1  100  110
      Q2  120  130
      Q3  110  120
      Q4  130  140

# Q1 sales for all products and years
q1_sales <- sales_array["Q1", , ]
cat("Q1 sales across products and years:\n"); print(q1_sales)

Q1 sales across products and years:

           years
products    2022 2023
  Product_A  100  110
  Product_B  150  160
  Product_C  200  210

# Multiple quarters for specific product and year
q1_q2_productc_2023 <- sales_array[c("Q1", "Q2"), "Product_C", "2023"]
cat("Q1-Q2 Product C 2023:", q1_q2_productc_2023, "\n")

Q1-Q2 Product C 2023: 210 220

10.1.3 Array Operations with Apply

# Apply operations across different dimensions
# MARGIN = 1: across quarters
# MARGIN = 2: across products  
# MARGIN = 3: across years

# Total sales by quarter (sum across products and years)
quarterly_totals <- apply(sales_array, 1, sum)
cat("Quarterly totals:", quarterly_totals, "\n")

Quarterly totals: 930 1010 910 1070

# Total sales by product (sum across quarters and years)
product_totals <- apply(sales_array, 2, sum)
cat("Product totals:", product_totals, "\n")

Product totals: 960 1280 1680

# Total sales by year (sum across quarters and products)
yearly_totals <- apply(sales_array, 3, sum)
cat("Yearly totals:", yearly_totals, "\n")

Yearly totals: 1900 2020

# Average sales by quarter and product (across years)
avg_by_quarter_product <- apply(sales_array, c(1, 2), mean)
cat("Average sales by quarter and product (across years):\n")

Average sales by quarter and product (across years):

print(avg_by_quarter_product)

        products
quarters Product_A Product_B Product_C
      Q1       105       155       205
      Q2       125       165       215
      Q3       115       145       195
      Q4       135       175       225

# Growth rates between years for each quarter-product combination
growth_rates <- apply(sales_array, c(1, 2), function(x) (x[2] - x[1]) / x[1] * 100)
cat("Growth rates (2022 to 2023) by quarter and product:\n")

Growth rates (2022 to 2023) by quarter and product:

print(round(growth_rates, 1))

        products
quarters Product_A Product_B Product_C
      Q1      10.0       6.7       5.0
      Q2       8.3       6.2       4.8
      Q3       9.1       7.1       5.3
      Q4       7.7       5.9       4.5

11 Advanced Data Structures and Concepts

11.1 Environments: R’s Scoping System

# Environments store object-name bindings
current_env <- environment()
global_env <- .GlobalEnv

# Create a new environment
my_env <- new.env()
my_env$x <- 10
my_env$y <- 20
my_env$calculate <- function(a, b) a + b

# List objects in environment
cat("Objects in my_env:", ls(my_env), "\n")

Objects in my_env: calculate x y

# Access environment objects
cat("x from my_env:", my_env$x, "\n")

x from my_env: 10

cat("Calculation result:", my_env$calculate(my_env$x, my_env$y), "\n")

Calculation result: 30

# Environment hierarchy
cat("Current environment:", environmentName(current_env), "\n")

Current environment: R_GlobalEnv

cat("Parent environment:", environmentName(parent.env(current_env)), "\n")

Parent environment: package:dplyr

# Search path for R objects
search_path <- search()
cat("R search path:\n")

R search path:

print(search_path)

 [1] ".GlobalEnv"        "package:dplyr"     "package:purrr"    
 [4] "package:forcats"   "package:stats"     "package:graphics" 
 [7] "package:grDevices" "package:utils"     "package:datasets" 
[10] "package:methods"   "Autoloads"         "package:base"

11.2 Object-Oriented Programming: S3 Classes

# Create an S3 object (simple object-oriented programming)
create_bank_account <- function(owner, balance = 0) {
  account <- list(
    owner = owner,
    balance = balance,
    transactions = data.frame(
      date = as.Date(character(0)),
      type = character(0),
      amount = numeric(0),
      stringsAsFactors = FALSE
    )
  )
  class(account) <- "bank_account"
  return(account)
}

# Create methods for our class
print.bank_account <- function(x) {
  cat("Bank Account\n")
  cat("Owner:", x$owner, "\n")
  cat("Balance:  $", x$balance, "\n")
  cat("Transactions:", nrow(x$transactions), "\n")
}

deposit.bank_account <- function(account, amount) {
  if (amount <= 0) {
    stop("Deposit amount must be positive")
  }
  account$balance <- account$balance + amount
  new_transaction <- data.frame(
    date = Sys.Date(),
    type = "deposit",
    amount = amount
  )
  account$transactions <- rbind(account$transactions, new_transaction)
  return(account)
}

withdraw.bank_account <- function(account, amount) {
  if (amount <= 0) {
    stop("Withdrawal amount must be positive")
  }
  if (amount > account$balance) {
    stop("Insufficient funds")
  }
  account$balance <- account$balance - amount
  new_transaction <- data.frame(
    date = Sys.Date(),
    type = "withdrawal", 
    amount = -amount
  )
  account$transactions <- rbind(account$transactions, new_transaction)
  return(account)
}

# Use our S3 class
my_account <- create_bank_account("John Doe", 1000)
print(my_account)

Bank Account
Owner: John Doe 
Balance:  $ 1000 
Transactions: 0

my_account <- deposit.bank_account(my_account, 500)
my_account <- withdraw.bank_account(my_account, 200)
print(my_account)

Bank Account
Owner: John Doe 
Balance:  $ 1300 
Transactions: 2

12 Performance Optimization and Best Practices

12.1 Memory Efficiency and Performance

12.1.1 Pre-allocation vs Growing Objects

# Demonstrate the performance difference
n <- 10000

# BAD: Growing objects (very slow)
system.time({
  bad_vector <- numeric(0)
  for (i in 1:n) {
    bad_vector <- c(bad_vector, i^2)
  }
})

   user  system elapsed 
  0.111   0.015   0.126

# GOOD: Pre-allocation (much faster)
system.time({
  good_vector <- numeric(n)
  for (i in 1:n) {
    good_vector[i] <- i^2
  }
})

   user  system elapsed 
  0.003   0.000   0.002

# BEST: Vectorized operations (fastest)
system.time({
  best_vector <- (1:n)^2
})

   user  system elapsed 
      0       0       0

# Verify all methods give same result
cat("All methods equivalent:", 
    identical(bad_vector, good_vector) && identical(good_vector, best_vector), "\n")

All methods equivalent: TRUE

12.1.2 Data Structure Selection for Performance

library(microbenchmark)

# Create test data
n_rows <- 1000
test_matrix <- matrix(rnorm(n_rows * 10), nrow = n_rows)
test_df <- as.data.frame(test_matrix)

# Compare performance for column operations
performance_test <- microbenchmark(
  matrix_colsums = colSums(test_matrix),
  df_apply = apply(test_df, 2, sum),
  df_sapply = sapply(test_df, sum),
  times = 100
)

print(performance_test)

Unit: microseconds
           expr     min       lq      mean   median       uq     max neval
 matrix_colsums  11.400  12.0295  13.31633  12.7575  14.0065  25.938   100
       df_apply 147.240 155.8440 181.98039 164.7900 208.6390 254.757   100
      df_sapply  23.237  24.7990  27.22466  26.2545  28.5240  66.505   100

# Memory usage comparison
cat("Matrix memory:", as.numeric(object.size(test_matrix)), "bytes\n")

Matrix memory: 80216 bytes

cat("Data frame memory:", as.numeric(object.size(test_df)), "bytes\n")

Data frame memory: 81904 bytes

12.1.3 Factors vs Characters for Memory

# Compare memory usage:  factors vs characters for categorical data
n_observations <- 100000
categories <- c("Category_A", "Category_B", "Category_C", "Category_D")

# Create character vector
char_vector <- sample(categories, n_observations, replace = TRUE)

# Create factor
factor_vector <- factor(char_vector)

# Memory comparison
char_size <- as.numeric(object.size(char_vector))
factor_size <- as.numeric(object.size(factor_vector))

cat("Character vector memory:", char_size, "bytes\n")

Character vector memory: 800304 bytes

cat("Factor memory:", factor_size, "bytes\n")

Factor memory: 400720 bytes

cat("Factor saves:", round((char_size - factor_size) / char_size * 100, 1), "% memory\n")

Factor saves: 49.9 % memory

# Performance comparison for operations
cat("\nPerformance comparison for table operations:\n")


Performance comparison for table operations:

performance_categorical <- microbenchmark(
  character_table = table(char_vector),
  factor_table = table(factor_vector),
  times = 50
)
print(performance_categorical)

Unit: milliseconds
            expr      min       lq     mean   median       uq      max neval
 character_table 3.063221 3.113143 3.689165 3.138797 4.567353 6.940484    50
    factor_table 1.389336 1.405627 1.630792 1.416196 1.435122 3.761770    50

12.2 Best Practices Summary

12.2.1 1. Data Structure Selection Guide

# Function to recommend data structure
recommend_structure <- function(data_description) {
  cat("Data Structure Recommendation Guide\n")
  cat("===================================\n\n")
  
  recommendations <- list(
    "Homogeneous numeric data, mathematical operations" = "Matrix or Numeric Vector",
    "Heterogeneous tabular data" = "Data Frame or Tibble", 
    "Categorical data with known levels" = "Factor",
    "Hierarchical or nested data" = "List",
    "Mixed-type collection of objects" = "List",
    "Multi-dimensional scientific data" = "Array",
    "Simple sequence of same-type values" = "Atomic Vector"
  )
  
  for (i in seq_along(recommendations)) {
    cat("•", names(recommendations)[i], "\n")
    cat("  →", recommendations[[i]], "\n\n")
  }
}

recommend_structure()

Data Structure Recommendation Guide
===================================

• Homogeneous numeric data, mathematical operations 
  → Matrix or Numeric Vector 

• Heterogeneous tabular data 
  → Data Frame or Tibble 

• Categorical data with known levels 
  → Factor 

• Hierarchical or nested data 
  → List 

• Mixed-type collection of objects 
  → List 

• Multi-dimensional scientific data 
  → Array 

• Simple sequence of same-type values 
  → Atomic Vector

12.2.2 2. Common Pitfalls and Solutions

# Common pitfall demonstrations and solutions

cat("PITFALL 1: Automatic type coercion\n")

PITFALL 1: Automatic type coercion

mixed_data <- c(1, 2, 3, "four")
cat("Result type:", typeof(mixed_data), "\n")

Result type: character

cat("Solution: Use lists for mixed types\n\n")

Solution: Use lists for mixed types

cat("PITFALL 2: Factor level surprises\n")

PITFALL 2: Factor level surprises

original_factor <- factor(c("low", "medium", "high"))
cat("Original levels:", levels(original_factor), "\n")

Original levels: high low medium

# Adding new level without updating factor
# new_factor <- c(original_factor, "very_high")  # This would cause issues
cat("Solution: Update levels before adding new data\n\n")

Solution: Update levels before adding new data

cat("PITFALL 3: Ignoring missing values\n")

PITFALL 3: Ignoring missing values

data_with_na <- c(1, 2, NA, 4, 5)
cat("Mean without na. rm:", mean(data_with_na), "\n")

Mean without na. rm: NA

cat("Mean with na.rm=TRUE:", mean(data_with_na, na.rm = TRUE), "\n\n")

Mean with na.rm=TRUE: 3

cat("PITFALL 4: Matrix vs data frame confusion\n")

PITFALL 4: Matrix vs data frame confusion

test_matrix <- matrix(1:6, nrow = 2)
test_df <- data.frame(x = 1:2, y = 3:4, z = 5:6)
cat("Matrix column access returns:", class(test_matrix[, 1]), "\n")

Matrix column access returns: integer

cat("Data frame column access returns:", class(test_df[, 1]), "\n")

Data frame column access returns: integer

cat("Solution: Use drop=FALSE to maintain structure\n")

Solution: Use drop=FALSE to maintain structure

cat("Matrix with drop=FALSE:", class(test_matrix[, 1, drop = FALSE]), "\n")

Matrix with drop=FALSE: matrix array

12.2.3 3. Debugging and Validation Functions

# Utility functions for data structure debugging
debug_structure <- function(obj, name = deparse(substitute(obj))) {
  cat("=== Debug Info for", name, "===\n")
  cat("Class:", class(obj), "\n")
  cat("Type:", typeof(obj), "\n")
  cat("Length/Dimensions:", 
      if(is.null(dim(obj))) length(obj) else paste(dim(obj), collapse = "×"), "\n")
  
  if (is.factor(obj)) {
    cat("Factor levels:", nlevels(obj), "\n")
    cat("Ordered:", is.ordered(obj), "\n")
  }
  
  if (any(is.na(obj))) {
    cat("Missing values:", sum(is.na(obj)), "\n")
  }
  
  cat("Memory usage:", as.numeric(object.size(obj)), "bytes\n")
  cat("========================\n\n")
}

# Validate data frame structure
validate_dataframe <- function(df, required_cols = NULL) {
  cat("Data Frame Validation\n")
  cat("====================\n")
  
  # Basic structure
  cat("Dimensions:", paste(dim(df), collapse = " × "), "\n")
  cat("Complete cases:", sum(complete.cases(df)), "/", nrow(df), "\n")
  
  # Check required columns
  if (!is.null(required_cols)) {
    missing_cols <- required_cols[!required_cols %in% names(df)]
    if (length(missing_cols) > 0) {
      cat("⚠️  Missing required columns:", paste(missing_cols, collapse = ", "), "\n")
    } else {
      cat("✅ All required columns present\n")
    }
  }
  
  # Data type summary
  cat("\nColumn types:\n")
  col_types <- sapply(df, function(x) paste(class(x), collapse = "/"))
  for (i in seq_along(col_types)) {
    cat("  ", names(col_types)[i], ":", col_types[i], "\n")
  }
  
  cat("====================\n\n")
}

# Example usage
debug_structure(employee_df)

=== Debug Info for employee_df ===
Class: data.frame 
Type: list 
Length/Dimensions: 6×8 
Memory usage: 3752 bytes
========================

validate_dataframe(employee_df, required_cols = c("employee_id", "first_name", "salary"))

Data Frame Validation
====================
Dimensions: 6 × 8 
Complete cases: 6 / 6 
✅ All required columns present

Column types:
   employee_id : integer 
   first_name : character 
   last_name : character 
   department : character 
   salary : numeric 
   start_date : Date 
   is_manager : logical 
   performance_rating : ordered/factor 
====================

13 Summary and Quick Reference

13.1 Data Structure Comparison Table

Structure	Dimensions	Homogeneous	Best Use Cases	Memory Efficiency
Vector	1D	✅ Yes	Mathematical operations, sequences	⭐⭐⭐⭐⭐
Factor	1D	✅ Yes	Categorical data, statistical modeling	⭐⭐⭐⭐⭐
Matrix	2D	✅ Yes	Linear algebra, image data	⭐⭐⭐⭐
Array	ND	✅ Yes	Multi-dimensional scientific data	⭐⭐⭐⭐
List	1D	❌ No	Complex nested structures	⭐⭐⭐
Data Frame	2D	❌ No	Data analysis, mixed-type tables	⭐⭐⭐

13.2 Key Takeaways

13.2.1 For Beginners

Start with vectors and data frames - they cover 80% of use cases
Understand indexing - mastering $, [], and [[]] is crucial
Use factors for categorical data - they save memory and improve performance ### For Intermediate Users
Leverage lists for complex data - they allow nesting and mixed types
Master apply functions - they enable efficient operations across data structures
Choose the right structure - consider data type, operations, and memory needs ### For Advanced Users
Optimize performance - pre-allocate objects and prefer vectorized operations
Utilize S3 classes - for custom object-oriented programming
Debug and validate - use custom functions to ensure data integrity and structure