6  Reading SAS Datasets (+ Cleaning)

library(haven)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(labelled)

We try to read a SAS dataset (e.g., SDTM DM). If not present, we synthesize an example.

dm_path <- "data/sdtm/dm.sas7bdat"

if (file.exists(dm_path)) {
  dm <- read_sas(dm_path)
} else {
  dm <- tibble::tibble(
    STUDYID = "XYZ123",
    USUBJID = sprintf("XYZ-%03d", 1:10),
    ARM = rep(c("Placebo","Active"), length.out=10),
    AGE = c(55, 62, 47, 50, 71, 66, 45, 59, 53, 68),
    SEX = rep(c("M","F"), length.out=10)
  )
  message("Synthesized `dm` since data/sdtm/dm.sas7bdat was not found.")
}
Synthesized `dm` since data/sdtm/dm.sas7bdat was not found.
str(dm)
tibble [10 × 5] (S3: tbl_df/tbl/data.frame)
 $ STUDYID: chr [1:10] "XYZ123" "XYZ123" "XYZ123" "XYZ123" ...
 $ USUBJID: chr [1:10] "XYZ-001" "XYZ-002" "XYZ-003" "XYZ-004" ...
 $ ARM    : chr [1:10] "Placebo" "Active" "Placebo" "Active" ...
 $ AGE    : num [1:10] 55 62 47 50 71 66 45 59 53 68
 $ SEX    : chr [1:10] "M" "F" "M" "F" ...

6.1 Handling Labels & Missing

# Example: Convert blank strings "" to NA for character columns
convert_blanks_to_na <- function(x) {
  if (is.character(x)) x[x == ""] <- NA_character_
  x
}
dm <- dm |> mutate(across(where(is.character), convert_blanks_to_na))

6.2 Labelled to Factor (if needed)

if (inherits(dm$SEX, "labelled")) {
  dm <- dm |> mutate(SEX = to_factor(SEX))
}

6.3 Common Cleaning

dm <- dm |>
  mutate(
    AGEGR1 = cut(AGE, breaks=c(-Inf, 50, 65, Inf),
                 labels=c("<50","50-65", "65+"))
  )

7 Exercises

  1. Read another SAS dataset (e.g., sv.sas7bdat) if available. If not, create a synthetic tibble.
  2. Write a function to trim character whitespace for all character columns.
  3. Make a clean factor for ARM with levels Placebo < Active.