--- title: "fauxnaif" author: "Alexander Rossell Hayes" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: df_print: "tibble" vignette: > %\VignetteIndexEntry{fauxnaif} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( message = FALSE, collapse = TRUE, comment = "#>" ) ``` # Getting started To demonstrate the basic functionality of `fauxnaif`, let's first load the package and an example dataset. ```{r} library(fauxnaif) library(magrittr) fauxnaif::faux_census ``` We can see the example dataset in full above. The data is a small section of census-like information. This dataset needs a lot of cleaning. Other tools like `dplyr` and `tidyr` would likely be needed to really analyze this data, but we'll focus on the aspects that can be handled by `fauxnaif`. # The most basic case First, let's look at the simplest issue in this dataset: income. ```{r} faux_census$income ``` Printing the vector of incomes, one value stands out: while most respondents' have values in the tens to hundreds of thousands, two respondents have incomes of 9999999. It's common for datasets you receive from other sources to use an unrealistically high value (often a string of 9s) to indicate `NA`. We can clean this using `na_if_in()`. ```{r} na_if_in(faux_census$income, 9999999) ``` The new variable has `NA`s in the place of those strings of 9s. As an alternative, we can use the `magrittr` pipe (`%>%`) to pass an input into `na_if_in()`: ```{r} faux_census$income %>% na_if_in(9999999) ``` This produces the same result. This task could have been completed using the version of `na_if_in()` included in the `dplyr` package. However, moving forward we will use more advanced functionality of `fauxnaif`. # Replacing multiple values Let's now examine the age variable: ```{r} faux_census$age ``` In this case, we see two improbable values: 557 and 2 (assuming this is a survey of adults). Using `dplyr`, this would have to be addressed using two steps: ```{r} faux_census$age %>% dplyr::na_if(557) %>% dplyr::na_if(2) ``` But using `fauxnaif` we can simplify this to a single step: ```{r} faux_census$age %>% na_if_in(557, 2) ``` # Specifying values to keep rather than values to discard In the above example, we were able to examine our dataset and select the values that were unrealistic. In real-life analyses, we often can't look at each observation one by one to find unrealistic values, but we often do know the range of realistic values. Using `na_if_not()`, we can specify which values are realistic and discard those that are not. Returning to the age variable, let's replace values with `NA` if they are *not* between 18 (the minimum age we expect to enter the survey) and 122 (the world record for the oldest person). ```{r} faux_census$age %>% na_if_not(18:122) ``` This has the same effect as specifying the unrealistic values directly, but no longer requires you to directly examine each observation. # Replacing values using formulas Another way to approach this problem is to use a formula to specify the range of acceptable values. This is particularly useful when dealing with non-integer values, where the colon operator (`:`) will not work: ```{r} 23 %in% 18:122 ``` but ```{r} 23.5 %in% 18:122 ``` Formulas in `fauxnaif` are based on the formula syntax used in `rlang` and `purrr`. They are introduced with a tilde (`~`) and indicate each observation with a dot (`.`). To clean the age variable, we can use two formulas. One will replace values less than 18 and another will replace values greater than 122: ```{r} faux_census$age %>% na_if_in(~ . < 18, ~ . > 122) ``` Or we can use the `between()` function from `dplyr`: ```{r} library(dplyr) faux_census$age %>% na_if_in(~ !between(., 18, 122)) ``` ## Using formulas for non-numeric variables Formulas are not only useful when dealing with numeric variables. While it's straightforward to use relational operators to specify replacements in numeric variables, we can also use more complex formulas to handle other data types. Let's take a look at the religion variable: ```{r} faux_census$religion ``` While there are a few things we might want to clean in this variable, one clear issue is the respondent who did not answer the question but instead used the space to give an opinion: "Religion is the opiate of the people". We could use the most basic form of `na_if_in()` to simply remove this answer: ```{r} faux_census$religion %>% na_if_in("Religion is the opiate of the people") ``` But in a larger analysis, we may prefer to have a simple rule for excluding answers. Perhaps we decide that answers longer than 25 characters are unlikely to be genuine. In that case, we can use a formula operating on the number of characters (`nchar(.)`) in a response: ```{r} faux_census$religion %>% na_if_in(~ nchar(.) > 25) ``` # Replacing values in data frames Often in data analysis, we prefer to work within a single data frame than operating on individual vectors. `fauxnaif` is built to handle this use case. A simple solution is to use `na_if_in()` or `na_if_not()` within `dplyr`'s `mutate()` function. ```{r} library(dplyr) faux_census %>% mutate(income = na_if_in(income, 9999999)) ``` # Replacing values in multiple columns Sometimes, the same replacement function can be used in multiple columns. Here, the respondent who didn't give a real answer to the religion question seemed to do the same with the gender and race questions. You can specify multiple columns using `dplyr`'s `across()` is you would like to make replacements based on the same criteria: ```{r} faux_census %>% mutate(across(c(religion, gender, race), na_if_in, ~ nchar(.) > 25)) ``` ## Replacing values using a predicate function Rather than specifying columns manually, we can also select columns using a predicate function with `dplyr`'s `where()`. For example, we may want to remove strings of 9s in any numeric column: ```{r} faux_census %>% mutate(across(where(is.numeric), na_if_in, ~ grepl("999", .))) ``` ## Replacing values in all columns While this replacement was intended for three specific columns, no variable contains a legitimate answer longer than 25 characters. In this case, rather than specifying the variable of interest, we can simply use `dplyr`'s `everything()` to make the replacement in all columns: ```{r} faux_census %>% mutate(across(everything(), na_if_in, ~ nchar(.) > 25)) ``` # Putting it all together In a data analysis pipeline, we can combine several steps to produce a usable dataset. Combining our interval check for age, our check for strings of 9s in numeric variables, and our check for long responses in character variables, we can yield much cleaner data: ```{r} faux_census %>% mutate( age = na_if_not(age, 18:122), across(where(is.numeric), na_if_in, ~ grepl("999", .)), across(everything(), na_if_in, ~ nchar(.) > 25) ) ```