To demonstrate the basic functionality of fauxnaif
,
let’s first load the package and an example dataset.
library(fauxnaif)
library(magrittr)
fauxnaif::faux_census
#> state gender age race income
#> 1 CA female 80 Native American 28000
#> 2 NY Woman 89 Latino 148800
#> 3 CA Female 48 White 479000
#> 4 TX Male 63 latinx 85000
#> 5 PA Male 47 asian 41900
#> 6 TX Gender is a social construct 57 Race is a social construct 9999999
#> 7 Canada Male 49 white 149000
#> 8 TX Female 50 White 98800
#> 9 NY f 557 white 90750
#> 10 WA F 33 White 45010
#> 11 TX Male 30 White 127000
#> 12 OH Non-binary 42 Caucasian 21600
#> 13 NC Female 22 African American 74200
#> 14 LA Male 2 White 61000
#> 15 LA Female 28 Black 20000
#> 16 CA male 34 Asian American 77400
#> 17 TN M 64 white 9999999
#> 18 FL Female 68 white 47100
#> 19 OH Male 39 black 23800
#> 20 NH male 73 Hispanic 33200
#> religion
#> 1 Christian
#> 2 Spiritual not religious
#> 3 Catholic
#> 4 christian
#> 5 Baptist
#> 6 Religion is the opiate of the people
#> 7 methodist
#> 8 Lutheran
#> 9 Agnostic
#> 10 Jewish
#> 11 none
#> 12 Roman Catholic
#> 13 atheist
#> 14 Christian
#> 15 Not religious
#> 16 Christian
#> 17 Nothing
#> 18 None
#> 19 baptist
#> 20 Christian
We can see the example dataset in full above. The data is a small
section of census-like information. This dataset needs a lot of
cleaning. Other tools like dplyr
and tidyr
would likely be needed to really analyze this data, but we’ll focus on
the aspects that can be handled by fauxnaif
.
First, let’s look at the simplest issue in this dataset: income.
faux_census$income
#> [1] 28000 148800 479000 85000 41900 9999999 149000 98800 90750
#> [10] 45010 127000 21600 74200 61000 20000 77400 9999999 47100
#> [19] 23800 33200
Printing the vector of incomes, one value stands out: while most
respondents’ have values in the tens to hundreds of thousands, two
respondents have incomes of 9999999. It’s common for datasets you
receive from other sources to use an unrealistically high value (often a
string of 9s) to indicate NA
. We can clean this using
na_if_in()
.
na_if_in(faux_census$income, 9999999)
#> [1] 28000 148800 479000 85000 41900 NA 149000 98800 90750 45010
#> [11] 127000 21600 74200 61000 20000 77400 NA 47100 23800 33200
The new variable has NA
s in the place of those strings
of 9s.
As an alternative, we can use the magrittr
pipe
(%>%
) to pass an input into na_if_in()
:
faux_census$income %>% na_if_in(9999999)
#> [1] 28000 148800 479000 85000 41900 NA 149000 98800 90750 45010
#> [11] 127000 21600 74200 61000 20000 77400 NA 47100 23800 33200
This produces the same result.
This task could have been completed using the version of
na_if_in()
included in the dplyr
package.
However, moving forward we will use more advanced functionality of
fauxnaif
.
Let’s now examine the age variable:
In this case, we see two improbable values: 557 and 2 (assuming this
is a survey of adults). Using dplyr
, this would have to be
addressed using two steps:
faux_census$age %>% dplyr::na_if(557) %>% dplyr::na_if(2)
#> [1] 80 89 48 63 47 57 49 50 NA 33 30 42 22 NA 28 34 64 68 39 73
But using fauxnaif
we can simplify this to a single
step:
In the above example, we were able to examine our dataset and select
the values that were unrealistic. In real-life analyses, we often can’t
look at each observation one by one to find unrealistic values, but we
often do know the range of realistic values. Using
na_if_not()
, we can specify which values are realistic and
discard those that are not.
Returning to the age variable, let’s replace values with
NA
if they are not between 18 (the minimum age we
expect to enter the survey) and 122 (the world record for the oldest
person).
faux_census$age %>% na_if_not(18:122)
#> [1] 80 89 48 63 47 57 49 50 NA 33 30 42 22 NA 28 34 64 68 39 73
This has the same effect as specifying the unrealistic values directly, but no longer requires you to directly examine each observation.
Another way to approach this problem is to use a formula to specify
the range of acceptable values. This is particularly useful when dealing
with non-integer values, where the colon operator (:
) will
not work:
but
Formulas in fauxnaif
are based on the formula syntax
used in rlang
and purrr
. They are introduced
with a tilde (~
) and indicate each observation with a dot
(.
).
To clean the age variable, we can use two formulas. One will replace values less than 18 and another will replace values greater than 122:
faux_census$age %>% na_if_in(~ . < 18, ~ . > 122)
#> [1] 80 89 48 63 47 57 49 50 NA 33 30 42 22 NA 28 34 64 68 39 73
Or we can use the between()
function from
dplyr
:
library(dplyr)
faux_census$age %>% na_if_in(~ !between(., 18, 122))
#> [1] 80 89 48 63 47 57 49 50 NA 33 30 42 22 NA 28 34 64 68 39 73
Formulas are not only useful when dealing with numeric variables. While it’s straightforward to use relational operators to specify replacements in numeric variables, we can also use more complex formulas to handle other data types.
Let’s take a look at the religion variable:
faux_census$religion
#> [1] "Christian"
#> [2] "Spiritual not religious"
#> [3] "Catholic"
#> [4] "christian"
#> [5] "Baptist"
#> [6] "Religion is the opiate of the people"
#> [7] "methodist"
#> [8] "Lutheran"
#> [9] "Agnostic"
#> [10] "Jewish"
#> [11] "none"
#> [12] "Roman Catholic"
#> [13] "atheist"
#> [14] "Christian"
#> [15] "Not religious"
#> [16] "Christian"
#> [17] "Nothing"
#> [18] "None"
#> [19] "baptist"
#> [20] "Christian"
While there are a few things we might want to clean in this variable, one clear issue is the respondent who did not answer the question but instead used the space to give an opinion: “Religion is the opiate of the people”.
We could use the most basic form of na_if_in()
to simply
remove this answer:
faux_census$religion %>% na_if_in("Religion is the opiate of the people")
#> [1] "Christian" "Spiritual not religious"
#> [3] "Catholic" "christian"
#> [5] "Baptist" NA
#> [7] "methodist" "Lutheran"
#> [9] "Agnostic" "Jewish"
#> [11] "none" "Roman Catholic"
#> [13] "atheist" "Christian"
#> [15] "Not religious" "Christian"
#> [17] "Nothing" "None"
#> [19] "baptist" "Christian"
But in a larger analysis, we may prefer to have a simple rule for
excluding answers. Perhaps we decide that answers longer than 25
characters are unlikely to be genuine. In that case, we can use a
formula operating on the number of characters (nchar(.)
) in
a response:
faux_census$religion %>% na_if_in(~ nchar(.) > 25)
#> [1] "Christian" "Spiritual not religious"
#> [3] "Catholic" "christian"
#> [5] "Baptist" NA
#> [7] "methodist" "Lutheran"
#> [9] "Agnostic" "Jewish"
#> [11] "none" "Roman Catholic"
#> [13] "atheist" "Christian"
#> [15] "Not religious" "Christian"
#> [17] "Nothing" "None"
#> [19] "baptist" "Christian"
Often in data analysis, we prefer to work within a single data frame
than operating on individual vectors. fauxnaif
is built to
handle this use case.
A simple solution is to use na_if_in()
or
na_if_not()
within dplyr
’s
mutate()
function.
library(dplyr)
faux_census %>% mutate(income = na_if_in(income, 9999999))
#> # A tibble: 20 × 6
#> state gender age race income religion
#> <chr> <chr> <dbl> <chr> <dbl> <chr>
#> 1 CA female 80 Native American 28000 Christi…
#> 2 NY Woman 89 Latino 148800 Spiritu…
#> 3 CA Female 48 White 479000 Catholic
#> 4 TX Male 63 latinx 85000 christi…
#> 5 PA Male 47 asian 41900 Baptist
#> 6 TX Gender is a social construct 57 Race is a social c… NA Religio…
#> 7 Canada Male 49 white 149000 methodi…
#> 8 TX Female 50 White 98800 Lutheran
#> 9 NY f 557 white 90750 Agnostic
#> 10 WA F 33 White 45010 Jewish
#> 11 TX Male 30 White 127000 none
#> 12 OH Non-binary 42 Caucasian 21600 Roman C…
#> 13 NC Female 22 African American 74200 atheist
#> 14 LA Male 2 White 61000 Christi…
#> 15 LA Female 28 Black 20000 Not rel…
#> 16 CA male 34 Asian American 77400 Christi…
#> 17 TN M 64 white NA Nothing
#> 18 FL Female 68 white 47100 None
#> 19 OH Male 39 black 23800 baptist
#> 20 NH male 73 Hispanic 33200 Christi…
Sometimes, the same replacement function can be used in multiple
columns. Here, the respondent who didn’t give a real answer to the
religion question seemed to do the same with the gender and race
questions. You can specify multiple columns using dplyr
’s
across()
is you would like to make replacements based on
the same criteria:
faux_census %>%
mutate(across(c(religion, gender, race), na_if_in, ~ nchar(.) > 25))
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `across(c(religion, gender, race), na_if_in, ~nchar(.) > 25)`.
#> Caused by warning:
#> ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
#> Supply arguments directly to `.fns` through an anonymous function instead.
#>
#> # Previously
#> across(a:b, mean, na.rm = TRUE)
#>
#> # Now
#> across(a:b, \(x) mean(x, na.rm = TRUE))
#> # A tibble: 20 × 6
#> state gender age race income religion
#> <chr> <chr> <dbl> <chr> <dbl> <chr>
#> 1 CA female 80 Native American 28000 Christian
#> 2 NY Woman 89 Latino 148800 Spiritual not religious
#> 3 CA Female 48 White 479000 Catholic
#> 4 TX Male 63 latinx 85000 christian
#> 5 PA Male 47 asian 41900 Baptist
#> 6 TX <NA> 57 <NA> 9999999 <NA>
#> 7 Canada Male 49 white 149000 methodist
#> 8 TX Female 50 White 98800 Lutheran
#> 9 NY f 557 white 90750 Agnostic
#> 10 WA F 33 White 45010 Jewish
#> 11 TX Male 30 White 127000 none
#> 12 OH Non-binary 42 Caucasian 21600 Roman Catholic
#> 13 NC Female 22 African American 74200 atheist
#> 14 LA Male 2 White 61000 Christian
#> 15 LA Female 28 Black 20000 Not religious
#> 16 CA male 34 Asian American 77400 Christian
#> 17 TN M 64 white 9999999 Nothing
#> 18 FL Female 68 white 47100 None
#> 19 OH Male 39 black 23800 baptist
#> 20 NH male 73 Hispanic 33200 Christian
Rather than specifying columns manually, we can also select columns
using a predicate function with dplyr
’s
where()
.
For example, we may want to remove strings of 9s in any numeric column:
faux_census %>% mutate(across(where(is.numeric), na_if_in, ~ grepl("999", .)))
#> # A tibble: 20 × 6
#> state gender age race income religion
#> <chr> <chr> <dbl> <chr> <dbl> <chr>
#> 1 CA female 80 Native American 28000 Christi…
#> 2 NY Woman 89 Latino 148800 Spiritu…
#> 3 CA Female 48 White 479000 Catholic
#> 4 TX Male 63 latinx 85000 christi…
#> 5 PA Male 47 asian 41900 Baptist
#> 6 TX Gender is a social construct 57 Race is a social c… NA Religio…
#> 7 Canada Male 49 white 149000 methodi…
#> 8 TX Female 50 White 98800 Lutheran
#> 9 NY f 557 white 90750 Agnostic
#> 10 WA F 33 White 45010 Jewish
#> 11 TX Male 30 White 127000 none
#> 12 OH Non-binary 42 Caucasian 21600 Roman C…
#> 13 NC Female 22 African American 74200 atheist
#> 14 LA Male 2 White 61000 Christi…
#> 15 LA Female 28 Black 20000 Not rel…
#> 16 CA male 34 Asian American 77400 Christi…
#> 17 TN M 64 white NA Nothing
#> 18 FL Female 68 white 47100 None
#> 19 OH Male 39 black 23800 baptist
#> 20 NH male 73 Hispanic 33200 Christi…
While this replacement was intended for three specific columns, no
variable contains a legitimate answer longer than 25 characters. In this
case, rather than specifying the variable of interest, we can simply use
dplyr
’s everything()
to make the replacement
in all columns:
faux_census %>% mutate(across(everything(), na_if_in, ~ nchar(.) > 25))
#> # A tibble: 20 × 6
#> state gender age race income religion
#> <chr> <chr> <dbl> <chr> <dbl> <chr>
#> 1 CA female 80 Native American 28000 Christian
#> 2 NY Woman 89 Latino 148800 Spiritual not religious
#> 3 CA Female 48 White 479000 Catholic
#> 4 TX Male 63 latinx 85000 christian
#> 5 PA Male 47 asian 41900 Baptist
#> 6 TX <NA> 57 <NA> 9999999 <NA>
#> 7 Canada Male 49 white 149000 methodist
#> 8 TX Female 50 White 98800 Lutheran
#> 9 NY f 557 white 90750 Agnostic
#> 10 WA F 33 White 45010 Jewish
#> 11 TX Male 30 White 127000 none
#> 12 OH Non-binary 42 Caucasian 21600 Roman Catholic
#> 13 NC Female 22 African American 74200 atheist
#> 14 LA Male 2 White 61000 Christian
#> 15 LA Female 28 Black 20000 Not religious
#> 16 CA male 34 Asian American 77400 Christian
#> 17 TN M 64 white 9999999 Nothing
#> 18 FL Female 68 white 47100 None
#> 19 OH Male 39 black 23800 baptist
#> 20 NH male 73 Hispanic 33200 Christian
In a data analysis pipeline, we can combine several steps to produce a usable dataset. Combining our interval check for age, our check for strings of 9s in numeric variables, and our check for long responses in character variables, we can yield much cleaner data:
faux_census %>%
mutate(
age = na_if_not(age, 18:122),
across(where(is.numeric), na_if_in, ~ grepl("999", .)),
across(everything(), na_if_in, ~ nchar(.) > 25)
)
#> # A tibble: 20 × 6
#> state gender age race income religion
#> <chr> <chr> <dbl> <chr> <dbl> <chr>
#> 1 CA female 80 Native American 28000 Christian
#> 2 NY Woman 89 Latino 148800 Spiritual not religious
#> 3 CA Female 48 White 479000 Catholic
#> 4 TX Male 63 latinx 85000 christian
#> 5 PA Male 47 asian 41900 Baptist
#> 6 TX <NA> 57 <NA> NA <NA>
#> 7 Canada Male 49 white 149000 methodist
#> 8 TX Female 50 White 98800 Lutheran
#> 9 NY f NA white 90750 Agnostic
#> 10 WA F 33 White 45010 Jewish
#> 11 TX Male 30 White 127000 none
#> 12 OH Non-binary 42 Caucasian 21600 Roman Catholic
#> 13 NC Female 22 African American 74200 atheist
#> 14 LA Male NA White 61000 Christian
#> 15 LA Female 28 Black 20000 Not religious
#> 16 CA male 34 Asian American 77400 Christian
#> 17 TN M 64 white NA Nothing
#> 18 FL Female 68 white 47100 None
#> 19 OH Male 39 black 23800 baptist
#> 20 NH male 73 Hispanic 33200 Christian