Variable label is human readable description of the variable. R
supports rather long variable names and these names can contain even
spaces and punctuation but short variables names make coding easier.
Variable label can give a nice, long description of variable. With this
description it is easier to remember what those variable names refer to.
Value labels are similar to variable labels, but value labels are
descriptions of the values a variable can take. Labeling values means we
don’t have to remember if 1=Extremely poor and 7=Excellent or
vice-versa. We can easily get dataset description and variables summary
with info
function.
The usual way to connect numeric data to labels in R is factor variables. However, factors miss important features which the value labels provide. Factors only allow for integers to be mapped to a text label, these integers have to be a count starting at 1 and every value need to be labelled. Also, we can’t calculate means or other numeric statistics on factors.
With labels we can manipulate short variable names and codes when we analyze our data but in the resulting tables and graphs we will see human-readable text.
It is easy to store labels as variable attributes in R but most R
functions cannot use them or even drop them. expss
package
integrates value labels support into base R functions and into functions
from other packages. Every function which internally converts variable
to factor will utilize labels. Labels will be preserved during variables
subsetting and concatenation. Additionally, there is a function
(use_labels
) which greatly simplify variable labels usage.
See examples below.
First, apply value and variables labels to dataset:
library(expss)
data(mtcars)
mtcars = apply_labels(mtcars,
mpg = "Miles/(US) gallon",
cyl = "Number of cylinders",
disp = "Displacement (cu.in.)",
hp = "Gross horsepower",
drat = "Rear axle ratio",
wt = "Weight (1000 lbs)",
qsec = "1/4 mile time",
vs = "Engine",
vs = c("V-engine" = 0,
"Straight engine" = 1),
am = "Transmission",
am = c("Automatic" = 0,
"Manual"=1),
gear = "Number of forward gears",
carb = "Number of carburetors"
)
In addition to apply_labels
we have SPSS-style
var_lab
and val_lab
functions:
nps = c(-1, 0, 1, 1, 0, 1, 1, -1)
var_lab(nps) = "Net promoter score"
val_lab(nps) = num_lab("
-1 Detractors
0 Neutralists
1 Promoters
")
We can read, add or remove existing labels:
## [1] "Net promoter score"
## Detractors Neutralists Promoters
## -1 0 1
# add new labels
add_val_lab(nps) = num_lab("
98 Other
99 Hard to say
")
# remove label by value
# %d% - diff, %n_d% - names diff
val_lab(nps) = val_lab(nps) %d% 98
# or, remove value by name
val_lab(nps) = val_lab(nps) %n_d% "Other"
Additionaly, there are some utility functions. They can applied on one variable as well as on the entire dataset.
## LABEL: Net promoter score
## VALUES:
## -1, 0, 1, 1, 0, 1, 1, -1
## VALUES:
## -1, 0, 1, 1, 0, 1, 1, -1
## VALUE LABELS:
## -1 Detractors
## 0 Neutralists
## 1 Promoters
## 99 Hard to say
## [1] -1 0 1 1 0 1 1 -1
## LABEL: Net promoter score
## VALUES:
## -1, 0, 1, 1, 0, 1, 1, -1
## VALUE LABELS:
## -1 Detractors
## 0 Neutralists
## 1 Promoters
## LABEL: Net promoter score
## VALUES:
## -1, 0, 1, 1, 0, 1, 1, -1
## VALUE LABELS:
## -1 -1 Detractors
## 0 0 Neutralists
## 1 1 Promoters
## 99 99 Hard to say
There is also prepend_names
function but it can be
applied only to data.frame.
Base table
and plotting with value labels:
## vs
## am V-engine Straight engine
## Automatic 12 7
## Manual 6 7
There is a special function for variables labels support -
use_labels
. By now variables labels support available only
for expression which will be evaluated inside data.frame.
## Engine
## Transmission V-engine Straight engine
## Automatic 12 7
## Manual 6 7
##
## Call:
## lm(formula = `Miles/(US) gallon` ~ `Weight (1000 lbs)` + `Gross horsepower` +
## `1/4 mile time`)
##
## Residuals:
## LABEL: Miles/(US) gallon
## VALUES:
## -3.8591, -1.6418, -0.4636, 1.194, 5.6092
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.61053 8.41993 3.279 0.00278 **
## `Weight (1000 lbs)` -4.35880 0.75270 -5.791 3.22e-06 ***
## `Gross horsepower` -0.01782 0.01498 -1.190 0.24418
## `1/4 mile time` 0.51083 0.43922 1.163 0.25463
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.578 on 28 degrees of freedom
## Multiple R-squared: 0.8348, Adjusted R-squared: 0.8171
## F-statistic: 47.15 on 3 and 28 DF, p-value: 4.506e-11
And, finally, ggplot2
graphics with variables and value
labels. Note that with ggplot2 version 3.2.0 and higher you need to
explicitly convert labelled variables to factors in the
facet_grid
formula:
We have an option for extreme values lables support:
expss_enable_value_labels_support_extreme()
. With this
option factor
/as.factor
will take into account
empty levels. However, unique
will give weird result for
labelled variables: labels without values will be added to unique
values. That’s why it is recommended to turn off this option immediately
after usage. See examples.
We have label ‘Hard to say’ for which there are no values in
nps
:
nps = c(-1, 0, 1, 1, 0, 1, 1, -1)
var_lab(nps) = "Net promoter score"
val_lab(nps) = num_lab("
-1 Detractors
0 Neutralists
1 Promoters
99 Hard to say
")
Here we disable labels support and get results without labels:
## nps
## -1 0 1
## 2 2 4
## [1] -1 0 1
Results with default value labels support - three labels are here but “Hard to say” is absent.
expss_enable_value_labels_support()
# table with labels but there are no label "Hard to say"
table(nps)
## nps
## Detractors Neutralists Promoters
## 2 2 4
## LABEL: Net promoter score
## VALUES:
## -1, 0, 1
## VALUE LABELS:
## -1 Detractors
## 0 Neutralists
## 1 Promoters
## 99 Hard to say
And now extreme value labels support - we see “Hard to say” with zero
counts. Note the weird unique
result.
## nps
## Detractors Neutralists Promoters Hard to say
## 2 2 4 0
## LABEL: Net promoter score
## VALUES:
## -1, 0, 1, 99
## VALUE LABELS:
## -1 Detractors
## 0 Neutralists
## 1 Promoters
## 99 Hard to say
Return immediately to defaults to avoid issues:
There are special methods for subsetting and concatenating labelled variables. These methods preserve labels during common operations. We don’t need to restore labels on subsetted or sorted data.frame.
mtcars
with labels:
## 'data.frame': 32 obs. of 11 variables:
## $ mpg :Class 'labelled' num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## .. .. LABEL: Miles/(US) gallon
## $ cyl :Class 'labelled' num 6 6 4 6 8 6 8 4 4 6 ...
## .. .. LABEL: Number of cylinders
## $ disp:Class 'labelled' num 160 160 108 258 360 ...
## .. .. LABEL: Displacement (cu.in.)
## $ hp :Class 'labelled' num 110 110 93 110 175 105 245 62 95 123 ...
## .. .. LABEL: Gross horsepower
## $ drat:Class 'labelled' num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## .. .. LABEL: Rear axle ratio
## $ wt :Class 'labelled' num 2.62 2.88 2.32 3.21 3.44 ...
## .. .. LABEL: Weight (1000 lbs)
## $ qsec:Class 'labelled' num 16.5 17 18.6 19.4 17 ...
## .. .. LABEL: 1/4 mile time
## $ vs :Class 'labelled' num 0 0 1 1 0 1 0 1 1 1 ...
## .. .. LABEL: Engine
## .. .. VALUE LABELS [1:2]: 0=V-engine, 1=Straight engine
## $ am :Class 'labelled' num 1 1 1 0 0 0 0 0 0 0 ...
## .. .. LABEL: Transmission
## .. .. VALUE LABELS [1:2]: 0=Automatic, 1=Manual
## $ gear:Class 'labelled' num 4 4 4 3 3 3 3 4 4 4 ...
## .. .. LABEL: Number of forward gears
## $ carb:Class 'labelled' num 4 4 1 1 2 1 4 2 2 4 ...
## .. .. LABEL: Number of carburetors
Make subset of the data.frame:
Labels are here, nothing is lost:
## 'data.frame': 10 obs. of 11 variables:
## $ mpg :Class 'labelled' num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
## .. .. LABEL: Miles/(US) gallon
## $ cyl :Class 'labelled' num 6 6 4 6 8 6 8 4 4 6
## .. .. LABEL: Number of cylinders
## $ disp:Class 'labelled' num 160 160 108 258 360 ...
## .. .. LABEL: Displacement (cu.in.)
## $ hp :Class 'labelled' num 110 110 93 110 175 105 245 62 95 123
## .. .. LABEL: Gross horsepower
## $ drat:Class 'labelled' num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92
## .. .. LABEL: Rear axle ratio
## $ wt :Class 'labelled' num 2.62 2.88 2.32 3.21 3.44 ...
## .. .. LABEL: Weight (1000 lbs)
## $ qsec:Class 'labelled' num 16.5 17 18.6 19.4 17 ...
## .. .. LABEL: 1/4 mile time
## $ vs :Class 'labelled' num 0 0 1 1 0 1 0 1 1 1
## .. .. LABEL: Engine
## .. .. VALUE LABELS [1:2]: 0=V-engine, 1=Straight engine
## $ am :Class 'labelled' num 1 1 1 0 0 0 0 0 0 0
## .. .. LABEL: Transmission
## .. .. VALUE LABELS [1:2]: 0=Automatic, 1=Manual
## $ gear:Class 'labelled' num 4 4 4 3 3 3 3 4 4 4
## .. .. LABEL: Number of forward gears
## $ carb:Class 'labelled' num 4 4 1 1 2 1 4 2 2 4
## .. .. LABEL: Number of carburetors
To use expss
with haven
you need to load
expss
strictly after haven
(or other package
with implemented ‘labelled’ class) to avoid conflicts. And it is better
to use read_spss
with explict package specification:
haven::read_spss
. See example below. haven
package doesn’t set ‘labelled’ class for variables which have variable
label but don’t have value labels. It leads to labels losing during
subsetting and other operations. We have a special function to fix this:
add_labelled_class
. Apply it to dataset loaded by
haven
.