Title: | Tables, Labels and Some Useful Functions from Spreadsheets and 'SPSS' Statistics |
---|---|
Description: | Package computes and displays tables with support for 'SPSS'-style labels, multiple and nested banners, weights, multiple-response variables and significance testing. There are facilities for nice output of tables in 'knitr', 'Shiny', '*.xlsx' files, R and 'Jupyter' notebooks. Methods for labelled variables add value labels support to base R functions and to some functions from other packages. Additionally, the package brings popular data transformation functions from 'SPSS' Statistics and 'Excel': 'RECODE', 'COUNT', 'COUNTIF', 'VLOOKUP' and etc. These functions are very useful for data processing in marketing research surveys. Package intended to help people to move data processing from 'Excel' and 'SPSS' to R. |
Authors: | Gregory Demin [aut, cre], Sebastian Jeworutzki [ctb] , Dan Chaltiel [ctb], John Williams [ctb], Tom Elliott [ctb] |
Maintainer: | Gregory Demin <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.11.6 |
Built: | 2025-01-04 04:42:12 UTC |
Source: | https://github.com/gdemin/expss |
add_columns
inspired by MATCH FILES (Add
variables...) from SPSS Statistics. It works similar to SQL left join but
number of cases in the left part always remain the same. If there are
duplicated keys in the y
then error will be raised by default.
add_columns(x, y, by = NULL, ignore_duplicates = FALSE, ...)
add_columns(x, y, by = NULL, ignore_duplicates = FALSE, ...)
x |
data.frame to be joined with |
y |
data.frame. |
by |
character vector or NULL(default) or 1. Names of common variables
in the |
ignore_duplicates |
logical Should we ignore duplicates in the |
... |
arguments for further methods |
data.frame
# example for 'add_columns' from base 'merge' authors = data.frame( surname = c("Tukey", "Venables", "Tierney", "Ripley", "McNeil"), nationality = c("US", "Australia", "US", "UK", "Australia"), deceased = c("yes", rep("no", 4)) ) books = data.frame( surname = c("Tukey", "Venables", "Tierney", "Ripley", "Ripley", "McNeil", "R Core"), title = c("Exploratory Data Analysis", "Modern Applied Statistics ...", "LISP-STAT", "Spatial Statistics", "Stochastic Simulation", "Interactive Data Analysis", "An Introduction to R") ) add_columns(books, authors)
# example for 'add_columns' from base 'merge' authors = data.frame( surname = c("Tukey", "Venables", "Tierney", "Ripley", "McNeil"), nationality = c("US", "Australia", "US", "UK", "Australia"), deceased = c("yes", rep("no", 4)) ) books = data.frame( surname = c("Tukey", "Venables", "Tierney", "Ripley", "Ripley", "McNeil", "R Core"), title = c("Exploratory Data Analysis", "Modern Applied Statistics ...", "LISP-STAT", "Spatial Statistics", "Stochastic Simulation", "Interactive Data Analysis", "An Introduction to R") ) add_columns(books, authors)
Take a sequence of vector, matrix or data-frame arguments and combine by rows. Contrary to rbind it handles non-matching column names. There is also special method for the results of cro/cro_fun/tables/fre.
add_rows(...) ## S3 method for class 'data.frame' add_rows(..., nomatch_columns = c("add", "drop", "stop"))
add_rows(...) ## S3 method for class 'data.frame' add_rows(..., nomatch_columns = c("add", "drop", "stop"))
... |
data.frame/matrix/table for binding |
nomatch_columns |
action if there are non-matching columns between
data.frames. Possible values are |
See rbind, cro, cro_fun, fre, tables
a = data.frame(x = 1:5, y = 6:10) b = data.frame(y = 6:10, z = 11:15) add_rows(a, b) # x, y, z add_rows(a, b, nomatch_columns = "drop") # y
a = data.frame(x = 1:5, y = 6:10) b = data.frame(y = 6:10, z = 11:15) add_rows(a, b) # x, y, z add_rows(a, b, nomatch_columns = "drop") # y
apply_labels
tries automatically detect what is variable label and
what are value labels. See also var_lab and val_lab.
apply_labels(data, ...)
apply_labels(data, ...)
data |
data.frame/list |
... |
named arguments or lists. Name of argument is a variable name in
the |
data
with applied labels
data(mtcars) mtcars = apply_labels(mtcars, vs = "Engine", vs = num_lab(" 0 V-engine 1 Straight engine "), am = "Transmission", am = num_lab(" 0 Automatic 1 Manual ") ) # 'table' from base R table(mtcars$vs, mtcars$am) # more sophisticated crosstable cross_cases(mtcars, vs, am) # the same but with list argument list_arg = list( vs = "Engine", vs = num_lab(" 0 V-engine 1 Straight engine "), am = "Transmission", am = num_lab(" 0 Automatic 1 Manual ") ) mtcars = apply_labels(mtcars, list_arg)
data(mtcars) mtcars = apply_labels(mtcars, vs = "Engine", vs = num_lab(" 0 V-engine 1 Straight engine "), am = "Transmission", am = num_lab(" 0 Automatic 1 Manual ") ) # 'table' from base R table(mtcars$vs, mtcars$am) # more sophisticated crosstable cross_cases(mtcars, vs, am) # the same but with list argument list_arg = list( vs = "Engine", vs = num_lab(" 0 V-engine 1 Straight engine "), am = "Transmission", am = num_lab(" 0 Automatic 1 Manual ") ) mtcars = apply_labels(mtcars, list_arg)
This function converts a etable
object to a huxtable
.
The huxtable-package
needs to be installed to use this function.
as_huxtable.etable(x, ...) as_hux.etable(x, ...)
as_huxtable.etable(x, ...) as_hux.etable(x, ...)
x |
etable. Table to convert to a huxtable. |
... |
arguments passed on to huxtable. |
huxtable
allows to export formated tables to LaTeX, HTML, Microsoft Word,
Microsoft Excel, Microsoft Powerpoint, RTF and Markdown.
Tables in knitr or rmarkdown documents of type LaTeX or Word are converted by default.
## Not run: library(huxtable) data(mtcars) mtcars = apply_labels(mtcars, mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (1000 lbs)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c("Automatic" = 0, "Manual"=1), gear = "Number of forward gears", carb = "Number of carburetors" ) tab = mtcars %>% tab_cols(total(), am %nest% vs) %>% tab_cells(mpg, hp) %>% tab_stat_mean() %>% tab_cells(cyl) %>% tab_stat_cpct() %>% tab_pivot() %>% set_caption("Table 1. Some variables from mtcars dataset.") ht = as_huxtable(tab) ht ## End(Not run)
## Not run: library(huxtable) data(mtcars) mtcars = apply_labels(mtcars, mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (1000 lbs)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c("Automatic" = 0, "Manual"=1), gear = "Number of forward gears", carb = "Number of carburetors" ) tab = mtcars %>% tab_cols(total(), am %nest% vs) %>% tab_cells(mpg, hp) %>% tab_stat_mean() %>% tab_cells(cyl) %>% tab_stat_cpct() %>% tab_pivot() %>% set_caption("Table 1. Some variables from mtcars dataset.") ht = as_huxtable(tab) ht ## End(Not run)
Convert dichotomy data.frame/matrix to data.frame with category encoding
as.category(x, prefix = NULL, counted_value = 1, compress = FALSE) is.category(x)
as.category(x, prefix = NULL, counted_value = 1, compress = FALSE) is.category(x)
x |
Dichotomy data.frame/matrix (usually with 0,1 coding). |
prefix |
If is not NULL then column names will be added in the form prefix+column number. |
counted_value |
Vector. Values that will be considered as indicator of category presence. By default it equals to 1. |
compress |
Logical. Should we drop columns with all NA? FALSE by default. TRUE significantly decreases performance of the function. |
data.frame of class category
with numeric values
that correspond to column numbers of counted values. Column names of x or
variable labels are added as value labels.
as.dichotomy
for reverse conversion, mrset,
mdset for usage multiple-response variables with tables.
set.seed(123) # Let's imagine it's matrix of consumed products dichotomy_matrix = matrix(sample(0:1,40,replace = TRUE,prob=c(.6,.4)),nrow=10) colnames(dichotomy_matrix) = c("Milk","Sugar","Tea","Coffee") as.category(dichotomy_matrix, compress = TRUE) # compressed version category_encoding = as.category(dichotomy_matrix) # should be TRUE identical(val_lab(category_encoding), c(Milk = 1L, Sugar = 2L, Tea = 3L, Coffee = 4L)) all(as.dichotomy(category_encoding, use_na = FALSE) == dichotomy_matrix) # with prefix as.category(dichotomy_matrix, prefix = "products_") # data.frame with variable labels dichotomy_dataframe = as.data.frame(dichotomy_matrix) colnames(dichotomy_dataframe) = paste0("product_", 1:4) var_lab(dichotomy_dataframe[[1]]) = "Milk" var_lab(dichotomy_dataframe[[2]]) = "Sugar" var_lab(dichotomy_dataframe[[3]]) = "Tea" var_lab(dichotomy_dataframe[[4]]) = "Coffee" as.category(dichotomy_dataframe, prefix = "products_")
set.seed(123) # Let's imagine it's matrix of consumed products dichotomy_matrix = matrix(sample(0:1,40,replace = TRUE,prob=c(.6,.4)),nrow=10) colnames(dichotomy_matrix) = c("Milk","Sugar","Tea","Coffee") as.category(dichotomy_matrix, compress = TRUE) # compressed version category_encoding = as.category(dichotomy_matrix) # should be TRUE identical(val_lab(category_encoding), c(Milk = 1L, Sugar = 2L, Tea = 3L, Coffee = 4L)) all(as.dichotomy(category_encoding, use_na = FALSE) == dichotomy_matrix) # with prefix as.category(dichotomy_matrix, prefix = "products_") # data.frame with variable labels dichotomy_dataframe = as.data.frame(dichotomy_matrix) colnames(dichotomy_dataframe) = paste0("product_", 1:4) var_lab(dichotomy_dataframe[[1]]) = "Milk" var_lab(dichotomy_dataframe[[2]]) = "Sugar" var_lab(dichotomy_dataframe[[3]]) = "Tea" var_lab(dichotomy_dataframe[[4]]) = "Coffee" as.category(dichotomy_dataframe, prefix = "products_")
This is method for rendering results of tables/fre/cro
in Shiny. DT
package should be installed for this
feature (install.packages('DT')
). For detailed description of function
and its arguments see datatable.
as.datatable_widget(data, ...) ## S3 method for class 'etable' as.datatable_widget( data, ..., repeat_row_labels = FALSE, show_row_numbers = FALSE, digits = get_expss_digits() ) ## S3 method for class 'with_caption' as.datatable_widget( data, ..., repeat_row_labels = FALSE, show_row_numbers = FALSE, digits = get_expss_digits() )
as.datatable_widget(data, ...) ## S3 method for class 'etable' as.datatable_widget( data, ..., repeat_row_labels = FALSE, show_row_numbers = FALSE, digits = get_expss_digits() ) ## S3 method for class 'with_caption' as.datatable_widget( data, ..., repeat_row_labels = FALSE, show_row_numbers = FALSE, digits = get_expss_digits() )
data |
|
... |
further parameters for datatable |
repeat_row_labels |
logical Should we repeat duplicated row labels in the every row? Default is FALSE. |
show_row_numbers |
logical Default is FALSE. |
digits |
integer By default, all numeric columns are rounded to one digit after
decimal separator. Also you can set this argument by option 'expss.digits'
- for example, |
Object of class datatable
htmlTable for knitting
## Not run: data(mtcars) mtcars = apply_labels(mtcars, mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (1000 lbs)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c("Automatic" = 0, "Manual"=1), gear = "Number of forward gears", carb = "Number of carburetors" ) mtcars_table = mtcars %>% tab_cols(total(), am %nest% vs) %>% tab_cells(mpg, hp) %>% tab_stat_mean() %>% tab_cells(cyl) %>% tab_stat_cpct() %>% tab_pivot() %>% set_caption("Table 1. Some variables from mtcars dataset.") library(shiny) shinyApp( ui = fluidPage(fluidRow(column(12, DT::dataTableOutput('tbl')))), server = function(input, output) { output$tbl = DT::renderDataTable( as.datatable_widget(mtcars_table) ) } ) ## End(Not run)
## Not run: data(mtcars) mtcars = apply_labels(mtcars, mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (1000 lbs)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c("Automatic" = 0, "Manual"=1), gear = "Number of forward gears", carb = "Number of carburetors" ) mtcars_table = mtcars %>% tab_cols(total(), am %nest% vs) %>% tab_cells(mpg, hp) %>% tab_stat_mean() %>% tab_cells(cyl) %>% tab_stat_cpct() %>% tab_pivot() %>% set_caption("Table 1. Some variables from mtcars dataset.") library(shiny) shinyApp( ui = fluidPage(fluidRow(column(12, DT::dataTableOutput('tbl')))), server = function(input, output) { output$tbl = DT::renderDataTable( as.datatable_widget(mtcars_table) ) } ) ## End(Not run)
This function converts variable/multiple response variable (vector/matrix/data.frame) with category encoding into data.frame/matrix with dichotomy encoding (0/1) suited for most statistical analysis, e. g. clustering, factor analysis, linear regression and so on.
as.dichotomy
returns data.frame of class 'dichotomy' with 0, 1
and possibly NA.
dummy
returns matrix of class 'dichotomy' with 0, 1 and possibly NA.
dummy1
drops last column in dichotomy matrix. It is useful in many cases
because any column of such matrix usually is linear combinations of other columns.
as.dichotomy( x, prefix = "v", keep_unused = FALSE, use_na = TRUE, keep_values = NULL, keep_labels = NULL, drop_values = NULL, drop_labels = NULL, presence = 1, absence = 0 ) dummy( x, keep_unused = FALSE, use_na = TRUE, keep_values = NULL, keep_labels = NULL, drop_values = NULL, drop_labels = NULL, presence = 1, absence = 0 ) dummy1( x, keep_unused = FALSE, use_na = TRUE, keep_values = NULL, keep_labels = NULL, drop_values = NULL, drop_labels = NULL, presence = 1, absence = 0 ) is.dichotomy(x)
as.dichotomy( x, prefix = "v", keep_unused = FALSE, use_na = TRUE, keep_values = NULL, keep_labels = NULL, drop_values = NULL, drop_labels = NULL, presence = 1, absence = 0 ) dummy( x, keep_unused = FALSE, use_na = TRUE, keep_values = NULL, keep_labels = NULL, drop_values = NULL, drop_labels = NULL, presence = 1, absence = 0 ) dummy1( x, keep_unused = FALSE, use_na = TRUE, keep_values = NULL, keep_labels = NULL, drop_values = NULL, drop_labels = NULL, presence = 1, absence = 0 ) is.dichotomy(x)
x |
vector/factor/matrix/data.frame. |
prefix |
character. By default "v". |
keep_unused |
Logical. Should we create columns for unused value labels/factor levels? FALSE by default. |
use_na |
Logical. Should we use NA for rows with all NA or use 0's instead. TRUE by default. |
keep_values |
Numeric/character. Values that should be kept. By default all values will be kept. |
keep_labels |
Numeric/character. Labels/levels that should be kept. By default all labels/levels will be kept. |
drop_values |
Numeric/character. Values that should be dropped. By default all values will be kept. Ignored if keep_values/keep_labels are provided. |
drop_labels |
Numeric/character. Labels/levels that should be dropped. By default all labels/levels will be kept. Ignored if keep_values/keep_labels are provided. |
presence |
numeric value which will code presence of the level. By
default it is 1. Note that all tables functions need that |
absence |
numeric value which will code absence of the level. By default
it is 0. Note that all tables functions need that |
as.dichotomy
returns data.frame of class dichotomy
with 0,1. Columns of this data.frame have variable labels according to
value labels of original data. If label doesn't exist for particular value
then this value will be used as variable label. dummy
returns matrix
of class dichotomy
. Column names of this matrix are value labels of
original data.
as.category
for reverse conversion, mrset,
mdset for usage multiple-response variables with tables.
data.table::setDTthreads(2) # toy example # brands - multiple response question # Which brands do you use during last three months? set.seed(123) brands = as.sheet(t(replicate(20,sample(c(1:5,NA),4,replace = FALSE)))) # score - evaluation of tested product score = sample(-1:1,20,replace = TRUE) var_lab(brands) = "Used brands" val_lab(brands) = autonum(" Brand A Brand B Brand C Brand D Brand E ") var_lab(score) = "Evaluation of tested brand" val_lab(score) = make_labels(" -1 Dislike it 0 So-so 1 Like it ") cro_cpct(as.dichotomy(brands), score) # the same as cro_cpct(mrset(brands), score) # customer segmentation by used brands kmeans(dummy(brands), 3) # model of influence of used brands on evaluation of tested product summary(lm(score ~ dummy(brands))) # prefixed data.frame as.dichotomy(brands, prefix = "brand_")
data.table::setDTthreads(2) # toy example # brands - multiple response question # Which brands do you use during last three months? set.seed(123) brands = as.sheet(t(replicate(20,sample(c(1:5,NA),4,replace = FALSE)))) # score - evaluation of tested product score = sample(-1:1,20,replace = TRUE) var_lab(brands) = "Used brands" val_lab(brands) = autonum(" Brand A Brand B Brand C Brand D Brand E ") var_lab(score) = "Evaluation of tested brand" val_lab(score) = make_labels(" -1 Dislike it 0 So-so 1 Like it ") cro_cpct(as.dichotomy(brands), score) # the same as cro_cpct(mrset(brands), score) # customer segmentation by used brands kmeans(dummy(brands), 3) # model of influence of used brands on evaluation of tested product summary(lm(score ~ dummy(brands))) # prefixed data.frame as.dichotomy(brands, prefix = "brand_")
If x
is data.frame
then as.etable
just adds
etable
to class
attribute of x
. If x
is matrix
then it will be converted to data.frame.
as.etable(x, rownames_as_row_labels = NULL) is.etable(x)
as.etable(x, rownames_as_row_labels = NULL) is.etable(x)
x |
data.frame/matrix |
rownames_as_row_labels |
logical. If it is TRUE than rownames of
|
object of class etable
data(mtcars) etable_mtcars = as.etable(mtcars) is.etable(etable_mtcars) #TRUE etable_mtcars #another 'print' method is used cor(mtcars) %>% as.etable()
data(mtcars) etable_mtcars = as.etable(mtcars) is.etable(etable_mtcars) #TRUE etable_mtcars #another 'print' method is used cor(mtcars) %>% as.etable()
Recode vector into numeric vector with value labels
as.labelled(x, label = NULL) is.labelled(x)
as.labelled(x, label = NULL) is.labelled(x)
x |
numeric vector/character vector/factor |
label |
optional variable label |
numeric vector with value labels
character_vector = c("one", "two", "two", "three") as.labelled(character_vector, label = "Numbers")
character_vector = c("one", "two", "two", "three") as.labelled(character_vector, label = "Numbers")
compare_proportions
calculates p-values (via z-test) for comparison
between each proportion in the prop1
and prop2
. Results are calculated
with the same formula as in prop.test without continuity
correction.
compare_means
calculates p-values (via t-test) for comparison between
each mean in the mean1
and mean2
. Results are calculated on the
aggregated statistics (means, std. devs, N) with the same formula as in
t.test.
These functions mainly intended for usage inside significance_cpct and
significance_means.
compare_proportions(prop1, prop2, base1, base2, common_base = 0) compare_means( mean1, mean2, sd1, sd2, base1, base2, common_base = 0, var_equal = FALSE )
compare_proportions(prop1, prop2, base1, base2, common_base = 0) compare_means( mean1, mean2, sd1, sd2, base1, base2, common_base = 0, var_equal = FALSE )
prop1 |
a numeric vector of proportions in the group 1. Values should be between 0 and 1 |
prop2 |
a numeric vector of proportions in the group 2. Values should be between 0 and 1 |
base1 |
a numeric vector for |
base2 |
a numeric vector for |
common_base |
numeric. Number of cases that belong to both values in the
first and the second argument. It can occur in the case of overlapping
samples. Calculations are made according to algorithm in IBM SPSS Statistics
Algorithms v20, p. 263. Note that with these adjustments t-tests between
means are made with equal variance assumed (as with |
mean1 |
a numeric vector of the means in the first group. |
mean2 |
a numeric vector of the means in the second group. |
sd1 |
a numeric vector of the standard deviations in the first group. Values should be non-negative. |
sd2 |
a numeric vector of the standard deviations in the second group. Values should be non-negative. |
var_equal |
a logical variable indicating whether to treat the variances in the groups as being equal. For details see t.test. |
numeric vector with p-values
significance_cpct, significance_means, prop.test, t.test
# proportions data(mtcars) counts = table(mtcars$am, mtcars$vs) props = prop.table(counts) compare_proportions(props[,1], props[,2], colSums(counts)[1], colSums(counts)[1]) # means t.test(mpg ~ am, data = mtcars)$p.value # the same result with(mtcars, compare_means( mean(mpg[am==0]), mean(mpg[am==1]), sd(mpg[am==0]), sd(mpg[am==1]), length(mpg[am==0]), length(mpg[am==1]) ))
# proportions data(mtcars) counts = table(mtcars$am, mtcars$vs) props = prop.table(counts) compare_proportions(props[,1], props[,2], colSums(counts)[1], colSums(counts)[1]) # means t.test(mpg ~ am, data = mtcars)$p.value # the same result with(mtcars, compare_means( mean(mpg[am==0]), mean(mpg[am==1]), sd(mpg[am==0]), sd(mpg[am==1]), length(mpg[am==0]), length(mpg[am==1]) ))
compute
evaluates expression expr
in the context of data.frame
data
and return original data possibly modified.
calculate
evaluates expression expr
in the context of
data.frame data
and return value of the evaluated expression. Function use_labels
is shortcut for calculate
with
argument use_labels
set to TRUE
. When use_labels
is TRUE
there is a special shortcut for entire data.frame - ..data
.
do_if
modifies only rows for which cond
equals to
TRUE. Other rows remain unchanged. Newly created variables also will have
values only in rows for which cond
have TRUE. There will be NA's in
other rows. This function tries to mimic SPSS "DO IF(). ... END IF."
statement.
Full-featured %to%
is available in the expressions for addressing
range of variables.
There is a special constant .N
which equals to number of cases in
data
for usage in expression inside compute
/calculate
.
Inside do_if
.N
gives number of rows which will be affected by
expressions. For parametrization (variable substitution) see .. or
examples. Sometimes it is useful to create new empty variable inside compute.
You can use .new_var
function for this task. This function creates
variable of length .N
filled with NA. See examples.
modify
is an alias for compute
, modify_if
is
an alias for do_if
and calc
is an alias for calculate
.
compute(data, ...) modify(data, ...) do_if(data, cond, ...) modify_if(data, cond, ...) calculate(data, expr, use_labels = FALSE) use_labels(data, expr) calc(data, expr, use_labels = FALSE)
compute(data, ...) modify(data, ...) do_if(data, cond, ...) modify_if(data, cond, ...) calculate(data, expr, use_labels = FALSE) use_labels(data, expr) calc(data, expr, use_labels = FALSE)
data |
data.frame/list of data.frames. If |
... |
expressions that should be evaluated in the context of data.frame
|
cond |
logical vector or expression. Expression will be evaluated in the context of the data. |
expr |
expression that should be evaluated in the context of data.frame |
use_labels |
logical. Experimental feature. If it equals to |
compute
and do_if
functions return modified
data.frame/list of modified data.frames, calculate
returns value of
the evaluated expression/list of values.
dfs = data.frame( test = 1:5, a = rep(10, 5), b_1 = rep(11, 5), b_2 = rep(12, 5), b_3 = rep(13, 5), b_4 = rep(14, 5), b_5 = rep(15, 5) ) # compute sum of b* variables and attach it to 'dfs' let(dfs, b_total = sum_row(b_1 %to% b_5), b_total = set_var_lab(b_total, "Sum of b"), random_numbers = runif(.N) # .N usage ) %>% print() # calculate sum of b* variables and return it query(dfs, sum_row(b_1 %to% b_5)) # set values to existing/new variables let(dfs, columns('new_b{1:5}') := b_1 %to% b_5 ) %>% print() # conditional modification let_if(dfs, test %in% 2:4, a = a + 1, b_total = sum_row(b_1 %to% b_5), random_numbers = runif(.N) # .N usage ) %>% print() # variable substitution name1 = "a" name2 = "new_var" let(dfs, (name2) := get(name1)*2 ) %>% print() # 'use_labels' examples. Utilization of labels in base R. data(mtcars) mtcars = apply_labels(mtcars, mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (lb/1000)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c("Automatic" = 0, "Manual"=1), gear = "Number of forward gears", carb = "Number of carburetors" ) use_labels(mtcars, table(am, vs)) ## Not run: use_labels(mtcars, plot(mpg, hp)) ## End(Not run) mtcars %>% use_labels(lm(mpg ~ disp + hp + wt)) %>% summary()
dfs = data.frame( test = 1:5, a = rep(10, 5), b_1 = rep(11, 5), b_2 = rep(12, 5), b_3 = rep(13, 5), b_4 = rep(14, 5), b_5 = rep(15, 5) ) # compute sum of b* variables and attach it to 'dfs' let(dfs, b_total = sum_row(b_1 %to% b_5), b_total = set_var_lab(b_total, "Sum of b"), random_numbers = runif(.N) # .N usage ) %>% print() # calculate sum of b* variables and return it query(dfs, sum_row(b_1 %to% b_5)) # set values to existing/new variables let(dfs, columns('new_b{1:5}') := b_1 %to% b_5 ) %>% print() # conditional modification let_if(dfs, test %in% 2:4, a = a + 1, b_total = sum_row(b_1 %to% b_5), random_numbers = runif(.N) # .N usage ) %>% print() # variable substitution name1 = "a" name2 = "new_var" let(dfs, (name2) := get(name1)*2 ) %>% print() # 'use_labels' examples. Utilization of labels in base R. data(mtcars) mtcars = apply_labels(mtcars, mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (lb/1000)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c("Automatic" = 0, "Manual"=1), gear = "Number of forward gears", carb = "Number of carburetors" ) use_labels(mtcars, table(am, vs)) ## Not run: use_labels(mtcars, plot(mpg, hp)) ## End(Not run) mtcars %>% use_labels(lm(mpg ~ disp + hp + wt)) %>% summary()
These functions calculate count/sum/average/etc. on values that meet a
criterion that you specify. apply_if_*
apply custom functions. There
are different flavors of these functions: *_if
work on entire
dataset/matrix/vector, *_row_if
works on each row and *_col_if
works on each column.
count_if(criterion, ...) count_row_if(criterion, ...) count_col_if(criterion, ...) has(x, criterion) x %row_in% criterion x %has% criterion x %col_in% criterion sum_if(criterion, ..., data = NULL) sum_row_if(criterion, ..., data = NULL) sum_col_if(criterion, ..., data = NULL) mean_if(criterion, ..., data = NULL) mean_row_if(criterion, ..., data = NULL) mean_col_if(criterion, ..., data = NULL) sd_if(criterion, ..., data = NULL) sd_row_if(criterion, ..., data = NULL) sd_col_if(criterion, ..., data = NULL) median_if(criterion, ..., data = NULL) median_row_if(criterion, ..., data = NULL) median_col_if(criterion, ..., data = NULL) max_if(criterion, ..., data = NULL) max_row_if(criterion, ..., data = NULL) max_col_if(criterion, ..., data = NULL) min_if(criterion, ..., data = NULL) min_row_if(criterion, ..., data = NULL) min_col_if(criterion, ..., data = NULL) apply_row_if(fun, criterion, ..., data = NULL) apply_col_if(fun, criterion, ..., data = NULL)
count_if(criterion, ...) count_row_if(criterion, ...) count_col_if(criterion, ...) has(x, criterion) x %row_in% criterion x %has% criterion x %col_in% criterion sum_if(criterion, ..., data = NULL) sum_row_if(criterion, ..., data = NULL) sum_col_if(criterion, ..., data = NULL) mean_if(criterion, ..., data = NULL) mean_row_if(criterion, ..., data = NULL) mean_col_if(criterion, ..., data = NULL) sd_if(criterion, ..., data = NULL) sd_row_if(criterion, ..., data = NULL) sd_col_if(criterion, ..., data = NULL) median_if(criterion, ..., data = NULL) median_row_if(criterion, ..., data = NULL) median_col_if(criterion, ..., data = NULL) max_if(criterion, ..., data = NULL) max_row_if(criterion, ..., data = NULL) max_col_if(criterion, ..., data = NULL) min_if(criterion, ..., data = NULL) min_row_if(criterion, ..., data = NULL) min_col_if(criterion, ..., data = NULL) apply_row_if(fun, criterion, ..., data = NULL) apply_col_if(fun, criterion, ..., data = NULL)
criterion |
Vector with counted values or function. See details and examples. |
... |
Data on which criterion will be applied. Vector, matrix, data.frame, list. |
x |
Data on which criterion will be applied. Vector, matrix, data.frame, list. |
data |
Data on which function will be applied. Doesn't applicable to
|
fun |
Custom function that will be applied based on criterion. |
Possible type for criterion argument:
vector/single value All values in ...
which equal to the
elements of vector in the criteria will be used as function fun
argument.
function Values for which function gives TRUE will be used as
function fun
argument. There are some special functions for
convenience (e. g. gt(5)
is equivalent ">5" in spreadsheet) - see
criteria.
count*
and %in*%
never returns NA's. Other functions remove
NA's before calculations (as na.rm = TRUE
in base R functions).
Function criterion should return logical vector of same size and shape as its
argument. This function will be applied to each column of supplied data and
TRUE results will be used. There is asymmetrical behavior in *_row_if
and *_col_if
for function criterion: in both cases function criterion
will be applied columnwise.
*_if
return single value (vector of length 1).
*_row_if
returns vector for each row of supplied arguments.
*_col_if
returns vector for each column of supplied arguments.
%row_in%
/%col_in%
return logical vector - indicator of
presence of criterion in each row/column. %has%
is an alias for
%row_in%
.
set.seed(123) sheet1 = as.sheet( matrix(sample(c(1:10,NA), 30, replace = TRUE), 10) ) result = let(sheet1, # count 8 exact = count_row_if(8, V1, V2, V3), # count values greater than 8 greater = count_row_if(gt(8), V1, V2, V3), # count integer values between 5 and 8, e. g. 5, 6, 7, 8 integer_range = count_row_if(5:8, V1, V2, V3), # count values between 5 and 8 range = count_row_if(5 %thru% 8, V1, V2, V3), # count NA na = count_row_if(is.na, V1, V2, V3), # count not-NA not_na = count_row_if(not_na, V1, V2, V3), # are there any 5 in each row? has_five = cbind(V1, V2, V3) %row_in% 5 ) print(result) mean_row_if(6, sheet1$V1, data = sheet1) median_row_if(gt(2), sheet1$V1, sheet1$V2, sheet1$V3) sd_row_if(5 %thru% 8, sheet1$V1, sheet1$V2, sheet1$V3) if_na(sheet1) = 5 # replace NA # custom apply apply_col_if(prod, gt(2), sheet1$V1, data = sheet1) # product of all elements by columns apply_row_if(prod, gt(2), sheet1$V1, data = sheet1) # product of all elements by rows # Examples borrowed from Microsoft Excel help for COUNTIF sheet1 = text_to_columns( " a b apples 32 oranges 54 peaches 75 apples 86 " ) count_if("apples", sheet1$a) # 2 count_if("apples", sheet1) # 2 with(sheet1, count_if("apples", a, b)) # 2 count_if(gt(55), sheet1$b) # greater than 55 = 2 count_if(ne(75), sheet1$b) # not equal 75 = 3 count_if(ge(32), sheet1$b) # greater than or equal 32 = 4 count_if(gt(32) & lt(86), sheet1$b) # 2 # count only integer values between 33 and 85 count_if(33:85, sheet1$b) # 2 # values with letters count_if(regex("^[A-z]+$"), sheet1) # 4 # values that started on 'a' count_if(regex("^a"), sheet1) # 2 # count_row_if count_row_if(regex("^a"), sheet1) # c(1,0,0,1) sheet1 %row_in% 'apples' # c(TRUE,FALSE,FALSE,TRUE) # Some of Microsoft Excel examples for SUMIF/AVERAGEIF/etc sheet1 = text_to_columns( " property_value commission data 100000 7000 250000 200000 14000 300000 21000 400000 28000 " ) # Sum of commision for property value greater than 160000 with(sheet1, sum_if(gt(160000), property_value, data = commission)) # 63000 # Sum of property value greater than 160000 with(sheet1, sum_if(gt(160000), property_value)) # 900000 # Sum of commision for property value equals to 300000 with(sheet1, sum_if(300000, property_value, data = commission)) # 21000 # Sum of commision for property value greater than first value of data with(sheet1, sum_if(gt(data[1]), property_value, data = commission)) # 49000 sheet1 = text_to_columns( " category food sales Vegetables Tomatoes 2300 Vegetables Celery 5500 Fruits Oranges 800 NA Butter 400 Vegetables Carrots 4200 Fruits Apples 1200 " ) # Sum of sales for Fruits with(sheet1, sum_if("Fruits", category, data = sales)) # 2000 # Sum of sales for Vegetables with(sheet1, sum_if("Vegetables", category, data = sales)) # 12000 # Sum of sales for food which is ending on 'es' with(sheet1, sum_if(perl("es$"), food, data = sales)) # 4300 # Sum of sales for empty category with(sheet1, sum_if(NA, category, data = sales)) # 400 sheet1 = text_to_columns( " property_value commission data 100000 7000 250000 200000 14000 300000 21000 400000 28000 " ) # Commision average for comission less than 23000 with(sheet1, mean_if(lt(23000), commission)) # 14000 # Property value average for property value less than 95000 with(sheet1, mean_if(lt(95000), property_value)) # NaN # Commision average for property value greater than 250000 with(sheet1, mean_if(gt(250000), property_value, data = commission)) # 24500 sheet1 = text_to_columns( ' region profits East 45678 West 23789 North -4789 "South (New Office)" 0 MidWest 9678 ', quote = '"' ) # Mean profits for 'west' regions with(sheet1, mean_if(contains("West"), region, data = profits)) # 16733.5 # Mean profits for regions wich doesn't contain New Office with(sheet1, mean_if(not(contains("New Office")), region, data = profits)) # 18589 sheet1 = text_to_columns( " grade weight 89 1 93 2 96 2 85 3 91 1 88 1 " ) # Minimum grade for weight equals to 1 with(sheet1, min_if(1, weight, data = grade)) # 88 # Maximum grade for weight equals to 1 with(sheet1, max_if(1, weight, data = grade)) #91 # Example with offset sheet1 = text_to_columns( " weight grade 10 b 11 a 100 a 111 b 1 a 1 a " ) with(sheet1, min_if("a", grade[2:5], data = weight[1:4])) # 10
set.seed(123) sheet1 = as.sheet( matrix(sample(c(1:10,NA), 30, replace = TRUE), 10) ) result = let(sheet1, # count 8 exact = count_row_if(8, V1, V2, V3), # count values greater than 8 greater = count_row_if(gt(8), V1, V2, V3), # count integer values between 5 and 8, e. g. 5, 6, 7, 8 integer_range = count_row_if(5:8, V1, V2, V3), # count values between 5 and 8 range = count_row_if(5 %thru% 8, V1, V2, V3), # count NA na = count_row_if(is.na, V1, V2, V3), # count not-NA not_na = count_row_if(not_na, V1, V2, V3), # are there any 5 in each row? has_five = cbind(V1, V2, V3) %row_in% 5 ) print(result) mean_row_if(6, sheet1$V1, data = sheet1) median_row_if(gt(2), sheet1$V1, sheet1$V2, sheet1$V3) sd_row_if(5 %thru% 8, sheet1$V1, sheet1$V2, sheet1$V3) if_na(sheet1) = 5 # replace NA # custom apply apply_col_if(prod, gt(2), sheet1$V1, data = sheet1) # product of all elements by columns apply_row_if(prod, gt(2), sheet1$V1, data = sheet1) # product of all elements by rows # Examples borrowed from Microsoft Excel help for COUNTIF sheet1 = text_to_columns( " a b apples 32 oranges 54 peaches 75 apples 86 " ) count_if("apples", sheet1$a) # 2 count_if("apples", sheet1) # 2 with(sheet1, count_if("apples", a, b)) # 2 count_if(gt(55), sheet1$b) # greater than 55 = 2 count_if(ne(75), sheet1$b) # not equal 75 = 3 count_if(ge(32), sheet1$b) # greater than or equal 32 = 4 count_if(gt(32) & lt(86), sheet1$b) # 2 # count only integer values between 33 and 85 count_if(33:85, sheet1$b) # 2 # values with letters count_if(regex("^[A-z]+$"), sheet1) # 4 # values that started on 'a' count_if(regex("^a"), sheet1) # 2 # count_row_if count_row_if(regex("^a"), sheet1) # c(1,0,0,1) sheet1 %row_in% 'apples' # c(TRUE,FALSE,FALSE,TRUE) # Some of Microsoft Excel examples for SUMIF/AVERAGEIF/etc sheet1 = text_to_columns( " property_value commission data 100000 7000 250000 200000 14000 300000 21000 400000 28000 " ) # Sum of commision for property value greater than 160000 with(sheet1, sum_if(gt(160000), property_value, data = commission)) # 63000 # Sum of property value greater than 160000 with(sheet1, sum_if(gt(160000), property_value)) # 900000 # Sum of commision for property value equals to 300000 with(sheet1, sum_if(300000, property_value, data = commission)) # 21000 # Sum of commision for property value greater than first value of data with(sheet1, sum_if(gt(data[1]), property_value, data = commission)) # 49000 sheet1 = text_to_columns( " category food sales Vegetables Tomatoes 2300 Vegetables Celery 5500 Fruits Oranges 800 NA Butter 400 Vegetables Carrots 4200 Fruits Apples 1200 " ) # Sum of sales for Fruits with(sheet1, sum_if("Fruits", category, data = sales)) # 2000 # Sum of sales for Vegetables with(sheet1, sum_if("Vegetables", category, data = sales)) # 12000 # Sum of sales for food which is ending on 'es' with(sheet1, sum_if(perl("es$"), food, data = sales)) # 4300 # Sum of sales for empty category with(sheet1, sum_if(NA, category, data = sales)) # 400 sheet1 = text_to_columns( " property_value commission data 100000 7000 250000 200000 14000 300000 21000 400000 28000 " ) # Commision average for comission less than 23000 with(sheet1, mean_if(lt(23000), commission)) # 14000 # Property value average for property value less than 95000 with(sheet1, mean_if(lt(95000), property_value)) # NaN # Commision average for property value greater than 250000 with(sheet1, mean_if(gt(250000), property_value, data = commission)) # 24500 sheet1 = text_to_columns( ' region profits East 45678 West 23789 North -4789 "South (New Office)" 0 MidWest 9678 ', quote = '"' ) # Mean profits for 'west' regions with(sheet1, mean_if(contains("West"), region, data = profits)) # 16733.5 # Mean profits for regions wich doesn't contain New Office with(sheet1, mean_if(not(contains("New Office")), region, data = profits)) # 18589 sheet1 = text_to_columns( " grade weight 89 1 93 2 96 2 85 3 91 1 88 1 " ) # Minimum grade for weight equals to 1 with(sheet1, min_if(1, weight, data = grade)) # 88 # Maximum grade for weight equals to 1 with(sheet1, max_if(1, weight, data = grade)) #91 # Example with offset sheet1 = text_to_columns( " weight grade 10 b 11 a 100 a 111 b 1 a 1 a " ) with(sheet1, min_if("a", grade[2:5], data = weight[1:4])) # 10
Produce criteria which could be used in the different situations - see
'recode', 'na_if', 'count_if', 'match_row',
'%i%' and etc. For example, 'greater(5)'
returns function
which tests whether its argument greater than five. 'fixed("apple")'
returns function which tests whether its argument contains "apple". For
criteria logical operations (|, &, !, xor) are defined, e. g. you can write
something like: 'greater(5) | equals(1)'
.
List of functions:
comparison criteria - 'equals'
, 'greater'
and etc. return
functions which compare its argument against value.
'thru'
checks whether a value is inside interval.
'thru(0,1)'
is equivalent to 'x>=0 & x<=1'
'%thru%'
is infix version of 'thru'
, e. g. '0
%thru% 1'
'is_max'
and 'is_min'
return TRUE where vector value is
equals to maximum or minimum.
'contains'
searches for the pattern in the strings. By default,
it works with fixed patterns rather than regular expressions. For details
about its arguments see grepl
'like'
searches for the Excel-style pattern in the strings. You
can use wildcards: '*' means any number of symbols, '?' means single symbol.
Case insensitive.
'fixed'
alias for contains.
'perl'
such as 'contains'
but the pattern is perl-compatible
regular expression ('perl = TRUE'
). For details see grepl
'regex'
use POSIX 1003.2 extended regular expressions
('fixed = FALSE'
). For details see grepl
'has_label'
searches values which have supplied label(-s). We
can used criteria as an argument for 'has_label'.
'to'
returns function which gives TRUE for all elements of
vector before the first occurrence of 'x'
and for 'x'
.
'from'
returns function which gives TRUE for all elements of
vector after the first occurrence of 'x'
and for 'x'
.
'not_na'
returns TRUE for all non-NA vector elements.
'other'
returns TRUE for all vector elements. It is intended
for usage with 'recode'
.
'items'
returns TRUE for the vector elements with the given
sequential numbers.
'and'
, 'or'
, 'not'
are spreadsheet-style boolean functions.
Shortcuts for comparison criteria:
'equals' - 'eq'
'not_equals' - 'neq'
, 'ne'
'greater' - 'gt'
'greater_or_equal' - 'gte'
, 'ge'
'less' - 'lt'
'less_or_equal' - 'lte'
, 'le'
as.criterion(crit) is.criterion(x) equals(x) not_equals(x) less(x) less_or_equal(x) greater(x) greater_or_equal(x) thru(lower, upper) lower %thru% upper when(x) is_max(x) is_min(x) contains( pattern, ignore.case = FALSE, perl = FALSE, fixed = TRUE, useBytes = FALSE ) like(pattern) fixed( pattern, ignore.case = FALSE, perl = FALSE, fixed = TRUE, useBytes = FALSE ) perl( pattern, ignore.case = FALSE, perl = TRUE, fixed = FALSE, useBytes = FALSE ) regex( pattern, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE ) has_label(x) from(x) to(x) items(...) not_na(x) is_na(x) other(x) and(...) or(...) not(x)
as.criterion(crit) is.criterion(x) equals(x) not_equals(x) less(x) less_or_equal(x) greater(x) greater_or_equal(x) thru(lower, upper) lower %thru% upper when(x) is_max(x) is_min(x) contains( pattern, ignore.case = FALSE, perl = FALSE, fixed = TRUE, useBytes = FALSE ) like(pattern) fixed( pattern, ignore.case = FALSE, perl = FALSE, fixed = TRUE, useBytes = FALSE ) perl( pattern, ignore.case = FALSE, perl = TRUE, fixed = FALSE, useBytes = FALSE ) regex( pattern, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE ) has_label(x) from(x) to(x) items(...) not_na(x) is_na(x) other(x) and(...) or(...) not(x)
crit |
vector of values/function which returns logical or logical vector. It will be converted to function of class criterion. |
x |
vector |
lower |
vector/single value - lower bound of interval |
upper |
vector/single value - upper bound of interval |
pattern |
character string containing a regular expression (or character
string for |
ignore.case |
logical see grepl |
perl |
logical see grepl |
fixed |
logical see grepl |
useBytes |
logical see grepl |
... |
numeric indexes of desired items for items, logical vectors or criteria for boolean functions. |
function of class 'criterion' which tests its argument against condition and return logical value
recode, count_if, match_row, na_if, %i%
# operations on vector, '%d%' means 'diff' 1:6 %d% greater(4) # 1:4 1:6 %d% (1 | greater(4)) # 2:4 # '%i%' means 'intersect 1:6 %i% (is_min() | is_max()) # 1, 6 # with Excel-style boolean operators 1:6 %i% or(is_min(), is_max()) # 1, 6 letters %i% (contains("a") | contains("z")) # a, z letters %i% perl("a|z") # a, z letters %i% from("w") # w, x, y, z letters %i% to("c") # a, b, c letters %i% (from("b") & to("e")) # b, d, e c(1, 2, NA, 3) %i% not_na() # c(1, 2, 3) # examples with count_if df1 = data.frame( a=c("apples", "oranges", "peaches", "apples"), b = c(32, 54, 75, 86) ) count_if(greater(55), df1$b) # greater than 55 = 2 count_if(not_equals(75), df1$b) # not equals 75 = 3 count_if(greater(32) & less(86), df1$b) # greater than 32 and less than 86 = 2 count_if(and(greater(32), less(86)), df1$b) # the same result # infix version count_if(35 %thru% 80, df1$b) # greater than or equals to 35 and less than or equals to 80 = 2 # values that started on 'a' count_if(like("a*"), df1) # 2 # the same with Perl-style regular expression count_if(perl("^a"), df1) # 2 # count_row_if count_row_if(perl("^a"), df1) # c(1,0,0,1) # examples with 'n_intersect' and 'n_diff' data(iris) iris %>% n_intersect(to("Petal.Width")) # all columns up to 'Species' # 'Sepal.Length', 'Sepal.Width' will be left iris %>% n_diff(from("Petal.Length")) # except first column iris %n_d% items(1) # 'recode' examples qvar = c(1:20, 97, NA, NA) recode(qvar, 1 %thru% 5 ~ 1, 6 %thru% 10 ~ 2, 11 %thru% hi ~ 3, other ~ 0) # the same result recode(qvar, 1 %thru% 5 ~ 1, 6 %thru% 10 ~ 2, greater_or_equal(11) ~ 3, other ~ 0)
# operations on vector, '%d%' means 'diff' 1:6 %d% greater(4) # 1:4 1:6 %d% (1 | greater(4)) # 2:4 # '%i%' means 'intersect 1:6 %i% (is_min() | is_max()) # 1, 6 # with Excel-style boolean operators 1:6 %i% or(is_min(), is_max()) # 1, 6 letters %i% (contains("a") | contains("z")) # a, z letters %i% perl("a|z") # a, z letters %i% from("w") # w, x, y, z letters %i% to("c") # a, b, c letters %i% (from("b") & to("e")) # b, d, e c(1, 2, NA, 3) %i% not_na() # c(1, 2, 3) # examples with count_if df1 = data.frame( a=c("apples", "oranges", "peaches", "apples"), b = c(32, 54, 75, 86) ) count_if(greater(55), df1$b) # greater than 55 = 2 count_if(not_equals(75), df1$b) # not equals 75 = 3 count_if(greater(32) & less(86), df1$b) # greater than 32 and less than 86 = 2 count_if(and(greater(32), less(86)), df1$b) # the same result # infix version count_if(35 %thru% 80, df1$b) # greater than or equals to 35 and less than or equals to 80 = 2 # values that started on 'a' count_if(like("a*"), df1) # 2 # the same with Perl-style regular expression count_if(perl("^a"), df1) # 2 # count_row_if count_row_if(perl("^a"), df1) # c(1,0,0,1) # examples with 'n_intersect' and 'n_diff' data(iris) iris %>% n_intersect(to("Petal.Width")) # all columns up to 'Species' # 'Sepal.Length', 'Sepal.Width' will be left iris %>% n_diff(from("Petal.Length")) # except first column iris %n_d% items(1) # 'recode' examples qvar = c(1:20, 97, NA, NA) recode(qvar, 1 %thru% 5 ~ 1, 6 %thru% 10 ~ 2, 11 %thru% hi ~ 3, other ~ 0) # the same result recode(qvar, 1 %thru% 5 ~ 1, 6 %thru% 10 ~ 2, greater_or_equal(11) ~ 3, other ~ 0)
cross_cases
build a contingency table of the counts.
cross_cpct
, cross_cpct_responses
build a contingency table
of the column percent. These functions give different results only for
multiple response variables. For cross_cpct
base of percent is number
of valid cases. Case is considered as valid if it has at least one non-NA
value. So for multiple response variables sum of percent may be greater than
100. For cross_cpct_responses
base of percent is number of valid
responses. Multiple response variables can have several responses for single
case. Sum of percent of cross_cpct_responses
always equals to 100%.
cross_rpct
build a contingency table of the row percent. Base
for percent is number of valid cases.
cross_tpct
build a contingency table of the table percent. Base
for percent is number of valid cases.
cross_*
functions evaluate their arguments
in the context of the first argument data
.
cro_*
functions use standard evaluation, e. g 'cro(mtcars$am, mtcars$vs)'.
total
auxiliary function - creates variables with 1 for valid
case of its argument x
and NA in opposite case.
You can combine tables with add_rows and merge.etable. For
sorting table see tab_sort_asc.
To provide multiple-response variables as arguments use mrset for
multiples with category encoding and mdset for multiples with
dichotomy (dummy) encoding. To compute statistics with nested
variables/banners use nest. For more sophisticated interface with
modern piping via magrittr
see tables.
cross_cases( data, cell_vars, col_vars = total(), row_vars = NULL, weight = NULL, subgroup = NULL, total_label = NULL, total_statistic = "u_cases", total_row_position = c("below", "above", "none") ) cross_cpct( data, cell_vars, col_vars = total(), row_vars = NULL, weight = NULL, subgroup = NULL, total_label = NULL, total_statistic = "u_cases", total_row_position = c("below", "above", "none") ) cross_rpct( data, cell_vars, col_vars = total(), row_vars = NULL, weight = NULL, subgroup = NULL, total_label = NULL, total_statistic = "u_cases", total_row_position = c("below", "above", "none") ) cross_tpct( data, cell_vars, col_vars = total(), row_vars = NULL, weight = NULL, subgroup = NULL, total_label = NULL, total_statistic = "u_cases", total_row_position = c("below", "above", "none") ) cross_cpct_responses( data, cell_vars, col_vars = total(), row_vars = NULL, weight = NULL, subgroup = NULL, total_label = NULL, total_statistic = "u_responses", total_row_position = c("below", "above", "none") ) cro( cell_vars, col_vars = total(), row_vars = NULL, weight = NULL, subgroup = NULL, total_label = NULL, total_statistic = "u_cases", total_row_position = c("below", "above", "none") ) cro_cases( cell_vars, col_vars = total(), row_vars = NULL, weight = NULL, subgroup = NULL, total_label = NULL, total_statistic = "u_cases", total_row_position = c("below", "above", "none") ) cro_cpct( cell_vars, col_vars = total(), row_vars = NULL, weight = NULL, subgroup = NULL, total_label = NULL, total_statistic = "u_cases", total_row_position = c("below", "above", "none") ) cro_rpct( cell_vars, col_vars = total(), row_vars = NULL, weight = NULL, subgroup = NULL, total_label = NULL, total_statistic = "u_cases", total_row_position = c("below", "above", "none") ) cro_tpct( cell_vars, col_vars = total(), row_vars = NULL, weight = NULL, subgroup = NULL, total_label = NULL, total_statistic = "u_cases", total_row_position = c("below", "above", "none") ) cro_cpct_responses( cell_vars, col_vars = total(), row_vars = NULL, weight = NULL, subgroup = NULL, total_label = NULL, total_statistic = "u_responses", total_row_position = c("below", "above", "none") ) total(x = 1, label = "#Total")
cross_cases( data, cell_vars, col_vars = total(), row_vars = NULL, weight = NULL, subgroup = NULL, total_label = NULL, total_statistic = "u_cases", total_row_position = c("below", "above", "none") ) cross_cpct( data, cell_vars, col_vars = total(), row_vars = NULL, weight = NULL, subgroup = NULL, total_label = NULL, total_statistic = "u_cases", total_row_position = c("below", "above", "none") ) cross_rpct( data, cell_vars, col_vars = total(), row_vars = NULL, weight = NULL, subgroup = NULL, total_label = NULL, total_statistic = "u_cases", total_row_position = c("below", "above", "none") ) cross_tpct( data, cell_vars, col_vars = total(), row_vars = NULL, weight = NULL, subgroup = NULL, total_label = NULL, total_statistic = "u_cases", total_row_position = c("below", "above", "none") ) cross_cpct_responses( data, cell_vars, col_vars = total(), row_vars = NULL, weight = NULL, subgroup = NULL, total_label = NULL, total_statistic = "u_responses", total_row_position = c("below", "above", "none") ) cro( cell_vars, col_vars = total(), row_vars = NULL, weight = NULL, subgroup = NULL, total_label = NULL, total_statistic = "u_cases", total_row_position = c("below", "above", "none") ) cro_cases( cell_vars, col_vars = total(), row_vars = NULL, weight = NULL, subgroup = NULL, total_label = NULL, total_statistic = "u_cases", total_row_position = c("below", "above", "none") ) cro_cpct( cell_vars, col_vars = total(), row_vars = NULL, weight = NULL, subgroup = NULL, total_label = NULL, total_statistic = "u_cases", total_row_position = c("below", "above", "none") ) cro_rpct( cell_vars, col_vars = total(), row_vars = NULL, weight = NULL, subgroup = NULL, total_label = NULL, total_statistic = "u_cases", total_row_position = c("below", "above", "none") ) cro_tpct( cell_vars, col_vars = total(), row_vars = NULL, weight = NULL, subgroup = NULL, total_label = NULL, total_statistic = "u_cases", total_row_position = c("below", "above", "none") ) cro_cpct_responses( cell_vars, col_vars = total(), row_vars = NULL, weight = NULL, subgroup = NULL, total_label = NULL, total_statistic = "u_responses", total_row_position = c("below", "above", "none") ) total(x = 1, label = "#Total")
data |
data.frame in which context all other arguments will be evaluated
(for |
cell_vars |
vector/data.frame/list. Variables on which percentage/cases will be computed. Use mrset/mdset for multiple-response variables. |
col_vars |
vector/data.frame/list. Variables which breaks table by columns. Use mrset/mdset for multiple-response variables. |
row_vars |
vector/data.frame/list. Variables which breaks table by rows. Use mrset/mdset for multiple-response variables. |
weight |
numeric vector. Optional cases weights. Cases with NA's, negative and zero weights are removed before calculations. |
subgroup |
logical vector. You can specify subgroup on which table will be computed. |
total_label |
By default "#Total". You can provide several names - each name for each total statistics. |
total_statistic |
By default it is "u_cases" (unweighted cases). Possible values are "u_cases", "u_responses", "u_cpct", "u_rpct", "u_tpct", "w_cases", "w_responses", "w_cpct", "w_rpct", "w_tpct". "u_" means unweighted statistics and "w_" means weighted statistics. |
total_row_position |
Position of total row in the resulting table. Can be one of "below", "above", "none". |
x |
vector/data.frame of class 'category'/'dichotomy'. |
label |
character. Label for total variable. |
object of class 'etable'. Basically it's a data.frame but class is needed for custom methods.
## Not run: data(mtcars) mtcars = apply_labels(mtcars, mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (1000 lbs)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c("Automatic" = 0, "Manual"=1), gear = "Number of forward gears", carb = "Number of carburetors" ) cross_cases(mtcars, am, vs) cro(mtcars$am, mtcars$vs) # the same result # column percent with multiple banners cross_cpct(mtcars, cyl, list(total(), vs, am)) # nested banner cross_cpct(mtcars, cyl, list(total(), vs %nest% am)) # stacked variables cross_cases(mtcars, list(cyl, carb), list(total(), vs %nest% am)) # nested variables cross_cpct(mtcars, am %nest% cyl, list(total(), vs)) # row variables cross_cpct(mtcars, cyl, list(total(), vs), row_vars = am) # several totals above table cross_cpct(mtcars, cyl, list(total(), vs), row_vars = am, total_row_position = "above", total_label = c("number of cases", "row %"), total_statistic = c("u_cases", "u_rpct") ) # multiple-choice variable # brands - multiple response question # Which brands do you use during last three months? set.seed(123) brands = data.table(t(replicate(20,sample(c(1:5,NA),4,replace = FALSE)))) %>% setNames(paste0("brand_", 1:4)) # score - evaluation of tested product brands = brands %>% let( score = sample(-1:1,.N,replace = TRUE) ) %>% apply_labels( brand_1 = "Used brands", brand_1 = num_lab(" 1 Brand A 2 Brand B 3 Brand C 4 Brand D 5 Brand E "), score = "Evaluation of tested brand", score = num_lab(" -1 Dislike it 0 So-so 1 Like it ") ) cross_cpct(brands, mrset(brand_1 %to% brand_4), list(total(), score)) # responses cross_cpct_responses(brands, mrset(brand_1 %to% brand_4), list(total(), score)) ## End(Not run)
## Not run: data(mtcars) mtcars = apply_labels(mtcars, mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (1000 lbs)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c("Automatic" = 0, "Manual"=1), gear = "Number of forward gears", carb = "Number of carburetors" ) cross_cases(mtcars, am, vs) cro(mtcars$am, mtcars$vs) # the same result # column percent with multiple banners cross_cpct(mtcars, cyl, list(total(), vs, am)) # nested banner cross_cpct(mtcars, cyl, list(total(), vs %nest% am)) # stacked variables cross_cases(mtcars, list(cyl, carb), list(total(), vs %nest% am)) # nested variables cross_cpct(mtcars, am %nest% cyl, list(total(), vs)) # row variables cross_cpct(mtcars, cyl, list(total(), vs), row_vars = am) # several totals above table cross_cpct(mtcars, cyl, list(total(), vs), row_vars = am, total_row_position = "above", total_label = c("number of cases", "row %"), total_statistic = c("u_cases", "u_rpct") ) # multiple-choice variable # brands - multiple response question # Which brands do you use during last three months? set.seed(123) brands = data.table(t(replicate(20,sample(c(1:5,NA),4,replace = FALSE)))) %>% setNames(paste0("brand_", 1:4)) # score - evaluation of tested product brands = brands %>% let( score = sample(-1:1,.N,replace = TRUE) ) %>% apply_labels( brand_1 = "Used brands", brand_1 = num_lab(" 1 Brand A 2 Brand B 3 Brand C 4 Brand D 5 Brand E "), score = "Evaluation of tested brand", score = num_lab(" -1 Dislike it 0 So-so 1 Like it ") ) cross_cpct(brands, mrset(brand_1 %to% brand_4), list(total(), score)) # responses cross_cpct_responses(brands, mrset(brand_1 %to% brand_4), list(total(), score)) ## End(Not run)
cross_mean
, cross_sum
, cross_median
calculate
mean/sum/median by groups. NA's are always omitted.
cross_mean_sd_n
calculates mean, standard deviation and N
simultaneously. Mainly intended for usage with significance_means.
cross_pearson
, cross_spearman
calculate correlation of
first variable in each data.frame in cell_vars
with other variables.
NA's are removed pairwise.
cross_fun
, cross_fun_df
return table with custom summary
statistics defined by fun
argument. NA's treatment depends on your
fun
behavior. To use weight you should have formal weight
argument in fun
and some logic for its processing inside. Several
functions with weight support are provided - see w_mean.
cross_fun
applies fun
on each variable in cell_vars
separately, cross_fun_df
gives to fun
each data.frame in
cell_vars
as a whole. So cross_fun(iris[, -5], iris$Species, fun =
mean)
gives the same result as cross_fun_df(iris[, -5], iris$Species,
fun = colMeans)
. For cross_fun_df
names of cell_vars
will
converted to labels if they are available before the fun
will be applied.
Generally it is recommended that fun
will always return object of the
same form. Row names/vector names of fun
result will appear in the row
labels of the table and column names/names of list will appear in the column
labels. If your fun
returns data.frame/matrix/list with element named
'row_labels' then this element will be used as row labels. And it will have
precedence over rownames.
cross_*
are evaluate their arguments
in the context of the first argument data
.
cro_*
functions use standard evaluation, e. g 'cro(mtcars$am, mtcars$vs)'.
combine_functions
is auxiliary function for combining several
functions into one function for usage with cro_fun
/cro_fun_df
.
Names of arguments will be used as statistic labels. By default, results of
each function are combined with c. But you can provide your own method
function with method
argument. It will be applied as in the expression
do.call(method, list_of_functions_results)
. Particular useful method
is list
. When it used then statistic labels will appear in the column
labels. See examples. Also you may be interested in data.frame
,
rbind
, cbind
methods.
cross_fun( data, cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL, fun, ..., unsafe = FALSE ) cross_fun_df( data, cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL, fun, ..., unsafe = FALSE ) cross_mean( data, cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL ) cross_mean_sd_n( data, cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL, weighted_valid_n = FALSE, labels = NULL ) cross_sum( data, cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL ) cross_median( data, cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL ) cross_pearson( data, cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL ) cross_spearman( data, cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL ) cro_fun( cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL, fun, ..., unsafe = FALSE ) cro_fun_df( cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL, fun, ..., unsafe = FALSE ) cro_mean( cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL ) cro_mean_sd_n( cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL, weighted_valid_n = FALSE, labels = NULL ) cro_sum( cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL ) cro_median( cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL ) cro_pearson( cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL ) cro_spearman( cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL ) combine_functions(..., method = c)
cross_fun( data, cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL, fun, ..., unsafe = FALSE ) cross_fun_df( data, cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL, fun, ..., unsafe = FALSE ) cross_mean( data, cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL ) cross_mean_sd_n( data, cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL, weighted_valid_n = FALSE, labels = NULL ) cross_sum( data, cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL ) cross_median( data, cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL ) cross_pearson( data, cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL ) cross_spearman( data, cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL ) cro_fun( cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL, fun, ..., unsafe = FALSE ) cro_fun_df( cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL, fun, ..., unsafe = FALSE ) cro_mean( cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL ) cro_mean_sd_n( cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL, weighted_valid_n = FALSE, labels = NULL ) cro_sum( cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL ) cro_median( cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL ) cro_pearson( cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL ) cro_spearman( cell_vars, col_vars = total(), row_vars = total(label = ""), weight = NULL, subgroup = NULL ) combine_functions(..., method = c)
data |
data.frame in which context all other arguments will be evaluated
(for |
cell_vars |
vector/data.frame/list. Variables on which summary function will be computed. |
col_vars |
vector/data.frame/list. Variables which breaks table by columns. Use mrset/mdset for multiple-response variables. |
row_vars |
vector/data.frame/list. Variables which breaks table by rows. Use mrset/mdset for multiple-response variables. |
weight |
numeric vector. Optional cases weights. Cases with NA's, negative and zero weights are removed before calculations. |
subgroup |
logical vector. You can specify subgroup on which table will be computed. |
fun |
custom summary function. Generally it is recommended that
|
... |
further arguments for |
unsafe |
logical/character If not FALSE than |
weighted_valid_n |
logical. Should we show weighted valid N in
|
labels |
character vector of length 3. Labels for mean, standard
deviation and valid N in |
method |
function which will combine results of multiple functions in
|
object of class 'etable'. Basically it's a data.frame but class is needed for custom methods.
tables, fre, cross_cases.
data(mtcars) mtcars = apply_labels(mtcars, mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (1000 lbs)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c("Automatic" = 0, "Manual"=1), gear = "Number of forward gears", carb = "Number of carburetors" ) # Simple example - there is special shortcut for it - 'cross_mean' cross_fun(mtcars, list(mpg, disp, hp, wt, qsec), col_vars = list(total(), am), row_vars = vs, fun = mean) # The same example with 'subgroup' cross_fun(mtcars, list(mpg, disp, hp, wt, qsec), col_vars = list(total(), am), row_vars = vs, subgroup = vs == 0, fun = mean) # 'combine_functions' usage cross_fun(mtcars, list(mpg, disp, hp, wt, qsec), col_vars = list(total(), am), row_vars = vs, fun = combine_functions(Mean = mean, 'Std. dev.' = sd, 'Valid N' = valid_n) ) # 'combine_functions' usage - statistic labels in columns cross_fun(mtcars, list(mpg, disp, hp, wt, qsec), col_vars = list(total(), am), row_vars = vs, fun = combine_functions(Mean = mean, 'Std. dev.' = sd, 'Valid N' = valid_n, method = list ) ) # 'summary' function cross_fun(mtcars, list(mpg, disp, hp, wt, qsec), col_vars = list(total(), am), row_vars = list(total(), vs), fun = summary ) # comparison 'cross_fun' and 'cross_fun_df' cross_fun(mtcars, data.frame(mpg, disp, hp, wt, qsec), col_vars = am, fun = mean ) # same result cross_fun_df(mtcars, data.frame(mpg, disp, hp, wt, qsec), col_vars = am, fun = colMeans ) # usage for 'cross_fun_df' which is not possible for 'cross_fun' # linear regression by groups cross_fun_df(mtcars, data.frame(mpg, disp, hp, wt, qsec), col_vars = am, fun = function(x){ frm = reformulate(".", response = as.name(names(x)[1])) model = lm(frm, data = x) cbind('Coef.' = coef(model), confint(model) ) } )
data(mtcars) mtcars = apply_labels(mtcars, mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (1000 lbs)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c("Automatic" = 0, "Manual"=1), gear = "Number of forward gears", carb = "Number of carburetors" ) # Simple example - there is special shortcut for it - 'cross_mean' cross_fun(mtcars, list(mpg, disp, hp, wt, qsec), col_vars = list(total(), am), row_vars = vs, fun = mean) # The same example with 'subgroup' cross_fun(mtcars, list(mpg, disp, hp, wt, qsec), col_vars = list(total(), am), row_vars = vs, subgroup = vs == 0, fun = mean) # 'combine_functions' usage cross_fun(mtcars, list(mpg, disp, hp, wt, qsec), col_vars = list(total(), am), row_vars = vs, fun = combine_functions(Mean = mean, 'Std. dev.' = sd, 'Valid N' = valid_n) ) # 'combine_functions' usage - statistic labels in columns cross_fun(mtcars, list(mpg, disp, hp, wt, qsec), col_vars = list(total(), am), row_vars = vs, fun = combine_functions(Mean = mean, 'Std. dev.' = sd, 'Valid N' = valid_n, method = list ) ) # 'summary' function cross_fun(mtcars, list(mpg, disp, hp, wt, qsec), col_vars = list(total(), am), row_vars = list(total(), vs), fun = summary ) # comparison 'cross_fun' and 'cross_fun_df' cross_fun(mtcars, data.frame(mpg, disp, hp, wt, qsec), col_vars = am, fun = mean ) # same result cross_fun_df(mtcars, data.frame(mpg, disp, hp, wt, qsec), col_vars = am, fun = colMeans ) # usage for 'cross_fun_df' which is not possible for 'cross_fun' # linear regression by groups cross_fun_df(mtcars, data.frame(mpg, disp, hp, wt, qsec), col_vars = am, fun = function(x){ frm = reformulate(".", response = as.name(names(x)[1])) model = lm(frm, data = x) cbind('Coef.' = coef(model), confint(model) ) } )
Repeats the same transformations on a specified set of variables/values
do_repeat(data, ...) as_is(...)
do_repeat(data, ...) as_is(...)
data |
data.frame/list. If |
... |
stand-in name(s) followed by equals sign and a vector/list of
replacement variables or values. They can be numeric/characters or
variables names. Names at the top-level can be unquoted (non-standard
evaluation). Quoted characters also considered as variables names. To avoid
this behavior use |
There is a special constant .N
which equals to number of
cases in data
for usage in expression inside do_repeat
. Also
there are a variables .item_num
which is equal to the current
iteration number and .item_value
which is named list with current
stand-in variables values.
transformed data.frame data
data(iris) scaled_iris = do_repeat(iris, i = Sepal.Length %to% Petal.Width, { i = scale(i) }) head(scaled_iris) # several stand-in names and standard evaluattion old_names = qc(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) new_names = paste0("scaled_", old_names) scaled_iris = do_repeat(iris, orig = ((old_names)), scaled = ((new_names)), { scaled = scale(orig) }) head(scaled_iris) # numerics new_df = data.frame(id = 1:20) # note the automatic creation of the sequence of variables new_df = do_repeat(new_df, item = i1 %to% i3, value = c(1, 2, 3), { item = value }) head(new_df) # the same result with internal variable '.item_num' new_df = data.frame(id = 1:20) new_df = do_repeat(new_df, item = i1 %to% i3, { item = .item_num }) head(new_df) # functions set.seed(123) new_df = data.frame(id = 1:20) new_df = do_repeat(new_df, item = c(i1, i2, i3), fun = c("rnorm", "runif", "rexp"), { item = fun(.N) }) head(new_df)
data(iris) scaled_iris = do_repeat(iris, i = Sepal.Length %to% Petal.Width, { i = scale(i) }) head(scaled_iris) # several stand-in names and standard evaluattion old_names = qc(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) new_names = paste0("scaled_", old_names) scaled_iris = do_repeat(iris, orig = ((old_names)), scaled = ((new_names)), { scaled = scale(orig) }) head(scaled_iris) # numerics new_df = data.frame(id = 1:20) # note the automatic creation of the sequence of variables new_df = do_repeat(new_df, item = i1 %to% i3, value = c(1, 2, 3), { item = value }) head(new_df) # the same result with internal variable '.item_num' new_df = data.frame(id = 1:20) new_df = do_repeat(new_df, item = i1 %to% i3, { item = .item_num }) head(new_df) # functions set.seed(123) new_df = data.frame(id = 1:20) new_df = do_repeat(new_df, item = c(i1, i2, i3), fun = c("rnorm", "runif", "rexp"), { item = fun(.N) }) head(new_df)
By default tables produced by functions tables, cross_cpct,
cross_fun and cross_fun_df are created with all
possible value labels. If values for this labels are absent in variable there
are NA's in rows and columns.
drop_empty_rows
/drop_empty_columns
are intended to remove
these empty rows/columns. drop_r
and drop_c
are the same
functions with shorter names. drop_rc
drops rows and columns
simultaneously.
drop_empty_rows(x, excluded_rows = NULL, excluded_columns = NULL) drop_empty_columns(x, excluded_rows = NULL, excluded_columns = NULL) drop_r(x, excluded_rows = NULL, excluded_columns = NULL) drop_c(x, excluded_rows = NULL, excluded_columns = NULL) drop_rc(x)
drop_empty_rows(x, excluded_rows = NULL, excluded_columns = NULL) drop_empty_columns(x, excluded_rows = NULL, excluded_columns = NULL) drop_r(x, excluded_rows = NULL, excluded_columns = NULL) drop_c(x, excluded_rows = NULL, excluded_columns = NULL) drop_rc(x)
x |
data.frame/etable(result of cro and etc.) |
excluded_rows |
character/logical/numeric rows which won't be dropped
and in which NAs won't be counted. If it is characters then they will be
considered as pattern/vector of patterns. Patterns will be matched with
Perl-style regular expression with values in the first column of |
excluded_columns |
logical/numeric/characters columns which won't be dropped and in which NAs won't be counted. By default for class 'etable' it is first column - column with labels in table. |
data.frame with removed rows/columns
data(mtcars) mtcars = apply_labels(mtcars, vs = "Engine", vs = num_lab(" 0 V-engine 1 Straight engine 9 Other "), am = "Transmission", am = num_lab(" 0 Automatic 1 Manual 9 Other ") ) with_empty = cross_cases(mtcars, am, vs) drop_empty_rows(with_empty) drop_empty_columns(with_empty) drop_rc(with_empty)
data(mtcars) mtcars = apply_labels(mtcars, vs = "Engine", vs = num_lab(" 0 V-engine 1 Straight engine 9 Other "), am = "Transmission", am = num_lab(" 0 Automatic 1 Manual 9 Other ") ) with_empty = cross_cases(mtcars, am, vs) drop_empty_rows(with_empty) drop_empty_columns(with_empty) drop_rc(with_empty)
'expss' package implements some popular functions from spreadsheets and SPSS Statistics software. Implementations are not complete copies of their originals. I try to make them consistent with other R functions. See examples in the vignette and in the help.
IF ifelse
AVERAGE mean_row
SUM sum_row
MIN min_row
MAX max_row
VLOOKUP vlookup
COUNTIF count_if
AVERAGEIF mean_row_if
SUMIF sum_row_if
MINIF min_row_if
MAXIF max_row_if
IFS ifs
IFNA if_na
MATCH match_row
INDEX index_row
PIVOT TABLES tables, cross_fun, cross_cpct
RECODE recode
COUNT count_row_if
VARIABLE LABELS var_lab
VALUE LABELS val_lab
ANY any_in_row
FREQUENCIES fre
CROSSTABS cro
CUSTOM TABLES tables
All options can be set with options(option.name = option.value)
or
with special functions (see below). You can get value of option with
getOption("option.name").
expss.digits
Number of digits after decimal separator which
will be shown for tables. This parameter is supported in the
as.datatable_widget, htmlTable.etable and print
methods. NULL
is default and means one digit. NA
means no
rounding. There is a convenience function for this option:
expss_digits
.
expss.round_half_to_even
Determines which rounding we will use.
Default rounding is as with R round
: "half to even". For rounding "half to largest"
set this option to FALSE. This parameter is supported in the as.datatable_widget,
htmlTable.etable, print
and in significance testing
methods. NULL
is default and means R default rounding. NA
means no
rounding. Parameter does not affect calculations, only table representation.
There is a convenience function for this option:
expss_round_half_to_even
.
expss.enable_value_labels_support
By default, all labelled
variables will use labels as labels for factor levels when factor is
called. So, any function which calls factor/as.factor will use
value labels. In details this option changes behavior of two methods for
class labelled
- as.character
and unique
- on which
factor
depends entirely. If you have compatibility problems set this
option to zero: options(expss.enable_value_labels_support = 0)
.
Additionally there is an option for extreme value labels support:
options(expss.enable_value_labels_support = 2)
. With this value
factor
/as.factor
will take into account empty levels. See
example. It is recommended to turn off this option immediately after usage
because unique.labelled
will give weird result. Labels without values
will be added to unique values. There are shortcuts for these options:
expss_enable_value_labels_support()
,
expss_enable_value_labels_support_extreme()
and
expss_disable_value_labels_support()
.
expss.output
By default tables are printed in the console.
You can change this behavior by setting this option. There are five possible
values: 'rnotebook'
, 'viewer'
, 'commented'
,
'raw'
or 'huxtable'
. First option is useful when you run your code in the R Notebook
- output will be rendered to nice HTML. The second option will render tables
to RStudio viewer. knitr
is supported automatically via
knit_print
method. 'commented'
prints default output to the
console with comment symbol (#
) at the beginning of the each line.
With comment symbol you can easily copy and paste your output into the
script. Option raw
disables any formatting and all tables are printed
as data.frames. Option huxtable
print output via the huxtable library.
Shortcuts for options: expss_output_default()
,
expss_output_raw()
, expss_output_viewer()
,
expss_output_commented()
, expss_output_rnotebook()
and
expss_output_huxtable()
.
expss_fix_encoding_on
/expss_fix_encoding_off
If you
expreience problems with character encoding in RStudio Viewer/RNotebooks under Windows
try expss_fix_encoding_on()
. In some cases, it can help.
expss_digits(digits = NULL) get_expss_digits() expss_round_half_to_even(round_half_to_even = TRUE) get_expss_rounding() expss_enable_value_labels_support() expss_enable_value_labels_support_extreme() expss_disable_value_labels_support() expss_output_default() expss_output_commented() expss_output_raw() expss_output_viewer() expss_output_rnotebook() expss_output_huxtable(...) expss_fix_encoding_on() expss_fix_encoding_off() expss_fre_stat_lab( label = c("Count", "Valid percent", "Percent", "Responses, %", "Cumulative responses, %") )
expss_digits(digits = NULL) get_expss_digits() expss_round_half_to_even(round_half_to_even = TRUE) get_expss_rounding() expss_enable_value_labels_support() expss_enable_value_labels_support_extreme() expss_disable_value_labels_support() expss_output_default() expss_output_commented() expss_output_raw() expss_output_viewer() expss_output_rnotebook() expss_output_huxtable(...) expss_fix_encoding_on() expss_fix_encoding_off() expss_fre_stat_lab( label = c("Count", "Valid percent", "Percent", "Responses, %", "Cumulative responses, %") )
digits |
integer. Number of digits after decimal point. |
round_half_to_even |
logical. Default is TRUE which means default R rounding: "half to even". For rounding "half to largest" set this option to FALSE. |
... |
list of parameters for |
label |
character vector of length 5. Default labels for fre. |
# example of different levels of value labels support my_scale = c(1, 2, 2, 2) # note that we have label 'Hard to say' for which there are no values in 'my_scale' val_lab(my_scale) = num_lab(" 1 Yes 2 No 3 Hard to say ") # disable labels support expss_disable_value_labels_support() table(my_scale) # there is no labels in the result unique(my_scale) # default value labels support expss_enable_value_labels_support() # table with labels but there are no label "Hard to say" table(my_scale) unique(my_scale) # extreme value labels support expss_enable_value_labels_support_extreme() # now we see "Hard to say" with zero counts table(my_scale) # weird 'unique'! There is a value 3 which is absent in 'my_scale' unique(my_scale) # return immediately to defaults to avoid issues expss_enable_value_labels_support()
# example of different levels of value labels support my_scale = c(1, 2, 2, 2) # note that we have label 'Hard to say' for which there are no values in 'my_scale' val_lab(my_scale) = num_lab(" 1 Yes 2 No 3 Hard to say ") # disable labels support expss_disable_value_labels_support() table(my_scale) # there is no labels in the result unique(my_scale) # default value labels support expss_enable_value_labels_support() # table with labels but there are no label "Hard to say" table(my_scale) unique(my_scale) # extreme value labels support expss_enable_value_labels_support_extreme() # now we see "Hard to say" with zero counts table(my_scale) # weird 'unique'! There is a value 3 which is absent in 'my_scale' unique(my_scale) # return immediately to defaults to avoid issues expss_enable_value_labels_support()
fctr
converts variable to factor. It force labels usage as factor
labels for labelled variables even if 'expss.enable_value_labels_support' set
to 0. For other types of variables base factor is called. Factor
levels are constructed as values labels. If label doesn't exist for
particular value then this value remain as is - so there is no information
lost. This levels look like as "Variable_label|Value label" if argument
prepend
set to TRUE.
fctr(x, ..., drop_unused_labels = FALSE, prepend_var_lab = TRUE)
fctr(x, ..., drop_unused_labels = FALSE, prepend_var_lab = TRUE)
x |
a vector of data with labels. |
... |
optional arguments for |
drop_unused_labels |
logical. Should we drop unused value labels? Default is FALSE. |
prepend_var_lab |
logical. Should we prepend variable label before value labels? Default is TRUE. |
an object of class factor. For details see base factor documentation.
values2labels, names2labels, val_lab,
var_lab. Materials for base functions: factor
,
as.factor
, ordered
,
as.ordered
data(mtcars) var_lab(mtcars$am) = "Transmission" val_lab(mtcars$am) = c(automatic = 0, manual=1) summary(lm(mpg ~ am, data = mtcars)) # no labels summary(lm(mpg ~ fctr(am), data = mtcars)) # with labels summary(lm(mpg ~ fctr(unvr(am)), data = mtcars)) # without variable label
data(mtcars) var_lab(mtcars$am) = "Transmission" val_lab(mtcars$am) = c(automatic = 0, manual=1) summary(lm(mpg ~ am, data = mtcars)) # no labels summary(lm(mpg ~ fctr(am), data = mtcars)) # with labels summary(lm(mpg ~ fctr(unvr(am)), data = mtcars)) # without variable label
fre
returns data.frame with six columns: labels or values, counts,
valid percent (excluding NA), percent (with NA), percent of responses(for
single-column x
it equals to valid percent) and cumulative percent of
responses.
fre( x, weight = NULL, drop_unused_labels = TRUE, prepend_var_lab = FALSE, stat_lab = getOption("expss.fre_stat_lab", c("Count", "Valid percent", "Percent", "Responses, %", "Cumulative responses, %")) )
fre( x, weight = NULL, drop_unused_labels = TRUE, prepend_var_lab = FALSE, stat_lab = getOption("expss.fre_stat_lab", c("Count", "Valid percent", "Percent", "Responses, %", "Cumulative responses, %")) )
x |
vector/data.frame/list. data.frames are considered as multiple
response variables. If |
weight |
numeric vector. Optional case weights. NA's and negative weights treated as zero weights. |
drop_unused_labels |
logical. Should we drop unused value labels? Default is TRUE. |
prepend_var_lab |
logical. Should we prepend variable label before value
labels? By default we will add variable labels to value labels only if
|
stat_lab |
character. Labels for the frequency columns. |
object of class 'etable'. Basically it's a data.frame but class is needed for custom methods.
data(mtcars) mtcars = mtcars %>% apply_labels( mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (lb/1000)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c(automatic = 0, manual=1), gear = "Number of forward gears", carb = "Number of carburetors" ) fre(mtcars$vs) # stacked frequencies fre(list(mtcars$vs, mtcars$am)) # multiple-choice variable # brands - multiple response question # Which brands do you use during last three months? set.seed(123) brands = data.frame(t(replicate(20,sample(c(1:5,NA),4,replace = FALSE)))) # score - evaluation of tested product score = sample(-1:1,20,replace = TRUE) var_lab(brands) = "Used brands" val_lab(brands) = make_labels(" 1 Brand A 2 Brand B 3 Brand C 4 Brand D 5 Brand E ") var_lab(score) = "Evaluation of tested brand" val_lab(score) = make_labels(" -1 Dislike it 0 So-so 1 Like it ") fre(brands) # stacked frequencies fre(list(score, brands))
data(mtcars) mtcars = mtcars %>% apply_labels( mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (lb/1000)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c(automatic = 0, manual=1), gear = "Number of forward gears", carb = "Number of carburetors" ) fre(mtcars$vs) # stacked frequencies fre(list(mtcars$vs, mtcars$am)) # multiple-choice variable # brands - multiple response question # Which brands do you use during last three months? set.seed(123) brands = data.frame(t(replicate(20,sample(c(1:5,NA),4,replace = FALSE)))) # score - evaluation of tested product score = sample(-1:1,20,replace = TRUE) var_lab(brands) = "Used brands" val_lab(brands) = make_labels(" 1 Brand A 2 Brand B 3 Brand C 4 Brand D 5 Brand E ") var_lab(score) = "Evaluation of tested brand" val_lab(score) = make_labels(" -1 Dislike it 0 So-so 1 Like it ") fre(brands) # stacked frequencies fre(list(score, brands))
This is method for rendering results of fre/cro/tables
in Shiny/RMarkdown/Jupyter notebooks and etc. For detailed description of
function and its arguments see htmlTable. You can pack your
tables in the list and render them all simultaneously. See examples. You may
be interested in expss_output_viewer()
for automatical rendering
tables in the RStudio viewer or expss_output_rnotebook()
for
rendering in the R notebooks. See expss.options. repr_html
is
method for rendering table in the Jupyter notebooks and knit_print
is
method for rendering table in the knitr
HTML-documents. Jupyter
notebooks and knitr
documents are supported automatically but in the R
notebooks it is needed to set output to notebook via
expss_output_rnotebook()
.
## S3 method for class 'etable' htmlTable( x, header = NULL, rnames = NULL, rowlabel = NULL, caption = NULL, tfoot = NULL, label = NULL, rgroup = NULL, n.rgroup = NULL, cgroup = NULL, n.cgroup = NULL, tspanner = NULL, n.tspanner = NULL, total = NULL, ctable = TRUE, compatibility = getOption("htmlTableCompat", "LibreOffice"), cspan.rgroup = "all", escape.html = FALSE, ..., digits = get_expss_digits(), row_groups = TRUE ) ## S3 method for class 'with_caption' htmlTable( x, header = NULL, rnames = NULL, rowlabel = NULL, caption = NULL, tfoot = NULL, label = NULL, rgroup = NULL, n.rgroup = NULL, cgroup = NULL, n.cgroup = NULL, tspanner = NULL, n.tspanner = NULL, total = NULL, ctable = TRUE, compatibility = getOption("htmlTableCompat", "LibreOffice"), cspan.rgroup = "all", escape.html = FALSE, ..., digits = get_expss_digits(), row_groups = TRUE ) ## S3 method for class 'list' htmlTable( x, header = NULL, rnames = NULL, rowlabel = NULL, caption = NULL, tfoot = NULL, label = NULL, rgroup = NULL, n.rgroup = NULL, cgroup = NULL, n.cgroup = NULL, tspanner = NULL, n.tspanner = NULL, total = NULL, ctable = TRUE, compatibility = getOption("htmlTableCompat", "LibreOffice"), cspan.rgroup = "all", escape.html = FALSE, ..., digits = get_expss_digits(), row_groups = TRUE, gap = "<br>" ) knit_print.etable(x, ..., digits = get_expss_digits(), escape.html = FALSE) knit_print.with_caption( x, ..., digits = get_expss_digits(), escape.html = FALSE ) repr_html.etable(obj, ..., digits = get_expss_digits(), escape.html = FALSE) repr_html.with_caption( obj, ..., digits = get_expss_digits(), escape.html = FALSE ) repr_text.etable(obj, ..., digits = get_expss_digits()) repr_text.with_caption(obj, ..., digits = get_expss_digits())
## S3 method for class 'etable' htmlTable( x, header = NULL, rnames = NULL, rowlabel = NULL, caption = NULL, tfoot = NULL, label = NULL, rgroup = NULL, n.rgroup = NULL, cgroup = NULL, n.cgroup = NULL, tspanner = NULL, n.tspanner = NULL, total = NULL, ctable = TRUE, compatibility = getOption("htmlTableCompat", "LibreOffice"), cspan.rgroup = "all", escape.html = FALSE, ..., digits = get_expss_digits(), row_groups = TRUE ) ## S3 method for class 'with_caption' htmlTable( x, header = NULL, rnames = NULL, rowlabel = NULL, caption = NULL, tfoot = NULL, label = NULL, rgroup = NULL, n.rgroup = NULL, cgroup = NULL, n.cgroup = NULL, tspanner = NULL, n.tspanner = NULL, total = NULL, ctable = TRUE, compatibility = getOption("htmlTableCompat", "LibreOffice"), cspan.rgroup = "all", escape.html = FALSE, ..., digits = get_expss_digits(), row_groups = TRUE ) ## S3 method for class 'list' htmlTable( x, header = NULL, rnames = NULL, rowlabel = NULL, caption = NULL, tfoot = NULL, label = NULL, rgroup = NULL, n.rgroup = NULL, cgroup = NULL, n.cgroup = NULL, tspanner = NULL, n.tspanner = NULL, total = NULL, ctable = TRUE, compatibility = getOption("htmlTableCompat", "LibreOffice"), cspan.rgroup = "all", escape.html = FALSE, ..., digits = get_expss_digits(), row_groups = TRUE, gap = "<br>" ) knit_print.etable(x, ..., digits = get_expss_digits(), escape.html = FALSE) knit_print.with_caption( x, ..., digits = get_expss_digits(), escape.html = FALSE ) repr_html.etable(obj, ..., digits = get_expss_digits(), escape.html = FALSE) repr_html.with_caption( obj, ..., digits = get_expss_digits(), escape.html = FALSE ) repr_text.etable(obj, ..., digits = get_expss_digits()) repr_text.with_caption(obj, ..., digits = get_expss_digits())
x |
a data object of class 'etable' - result of fre/cro and etc. |
header |
Ignored. |
rnames |
Ignored. |
rowlabel |
Ignored. |
caption |
See manual for htmlTable. |
tfoot |
See manual for htmlTable. |
label |
See manual for htmlTable. |
rgroup |
Ignored. |
n.rgroup |
Ignored. |
cgroup |
Ignored. |
n.cgroup |
Ignored. |
tspanner |
See manual for htmlTable. |
n.tspanner |
See manual for htmlTable. |
total |
See manual for htmlTable. |
ctable |
See manual for htmlTable. |
compatibility |
See manual for htmlTable. |
cspan.rgroup |
See manual for htmlTable. |
escape.html |
logical: should HTML characters be escaped? Defaults to FALSE. |
... |
further parameters for htmlTable. |
digits |
integer By default, all numeric columns are rounded to one digit after
decimal separator. Also you can set this argument by setting option 'expss.digits'
- for example, |
row_groups |
logical Should we create row groups? TRUE by default. |
gap |
character Separator between tables if we output list of tables. By default it is line break '<br>'. |
obj |
a data object of class 'etable' - result of fre/cro and etc. |
Returns a string of class htmlTable
## Not run: data(mtcars) mtcars = apply_labels(mtcars, mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (1000 lbs)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c("Automatic" = 0, "Manual"=1), gear = "Number of forward gears", carb = "Number of carburetors" ) expss_output_viewer() mtcars %>% tab_cols(total(), am %nest% vs) %>% tab_cells(mpg, hp) %>% tab_stat_mean() %>% tab_cells(cyl) %>% tab_stat_cpct() %>% tab_pivot() %>% set_caption("Table 1. Some variables from mtcars dataset.") # several tables in a list list( cross_cpct(mtcars, list(am, vs, cyl), list(total(), am)) %>% set_caption("Table 1. Percent."), cross_mean_sd_n(mtcars, list(mpg, hp, qsec), list(total(), am)) %>% set_caption("Table 2. Means.") ) %>% htmlTable() expss_output_default() ## End(Not run)
## Not run: data(mtcars) mtcars = apply_labels(mtcars, mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (1000 lbs)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c("Automatic" = 0, "Manual"=1), gear = "Number of forward gears", carb = "Number of carburetors" ) expss_output_viewer() mtcars %>% tab_cols(total(), am %nest% vs) %>% tab_cells(mpg, hp) %>% tab_stat_mean() %>% tab_cells(cyl) %>% tab_stat_cpct() %>% tab_pivot() %>% set_caption("Table 1. Some variables from mtcars dataset.") # several tables in a list list( cross_cpct(mtcars, list(am, vs, cyl), list(total(), am)) %>% set_caption("Table 1. Percent."), cross_mean_sd_n(mtcars, list(mpg, hp, qsec), list(total(), am)) %>% set_caption("Table 2. Means.") ) %>% htmlTable() expss_output_default() ## End(Not run)
if_na
replaces NA values in vector/data.frame/matrix/list with
supplied value. For single value argument label can be provided with
label
argument. If replacement value is vector then if_na
uses
for replacement values from appropriate positions. An opposite operation is na_if
.
na_if
replaces values with NA in
vector/data.frame/matrix/list. Another alias for this is mis_val
.
valid
returns logical vector which indicate the presence of at
least one not-NA value in row. For vector or single column data.frame result
is the same as with complete.cases. There is a special case
for data.frame of class dichotomy
. In this case result indicate the
presence of at least one 1 in a row.
if_na(x, value, label = NULL) if_na(x, label = NULL) <- value x %if_na% value na_if(x, value, with_labels = FALSE) na_if(x, with_labels = FALSE) <- value x %na_if% value mis_val(x, value, with_labels = FALSE) mis_val(x, with_labels = FALSE) <- value valid(x)
if_na(x, value, label = NULL) if_na(x, label = NULL) <- value x %if_na% value na_if(x, value, with_labels = FALSE) na_if(x, with_labels = FALSE) <- value x %na_if% value mis_val(x, value, with_labels = FALSE) mis_val(x, with_labels = FALSE) <- value valid(x)
x |
vector/matrix/data.frame/list |
value |
single value, vector of the same length as number of rows in
|
label |
a character of length 1. Label for |
with_labels |
logical. FALSE by default. Should we also remove labels of values which we recode to NA? |
An object of class character
of length 1.
object of the same form and class as x
. valid
returns logical vector.
# simple case a = c(NA, 2, 3, 4, NA) if_na(a, 99) # the same result a %if_na% 99 # with label a = c(NA, 2, 3, 4, NA) if_na(a, 99, label = "Hard to say") # in-place replacement. The same result: if_na(a, label = "Hard to say") = 99 a # c(99, 2, 3, 4, 99) # replacement with values from other variable a = c(NA, 2, 3, 4, NA) b = 1:5 if_na(a, b) # replacement with group means # make data.frame set.seed(123) group = sample(1:3, 30, replace = TRUE) param = runif(30) param[sample(30, 10)] = NA # place 10 NA's df = data.frame(group, param) # replace NA's with group means if_na(df$param) = window_fun(df$param, df$group, mean_col) df ###################### ### na_if examples ### ###################### a = c(1:5, 99) # 99 to NA na_if(a, 99) # c(1:5, NA) a %na_if% 99 # same result # values which greater than 4 to NA na_if(a, gt(4)) # c(1:4, NA, NA) # alias 'mis_val', with_labels = TRUE a = c(1, 1, 2, 2, 99) val_lab(a) = c(Yes = 1, No = 2, "Hard to say" = 99) mis_val(a, 99, with_labels = TRUE) set.seed(123) dfs = data.frame( a = c("bad value", "bad value", "good value", "good value", "good value"), b = runif(5) ) # rows with 'bad value' will be filled with NA # logical argument and recycling by columns na_if(dfs, dfs$a=="bad value") a = rnorm(50) # values greater than 1 or less than -1 will be set to NA # special functions usage na_if(a, lt(-1) | gt(1)) # values inside [-1, 1] to NA na_if(a, -1 %thru% 1)
# simple case a = c(NA, 2, 3, 4, NA) if_na(a, 99) # the same result a %if_na% 99 # with label a = c(NA, 2, 3, 4, NA) if_na(a, 99, label = "Hard to say") # in-place replacement. The same result: if_na(a, label = "Hard to say") = 99 a # c(99, 2, 3, 4, 99) # replacement with values from other variable a = c(NA, 2, 3, 4, NA) b = 1:5 if_na(a, b) # replacement with group means # make data.frame set.seed(123) group = sample(1:3, 30, replace = TRUE) param = runif(30) param[sample(30, 10)] = NA # place 10 NA's df = data.frame(group, param) # replace NA's with group means if_na(df$param) = window_fun(df$param, df$group, mean_col) df ###################### ### na_if examples ### ###################### a = c(1:5, 99) # 99 to NA na_if(a, 99) # c(1:5, NA) a %na_if% 99 # same result # values which greater than 4 to NA na_if(a, gt(4)) # c(1:4, NA, NA) # alias 'mis_val', with_labels = TRUE a = c(1, 1, 2, 2, 99) val_lab(a) = c(Yes = 1, No = 2, "Hard to say" = 99) mis_val(a, 99, with_labels = TRUE) set.seed(123) dfs = data.frame( a = c("bad value", "bad value", "good value", "good value", "good value"), b = runif(5) ) # rows with 'bad value' will be filled with NA # logical argument and recycling by columns na_if(dfs, dfs$a=="bad value") a = rnorm(50) # values greater than 1 or less than -1 will be set to NA # special functions usage na_if(a, lt(-1) | gt(1)) # values inside [-1, 1] to NA na_if(a, -1 %thru% 1)
info
returns data.frame with variables description and some summary
statistics. Resulting data.frame mainly intended to keep in front of eyes in
RStudio viewer or to be saved as csv to view in the spreadsheet software as
reference about working dataset.
info(x, stats = TRUE, frequencies = TRUE, max_levels = 10)
info(x, stats = TRUE, frequencies = TRUE, max_levels = 10)
x |
vector/factor/list/data.frame. |
stats |
Logical. Should we calculate summary for each variable? |
frequencies |
Logical. Should we calculate frequencies for each variable? This calculation can take significant amount of time for large datasets. |
max_levels |
Numeric. Maximum levels for using in frequency calculations. Levels above this value will convert to 'Other values'. |
data.frame with following columns: Name, Class, Length, NotNA, NA, Distincts, Label, ValueLabels, Min., 1st Qu., Median, Mean, 3rd Qu., Max., Frequency.
data(mtcars) var_lab(mtcars$am) = "Transmission" val_lab(mtcars$am) = c("Automatic"=0, "Manual"=1) info(mtcars, max_levels = 5)
data(mtcars) var_lab(mtcars$am) = "Transmission" val_lab(mtcars$am) = c("Automatic"=0, "Manual"=1) info(mtcars, max_levels = 5)
keep
selects variables/elements from data.frame by their names or by
criteria (see criteria). except
drops variables/elements from
data.frame by their names or by criteria. Names at the top-level can be
unquoted (non-standard evaluation). For standard evaluation of parameters you
can surround them by round brackets. See examples. Methods for list will apply
keep
/except
to each element of the list separately.
keep(data, ...) except(data, ...)
keep(data, ...) except(data, ...)
data |
data.frame/matrix/list |
... |
column names of type character/numeric or criteria/logical functions |
object of the same type as data
data(iris) columns(iris, Sepal.Length, Sepal.Width) columns(iris, -Species) columns(iris, Species, "^.") # move 'Species' to the first position columns(iris, -"^Petal") # remove columns which names start with 'Petal' columns(iris, -5) # remove fifth column data(mtcars) columns(mtcars, mpg:qsec) # keep columns from 'mpg' to 'qsec' columns(mtcars, mpg %to% qsec) # the same result # standard and non-standard evaluation many_vars = c("am", "vs", "cyl") columns(mtcars, many_vars) # character expansion dfs = data.frame( a = rep(10, 5), b_1 = rep(11, 5), b_2 = rep(12, 5), b_3 = rep(12, 5), b_4 = rep(14, 5), b_5 = rep(15, 5) ) i = 1:5 columns(dfs, b_1 %to% b_5) columns(dfs, "b_{i}") # the same result
data(iris) columns(iris, Sepal.Length, Sepal.Width) columns(iris, -Species) columns(iris, Species, "^.") # move 'Species' to the first position columns(iris, -"^Petal") # remove columns which names start with 'Petal' columns(iris, -5) # remove fifth column data(mtcars) columns(mtcars, mpg:qsec) # keep columns from 'mpg' to 'qsec' columns(mtcars, mpg %to% qsec) # the same result # standard and non-standard evaluation many_vars = c("am", "vs", "cyl") columns(mtcars, many_vars) # character expansion dfs = data.frame( a = rep(10, 5), b_1 = rep(11, 5), b_2 = rep(12, 5), b_3 = rep(12, 5), b_4 = rep(14, 5), b_5 = rep(15, 5) ) i = 1:5 columns(dfs, b_1 %to% b_5) columns(dfs, "b_{i}") # the same result
match
finds value in rows or columns. index
returns value by index
from row or column. One can use functions as criteria for match
. In
this case position of first value on which function equals to TRUE will be
returned. For convenience there are special predefined functions - see
criteria. If value is not found then NA will be returned.
match_row(criterion, ...) match_col(criterion, ...) index_row(index, ...) index_col(index, ...) value_row_if(criterion, ...) value_col_if(criterion, ...)
match_row(criterion, ...) match_col(criterion, ...) index_row(index, ...) index_col(index, ...) value_row_if(criterion, ...) value_col_if(criterion, ...)
criterion |
Vector of values to be matched, or function. |
... |
data. Vectors, matrixes, data.frames, lists. Shorter arguments will be recycled. |
index |
vector of positions in rows/columns from which values should be returned. |
vector with length equals to number of rows for *_row and equals to number of columns for *_col.
# toy data v1 = 1:3 v2 = 2:4 v3 = 7:5 # postions of 1,3,5 in rows match_row(c(1, 3, 5), v1, v2, v3) # 1:3 # postions of 1,3,5 in columns match_col(1, v1, v2, v3) # c(v1 = 1, v2 = NA, v3 = NA) # postion of first value greater than 2 ix = match_row(gt(2), v1, v2, v3) ix # c(3,2,1) # return values by result of previous 'match_row' index_row(ix, v1, v2, v3) # c(7,3,3) # the same actions with data.frame dfs = data.frame(v1, v2, v3) # postions of 1,3,5 in rows match_row(c(1, 3, 5), dfs) # 1:3 # postions of 1,3,5 in columns match_col(1, dfs) # c(v1 = 1, v2 = NA, v3 = NA) # postion of first value greater than 2 ix = match_row(gt(2), dfs) ix # c(3,2,1) # return values by result of previous 'match_row' index_row(ix, dfs) # c(7,3,3)
# toy data v1 = 1:3 v2 = 2:4 v3 = 7:5 # postions of 1,3,5 in rows match_row(c(1, 3, 5), v1, v2, v3) # 1:3 # postions of 1,3,5 in columns match_col(1, v1, v2, v3) # c(v1 = 1, v2 = NA, v3 = NA) # postion of first value greater than 2 ix = match_row(gt(2), v1, v2, v3) ix # c(3,2,1) # return values by result of previous 'match_row' index_row(ix, v1, v2, v3) # c(7,3,3) # the same actions with data.frame dfs = data.frame(v1, v2, v3) # postions of 1,3,5 in rows match_row(c(1, 3, 5), dfs) # 1:3 # postions of 1,3,5 in columns match_col(1, dfs) # c(v1 = 1, v2 = NA, v3 = NA) # postion of first value greater than 2 ix = match_row(gt(2), dfs) ix # c(3,2,1) # return values by result of previous 'match_row' index_row(ix, dfs) # c(7,3,3)
%merge%
is infix shortcut for base merge with
all.x = TRUE
and all.y = FALSE
(left join). There is also
special method for combining results of cross_*
and fre
. For them
all = TRUE
(full join). It allows make complex tables from simple
ones. See examples. Strange result is possible if one or two arguments have
duplicates in first column (column with labels).
## S3 method for class 'etable' merge( x, y, by = 1, by.x = by, by.y = by, all = TRUE, all.x = all, all.y = all, sort = FALSE, suffixes = c("", ""), incomparables = NULL, ... )
## S3 method for class 'etable' merge( x, y, by = 1, by.x = by, by.y = by, all = TRUE, all.x = all, all.y = all, sort = FALSE, suffixes = c("", ""), incomparables = NULL, ... )
x |
data.frame or results of |
y |
data.frame or results of |
by |
for 'etable' object default is 1 (first column). For details see merge |
by.x |
For details see merge |
by.y |
For details see merge |
all |
For details see merge |
all.x |
For details see merge |
all.y |
For details see merge |
sort |
For details see merge |
suffixes |
For details see merge |
incomparables |
For details see merge |
... |
arguments to be passed to or from methods. |
data.frame
fre, cross_cpct, cross_fun, merge
data.table::setDTthreads(2) data(mtcars) # apply labels mtcars = apply_labels(mtcars, mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (lb/1000)", qsec = "1/4 mile time", vs = "V/S", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission (0 = automatic, 1 = manual)", am = c(automatic = 0, manual = 1), gear = "Number of forward gears", carb = "Number of carburetors" ) # table by 'am' tab1 = cross_cpct(mtcars, gear, am) # table with percents tab2 = cross_cpct(mtcars, gear, vs) # combine tables tab1 %>% merge(tab2) # complex tables # table with counts counts = cross_cases(mtcars, list(vs, am, gear, carb), list("Count")) # table with percents percents = cross_cpct(mtcars, list(vs, am, gear, carb), list("Column, %")) # combine tables counts %>% merge(percents)
data.table::setDTthreads(2) data(mtcars) # apply labels mtcars = apply_labels(mtcars, mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (lb/1000)", qsec = "1/4 mile time", vs = "V/S", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission (0 = automatic, 1 = manual)", am = c(automatic = 0, manual = 1), gear = "Number of forward gears", carb = "Number of carburetors" ) # table by 'am' tab1 = cross_cpct(mtcars, gear, am) # table with percents tab2 = cross_cpct(mtcars, gear, vs) # combine tables tab1 %>% merge(tab2) # complex tables # table with counts counts = cross_cases(mtcars, list(vs, am, gear, carb), list("Count")) # table with percents percents = cross_cpct(mtcars, list(vs, am, gear, carb), list("Column, %")) # combine tables counts %>% merge(percents)
These functions are intended for usage with tables - tables,
cross_cpct, cross_fun. Result of mrset
is considered as
muliple-response set with category encoding and result of mdset
is
considered as multiple response set with dichotomy (dummy) encoding e. g.
with 0 or 1 in the each column. Each column in the dichotomy
is
indicator of absence or presence of particular feature. Both functions don't
convert its arguments to anything - it is supposed that arguments already
have appropriate encoding. For conversation see as.dichotomy or
as.category.
mrset_f
and mdset_f
select variables by fixed pattern. Fixed
pattern can be unquoted. For details see ..f.
mrset_p
and mdset_p
select variables for
multiple-responses by perl-style regular expresssion. For details see ..p.
mrset_t
and mdset_t
select variables by expanding text
arguments. For details see ..t and text_expand.
mrset(..., label = NULL) mdset(..., label = NULL) mrset_f(..., label = NULL) mdset_f(..., label = NULL) mrset_p(..., label = NULL) mdset_p(..., label = NULL) mrset_t(..., label = NULL) mdset_t(..., label = NULL)
mrset(..., label = NULL) mdset(..., label = NULL) mrset_f(..., label = NULL) mdset_f(..., label = NULL) mrset_p(..., label = NULL) mdset_p(..., label = NULL) mrset_t(..., label = NULL) mdset_t(..., label = NULL)
... |
variables |
label |
character optional label for multiple response set |
data.frame of class category
/dichotomy
data.table::setDTthreads(2) data(product_test) cross_cpct(product_test, mrset(a1_1 %to% a1_6)) # same result cross_cpct(product_test, mrset_f(a1_)) # same result cross_cpct(product_test, mrset_p("a1_")) # same result cross_cpct(product_test, mrset_t("a1_{1:6}"))
data.table::setDTthreads(2) data(product_test) cross_cpct(product_test, mrset(a1_1 %to% a1_6)) # same result cross_cpct(product_test, mrset_f(a1_)) # same result cross_cpct(product_test, mrset_p("a1_")) # same result cross_cpct(product_test, mrset_t("a1_{1:6}"))
Function is added to workaround strange bug with data.table (issue #10).
name_dots(...)
name_dots(...)
... |
arguments |
list
names2labels
replaces data.frame/list names with corresponding
variables labels. If there are no labels for some variables their names
remain unchanged. n2l
is just shortcut for names2labels
.
names2labels(x, exclude = NULL, keep_names = FALSE) n2l(x, exclude = NULL, keep_names = FALSE)
names2labels(x, exclude = NULL, keep_names = FALSE) n2l(x, exclude = NULL, keep_names = FALSE)
x |
data.frame/list. |
exclude |
logical/integer/character columns which names should be left unchanged. Only applicable to list/data.frame. |
keep_names |
logical. If TRUE original column names will be kept with labels. Only applicable to list/data.frame. |
Object of the same type as x but with variable labels instead of names.
values2labels, val_lab, var_lab
data(mtcars) mtcars = mtcars %>% apply_labels( mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (lb/1000)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c(automatic = 0, manual=1), gear = "Number of forward gears", carb = "Number of carburetors" ) # without original names # note: we exclude dependent variable 'mpg' from conversion to use its short name in formula summary(lm(mpg ~ ., data = names2labels(mtcars, exclude = "mpg"))) # with names summary(lm(mpg ~ ., data = names2labels(mtcars, exclude = "mpg", keep_names = TRUE)))
data(mtcars) mtcars = mtcars %>% apply_labels( mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (lb/1000)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c(automatic = 0, manual=1), gear = "Number of forward gears", carb = "Number of carburetors" ) # without original names # note: we exclude dependent variable 'mpg' from conversion to use its short name in formula summary(lm(mpg ~ ., data = names2labels(mtcars, exclude = "mpg"))) # with names summary(lm(mpg ~ ., data = names2labels(mtcars, exclude = "mpg", keep_names = TRUE)))
nest
mainly intended for usage with table functions such as
cro. See examples. %nest%
is infix version of this function.
You can apply nest
on multiple-response variables/list of variables
and data.frames.
nest(...) x %nest% y
nest(...) x %nest% y
... |
vectors/data.frames/lists |
x |
vector/data.frame/list |
y |
vector/data.frame/list |
vector/data.frame/list
See also interaction
data(mtcars) mtcars = apply_labels(mtcars, cyl = "Number of cylinders", vs = "Engine", vs = num_lab(" 0 V-engine 1 Straight engine "), am = "Transmission", am = num_lab(" 0 Automatic 1 Manual "), carb = "Number of carburetors" ) data.table::setDTthreads(2) # for running on CRAN cross_cases(mtcars, cyl, am %nest% vs) # list of variables cross_cases(mtcars, cyl, am %nest% list(vs, cyl)) # list of variables - multiple banners/multiple nesting cross_cases(mtcars, cyl, list(total(), list(am, vs) %nest% cyl)) # three variables cross_cases(mtcars, am %nest% vs %nest% carb, cyl) # the same with usual version cross_cases(mtcars, cyl, nest(am, vs)) # three variables cross_cases(mtcars, nest(am, vs, carb), cyl)
data(mtcars) mtcars = apply_labels(mtcars, cyl = "Number of cylinders", vs = "Engine", vs = num_lab(" 0 V-engine 1 Straight engine "), am = "Transmission", am = num_lab(" 0 Automatic 1 Manual "), carb = "Number of carburetors" ) data.table::setDTthreads(2) # for running on CRAN cross_cases(mtcars, cyl, am %nest% vs) # list of variables cross_cases(mtcars, cyl, am %nest% list(vs, cyl)) # list of variables - multiple banners/multiple nesting cross_cases(mtcars, cyl, list(total(), list(am, vs) %nest% cyl)) # three variables cross_cases(mtcars, am %nest% vs %nest% carb, cyl) # the same with usual version cross_cases(mtcars, cyl, nest(am, vs)) # three variables cross_cases(mtcars, nest(am, vs, carb), cyl)
'subtotal' adds subtotal to set of categories, 'net' replaces categories with their net value. If you provide named arguments then name will be used as label for subtotal. In other case labels will be automatically generated taking into account arguments 'new_label' and 'prefix'. Note that if you provide overlapping categories then net and subtotals will also be overlapping. 'subtotal' and 'net' are intended for usage with cro and friends. 'tab_subtotal_*' and 'tab_net_*' are intended for usage with custom tables - see tables. There are auxiliary functions 'hide' and 'unhide'. 'hide' is used with 'subtotal' when you need to leave only subtotal for some specific items. And 'unhide' is used with 'net' when you want to show items for some nets. See examples.
net( x, ..., position = c("below", "above", "top", "bottom"), prefix = "TOTAL ", new_label = c("all", "range", "first", "last"), add = FALSE ) subtotal( x, ..., position = c("below", "above", "top", "bottom"), prefix = "TOTAL ", new_label = c("all", "range", "first", "last"), add = TRUE ) tab_net_cells( data, ..., position = c("below", "above", "top", "bottom"), prefix = "TOTAL ", new_label = c("all", "range", "first", "last") ) tab_net_cols( data, ..., position = c("below", "above", "top", "bottom"), prefix = "TOTAL ", new_label = c("all", "range", "first", "last") ) tab_net_rows( data, ..., position = c("below", "above", "top", "bottom"), prefix = "TOTAL ", new_label = c("all", "range", "first", "last") ) tab_subtotal_cells( data, ..., position = c("below", "above", "top", "bottom"), prefix = "TOTAL ", new_label = c("all", "range", "first", "last") ) tab_subtotal_cols( data, ..., position = c("below", "above", "top", "bottom"), prefix = "TOTAL ", new_label = c("all", "range", "first", "last") ) tab_subtotal_rows( data, ..., position = c("below", "above", "top", "bottom"), prefix = "TOTAL ", new_label = c("all", "range", "first", "last") ) hide(category) unhide(category)
net( x, ..., position = c("below", "above", "top", "bottom"), prefix = "TOTAL ", new_label = c("all", "range", "first", "last"), add = FALSE ) subtotal( x, ..., position = c("below", "above", "top", "bottom"), prefix = "TOTAL ", new_label = c("all", "range", "first", "last"), add = TRUE ) tab_net_cells( data, ..., position = c("below", "above", "top", "bottom"), prefix = "TOTAL ", new_label = c("all", "range", "first", "last") ) tab_net_cols( data, ..., position = c("below", "above", "top", "bottom"), prefix = "TOTAL ", new_label = c("all", "range", "first", "last") ) tab_net_rows( data, ..., position = c("below", "above", "top", "bottom"), prefix = "TOTAL ", new_label = c("all", "range", "first", "last") ) tab_subtotal_cells( data, ..., position = c("below", "above", "top", "bottom"), prefix = "TOTAL ", new_label = c("all", "range", "first", "last") ) tab_subtotal_cols( data, ..., position = c("below", "above", "top", "bottom"), prefix = "TOTAL ", new_label = c("all", "range", "first", "last") ) tab_subtotal_rows( data, ..., position = c("below", "above", "top", "bottom"), prefix = "TOTAL ", new_label = c("all", "range", "first", "last") ) hide(category) unhide(category)
x |
variable, list, data.frame or multiple response set |
... |
list of categories for grouping. It can be numeric vectors (for example, 1:2), ranges (for example, 4 greater(5)). If an argument is named then this name will be used as label for subtotal. |
position |
position of the subtotal or net relative to original categories. "below" by default. One of the "below", "above", "top", "bottom". "top" and "bottom" place nets and subtotals above or below all other categories. For nets "below" and "above" have no difference because original categories are removed. |
prefix |
character, "TOTAL " by default. It is a prefix to automatically created labels for nets and subtotals. |
new_label |
how we will combine original values for automatically generated subtotal labels. Possible values are "all", "range", "first", "last". "all" collapse all labels, "range" take only first and last label, |
add |
logical. Should we add subtotal to categories or replace categories with a net? |
data |
intermediate table. See tables. |
category |
category (numeric vectors, ranges, criteria) which you want to 'hide' or 'unhide'. |
multiple response set or list of the multiple response sets
data.table::setDTthreads(2) ol = c(1:7, 99) var_lab(ol) = "Liking" val_lab(ol) = num_lab(" 1 Disgusting 2 Very Poor 3 Poor 4 So-so 5 Good 6 Very good 7 Excellent 99 Hard to say ") cro(subtotal(ol, BOTTOM = 1:3, TOP = 6:7, position = "top")) # example with hide cro(subtotal(ol, TOP1 = hide(7), TOP2 = hide(6:7), TOP3 = 5:7, BOTTOM = 1:3, position = "top")) # autolabelling cro(subtotal(ol, 1:3, 6:7)) # replace original codes and another way of autolabelling cro(net(ol, 1:3, 6:7, new_label = "range", prefix = "NET ")) # unhide cro(net(ol, 1:3, unhide(6:7), new_label = "range", prefix = "NET ")) # character variable and criteria usage items = c("apple", "banana", "potato", "orange", "onion", "tomato", "pineapple") cro( subtotal(items, "TOTAL FRUITS" = like("*ap*") | like("*an*"), "TOTAL VEGETABLES" = like("*to*") | like("*on*"), position = "bottom") ) # 'tab_net_*' usage data(mtcars) mtcars = apply_labels(mtcars, mpg = "Miles/(US) gallon", am = "Transmission", am = c("Automatic" = 0, "Manual"=1), gear = "Number of forward gears", gear = c( One = 1, Two = 2, Three = 3, Four = 4, Five = 5 ) ) mtcars %>% tab_cells(mpg) %>% tab_net_cells("Low mpg" = less(mean(mpg)), "High mpg" = greater_or_equal(mean(mpg))) %>% tab_cols(total(), am) %>% tab_stat_cases() %>% tab_pivot() mtcars %>% tab_cells(mpg) %>% tab_rows(gear) %>% tab_subtotal_rows(1:2, 3:4, "5 and more" = greater(4)) %>% tab_stat_mean() %>% tab_pivot()
data.table::setDTthreads(2) ol = c(1:7, 99) var_lab(ol) = "Liking" val_lab(ol) = num_lab(" 1 Disgusting 2 Very Poor 3 Poor 4 So-so 5 Good 6 Very good 7 Excellent 99 Hard to say ") cro(subtotal(ol, BOTTOM = 1:3, TOP = 6:7, position = "top")) # example with hide cro(subtotal(ol, TOP1 = hide(7), TOP2 = hide(6:7), TOP3 = 5:7, BOTTOM = 1:3, position = "top")) # autolabelling cro(subtotal(ol, 1:3, 6:7)) # replace original codes and another way of autolabelling cro(net(ol, 1:3, 6:7, new_label = "range", prefix = "NET ")) # unhide cro(net(ol, 1:3, unhide(6:7), new_label = "range", prefix = "NET ")) # character variable and criteria usage items = c("apple", "banana", "potato", "orange", "onion", "tomato", "pineapple") cro( subtotal(items, "TOTAL FRUITS" = like("*ap*") | like("*an*"), "TOTAL VEGETABLES" = like("*to*") | like("*on*"), position = "bottom") ) # 'tab_net_*' usage data(mtcars) mtcars = apply_labels(mtcars, mpg = "Miles/(US) gallon", am = "Transmission", am = c("Automatic" = 0, "Manual"=1), gear = "Number of forward gears", gear = c( One = 1, Two = 2, Three = 3, Four = 4, Five = 5 ) ) mtcars %>% tab_cells(mpg) %>% tab_net_cells("Low mpg" = less(mean(mpg)), "High mpg" = greater_or_equal(mean(mpg))) %>% tab_cols(total(), am) %>% tab_stat_cases() %>% tab_pivot() mtcars %>% tab_cells(mpg) %>% tab_rows(gear) %>% tab_subtotal_rows(1:2, 3:4, "5 and more" = greater(4)) %>% tab_stat_mean() %>% tab_pivot()
These functions add values/variable names as prefixes to value/variable
labels. Functions which start with tab_
intended for usage inside
table creation sequences. See examples and tables. It is recommended
to use tab_prepend_*
at the start of sequence of tables creation. If
you use it in the middle of the sequence then previous statements will not be
affected.
prepend_values(x) prepend_names(x) prepend_all(x) tab_prepend_values(data) tab_prepend_names(data) tab_prepend_all(data)
prepend_values(x) prepend_names(x) prepend_all(x) tab_prepend_values(data) tab_prepend_names(data) tab_prepend_all(data)
x |
vector/data.frame. |
data |
data.frame/intermediate result of tables construction. See tables. |
original object with prepended names/values to labels
## Not run: data(mtcars) mtcars = apply_labels(mtcars, mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (lb/1000)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c("Automatic" = 0, "Manual"=1), gear = "Number of forward gears", carb = "Number of carburetors" ) # prepend names and 'cross_cpct' mtcars %>% prepend_names %>% cross_cpct(list(cyl, gear), list(total(), vs, am)) # prepend values to value labels mtcars %>% tab_prepend_values %>% tab_cols(total(), vs, am) %>% tab_cells(cyl, gear) %>% tab_stat_cpct() %>% tab_pivot() # prepend names and labels mtcars %>% tab_prepend_all %>% tab_cols(total(), vs, am) %>% tab_cells(cyl, gear) %>% tab_stat_cpct() %>% tab_pivot() # variable in rows without prefixes mtcars %>% tab_cells(cyl, gear) %>% tab_prepend_all %>% tab_cols(total(), vs, am) %>% tab_stat_cpct() %>% tab_pivot() ## End(Not run)
## Not run: data(mtcars) mtcars = apply_labels(mtcars, mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (lb/1000)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c("Automatic" = 0, "Manual"=1), gear = "Number of forward gears", carb = "Number of carburetors" ) # prepend names and 'cross_cpct' mtcars %>% prepend_names %>% cross_cpct(list(cyl, gear), list(total(), vs, am)) # prepend values to value labels mtcars %>% tab_prepend_values %>% tab_cols(total(), vs, am) %>% tab_cells(cyl, gear) %>% tab_stat_cpct() %>% tab_pivot() # prepend names and labels mtcars %>% tab_prepend_all %>% tab_cols(total(), vs, am) %>% tab_cells(cyl, gear) %>% tab_stat_cpct() %>% tab_pivot() # variable in rows without prefixes mtcars %>% tab_cells(cyl, gear) %>% tab_prepend_all %>% tab_cols(total(), vs, am) %>% tab_stat_cpct() %>% tab_pivot() ## End(Not run)
It is truncated dataset with data from product test of two samples of chocolate sweets. 150 respondents tested two kinds of sweets (codenames: VSX123 and SDF546). Sample was divided into two groups (cells) of 75 respondents in each group. In cell 1 product VSX123 was presented first and then SDF546. In cell 2 sweets were presented in reversed order. Questions about respondent impressions about first product are in the block A (and about second tested product in the block B). At the end of the questionnaire there is a question about preferences between sweets.
product_test
product_test
A data frame with 150 rows and 18 variables:
Respondent Id.
First tested product (cell number).
Age.
What did you like in these sweets? Multiple response. First tested product.
(continue) What did you like in these sweets? Multiple response. First tested product.
(continue) What did you like in these sweets? Multiple response. First tested product.
(continue) What did you like in these sweets? Multiple response. First tested product.
(continue) What did you like in these sweets? Multiple response. First tested product.
(continue) What did you like in these sweets? Multiple response. First tested product.
Overall liking. First tested product.
What did you like in these sweets? Multiple response. Second tested product.
(continue) What did you like in these sweets? Multiple response. Second tested product.
(continue) What did you like in these sweets? Multiple response. Second tested product.
(continue) What did you like in these sweets? Multiple response. Second tested product.
(continue) What did you like in these sweets? Multiple response. Second tested product.
(continue) What did you like in these sweets? Multiple response. Second tested product.
Overall liking. Second tested product.
Preferences.
prop
returns proportion to sum of entire x
. prop_col
returns proportion to sum of each column of x
. prop_row
returns
proportion to sum of each row of x
. Non-numeric columns in the
data.frame are ignored. NA's are also ignored.
prop(x) prop_col(x) prop_row(x)
prop(x) prop_col(x) prop_row(x)
x |
numeric vector/matrix/data.frame |
the same structure as x
but with proportions of original
values from sum of original values.
a = c(25, 25, NA) prop(a) # data.frame with non-numeric columns fac = factor(c("a", "b", "c")) char = c("a", "b", "c") dat = as.POSIXct("2016-09-27") a = sheet(fac, a = c(25, 25, NA), b = c(100, NA, 50), char, dat) prop(a) prop_row(a) prop_col(a) # the same as result as with 'prop.table' tbl = table(state.division, state.region) prop(tbl) prop_row(tbl) prop_col(tbl)
a = c(25, 25, NA) prop(a) # data.frame with non-numeric columns fac = factor(c("a", "b", "c")) char = c("a", "b", "c") dat = as.POSIXct("2016-09-27") a = sheet(fac, a = c(25, 25, NA), b = c(100, NA, 50), char, dat) prop(a) prop_row(a) prop_col(a) # the same as result as with 'prop.table' tbl = table(state.division, state.region) prop(tbl) prop_row(tbl) prop_col(tbl)
qc
It is often needed to address variables in the data.frame in the such
manner: dfs[ , c("var1", "var2", "var3")]
. qc
("quoted c") is a
shortcut for the such cases to reduce keystrokes. With qc
you can write:
dfs[ , qc(var1, var2, var3)]
.
qe
returns list of expression.
qc(...) qe(...)
qc(...) qe(...)
... |
unquoted names of variables in
|
Vector of characters or expressions
## qc qc(a, b, c) identical(qc(a, b, c), c("a", "b", "c")) mtcars[, qc(am, mpg, gear)] ## qe qe(mrset(a1 %to% a6), mrset(b1 %to% b6), mrset(c1 %to% c6))
## qc qc(a, b, c) identical(qc(a, b, c), c("a", "b", "c")) mtcars[, qc(am, mpg, gear)] ## qe qe(mrset(a1 %to% a6), mrset(b1 %to% b6), mrset(c1 %to% c6))
read_spss
reads data from a file stored in SPSS *.sav format. It
returns data.frame and never converts string variables to factors. Also it
prepares SPSS values/variables labels for working with
val_lab
/var_lab
functions. User-missings values are ignored.
read_spss
is simple wrapper around read.spss
function from
package foreign
.
read_spss(file, reencode = TRUE, use_missings = FALSE, ...) read_spss_to_list(file, reencode = TRUE, use_missings = FALSE, ...)
read_spss(file, reencode = TRUE, use_missings = FALSE, ...) read_spss_to_list(file, reencode = TRUE, use_missings = FALSE, ...)
file |
Character string: the name of the file or URL to read. |
reencode |
logical: should character strings be re-encoded to the current locale. The default is TRUE. NA means to do so in a UTF-8 locale, only. Alternatively, a character string specifying an encoding to assume for the file. |
use_missings |
logical: should information on user-defined missing values be used to set the corresponding values to NA? |
... |
further parameters for read.spss |
read_spss
returns data.frame.
read.spss in package foreign
, val_lab,
var_lab
## Not run: w = read_spss("project_123.sav") # to data.frame ## End(Not run)
## Not run: w = read_spss("project_123.sav") # to data.frame ## End(Not run)
recode
change, rearrange or consolidate the values of an existing
variable based on conditions. Design of this function inspired by RECODE from
SPSS. Sequence of recodings provided in the form of formulas. For example,
1:2 ~ 1 means that all 1's and 2's will be replaced with 1. Each value will be
recoded only once. In the assignment form recode(...) = ...
of this
function values which doesn't meet any condition remain unchanged. In case of
the usual form ... = recode(...)
values which doesn't meet any
condition will be replaced with NA. One can use values or more sophisticated
logical conditions and functions as a condition. There are several special
functions for usage as criteria - for details see criteria. Simple
common usage looks like: recode(x, 1:2 ~ -1, 3 ~ 0, 1:2 ~ 1, 99 ~ NA)
.
For more information, see details and examples.
The ifs
function checks whether one or more conditions are met and
returns a value that corresponds to the first TRUE condition. ifs
can
take the place of multiple nested ifelse
statements and is much
easier to read with multiple conditions. ifs
works in the same manner
as recode
- e. g. with formulas. But conditions
should be only logical and it doesn't operate on multicolumn objects.
recode( x, ..., with_labels = FALSE, new_label = c("all", "range", "first", "last") ) rec(x, ..., with_labels = TRUE, new_label = c("all", "range", "first", "last")) recode(x, with_labels = FALSE, new_label = c("all", "range", "first", "last")) <- value rec(x, with_labels = TRUE, new_label = c("all", "range", "first", "last")) <- value ifs(...) lo hi copy(x) from_to(from, to) values %into% names
recode( x, ..., with_labels = FALSE, new_label = c("all", "range", "first", "last") ) rec(x, ..., with_labels = TRUE, new_label = c("all", "range", "first", "last")) recode(x, with_labels = FALSE, new_label = c("all", "range", "first", "last")) <- value rec(x, with_labels = TRUE, new_label = c("all", "range", "first", "last")) <- value ifs(...) lo hi copy(x) from_to(from, to) values %into% names
x |
vector/matrix/data.frame/list |
... |
sequence of formulas which describe recodings. They are used when
|
with_labels |
logical. FALSE by default for 'recode' and TRUE for 'rec'. Should we also recode value labels with the same recodings as variable? |
new_label |
one of "all", "range", "first", or "last". If we recode value labels ('with_labels = TRUE') how we will combine labels for duplicated values? "all" will use all labels, "range" will use first and last labels. See examples. |
value |
list with formulas which describe recodings in assignment form
of function/ |
from |
list of conditions for values which should be recoded (in the same format as LHS of formulas). |
to |
list of values into which old values should be recoded (in the same format as RHS of formulas). |
values |
object(-s) which will be assigned to |
names |
name(-s) which will be given to |
An object of class numeric
of length 1.
An object of class numeric
of length 1.
Input conditions - possible values for left-hand side (LHS) of formula or
element of from
list:
vector/single value All values in x
which equal to elements of the
vector in LHS will be replaced with RHS.
function Values for which function gives TRUE will be replaced with RHS. There are some special functions for the convenience - see criteria.
single logical value TRUE
It means all other unrecoded values
(ELSE in SPSS RECODE). All other unrecoded values will be changed to RHS of the
formula or appropriate element of to
.
Output values - possible values for right-hand side (RHS) of formula or
element of to
list:
value replace elements of x
. This value will be
recycled across rows and columns of x
.
vector values of this vector will replace values in the corresponding
position in rows of x
. Vector will be recycled across columns of
x
.
function This function will be applied to values of x
which
satisfy recoding condition. There is a special auxiliary function copy
which just returns its argument. So, in the recode
it just copies old
value (COPY in SPSS RECODE). See examples.
%into%
tries to mimic SPSS 'INTO'. Values from left-hand side will
be assigned to right-hand side. You can use %to%
expression in the
RHS of %into%
. See examples.
lo
and hi
are shortcuts for -Inf
and Inf
. They
can be useful in expressions with %thru%
, e. g. 1 %thru%
hi
.
object of the same form as x
with recoded values
# examples from SPSS manual # RECODE V1 TO V3 (0=1) (1=0) (2, 3=-1) (9=9) (ELSE=SYSMIS) v1 = c(0, 1, 2, 3, 9, 10) recode(v1) = c(0 ~ 1, 1 ~ 0, 2:3 ~ -1, 9 ~ 9, TRUE ~ NA) v1 # RECODE QVAR(1 THRU 5=1)(6 THRU 10=2)(11 THRU HI=3)(ELSE=0). qvar = c(1:20, 97, NA, NA) recode(qvar, 1 %thru% 5 ~ 1, 6 %thru% 10 ~ 2, 11 %thru% hi ~ 3, TRUE ~ 0) # the same result recode(qvar, 1 %thru% 5 ~ 1, 6 %thru% 10 ~ 2, ge(11) ~ 3, TRUE ~ 0) # RECODE STRNGVAR ('A', 'B', 'C'='A')('D', 'E', 'F'='B')(ELSE=' '). strngvar = LETTERS recode(strngvar, c('A', 'B', 'C') ~ 'A', c('D', 'E', 'F') ~ 'B', TRUE ~ ' ') # recode in place. Note that we recode only first six letters recode(strngvar) = c(c('A', 'B', 'C') ~ 'A', c('D', 'E', 'F') ~ 'B') strngvar # RECODE AGE (MISSING=9) (18 THRU HI=1) (0 THRU 18=0) INTO VOTER. age = c(NA, 2:40, NA) voter = recode(age, NA ~ 9, 18 %thru% hi ~ 1, 0 %thru% 18 ~ 0) voter # the same result with '%into%' recode(age, NA ~ 9, 18 %thru% hi ~ 1, 0 %thru% 18 ~ 0) %into% voter2 voter2 # recode with adding labels voter = recode(age, "Refuse to answer" = NA ~ 9, "Vote" = 18 %thru% hi ~ 1, "Don't vote" = 0 %thru% 18 ~ 0) voter # recoding with labels ol = c(1:7, 99) var_lab(ol) = "Liking" val_lab(ol) = num_lab(" 1 Disgusting 2 Very Poor 3 Poor 4 So-so 5 Good 6 Very good 7 Excellent 99 Hard to say ") recode(ol, 1:3 ~ 1, 5:7 ~ 7, TRUE ~ copy, with_labels = TRUE) # 'rec' is a shortcut for recoding with labels. Same result: rec(ol, 1:3 ~ 1, 5:7 ~ 7, TRUE ~ copy) # another method of combining labels recode(ol, 1:3 ~ 1, 5:7 ~ 7, TRUE ~ copy, with_labels = TRUE, new_label = "range") # example with from/to notation # RECODE QVAR(1 THRU 5=1)(6 THRU 10=2)(11 THRU HI=3)(ELSE=0). list_from = list(1 %thru% 5, 6 %thru% 10, ge(11), TRUE) list_to = list(1, 2, 3, 0) recode(qvar, from_to(list_from, list_to)) list_from = list(NA, 18 %thru% hi, 0 %thru% 18) list_to = list("Refuse to answer" = 9, "Vote" = 1, "Don't vote" = 0) voter = recode(age, from_to(list_from, list_to)) voter # 'ifs' examples a = 1:5 b = 5:1 ifs(b>3 ~ 1) # c(1, 1, NA, NA, NA) ifs(b>3 ~ 1, TRUE ~ 3) # c(1, 1, 3, 3, 3) ifs(b>3 ~ 1, a>4 ~ 7, TRUE ~ 3) # c(1, 1, 3, 3, 7) ifs(b>3 ~ a, TRUE ~ 42) # c(1, 2, 42, 42, 42) # advanced usage #' # multiple assignment with '%into%' set.seed(123) x1 = runif(30) x2 = runif(30) x3 = runif(30) # note nessesary brackets around RHS of '%into%' recode(x1 %to% x3, gt(0.5) ~ 1, other ~ 0) %into% (x_rec_1 %to% x_rec_3) fre(x_rec_1) # the same operation with characters expansion i = 1:3 recode(x1 %to% x3, gt(0.5) ~ 1, other ~ 0) %into% text_expand('x_rec2_{i}') fre(x_rec2_1) # factor recoding a = factor(letters[1:4]) recode(a, "a" ~ "z", TRUE ~ copy) # we get factor # example with function in RHS data(iris) new_iris = recode(iris, is.numeric ~ scale, other ~ copy) str(new_iris) set.seed(123) a = rnorm(20) # if a<(-0.5) we change it to absolute value of a (abs function) recode(a, lt(-0.5) ~ abs, other ~ copy) # the same example with logical criteria recode(a, when(a<(-.5)) ~ abs, other ~ copy)
# examples from SPSS manual # RECODE V1 TO V3 (0=1) (1=0) (2, 3=-1) (9=9) (ELSE=SYSMIS) v1 = c(0, 1, 2, 3, 9, 10) recode(v1) = c(0 ~ 1, 1 ~ 0, 2:3 ~ -1, 9 ~ 9, TRUE ~ NA) v1 # RECODE QVAR(1 THRU 5=1)(6 THRU 10=2)(11 THRU HI=3)(ELSE=0). qvar = c(1:20, 97, NA, NA) recode(qvar, 1 %thru% 5 ~ 1, 6 %thru% 10 ~ 2, 11 %thru% hi ~ 3, TRUE ~ 0) # the same result recode(qvar, 1 %thru% 5 ~ 1, 6 %thru% 10 ~ 2, ge(11) ~ 3, TRUE ~ 0) # RECODE STRNGVAR ('A', 'B', 'C'='A')('D', 'E', 'F'='B')(ELSE=' '). strngvar = LETTERS recode(strngvar, c('A', 'B', 'C') ~ 'A', c('D', 'E', 'F') ~ 'B', TRUE ~ ' ') # recode in place. Note that we recode only first six letters recode(strngvar) = c(c('A', 'B', 'C') ~ 'A', c('D', 'E', 'F') ~ 'B') strngvar # RECODE AGE (MISSING=9) (18 THRU HI=1) (0 THRU 18=0) INTO VOTER. age = c(NA, 2:40, NA) voter = recode(age, NA ~ 9, 18 %thru% hi ~ 1, 0 %thru% 18 ~ 0) voter # the same result with '%into%' recode(age, NA ~ 9, 18 %thru% hi ~ 1, 0 %thru% 18 ~ 0) %into% voter2 voter2 # recode with adding labels voter = recode(age, "Refuse to answer" = NA ~ 9, "Vote" = 18 %thru% hi ~ 1, "Don't vote" = 0 %thru% 18 ~ 0) voter # recoding with labels ol = c(1:7, 99) var_lab(ol) = "Liking" val_lab(ol) = num_lab(" 1 Disgusting 2 Very Poor 3 Poor 4 So-so 5 Good 6 Very good 7 Excellent 99 Hard to say ") recode(ol, 1:3 ~ 1, 5:7 ~ 7, TRUE ~ copy, with_labels = TRUE) # 'rec' is a shortcut for recoding with labels. Same result: rec(ol, 1:3 ~ 1, 5:7 ~ 7, TRUE ~ copy) # another method of combining labels recode(ol, 1:3 ~ 1, 5:7 ~ 7, TRUE ~ copy, with_labels = TRUE, new_label = "range") # example with from/to notation # RECODE QVAR(1 THRU 5=1)(6 THRU 10=2)(11 THRU HI=3)(ELSE=0). list_from = list(1 %thru% 5, 6 %thru% 10, ge(11), TRUE) list_to = list(1, 2, 3, 0) recode(qvar, from_to(list_from, list_to)) list_from = list(NA, 18 %thru% hi, 0 %thru% 18) list_to = list("Refuse to answer" = 9, "Vote" = 1, "Don't vote" = 0) voter = recode(age, from_to(list_from, list_to)) voter # 'ifs' examples a = 1:5 b = 5:1 ifs(b>3 ~ 1) # c(1, 1, NA, NA, NA) ifs(b>3 ~ 1, TRUE ~ 3) # c(1, 1, 3, 3, 3) ifs(b>3 ~ 1, a>4 ~ 7, TRUE ~ 3) # c(1, 1, 3, 3, 7) ifs(b>3 ~ a, TRUE ~ 42) # c(1, 2, 42, 42, 42) # advanced usage #' # multiple assignment with '%into%' set.seed(123) x1 = runif(30) x2 = runif(30) x3 = runif(30) # note nessesary brackets around RHS of '%into%' recode(x1 %to% x3, gt(0.5) ~ 1, other ~ 0) %into% (x_rec_1 %to% x_rec_3) fre(x_rec_1) # the same operation with characters expansion i = 1:3 recode(x1 %to% x3, gt(0.5) ~ 1, other ~ 0) %into% text_expand('x_rec2_{i}') fre(x_rec2_1) # factor recoding a = factor(letters[1:4]) recode(a, "a" ~ "z", TRUE ~ copy) # we get factor # example with function in RHS data(iris) new_iris = recode(iris, is.numeric ~ scale, other ~ copy) str(new_iris) set.seed(123) a = rnorm(20) # if a<(-0.5) we change it to absolute value of a (abs function) recode(a, lt(-0.5) ~ abs, other ~ copy) # the same example with logical criteria recode(a, when(a<(-.5)) ~ abs, other ~ copy)
sheet
and as.sheet
are shortcuts to data.frame
and
as.data.frame
with stringsAsFactors = FALSE, check.names = FALSE.
sheet(...) as.sheet(x, ...)
sheet(...) as.sheet(x, ...)
... |
objects, possibly named |
x |
object to be coerced to data.frame |
data.frame/list
# see the difference df1 = data.frame(a = letters[1:3], "This is my long name" = 1:3) df2 = sheet(a = letters[1:3], "This is my long name" = 1:3) str(df1) str(df2)
# see the difference df1 = data.frame(a = letters[1:3], "This is my long name" = 1:3) df2 = sheet(a = letters[1:3], "This is my long name" = 1:3) str(df1) str(df2)
sort_asc
sorts in ascending order and sort_desc
sorts in
descending order.
sort_asc(data, ..., na.last = FALSE) sort_desc(data, ..., na.last = TRUE)
sort_asc(data, ..., na.last = FALSE) sort_desc(data, ..., na.last = TRUE)
data |
data.frame/matrix/vector |
... |
character/numeric or criteria/logical functions (see criteria). Column names/numbers for data.frame/matrix by which object will be sorted. Names at the top-level can be unquoted (non-standard evaluation). For standard evaluation of parameters you can surround them by round brackets. See examples. Ignored for vectors. |
na.last |
for controlling the treatment of NAs. If TRUE, missing values in the data are put last; if FALSE, they are put first; if NA, they are removed. |
sorted data
data(mtcars) sort_asc(mtcars, mpg) sort_asc(mtcars, cyl, mpg) # by two column # same results with column nums sort_asc(mtcars, 1) sort_asc(mtcars, 2:1) # by two column sort_asc(mtcars, 2, 1) # by two column # call with parameter sorting_columns = c("cyl", "mpg") sort_asc(mtcars, (sorting_columns))
data(mtcars) sort_asc(mtcars, mpg) sort_asc(mtcars, cyl, mpg) # by two column # same results with column nums sort_asc(mtcars, 1) sort_asc(mtcars, 2:1) # by two column sort_asc(mtcars, 2, 1) # by two column # call with parameter sorting_columns = c("cyl", "mpg") sort_asc(mtcars, (sorting_columns))
Splits data.frame into list of data.frames that can be analyzed separately.
These data.frames are sets of cases that have the same values for the
specified split variables. Any missing values in split variables are dropped
together with the corresponding values of data
. split_off
works
with lists of data.frames or objects that can be coerced to data.frame and
assumed to have compatible structure. Resulting rows will be sorted in order
of the split variables.
split_by(data, ..., drop = TRUE) split_off(data, groups = NULL, rownames = NULL)
split_by(data, ..., drop = TRUE) split_off(data, groups = NULL, rownames = NULL)
data |
data.frame for |
... |
unquoted variables names (see keep) by which |
drop |
should we drop combination of levels with zero observation? TRUE by default. |
groups |
character If it is not |
rownames |
character If it is not |
split_by
returns list of data.frames/split_off
returns data.frame
# usage of 'groups', 'rownames' data(mtcars) # add labels to dataset mtcars %>% apply_labels(mpg = "Miles/(US) gallon", disp = "Displacement (cu.in.)", wt = "Weight", hp = "Gross horsepower", vs = "Engine", vs = num_lab(" 0 V-engine 1 Straight engine "), am = "Transmission", am = num_lab(" 0 Automatic 1 Manual ") ) %>% split_by(am, vs) %>% to_list({ res = lm(mpg ~ hp + disp + wt, data = .x) cbind(Coef. = coef(res), confint(res)) }) %>% split_off(groups = TRUE, rownames = "variable")
# usage of 'groups', 'rownames' data(mtcars) # add labels to dataset mtcars %>% apply_labels(mpg = "Miles/(US) gallon", disp = "Displacement (cu.in.)", wt = "Weight", hp = "Gross horsepower", vs = "Engine", vs = num_lab(" 0 V-engine 1 Straight engine "), am = "Transmission", am = num_lab(" 0 Automatic 1 Manual ") ) %>% split_by(am, vs) %>% to_list({ res = lm(mpg ~ hp + disp + wt, data = .x) cbind(Coef. = coef(res), confint(res)) }) %>% split_off(groups = TRUE, rownames = "variable")
split_labels
/split_columns
are auxiliary functions for
post-processing tables resulted from cro/cro_fun and etc. In
these tables all labels collapsed in the first column with "|" separator.
split_columns
split first column into multiple columns with separator
(split
argument). split_table_to_df
split first column of table
and column names. Result of this operation is data.frame with character
columns.
split_labels( x, remove_repeated = TRUE, split = "\\|", fixed = FALSE, perl = FALSE ) split_columns( data, columns = 1, remove_repeated = TRUE, split = "\\|", fixed = FALSE, perl = FALSE ) split_table_to_df( data, digits = get_expss_digits(), remove_repeated = TRUE, split = "\\|", fixed = FALSE, perl = FALSE ) make_subheadings(data, number_of_columns = 1)
split_labels( x, remove_repeated = TRUE, split = "\\|", fixed = FALSE, perl = FALSE ) split_columns( data, columns = 1, remove_repeated = TRUE, split = "\\|", fixed = FALSE, perl = FALSE ) split_table_to_df( data, digits = get_expss_digits(), remove_repeated = TRUE, split = "\\|", fixed = FALSE, perl = FALSE ) make_subheadings(data, number_of_columns = 1)
x |
character vector which will be split |
remove_repeated |
logical. Default is |
split |
character vector (or object which can be coerced to such)
containing regular expression(s) (unless |
fixed |
logical. If TRUE match split exactly, otherwise use regular
expressions. Has priority over |
perl |
logical. Should Perl-compatible regexps be used? |
data |
data.frame vector which will be split |
columns |
character/numeric/logical columns in the data.frame
|
digits |
numeric. How many digits after decimal point should be left in
|
number_of_columns |
integer. Number of columns from row labels which will be used as subheadings in table. |
split_labels
returns character matrix, split_columns
returns
data.frame with columns replaced by possibly multiple columns with split
labels. split_table_to_df
returns data.frame with character columns.
data.table::setDTthreads(2) data(mtcars) # apply labels mtcars = apply_labels(mtcars, cyl = "Number of cylinders", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c(automatic = 0, manual=1), gear = "Number of forward gears", carb = "Number of carburetors" ) # all row labels in the first column tabl = mtcars %>% cross_cpct(list(cyl, gear, carb), list(total(), vs, am)) tabl # without subheadings make_subheadings(tabl) # with subheadings split_labels(tabl[[1]]) split_labels(colnames(tabl)) # replace first column with new columns split_columns(tabl) # remove repeated split_columns(tabl, remove_repeated = FALSE) split_columns(tabl) split_table_to_df(tabl) split_table_to_df(tabl)
data.table::setDTthreads(2) data(mtcars) # apply labels mtcars = apply_labels(mtcars, cyl = "Number of cylinders", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c(automatic = 0, manual=1), gear = "Number of forward gears", carb = "Number of carburetors" ) # all row labels in the first column tabl = mtcars %>% cross_cpct(list(cyl, gear, carb), list(total(), vs, am)) tabl # without subheadings make_subheadings(tabl) # with subheadings split_labels(tabl[[1]]) split_labels(colnames(tabl)) # replace first column with new columns split_columns(tabl) # remove repeated split_columns(tabl, remove_repeated = FALSE) split_columns(tabl) split_table_to_df(tabl) split_table_to_df(tabl)
These functions are intended for usage inside let, and
let_if. sum/mean/sd/median/max/min by default omits NA. any_in_*
checks existence of any TRUE in each row/column. It is equivalent of
any applied to each row/column. all_in_*
is equivalent of
all applied to each row/column.
sum_row(..., na.rm = TRUE) sum_col(..., na.rm = TRUE) mean_row(..., na.rm = TRUE) mean_col(..., na.rm = TRUE) sd_row(..., na.rm = TRUE) sd_col(..., na.rm = TRUE) median_row(..., na.rm = TRUE) median_col(..., na.rm = TRUE) max_row(..., na.rm = TRUE) max_col(..., na.rm = TRUE) min_row(..., na.rm = TRUE) min_col(..., na.rm = TRUE) apply_row(fun, ...) apply_col(fun, ...) any_in_row(..., na.rm = TRUE) any_in_col(..., na.rm = TRUE) all_in_row(..., na.rm = TRUE) all_in_col(..., na.rm = TRUE)
sum_row(..., na.rm = TRUE) sum_col(..., na.rm = TRUE) mean_row(..., na.rm = TRUE) mean_col(..., na.rm = TRUE) sd_row(..., na.rm = TRUE) sd_col(..., na.rm = TRUE) median_row(..., na.rm = TRUE) median_col(..., na.rm = TRUE) max_row(..., na.rm = TRUE) max_col(..., na.rm = TRUE) min_row(..., na.rm = TRUE) min_col(..., na.rm = TRUE) apply_row(fun, ...) apply_col(fun, ...) any_in_row(..., na.rm = TRUE) any_in_col(..., na.rm = TRUE) all_in_row(..., na.rm = TRUE) all_in_col(..., na.rm = TRUE)
... |
data. Vectors, matrixes, data.frames, list. Shorter arguments will be recycled. |
na.rm |
logical. Contrary to the base 'sum' it is TRUE by default. Should missing values (including NaN) be removed? |
fun |
custom function that will be applied to ... |
All functions except apply_*
return numeric vector of length
equals the number of argument columns/rows. Value of apply_*
depends
on supplied fun
function.
%to%, count_if, sum_if, mean_if, median_if, sd_if, min_if, max_if
iris = iris %>% let( new_median = median_row(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width), new_mean = mean_row(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) ) dfs = data.frame( test = 1:5, aa = rep(10, 5), b_ = rep(20, 5), b_1 = rep(11, 5), b_2 = rep(12, 5), b_4 = rep(14, 5), b_5 = rep(15, 5) ) # calculate sum of b* variables dfs %>% let( b_total = sum_row(b_, b_1 %to% b_5) ) %>% print() # conditional modification dfs %>% let_if(test %in% 2:4, b_total = sum_row(b_, b_1 %to% b_5) ) %>% print()
iris = iris %>% let( new_median = median_row(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width), new_mean = mean_row(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) ) dfs = data.frame( test = 1:5, aa = rep(10, 5), b_ = rep(20, 5), b_1 = rep(11, 5), b_2 = rep(12, 5), b_4 = rep(14, 5), b_5 = rep(15, 5) ) # calculate sum of b* variables dfs %>% let( b_total = sum_row(b_, b_1 %to% b_5) ) %>% print() # conditional modification dfs %>% let_if(test %in% 2:4, b_total = sum_row(b_, b_1 %to% b_5) ) %>% print()
significance_cpct
conducts z-tests between column percent in
the result of cross_cpct. Results are calculated with the same formula
as in prop.test without continuity correction.
significance_means
conducts t-tests between column means in
the result of cross_mean_sd_n. Results are calculated with the same formula
as in t.test.
significance_cases
conducts chi-squared tests on the subtable of
table with counts in the result of cross_cases. Results are calculated
with the same formula as in chisq.test.
significance_cell_chisq
compute cell chi-square test on table
with column percent. The cell chi-square test looks at each table cell and
tests whether it is significantly different from its expected value in the
overall table. For example, if it is thought that variations in political
opinions might depend on the respondent's age, this test can be used to
detect which cells contribute significantly to that dependence. Unlike the
chi-square test (significance_cases
), which is carried out on a whole
set of rows and columns, the cell chi-square test is carried out
independently on each table cell. Although the significance level of the cell
chi-square test is accurate for any given cell, the cell tests cannot be used
instead of the chi-square test carried out on the overall table. Their
purpose is simply to point to the parts of the table where dependencies
between row and column categories may exist.
For significance_cpct
and significance_means
there are three
type of comparisons which can be conducted simultaneously (argument
compare_type
):
subtable
provide comparisons between all columns inside each
subtable.
previous_column
is a comparison of each column of the subtable
with the previous column. It is useful if columns are periods or survey
waves.
first_column
provides comparison the table first column with
all other columns in the table. adjusted_first_column
is also
comparison with the first column but with adjustment for common base. It is
useful if the first column is total column and other columns are subgroups of
this total. Adjustments are made according to algorithm in IBM SPSS
Statistics Algorithms v20, p. 263. Note that with these adjustments t-tests
between means are made with equal variance assumed (as with var_equal =
TRUE
).
By now there are no adjustments for multiple-response variables (results of mrset) in the table columns so significance tests are rather approximate for such cases. Also, there are functions for the significance testing in the sequence of custom tables calculations (see tables):
tab_last_sig_cpct
, tab_last_sig_means
and
tab_last_sig_cpct
make the same tests as their analogs mentioned
above. It is recommended to use them after appropriate statistic function:
tab_stat_cpct, tab_stat_mean_sd_n and tab_stat_cases.
tab_significance_options
With this function we can set
significance options for the entire custom table creation sequence.
tab_last_add_sig_labels
This function applies
add_sig_labels
to the last calculated table - it adds labels (letters
by default) for significance to columns header. It may be useful if you want
to combine a table with significance with table without it.
tab_last_round
This function rounds numeric columns in the
last calculated table to specified number of digits. It is sometimes
needed if you want to combine table with significance with table without it.
tab_significance_options( data, sig_level = 0.05, min_base = 2, delta_cpct = 0, delta_means = 0, correct = TRUE, compare_type = "subtable", bonferroni = FALSE, subtable_marks = "greater", inequality_sign = "both" %in% subtable_marks, sig_labels = LETTERS, sig_labels_previous_column = c("v", "^"), sig_labels_first_column = c("-", "+"), sig_labels_chisq = c("<", ">"), keep = c("percent", "cases", "means", "sd", "bases"), row_margin = c("auto", "sum_row", "first_column"), total_marker = "#", total_row = 1, digits = get_expss_digits(), na_as_zero = FALSE, var_equal = FALSE, mode = c("replace", "append"), as_spss = FALSE ) tab_last_sig_cpct( data, sig_level = 0.05, delta_cpct = 0, min_base = 2, compare_type = "subtable", bonferroni = FALSE, subtable_marks = c("greater", "both", "less"), inequality_sign = "both" %in% subtable_marks, sig_labels = LETTERS, sig_labels_previous_column = c("v", "^"), sig_labels_first_column = c("-", "+"), keep = c("percent", "bases"), na_as_zero = FALSE, total_marker = "#", total_row = 1, digits = get_expss_digits(), as_spss = FALSE, mode = c("replace", "append"), label = NULL ) tab_last_sig_means( data, sig_level = 0.05, delta_means = 0, min_base = 2, compare_type = "subtable", bonferroni = FALSE, subtable_marks = c("greater", "both", "less"), inequality_sign = "both" %in% subtable_marks, sig_labels = LETTERS, sig_labels_previous_column = c("v", "^"), sig_labels_first_column = c("-", "+"), keep = c("means", "sd", "bases"), var_equal = FALSE, digits = get_expss_digits(), mode = c("replace", "append"), label = NULL ) tab_last_sig_cases( data, sig_level = 0.05, min_base = 2, correct = TRUE, keep = c("cases", "bases"), total_marker = "#", total_row = 1, digits = get_expss_digits(), mode = c("replace", "append"), label = NULL ) tab_last_sig_cell_chisq( data, sig_level = 0.05, min_base = 2, subtable_marks = c("both", "greater", "less"), sig_labels_chisq = c("<", ">"), correct = TRUE, keep = c("percent", "bases", "none"), row_margin = c("auto", "sum_row", "first_column"), total_marker = "#", total_row = 1, total_column_marker = "#", digits = get_expss_digits(), mode = c("replace", "append"), label = NULL ) tab_last_round(data, digits = get_expss_digits()) tab_last_add_sig_labels(data, sig_labels = LETTERS) significance_cases( x, sig_level = 0.05, min_base = 2, correct = TRUE, keep = c("cases", "bases"), total_marker = "#", total_row = 1, digits = get_expss_digits() ) significance_cell_chisq( x, sig_level = 0.05, min_base = 2, subtable_marks = c("both", "greater", "less"), sig_labels_chisq = c("<", ">"), correct = TRUE, keep = c("percent", "bases", "none"), row_margin = c("auto", "sum_row", "first_column"), total_marker = "#", total_row = 1, total_column_marker = "#", digits = get_expss_digits() ) cell_chisq(cases_matrix, row_base, col_base, total_base, correct) significance_cpct( x, sig_level = 0.05, delta_cpct = 0, min_base = 2, compare_type = "subtable", bonferroni = FALSE, subtable_marks = c("greater", "both", "less"), inequality_sign = "both" %in% subtable_marks, sig_labels = LETTERS, sig_labels_previous_column = c("v", "^"), sig_labels_first_column = c("-", "+"), keep = c("percent", "bases"), na_as_zero = FALSE, total_marker = "#", total_row = 1, digits = get_expss_digits(), as_spss = FALSE ) add_sig_labels(x, sig_labels = LETTERS) significance_means( x, sig_level = 0.05, delta_means = 0, min_base = 2, compare_type = "subtable", bonferroni = FALSE, subtable_marks = c("greater", "both", "less"), inequality_sign = "both" %in% subtable_marks, sig_labels = LETTERS, sig_labels_previous_column = c("v", "^"), sig_labels_first_column = c("-", "+"), keep = c("means", "sd", "bases"), var_equal = FALSE, digits = get_expss_digits() )
tab_significance_options( data, sig_level = 0.05, min_base = 2, delta_cpct = 0, delta_means = 0, correct = TRUE, compare_type = "subtable", bonferroni = FALSE, subtable_marks = "greater", inequality_sign = "both" %in% subtable_marks, sig_labels = LETTERS, sig_labels_previous_column = c("v", "^"), sig_labels_first_column = c("-", "+"), sig_labels_chisq = c("<", ">"), keep = c("percent", "cases", "means", "sd", "bases"), row_margin = c("auto", "sum_row", "first_column"), total_marker = "#", total_row = 1, digits = get_expss_digits(), na_as_zero = FALSE, var_equal = FALSE, mode = c("replace", "append"), as_spss = FALSE ) tab_last_sig_cpct( data, sig_level = 0.05, delta_cpct = 0, min_base = 2, compare_type = "subtable", bonferroni = FALSE, subtable_marks = c("greater", "both", "less"), inequality_sign = "both" %in% subtable_marks, sig_labels = LETTERS, sig_labels_previous_column = c("v", "^"), sig_labels_first_column = c("-", "+"), keep = c("percent", "bases"), na_as_zero = FALSE, total_marker = "#", total_row = 1, digits = get_expss_digits(), as_spss = FALSE, mode = c("replace", "append"), label = NULL ) tab_last_sig_means( data, sig_level = 0.05, delta_means = 0, min_base = 2, compare_type = "subtable", bonferroni = FALSE, subtable_marks = c("greater", "both", "less"), inequality_sign = "both" %in% subtable_marks, sig_labels = LETTERS, sig_labels_previous_column = c("v", "^"), sig_labels_first_column = c("-", "+"), keep = c("means", "sd", "bases"), var_equal = FALSE, digits = get_expss_digits(), mode = c("replace", "append"), label = NULL ) tab_last_sig_cases( data, sig_level = 0.05, min_base = 2, correct = TRUE, keep = c("cases", "bases"), total_marker = "#", total_row = 1, digits = get_expss_digits(), mode = c("replace", "append"), label = NULL ) tab_last_sig_cell_chisq( data, sig_level = 0.05, min_base = 2, subtable_marks = c("both", "greater", "less"), sig_labels_chisq = c("<", ">"), correct = TRUE, keep = c("percent", "bases", "none"), row_margin = c("auto", "sum_row", "first_column"), total_marker = "#", total_row = 1, total_column_marker = "#", digits = get_expss_digits(), mode = c("replace", "append"), label = NULL ) tab_last_round(data, digits = get_expss_digits()) tab_last_add_sig_labels(data, sig_labels = LETTERS) significance_cases( x, sig_level = 0.05, min_base = 2, correct = TRUE, keep = c("cases", "bases"), total_marker = "#", total_row = 1, digits = get_expss_digits() ) significance_cell_chisq( x, sig_level = 0.05, min_base = 2, subtable_marks = c("both", "greater", "less"), sig_labels_chisq = c("<", ">"), correct = TRUE, keep = c("percent", "bases", "none"), row_margin = c("auto", "sum_row", "first_column"), total_marker = "#", total_row = 1, total_column_marker = "#", digits = get_expss_digits() ) cell_chisq(cases_matrix, row_base, col_base, total_base, correct) significance_cpct( x, sig_level = 0.05, delta_cpct = 0, min_base = 2, compare_type = "subtable", bonferroni = FALSE, subtable_marks = c("greater", "both", "less"), inequality_sign = "both" %in% subtable_marks, sig_labels = LETTERS, sig_labels_previous_column = c("v", "^"), sig_labels_first_column = c("-", "+"), keep = c("percent", "bases"), na_as_zero = FALSE, total_marker = "#", total_row = 1, digits = get_expss_digits(), as_spss = FALSE ) add_sig_labels(x, sig_labels = LETTERS) significance_means( x, sig_level = 0.05, delta_means = 0, min_base = 2, compare_type = "subtable", bonferroni = FALSE, subtable_marks = c("greater", "both", "less"), inequality_sign = "both" %in% subtable_marks, sig_labels = LETTERS, sig_labels_previous_column = c("v", "^"), sig_labels_first_column = c("-", "+"), keep = c("means", "sd", "bases"), var_equal = FALSE, digits = get_expss_digits() )
data |
data.frame/intermediate_table for |
sig_level |
numeric. Significance level - by default it equals to |
min_base |
numeric. Significance test will be conducted if both
columns have bases greater or equal to |
delta_cpct |
numeric. Minimal delta between percent for which we mark
significant differences (in percent points) - by default it equals to zero.
Note that, for example, for minimal 5 percent point difference
|
delta_means |
numeric. Minimal delta between means for which we mark significant differences - by default it equals to zero. |
correct |
logical indicating whether to apply continuity correction when
computing the test statistic for 2 by 2 tables. Only for
|
compare_type |
Type of compare between columns. By default, it is
|
bonferroni |
logical. |
subtable_marks |
character. One of "greater", "both" or "less". By
deafult we mark only values which are significantly greater than some other
columns. For |
inequality_sign |
logical. FALSE if |
sig_labels |
character vector. Labels for marking differences between columns of subtable. |
sig_labels_previous_column |
a character vector with two elements. Labels
for marking a difference with the previous column. First mark means 'lower' (by
default it is |
sig_labels_first_column |
a character vector with two elements. Labels
for marking a difference with the first column of the table. First mark means
'lower' (by default it is |
sig_labels_chisq |
a character vector with two labels
for marking a difference with row margin of the table. First mark means
'lower' (by default it is |
keep |
character. One or more from "percent", "cases", "means", "bases", "sd" or "none". This argument determines which statistics will remain in the table after significance marking. |
row_margin |
character. One of values "auto" (default), "sum_row", or
"first_column". If it is "auto" we try to find total column in the subtable
by |
total_marker |
character. Total rows mark in the table. "#" by default. |
total_row |
integer/character. In the case of the several totals per subtable it is a number or name of total row for the significance calculation. |
digits |
an integer indicating how much digits after decimal separator |
na_as_zero |
logical. |
var_equal |
a logical variable indicating whether to treat the two variances as being equal. For details see t.test. |
mode |
character. One of |
as_spss |
a logical. FALSE by default. If TRUE, proportions which are equal to zero or one will be ignored. Also will be ignored categories with bases less than 2. |
label |
character. Label for the statistic in the |
total_column_marker |
character. Mark for total columns in the subtables. "#" by default. |
x |
table (class |
cases_matrix |
numeric matrix with counts size R*C |
row_base |
numeric vector with row bases, length R |
col_base |
numeric vector with col bases, length C |
total_base |
numeric single value, total base |
tab_last_*
functions return objects of class
intermediate_table
. Use tab_pivot to get the final result -
etable
object. Other functions return etable
object with
significant differences.
cross_cpct, cross_cases, cross_mean_sd_n, tables, compare_proportions, compare_means, prop.test, t.test, chisq.test
data(mtcars) mtcars = apply_labels(mtcars, mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (lb/1000)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c("Automatic" = 0, "Manual"=1), gear = "Number of forward gears", carb = "Number of carburetors" ) ## Not run: mtcars_table = cross_cpct(mtcars, list(cyl, gear), list(total(), vs, am) ) significance_cpct(mtcars_table) # comparison with first column significance_cpct(mtcars_table, compare_type = "first_column") # comparison with first column and inside subtable significance_cpct(mtcars_table, compare_type = c("first_column", "subtable")) # only significance marks significance_cpct(mtcars_table, keep = "none") # means mtcars_means = cross_mean_sd_n(mtcars, list(mpg, wt, hp), list(total(), vs, cyl)) ) significance_means(mtcars_means) # mark values which are less and greater significance_means(mtcars_means, subtable_marks = "both") # chi-squared test mtcars_cases = cross_cases(mtcars, list(cyl, gear), list(total(), vs, am) ) significance_cases(mtcars_cases) # cell chi-squared test # increase number of cases to avoid warning about chi-square approximation mtcars2 = add_rows(mtcars, mtcars, mtcars) tbl = cross_cpct(mtcars2, gear, am) significance_cell_chisq(tbl) # table with multiple variables tbl = cross_cpct(mtcars2, list(gear, cyl), list(total(), am, vs)) significance_cell_chisq(tbl, sig_level = .0001) # custom tables with significance mtcars %>% tab_significance_options(subtable_marks = "both") %>% tab_cells(mpg, hp) %>% tab_cols(total(), vs, am) %>% tab_stat_mean_sd_n() %>% tab_last_sig_means(keep = "means") %>% tab_cells(cyl, gear) %>% tab_stat_cpct() %>% tab_last_sig_cpct() %>% tab_pivot() # Overcomplicated examples - we move significance marks to # separate columns. Columns with statistics remain numeric mtcars %>% tab_significance_options(keep = "none", sig_labels = NULL, subtable_marks = "both", mode = "append") %>% tab_cols(total(), vs, am) %>% tab_cells(mpg, hp) %>% tab_stat_mean_sd_n() %>% tab_last_sig_means() %>% tab_last_hstack("inside_columns") %>% tab_cells(cyl, gear) %>% tab_stat_cpct() %>% tab_last_sig_cpct() %>% tab_last_hstack("inside_columns") %>% tab_pivot(stat_position = "inside_rows") %>% drop_empty_columns() ## End(Not run)
data(mtcars) mtcars = apply_labels(mtcars, mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (lb/1000)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c("Automatic" = 0, "Manual"=1), gear = "Number of forward gears", carb = "Number of carburetors" ) ## Not run: mtcars_table = cross_cpct(mtcars, list(cyl, gear), list(total(), vs, am) ) significance_cpct(mtcars_table) # comparison with first column significance_cpct(mtcars_table, compare_type = "first_column") # comparison with first column and inside subtable significance_cpct(mtcars_table, compare_type = c("first_column", "subtable")) # only significance marks significance_cpct(mtcars_table, keep = "none") # means mtcars_means = cross_mean_sd_n(mtcars, list(mpg, wt, hp), list(total(), vs, cyl)) ) significance_means(mtcars_means) # mark values which are less and greater significance_means(mtcars_means, subtable_marks = "both") # chi-squared test mtcars_cases = cross_cases(mtcars, list(cyl, gear), list(total(), vs, am) ) significance_cases(mtcars_cases) # cell chi-squared test # increase number of cases to avoid warning about chi-square approximation mtcars2 = add_rows(mtcars, mtcars, mtcars) tbl = cross_cpct(mtcars2, gear, am) significance_cell_chisq(tbl) # table with multiple variables tbl = cross_cpct(mtcars2, list(gear, cyl), list(total(), am, vs)) significance_cell_chisq(tbl, sig_level = .0001) # custom tables with significance mtcars %>% tab_significance_options(subtable_marks = "both") %>% tab_cells(mpg, hp) %>% tab_cols(total(), vs, am) %>% tab_stat_mean_sd_n() %>% tab_last_sig_means(keep = "means") %>% tab_cells(cyl, gear) %>% tab_stat_cpct() %>% tab_last_sig_cpct() %>% tab_pivot() # Overcomplicated examples - we move significance marks to # separate columns. Columns with statistics remain numeric mtcars %>% tab_significance_options(keep = "none", sig_labels = NULL, subtable_marks = "both", mode = "append") %>% tab_cols(total(), vs, am) %>% tab_cells(mpg, hp) %>% tab_stat_mean_sd_n() %>% tab_last_sig_means() %>% tab_last_hstack("inside_columns") %>% tab_cells(cyl, gear) %>% tab_stat_cpct() %>% tab_last_sig_cpct() %>% tab_last_hstack("inside_columns") %>% tab_pivot(stat_position = "inside_rows") %>% drop_empty_columns() ## End(Not run)
tab_sort_asc
/tab_sort_desc
sort tables (usually result of
cro/tables) in ascending/descending order between specified
rows (by default, it is rows which contain '#' in the first column).
tab_sort_asc(x, ..., excluded_rows = "#", na.last = FALSE) tab_sort_desc(x, ..., excluded_rows = "#", na.last = TRUE)
tab_sort_asc(x, ..., excluded_rows = "#", na.last = FALSE) tab_sort_desc(x, ..., excluded_rows = "#", na.last = TRUE)
x |
data.frame |
... |
character/numeric or criteria/logical functions (see criteria). Column names/numbers for data.frame/matrix by which object will be sorted. Names at the top-level can be unquoted (non-standard evaluation). For standard evaluation of parameters you can surround them by round brackets. See examples. If this argument is missing then table will be sorted by second column. Usually second column is the first column with numbers in the table (there are row labels in the first column). |
excluded_rows |
character/logical/numeric rows which won't be sorted.
Rows of the table will be sorted between excluded rows. If it is characters
then they will be considered as pattern/vector of patterns. Patterns will
be matched with Perl-style regular expression with values in the first
column of |
na.last |
for controlling the treatment of NAs. If TRUE, missing values in the data are put last; if FALSE, they are put first; if NA, they are removed. |
sorted table('etable')/data.frame
## Not run: data(mtcars) # apply labels mtcars = apply_labels(mtcars, cyl = "Number of cylinders", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c(automatic = 0, manual=1), gear = "Number of forward gears", carb = "Number of carburetors" ) # without sorting mtcars %>% cross_cpct(list(cyl, gear, carb), list("#total", vs, am)) # with sorting mtcars %>% cross_cpct(list(cyl, gear, carb), list("#total", vs, am)) %>% tab_sort_desc # sort by parameter sorting_column = "Engine|V-engine" mtcars %>% cross_cpct(list(cyl, gear, carb), list("#total", vs, am)) %>% tab_sort_desc((sorting_column)) ## End(Not run)
## Not run: data(mtcars) # apply labels mtcars = apply_labels(mtcars, cyl = "Number of cylinders", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c(automatic = 0, manual=1), gear = "Number of forward gears", carb = "Number of carburetors" ) # without sorting mtcars %>% cross_cpct(list(cyl, gear, carb), list("#total", vs, am)) # with sorting mtcars %>% cross_cpct(list(cyl, gear, carb), list("#total", vs, am)) %>% tab_sort_desc # sort by parameter sorting_column = "Engine|V-engine" mtcars %>% cross_cpct(list(cyl, gear, carb), list("#total", vs, am)) %>% tab_sort_desc((sorting_column)) ## End(Not run)
Table construction consists of at least of three functions chained with
magrittr
pipe operator. At first we need to
specify variables for which statistics will be computed with
tab_cells
. Secondary, we calculate statistics with one of
tab_stat_*
functions. And last, we finalize table creation with
tab_pivot
: dataset %>% tab_cells(variable) %>%
tab_stat_cases() %>% tab_pivot()
. After that we can optionally sort table
with tab_sort_asc, drop empty rows/columns with drop_rc and
transpose with tab_transpose
. Generally, table is just a data.frame so
we can use arbitrary operations on it. Statistic is always calculated with
the last cell, column/row variables, weight, missing values and subgroup. To
define new cell/column/row variables we can call appropriate function one more time.
tab_pivot
defines how we combine different statistics and where
statistic labels will appear - inside/outside rows/columns. See examples.
For significance testing see significance.
tab_cols(data, ...) tab_cells(data, ...) tab_rows(data, ...) tab_weight(data, weight = NULL) tab_mis_val(data, ...) tab_total_label(data, ...) tab_total_statistic(data, ...) tab_total_row_position(data, total_row_position = c("below", "above", "none")) tab_subgroup(data, subgroup = NULL) tab_row_label(data, ..., label = NULL) tab_stat_fun(data, ..., label = NULL, unsafe = FALSE) tab_stat_mean_sd_n( data, weighted_valid_n = FALSE, labels = c("Mean", "Std. dev.", ifelse(weighted_valid_n, "Valid N", "Unw. valid N")), label = NULL ) tab_stat_mean(data, label = "Mean") tab_stat_median(data, label = "Median") tab_stat_se(data, label = "S. E.") tab_stat_sum(data, label = "Sum") tab_stat_min(data, label = "Min.") tab_stat_max(data, label = "Max.") tab_stat_sd(data, label = "Std. dev.") tab_stat_valid_n(data, label = "Valid N") tab_stat_unweighted_valid_n(data, label = "Unw. valid N") tab_stat_fun_df(data, ..., label = NULL, unsafe = FALSE) tab_stat_cases( data, total_label = NULL, total_statistic = "u_cases", total_row_position = c("below", "above", "none"), label = NULL ) tab_stat_cpct( data, total_label = NULL, total_statistic = "u_cases", total_row_position = c("below", "above", "none"), label = NULL ) tab_stat_cpct_responses( data, total_label = NULL, total_statistic = "u_responses", total_row_position = c("below", "above", "none"), label = NULL ) tab_stat_tpct( data, total_label = NULL, total_statistic = "u_cases", total_row_position = c("below", "above", "none"), label = NULL ) tab_stat_rpct( data, total_label = NULL, total_statistic = "u_cases", total_row_position = c("below", "above", "none"), label = NULL ) tab_last_vstack( data, stat_position = c("outside_rows", "inside_rows"), stat_label = c("inside", "outside"), label = NULL ) tab_last_hstack( data, stat_position = c("outside_columns", "inside_columns"), stat_label = c("inside", "outside"), label = NULL ) tab_pivot( data, stat_position = c("outside_rows", "inside_rows", "outside_columns", "inside_columns"), stat_label = c("inside", "outside") ) tab_transpose(data) tab_caption(data, ...)
tab_cols(data, ...) tab_cells(data, ...) tab_rows(data, ...) tab_weight(data, weight = NULL) tab_mis_val(data, ...) tab_total_label(data, ...) tab_total_statistic(data, ...) tab_total_row_position(data, total_row_position = c("below", "above", "none")) tab_subgroup(data, subgroup = NULL) tab_row_label(data, ..., label = NULL) tab_stat_fun(data, ..., label = NULL, unsafe = FALSE) tab_stat_mean_sd_n( data, weighted_valid_n = FALSE, labels = c("Mean", "Std. dev.", ifelse(weighted_valid_n, "Valid N", "Unw. valid N")), label = NULL ) tab_stat_mean(data, label = "Mean") tab_stat_median(data, label = "Median") tab_stat_se(data, label = "S. E.") tab_stat_sum(data, label = "Sum") tab_stat_min(data, label = "Min.") tab_stat_max(data, label = "Max.") tab_stat_sd(data, label = "Std. dev.") tab_stat_valid_n(data, label = "Valid N") tab_stat_unweighted_valid_n(data, label = "Unw. valid N") tab_stat_fun_df(data, ..., label = NULL, unsafe = FALSE) tab_stat_cases( data, total_label = NULL, total_statistic = "u_cases", total_row_position = c("below", "above", "none"), label = NULL ) tab_stat_cpct( data, total_label = NULL, total_statistic = "u_cases", total_row_position = c("below", "above", "none"), label = NULL ) tab_stat_cpct_responses( data, total_label = NULL, total_statistic = "u_responses", total_row_position = c("below", "above", "none"), label = NULL ) tab_stat_tpct( data, total_label = NULL, total_statistic = "u_cases", total_row_position = c("below", "above", "none"), label = NULL ) tab_stat_rpct( data, total_label = NULL, total_statistic = "u_cases", total_row_position = c("below", "above", "none"), label = NULL ) tab_last_vstack( data, stat_position = c("outside_rows", "inside_rows"), stat_label = c("inside", "outside"), label = NULL ) tab_last_hstack( data, stat_position = c("outside_columns", "inside_columns"), stat_label = c("inside", "outside"), label = NULL ) tab_pivot( data, stat_position = c("outside_rows", "inside_rows", "outside_columns", "inside_columns"), stat_label = c("inside", "outside") ) tab_transpose(data) tab_caption(data, ...)
data |
data.frame/intermediate_table |
... |
vector/data.frame/list. Variables for tables. Use mrset/mdset for multiple-response variables. |
weight |
numeric vector in |
total_row_position |
Position of total row in the resulting table. Can be one of "below", "above", "none". |
subgroup |
logical vector in |
label |
character. Label for the statistic in the |
unsafe |
logical If TRUE than |
weighted_valid_n |
logical. Sould we show weighted valid N in
|
labels |
character vector of length 3. Labels for mean, standard
deviation and valid N in |
total_label |
By default "#Total". You can provide several names - each name for each total statistics. |
total_statistic |
By default it is "u_cases" (unweighted cases). Possible values are "u_cases", "u_responses", "u_cpct", "u_rpct", "u_tpct", "w_cases", "w_responses", "w_cpct", "w_rpct", "w_tpct". "u_" means unweighted statistics and "w_" means weighted statistics. |
stat_position |
character one of the values |
stat_label |
character one of the values |
tab_cells
variables on which percentage/cases/summary
functions will be computed. Use mrset/mdset for
multiple-response variables.
tab_cols
optional variables which breaks table by
columns. Use mrset/mdset for
multiple-response variables.
tab_rows
optional variables which breaks table by rows. Use
mrset/mdset for multiple-response variables.
tab_weight
optional weight for the statistic.
tab_mis_val
optional missing values for the statistic. It will
be applied on variables specified by tab_cells
. It works in the same
manner as na_if.
tab_subgroup
optional logical vector/expression which specify
subset of data for table.
tab_row_label
Add to table empty row with specified row
labels. It is usefull for making section headings and etc.
tab_total_row_position
Default value for
total_row_position
argument in tab_stat_cases
and etc. Can be
one of "below", "above", "none".
tab_total_label
Default value for total_label
argument
in tab_stat_cases
and etc. You can provide several names - each name
for each total statistics.
tab_total_statistic
Default value for total_statistic
argument in tab_stat_cases
and etc. You can provide several values.
Possible values are "u_cases", "u_responses", "u_cpct", "u_rpct", "u_tpct",
"w_cases", "w_responses", "w_cpct", "w_rpct", "w_tpct". "u_" means unweighted
statistics and "w_" means weighted statistics.
tab_stat_fun
, tab_stat_fun_df
tab_stat_fun
applies function on each variable in cells separately, tab_stat_fun_df
gives to function each data.frame in cells as a whole
data.table with all names converted to variable labels (if
labels exists). So it is not recommended to rely on original variables names
in your fun
. For details see cross_fun. You can provide several
functions as arguments. They will be combined as with
combine_functions. So you can use method
argument. For details
see documentation for combine_functions.
tab_stat_cases
calculate counts.
tab_stat_cpct
, tab_stat_cpct_responses
calculate column
percent. These functions give different results only for multiple response
variables. For tab_stat_cpct
base of percent is number of valid cases.
Case is considered as valid if it has at least one non-NA value. So for
multiple response variables sum of percent may be greater than 100. For
tab_stat_cpct_responses
base of percent is number of valid responses.
Multiple response variables can have several responses for single case. Sum
of percent of tab_stat_cpct_responses
always equals to 100%.
tab_stat_rpct
calculate row percent. Base
for percent is number of valid cases.
tab_stat_tpct
calculate table percent. Base
for percent is number of valid cases.
tab_stat_mean
, tab_stat_median
, tab_stat_se
,
tab_stat_sum
, tab_stat_min
, tab_stat_max
,
tab_stat_sd
, tab_stat_valid_n
,
tab_stat_unweighted_valid_n
different summary statistics. NA's are
always omitted.
tab_pivot
finalize table creation and define how different
tab_stat_*
will be combined
tab_caption
set caption on the table. Should be used after the tab_pivot
.
tab_transpose
transpose final table after tab_pivot
or last
statistic.
All of these functions return object of class
intermediate_table
except tab_pivot
which returns final
result - object of class etable
. Basically it's a data.frame but
class is needed for custom methods.
fre, cross_cases, cross_fun, tab_sort_asc, drop_empty_rows, significance.
## Not run: data(mtcars) mtcars = apply_labels(mtcars, mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (1000 lbs)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c("Automatic" = 0, "Manual"=1), gear = "Number of forward gears", carb = "Number of carburetors" ) # some examples from 'cro' # simple example - generally with 'cro' it can be made with less typing mtcars %>% tab_cells(cyl) %>% tab_cols(vs) %>% tab_stat_cpct() %>% tab_pivot() # split rows mtcars %>% tab_cells(cyl) %>% tab_cols(vs) %>% tab_rows(am) %>% tab_stat_cpct() %>% tab_pivot() # multiple banners mtcars %>% tab_cells(cyl) %>% tab_cols(total(), vs, am) %>% tab_stat_cpct() %>% tab_pivot() # nested banners mtcars %>% tab_cells(cyl) %>% tab_cols(total(), vs %nest% am) %>% tab_stat_cpct() %>% tab_pivot() # summary statistics mtcars %>% tab_cells(mpg, disp, hp, wt, qsec) %>% tab_cols(am) %>% tab_stat_fun(Mean = w_mean, "Std. dev." = w_sd, "Valid N" = w_n) %>% tab_pivot() # summary statistics - labels in columns mtcars %>% tab_cells(mpg, disp, hp, wt, qsec) %>% tab_cols(am) %>% tab_stat_fun(Mean = w_mean, "Std. dev." = w_sd, "Valid N" = w_n, method = list) %>% tab_pivot() # subgroup with droping empty columns mtcars %>% tab_subgroup(am == 0) %>% tab_cells(cyl) %>% tab_cols(total(), vs %nest% am) %>% tab_stat_cpct() %>% tab_pivot() %>% drop_empty_columns() # total position at the top of the table mtcars %>% tab_cells(cyl) %>% tab_cols(total(), vs) %>% tab_rows(am) %>% tab_stat_cpct(total_row_position = "above", total_label = c("number of cases", "row %"), total_statistic = c("u_cases", "u_rpct")) %>% tab_pivot() # this example cannot be made easily with 'cro' mtcars %>% tab_cells(am) %>% tab_cols(total(), vs) %>% tab_total_row_position("none") %>% tab_stat_cpct(label = "col %") %>% tab_stat_rpct(label = "row %") %>% tab_stat_tpct(label = "table %") %>% tab_pivot(stat_position = "inside_rows") # statistic labels inside columns mtcars %>% tab_cells(am) %>% tab_cols(total(), vs) %>% tab_total_row_position("none") %>% tab_stat_cpct(label = "col %") %>% tab_stat_rpct(label = "row %") %>% tab_stat_tpct(label = "table %") %>% tab_pivot(stat_position = "inside_columns") # stacked statistics mtcars %>% tab_cells(cyl) %>% tab_cols(total(), am) %>% tab_stat_mean() %>% tab_stat_se() %>% tab_stat_valid_n() %>% tab_stat_cpct() %>% tab_pivot() # stacked statistics with section headings mtcars %>% tab_cells(cyl) %>% tab_cols(total(), am) %>% tab_row_label("#Summary statistics") %>% tab_stat_mean() %>% tab_stat_se() %>% tab_stat_valid_n() %>% tab_row_label("#Column percent") %>% tab_stat_cpct() %>% tab_pivot() # stacked statistics with different variables mtcars %>% tab_cols(total(), am) %>% tab_cells(mpg, hp, qsec) %>% tab_stat_mean() %>% tab_cells(cyl, carb) %>% tab_stat_cpct() %>% tab_pivot() # stacked statistics - label position outside row labels mtcars %>% tab_cells(cyl) %>% tab_cols(total(), am) %>% tab_stat_mean() %>% tab_stat_se %>% tab_stat_valid_n() %>% tab_stat_cpct(label = "Col %") %>% tab_pivot(stat_label = "outside") # example from 'cross_fun_df' - linear regression by groups with sorting mtcars %>% tab_cells(sheet(mpg, disp, hp, wt, qsec)) %>% tab_cols(total(), am) %>% tab_stat_fun_df( function(x){ frm = reformulate(".", response = as.name(names(x)[1])) model = lm(frm, data = x) sheet('Coef.' = coef(model), confint(model) ) } ) %>% tab_pivot() %>% tab_sort_desc() # multiple-response variables and weight data(product_test) codeframe_likes = num_lab(" 1 Liked everything 2 Disliked everything 3 Chocolate 4 Appearance 5 Taste 6 Stuffing 7 Nuts 8 Consistency 98 Other 99 Hard to answer ") set.seed(1) product_test = product_test %>% let( # recode age by groups age_cat = recode(s2a, lo %thru% 25 ~ 1, lo %thru% hi ~ 2), wgt = runif(.N, 0.25, 4), wgt = wgt/sum(wgt)*.N ) %>% apply_labels( age_cat = "Age", age_cat = c("18 - 25" = 1, "26 - 35" = 2), a1_1 = "Likes. VSX123", b1_1 = "Likes. SDF456", a1_1 = codeframe_likes, b1_1 = codeframe_likes ) product_test %>% tab_cells(mrset(a1_1 %to% a1_6), mrset(b1_1 %to% b1_6)) %>% tab_cols(total(), age_cat) %>% tab_weight(wgt) %>% tab_stat_cpct() %>% tab_sort_desc() %>% tab_pivot() # trick to place cell variables labels inside columns # useful to compare two variables # '|' is needed to prevent automatic labels creation from argument # alternatively we can use list(...) to avoid this product_test %>% tab_cols(total(), age_cat) %>% tab_weight(wgt) %>% tab_cells("|" = unvr(mrset(a1_1 %to% a1_6))) %>% tab_stat_cpct(label = var_lab(a1_1)) %>% tab_cells("|" = unvr(mrset(b1_1 %to% b1_6))) %>% tab_stat_cpct(label = var_lab(b1_1)) %>% tab_pivot(stat_position = "inside_columns") # if you need standard evaluation, use 'vars' tables = mtcars %>% tab_cols(total(), am %nest% vs) for(each in c("mpg", "disp", "hp", "qsec")){ tables = tables %>% tab_cells(vars(each)) %>% tab_stat_fun(Mean = w_mean, "Std. dev." = w_sd, "Valid N" = w_n) } tables %>% tab_pivot() ## End(Not run)
## Not run: data(mtcars) mtcars = apply_labels(mtcars, mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (1000 lbs)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c("Automatic" = 0, "Manual"=1), gear = "Number of forward gears", carb = "Number of carburetors" ) # some examples from 'cro' # simple example - generally with 'cro' it can be made with less typing mtcars %>% tab_cells(cyl) %>% tab_cols(vs) %>% tab_stat_cpct() %>% tab_pivot() # split rows mtcars %>% tab_cells(cyl) %>% tab_cols(vs) %>% tab_rows(am) %>% tab_stat_cpct() %>% tab_pivot() # multiple banners mtcars %>% tab_cells(cyl) %>% tab_cols(total(), vs, am) %>% tab_stat_cpct() %>% tab_pivot() # nested banners mtcars %>% tab_cells(cyl) %>% tab_cols(total(), vs %nest% am) %>% tab_stat_cpct() %>% tab_pivot() # summary statistics mtcars %>% tab_cells(mpg, disp, hp, wt, qsec) %>% tab_cols(am) %>% tab_stat_fun(Mean = w_mean, "Std. dev." = w_sd, "Valid N" = w_n) %>% tab_pivot() # summary statistics - labels in columns mtcars %>% tab_cells(mpg, disp, hp, wt, qsec) %>% tab_cols(am) %>% tab_stat_fun(Mean = w_mean, "Std. dev." = w_sd, "Valid N" = w_n, method = list) %>% tab_pivot() # subgroup with droping empty columns mtcars %>% tab_subgroup(am == 0) %>% tab_cells(cyl) %>% tab_cols(total(), vs %nest% am) %>% tab_stat_cpct() %>% tab_pivot() %>% drop_empty_columns() # total position at the top of the table mtcars %>% tab_cells(cyl) %>% tab_cols(total(), vs) %>% tab_rows(am) %>% tab_stat_cpct(total_row_position = "above", total_label = c("number of cases", "row %"), total_statistic = c("u_cases", "u_rpct")) %>% tab_pivot() # this example cannot be made easily with 'cro' mtcars %>% tab_cells(am) %>% tab_cols(total(), vs) %>% tab_total_row_position("none") %>% tab_stat_cpct(label = "col %") %>% tab_stat_rpct(label = "row %") %>% tab_stat_tpct(label = "table %") %>% tab_pivot(stat_position = "inside_rows") # statistic labels inside columns mtcars %>% tab_cells(am) %>% tab_cols(total(), vs) %>% tab_total_row_position("none") %>% tab_stat_cpct(label = "col %") %>% tab_stat_rpct(label = "row %") %>% tab_stat_tpct(label = "table %") %>% tab_pivot(stat_position = "inside_columns") # stacked statistics mtcars %>% tab_cells(cyl) %>% tab_cols(total(), am) %>% tab_stat_mean() %>% tab_stat_se() %>% tab_stat_valid_n() %>% tab_stat_cpct() %>% tab_pivot() # stacked statistics with section headings mtcars %>% tab_cells(cyl) %>% tab_cols(total(), am) %>% tab_row_label("#Summary statistics") %>% tab_stat_mean() %>% tab_stat_se() %>% tab_stat_valid_n() %>% tab_row_label("#Column percent") %>% tab_stat_cpct() %>% tab_pivot() # stacked statistics with different variables mtcars %>% tab_cols(total(), am) %>% tab_cells(mpg, hp, qsec) %>% tab_stat_mean() %>% tab_cells(cyl, carb) %>% tab_stat_cpct() %>% tab_pivot() # stacked statistics - label position outside row labels mtcars %>% tab_cells(cyl) %>% tab_cols(total(), am) %>% tab_stat_mean() %>% tab_stat_se %>% tab_stat_valid_n() %>% tab_stat_cpct(label = "Col %") %>% tab_pivot(stat_label = "outside") # example from 'cross_fun_df' - linear regression by groups with sorting mtcars %>% tab_cells(sheet(mpg, disp, hp, wt, qsec)) %>% tab_cols(total(), am) %>% tab_stat_fun_df( function(x){ frm = reformulate(".", response = as.name(names(x)[1])) model = lm(frm, data = x) sheet('Coef.' = coef(model), confint(model) ) } ) %>% tab_pivot() %>% tab_sort_desc() # multiple-response variables and weight data(product_test) codeframe_likes = num_lab(" 1 Liked everything 2 Disliked everything 3 Chocolate 4 Appearance 5 Taste 6 Stuffing 7 Nuts 8 Consistency 98 Other 99 Hard to answer ") set.seed(1) product_test = product_test %>% let( # recode age by groups age_cat = recode(s2a, lo %thru% 25 ~ 1, lo %thru% hi ~ 2), wgt = runif(.N, 0.25, 4), wgt = wgt/sum(wgt)*.N ) %>% apply_labels( age_cat = "Age", age_cat = c("18 - 25" = 1, "26 - 35" = 2), a1_1 = "Likes. VSX123", b1_1 = "Likes. SDF456", a1_1 = codeframe_likes, b1_1 = codeframe_likes ) product_test %>% tab_cells(mrset(a1_1 %to% a1_6), mrset(b1_1 %to% b1_6)) %>% tab_cols(total(), age_cat) %>% tab_weight(wgt) %>% tab_stat_cpct() %>% tab_sort_desc() %>% tab_pivot() # trick to place cell variables labels inside columns # useful to compare two variables # '|' is needed to prevent automatic labels creation from argument # alternatively we can use list(...) to avoid this product_test %>% tab_cols(total(), age_cat) %>% tab_weight(wgt) %>% tab_cells("|" = unvr(mrset(a1_1 %to% a1_6))) %>% tab_stat_cpct(label = var_lab(a1_1)) %>% tab_cells("|" = unvr(mrset(b1_1 %to% b1_6))) %>% tab_stat_cpct(label = var_lab(b1_1)) %>% tab_pivot(stat_position = "inside_columns") # if you need standard evaluation, use 'vars' tables = mtcars %>% tab_cols(total(), am %nest% vs) for(each in c("mpg", "disp", "hp", "qsec")){ tables = tables %>% tab_cells(vars(each)) %>% tab_stat_fun(Mean = w_mean, "Std. dev." = w_sd, "Valid N" = w_n) } tables %>% tab_pivot() ## End(Not run)
Convert delimited text lines to data.frame. Blank lines are always skipped, trailing whitespaces are trimmed. You can use comments with '#' inside your text. For details see read.table.
text_to_columns( text, header = TRUE, sep = "", quote = "", dec = ".", encoding = "unknown", ... ) text_to_columns_csv( text, header = TRUE, sep = ",", quote = "", dec = ".", encoding = "unknown", ... ) text_to_columns_csv2( text, header = TRUE, sep = ";", quote = "", dec = ",", encoding = "unknown", ... ) text_to_columns_tab( text, header = TRUE, sep = "\t", quote = "", dec = ".", encoding = "unknown", ... ) text_to_columns_tab2( text, header = TRUE, sep = "\t", quote = "", dec = ",", encoding = "unknown", ... )
text_to_columns( text, header = TRUE, sep = "", quote = "", dec = ".", encoding = "unknown", ... ) text_to_columns_csv( text, header = TRUE, sep = ",", quote = "", dec = ".", encoding = "unknown", ... ) text_to_columns_csv2( text, header = TRUE, sep = ";", quote = "", dec = ",", encoding = "unknown", ... ) text_to_columns_tab( text, header = TRUE, sep = "\t", quote = "", dec = ".", encoding = "unknown", ... ) text_to_columns_tab2( text, header = TRUE, sep = "\t", quote = "", dec = ",", encoding = "unknown", ... )
text |
character/vector of characters |
header |
a logical value indicating whether the |
sep |
the field separator character. Values on each line of the file are
separated by this character. If sep = "" (the default for |
quote |
the set of quoting characters. To disable quoting altogether, use quote = "". |
dec |
the character used in the file for decimal points. |
encoding |
encoding to be assumed for input strings. It is used to mark character strings as known to be in Latin-1 or UTF-8 (see read.table). |
... |
further parameters which will be passed to read.table. |
data.frame
text_to_columns(" # simple data.frame a b c 1 2.5 a 4 5.5 b 7 8.5 c ")
text_to_columns(" # simple data.frame a b c 1 2.5 a 4 5.5 b 7 8.5 c ")
unlab
returns variable x without variable labels and value labels
unlab(x) drop_all_labels(x)
unlab(x) drop_all_labels(x)
x |
Variable(s). Vector/data.frame/list. |
unlab
returns original variable x without variable label, value labels and class.
raw_var = rep(1:2,5) var_with_lab = set_var_lab(raw_var,"Income") val_lab(var_with_lab) = c("Low"=1,"High"=2) identical(raw_var,unlab(var_with_lab)) # should be TRUE
raw_var = rep(1:2,5) var_with_lab = set_var_lab(raw_var,"Income") val_lab(var_with_lab) = c("Low"=1,"High"=2) identical(raw_var,unlab(var_with_lab)) # should be TRUE
These functions set/get/drop value labels. Duplicated values are not allowed.
If argument x
is data.frame or list then labels applied to all
elements of data.frame/list. To drop value labels, use val_lab(var) <-
NULL
or unvl(var)
. make_labels
converts text from the form
that usually used in questionnaires to named vector. For variable labels see
var_lab. For working with entire data.frame see apply_labels.
val_lab
returns value labels or NULL if labels doesn't
exist.
val_lab<-
set value labels.
set_val_lab
returns variable with value labels.
add_val_lab<-
add value labels to already existing value labels.
unvl
drops value labels.
make_labels
makes named vector from text for usage as value labels.
num_lab
, lab_num
and autonum
are shortcuts for make_labels
with code_postion
'left', 'right' and 'autonum' accordingly.
val_lab(x) val_lab(x) <- value set_val_lab(x, value, add = FALSE) add_val_lab(x, value) add_val_lab(x) <- value unvl(x) drop_val_labs(x) make_labels(text, code_position = c("left", "right", "autonum")) drop_unused_labels(x) num_lab(text) lab_num(text) autonum(text)
val_lab(x) val_lab(x) <- value set_val_lab(x, value, add = FALSE) add_val_lab(x, value) add_val_lab(x) <- value unvl(x) drop_val_labs(x) make_labels(text, code_position = c("left", "right", "autonum")) drop_unused_labels(x) num_lab(text) lab_num(text) autonum(text)
x |
Variable(s). Vector/data.frame/list. |
value |
Named vector. Names of vector are labels for the appropriate values of variable x. |
add |
Logical. Should we add value labels to old labels or replace it? Deafult is FALSE - we completely replace old values. If TRUE new value labels will be combined with old value labels. |
text |
text that should be converted to named vector |
code_position |
Possible values "left", "right" - position of numeric code in
|
Value labels are stored in attribute "labels"
(attr(x,"labels")
). We set variable class to "labelled" for preserving
labels from dropping during some operations (such as c
and `[`
).
val_lab
return value labels (named vector). If labels doesn't
exist it return NULL . val_lab<-
and set_val_lab
return
variable (vector x) of class "labelled" with attribute "labels" which
contains value labels. make_labels
return named vector for usage as
value labels.
# toy example data.table::setDTthreads(2) set.seed(123) # score - evaluation of tested product score = sample(-1:1,20,replace = TRUE) var_lab(score) = "Evaluation of tested brand" val_lab(score) = c("Dislike it" = -1, "So-so" = 0, "Like it" = 1 ) # frequency of product scores fre(score) # brands - multiple response question # Which brands do you use during last three months? brands = as.sheet(t(replicate(20,sample(c(1:5,NA),4,replace = FALSE)))) var_lab(brands) = "Used brands" val_lab(brands) = make_labels(" 1 Brand A 2 Brand B 3 Brand C 4 Brand D 5 Brand E ") # percentage of used brands fre(brands) # percentage of brands within each score cro_cpct(brands, score) ## make labels from text copied from questionnaire age = c(1, 2, 1, 2) val_lab(age) = num_lab(" 1. 18 - 26 2. 27 - 35 ") # note support of value lables in base R table(age) # or, if in original codes is on the right side products = 1:8 val_lab(products) = lab_num(" Chocolate bars 1 Chocolate sweets (bulk) 2 Slab chocolate(packed) 3 Slab chocolate (bulk) 4 Boxed chocolate sweets 5 Marshmallow/pastilles in chocolate coating 6 Marmalade in chocolate coating 7 Other 8 ") table(products)
# toy example data.table::setDTthreads(2) set.seed(123) # score - evaluation of tested product score = sample(-1:1,20,replace = TRUE) var_lab(score) = "Evaluation of tested brand" val_lab(score) = c("Dislike it" = -1, "So-so" = 0, "Like it" = 1 ) # frequency of product scores fre(score) # brands - multiple response question # Which brands do you use during last three months? brands = as.sheet(t(replicate(20,sample(c(1:5,NA),4,replace = FALSE)))) var_lab(brands) = "Used brands" val_lab(brands) = make_labels(" 1 Brand A 2 Brand B 3 Brand C 4 Brand D 5 Brand E ") # percentage of used brands fre(brands) # percentage of brands within each score cro_cpct(brands, score) ## make labels from text copied from questionnaire age = c(1, 2, 1, 2) val_lab(age) = num_lab(" 1. 18 - 26 2. 27 - 35 ") # note support of value lables in base R table(age) # or, if in original codes is on the right side products = 1:8 val_lab(products) = lab_num(" Chocolate bars 1 Chocolate sweets (bulk) 2 Slab chocolate(packed) 3 Slab chocolate (bulk) 4 Boxed chocolate sweets 5 Marshmallow/pastilles in chocolate coating 6 Marmalade in chocolate coating 7 Other 8 ") table(products)
values2labels
replaces vector/matrix/data.frame/list values with
corresponding value labels. If there are no labels for some values they are
converted to characters in most cases. If there are no labels at all for
variable it remains unchanged. v2l
is just shortcut to values2labels
.
values2labels(x) v2l(x)
values2labels(x) v2l(x)
x |
vector/matrix/data.frame/list |
Object of the same form as x but with value labels instead of values.
names2labels, val_lab, var_lab
data(mtcars) var_lab(mtcars$mpg) = NULL val_lab(mtcars$am) = c(" automatic" = 0, " manual" = 1) summary(lm(mpg ~ ., data = values2labels(mtcars[,c("mpg","am")])))
data(mtcars) var_lab(mtcars$mpg) = NULL val_lab(mtcars$am) = c(" automatic" = 0, " manual" = 1) summary(lm(mpg ~ ., data = values2labels(mtcars[,c("mpg","am")])))
These functions set/get/drop variable labels. For value labels see val_lab. For working with entire data.frame see apply_labels.
var_lab
returns variable label or NULL if label doesn't
exist.
var_lab<-
set variable label.
set_var_lab
returns variable with label.
unvr
drops variable label.
add_labelled_class
Add missing 'labelled' class. This function
is needed when you load SPSS data with packages which in some cases don't set
'labelled' class for variables with labels. For example, haven
package
doesn't set 'labelled' class for variables which have variable label but
don't have value labels. Note that to use 'expss' with 'haven' you need to
load 'expss' strictly after 'haven' to avoid conflicts.
var_lab(x, default = NULL) var_lab(x) <- value set_var_lab(x, value) unvr(x) drop_var_labs(x) add_labelled_class( x, remove_classes = c("haven_labelled", "spss_labelled", "haven_labelled_spss", "vctrs_vctr") )
var_lab(x, default = NULL) var_lab(x) <- value set_var_lab(x, value) unvr(x) drop_var_labs(x) add_labelled_class( x, remove_classes = c("haven_labelled", "spss_labelled", "haven_labelled_spss", "vctrs_vctr") )
x |
Variable. In the most cases it is numeric vector. |
default |
A character scalar. What we want to get from 'var_lab' if there is no variable label. NULL by default. |
value |
A character scalar - label for the variable x. |
remove_classes |
A character vector of classes which should be removed
from the class attribute of the |
Variable label is stored in attribute "label" (attr(x,"label")
). For
preserving from dropping this attribute during some operations (such as c
)
variable class is set to "labelled". There are special methods of
subsetting and concatenation for this class. To drop variable label use
var_lab(var) <- NULL
or unvr(var)
.
var_lab
return variable label. If label doesn't exist it return
NULL . var_lab<-
and set_var_lab
return variable (vector x)
of class "labelled" with attribute "label" which equals submitted value.
data(mtcars) var_lab(mtcars$mpg) = "Miles/(US) gallon" var_lab(mtcars$cyl) = "Number of cylinders" var_lab(mtcars$disp) = "Displacement (cu.in.)" var_lab(mtcars$hp) = "Gross horsepower" var_lab(mtcars$drat) = "Rear axle ratio" var_lab(mtcars$wt) = "Weight (lb/1000)" var_lab(mtcars$qsec) = "1/4 mile time" var_lab(mtcars$vs) = "V/S" var_lab(mtcars$am) = "Transmission (0 = automatic, 1 = manual)" val_lab(mtcars$am) = c(automatic = 0, manual=1) var_lab(mtcars$gear) = "Number of forward gears" var_lab(mtcars$carb) = "Number of carburetors" fre(mtcars$am) cross_mean(mtcars, list(mpg, disp, hp, qsec), list(total(), am)) ## Not run: if(FALSE){ # to prevent execution # you need to load packages strictly in this order to avoid conflicts library(haven) library(expss) spss_data = haven::read_spss("spss_file.sav") # add missing 'labelled' class spss_data = add_labelled_class(spss_data) } ## End(Not run)
data(mtcars) var_lab(mtcars$mpg) = "Miles/(US) gallon" var_lab(mtcars$cyl) = "Number of cylinders" var_lab(mtcars$disp) = "Displacement (cu.in.)" var_lab(mtcars$hp) = "Gross horsepower" var_lab(mtcars$drat) = "Rear axle ratio" var_lab(mtcars$wt) = "Weight (lb/1000)" var_lab(mtcars$qsec) = "1/4 mile time" var_lab(mtcars$vs) = "V/S" var_lab(mtcars$am) = "Transmission (0 = automatic, 1 = manual)" val_lab(mtcars$am) = c(automatic = 0, manual=1) var_lab(mtcars$gear) = "Number of forward gears" var_lab(mtcars$carb) = "Number of carburetors" fre(mtcars$am) cross_mean(mtcars, list(mpg, disp, hp, qsec), list(total(), am)) ## Not run: if(FALSE){ # to prevent execution # you need to load packages strictly in this order to avoid conflicts library(haven) library(expss) spss_data = haven::read_spss("spss_file.sav") # add missing 'labelled' class spss_data = add_labelled_class(spss_data) } ## End(Not run)
vars
returns data.frame with all variables by their names or
by criteria (see criteria). There is no non-standard evaluation in
this function by design so use quotes for names of your variables. This
function is intended to get variables by parameter/criteria. The only
exception with non-standard evaluation is %to%
. You can use
%to%
inside vars
or independently.
..p
returns data.frame with all variables which names satisfy
supplied perl-style regular expression. Arguments for this function is quoted
characters. It is a shortcut for vars(perl(pattern))
.
..f
returns data.frame with all variables which names contain
supplied pattern. Arguments for this function can be unquoted. It is a
shortcut for vars(fixed(pattern))
.
..t
returns data.frame with variables which names are stored
in the supplied arguments. Expressions in characters in curly brackets are
expanded. See text_expand.
..[]
returns data.frame with all variables by their names or
by criteria (see criteria). Names at the top-level can be unquoted
(non-standard evaluation). For standard evaluation of parameters you can
surround them by round brackets. You can assign to this expression. If there
are several names inside square brackets then each element of list/data.frame
from right side will be assigned to appropriate name from left side. You can
use item1 %to% item2
notation to get/create sequence of variables.
If there are no arguments inside square brackets than from each item of RHS
will be created separate variable in the parent frame. In this case RHS
should be named list or data.frame.
..$name
sets/returns object which name is stored in the
variable name
. It is convenient wrapper around
get/assign functions.
%to%
returns range of variables between e1
and
e2
(similar to SPSS 'to').
indirect
/indirect_list
are aliases for
vars
/vars_list
.
Functions with word 'list' in name return lists of variables instead of dataframes.
vars(...) vars_list(...) indirect(...) indirect_list(...) e1 %to% e2 e1 %to_list% e2 .. ..f(...) ..p(...) ..t(...)
vars(...) vars_list(...) indirect(...) indirect_list(...) e1 %to% e2 e1 %to_list% e2 .. ..f(...) ..p(...) ..t(...)
... |
characters names of variables or criteria/logical functions |
e1 |
unquoted name of start variable (e. g. a_1) |
e2 |
unquoted name of start variable (e. g. a_5) |
An object of class parameter
of length 1.
data.frame/list with variables
All these functions except %n_d%
, %n_i%
preserve names of
vectors and don't remove duplicates.
%a%
a(ppends) second argument to the first argument. See also
append.
%u%
and v_union
u(nite) first and second arguments.
Remove elements from the second argument which exist in the first argument.
%d%
and v_diff
d(iff) second argument from the first
argument. Second argument could be a function which returns logical value. In
this case elements of the first argument which give TRUE will be removed.
%i%
and v_intersect
i(ntersect) first argument and
second argument. Second argument could be a function which returns logical
value. In this case elements of the first argument which give FALSE will be
removed.
%e%
and v_xor
e(xclusive OR). Returns elements that
contained only in one of arguments.
%r%
r(epeats) first argument second argument times. See
also rep.
%n_d%
and n_diff
n(ames) d(iff) - diff second argument
from names of first argument. Second argument could be a function which
returns logical value. In this case elements of the first argument which names
give TRUE will be removed.
%n_i%
and n_intersect
n(ames) i(ntersect) - intersect
names of the first argument with the second argument. Second argument could be a
function which returns logical value. In this case elements of the first argument
which names give FALSE will be removed.
For %d%
, %i%
, %n_d%
, %n_i%
one can use
criteria functions. See criteria for details.
e1 %a% e2 v_union(e1, e2) e1 %u% e2 v_diff(e1, e2) e1 %d% e2 v_intersect(e1, e2) e1 %i% e2 v_xor(e1, e2) e1 %e% e2 e1 %r% e2 n_intersect(e1, e2) e1 %n_i% e2 n_diff(e1, e2) e1 %n_d% e2
e1 %a% e2 v_union(e1, e2) e1 %u% e2 v_diff(e1, e2) e1 %d% e2 v_intersect(e1, e2) e1 %i% e2 v_xor(e1, e2) e1 %e% e2 e1 %r% e2 n_intersect(e1, e2) e1 %n_i% e2 n_diff(e1, e2) e1 %n_d% e2
e1 |
vector or data.frame, matrix, list for |
e2 |
vector or function for |
vector or data.frame, matrix, list for %n_d%
, %n_i%
)
1:4 %a% 5:6 # 1:6 1:4 %a% 4:5 # 1,2,3,4,4,5 1:4 %u% 4:5 # 1,2,3,4,5 1:6 %d% 5:6 # 1:4 # function as criterion 1:6 %d% greater(4) # 1:4 1:4 %i% 4:5 # 4 # with criteria functions letters %i% (contains("a") | contains("z")) # a, z letters %i% perl("[a-d]") # a,b,c,d 1:4 %e% 4:5 # 1, 2, 3, 5 1:2 %r% 2 # 1, 2, 1, 2 # %n_i%, %n_d% # remove column Species iris %n_d% "Species" # leave only columns which names start with "Sepal" iris %n_i% like("Sepal*") # leave column "Species" and columns which names start with "Sepal" iris %n_i% ("Species" | like("Sepal*")) iris %n_i% or("Species", like("Sepal*")) # same result
1:4 %a% 5:6 # 1:6 1:4 %a% 4:5 # 1,2,3,4,4,5 1:4 %u% 4:5 # 1,2,3,4,5 1:6 %d% 5:6 # 1:4 # function as criterion 1:6 %d% greater(4) # 1:4 1:4 %i% 4:5 # 4 # with criteria functions letters %i% (contains("a") | contains("z")) # a, z letters %i% perl("[a-d]") # a,b,c,d 1:4 %e% 4:5 # 1, 2, 3, 5 1:2 %r% 2 # 1, 2, 1, 2 # %n_i%, %n_d% # remove column Species iris %n_d% "Species" # leave only columns which names start with "Sepal" iris %n_i% like("Sepal*") # leave column "Species" and columns which names start with "Sepal" iris %n_i% ("Species" | like("Sepal*")) iris %n_i% or("Species", like("Sepal*")) # same result
w_mean
weighted mean of a numeric vector
w_sd
weighted sample standard deviation of a numeric vector
w_var
weighted sample variance of a numeric vector
w_se
weighted standard error of a numeric vector
w_median
weighted median of a numeric vector
w_mad
weighted mean absolute deviation from median of a numeric vector
w_sum
weighted sum of a numeric vector
w_n
weighted number of values of a numeric vector
w_cov
weighted covariance matrix of a numeric matrix/data.frame
w_cor
weighted Pearson correlation matrix of a numeric matrix/data.frame
w_pearson
shortcut for w_cor
. Weighted Pearson
correlation matrix of a numeric matrix/data.frame
w_spearman
weighted Spearman correlation matrix of a numeric matrix/data.frame
w_mean(x, weight = NULL, na.rm = TRUE) w_median(x, weight = NULL, na.rm = TRUE) w_var(x, weight = NULL, na.rm = TRUE) w_sd(x, weight = NULL, na.rm = TRUE) w_se(x, weight = NULL, na.rm = TRUE) w_mad(x, weight = NULL, na.rm = TRUE) w_sum(x, weight = NULL, na.rm = TRUE) w_n(x, weight = NULL, na.rm = TRUE) unweighted_valid_n(x, weight = NULL) valid_n(x, weight = NULL) w_max(x, weight = NULL, na.rm = TRUE) w_min(x, weight = NULL, na.rm = TRUE) w_cov(x, weight = NULL, use = c("pairwise.complete.obs", "complete.obs")) w_cor(x, weight = NULL, use = c("pairwise.complete.obs", "complete.obs")) w_pearson(x, weight = NULL, use = c("pairwise.complete.obs", "complete.obs")) w_spearman(x, weight = NULL, use = c("pairwise.complete.obs", "complete.obs"))
w_mean(x, weight = NULL, na.rm = TRUE) w_median(x, weight = NULL, na.rm = TRUE) w_var(x, weight = NULL, na.rm = TRUE) w_sd(x, weight = NULL, na.rm = TRUE) w_se(x, weight = NULL, na.rm = TRUE) w_mad(x, weight = NULL, na.rm = TRUE) w_sum(x, weight = NULL, na.rm = TRUE) w_n(x, weight = NULL, na.rm = TRUE) unweighted_valid_n(x, weight = NULL) valid_n(x, weight = NULL) w_max(x, weight = NULL, na.rm = TRUE) w_min(x, weight = NULL, na.rm = TRUE) w_cov(x, weight = NULL, use = c("pairwise.complete.obs", "complete.obs")) w_cor(x, weight = NULL, use = c("pairwise.complete.obs", "complete.obs")) w_pearson(x, weight = NULL, use = c("pairwise.complete.obs", "complete.obs")) w_spearman(x, weight = NULL, use = c("pairwise.complete.obs", "complete.obs"))
x |
a numeric vector (matrix/data.frame for correlations) containing the values whose weighted statistics is to be computed. |
weight |
a vector of weights to use for each element of x. Cases with
missing, zero or negative weights will be removed before calculations. If
|
na.rm |
a logical value indicating whether NA values should be stripped before the computation proceeds. Note that contrary to base R statistic functions the default value is TRUE (remove missing values). |
use |
|
If argument of correlation functions is data.frame with variable labels then
variables names will be replaced with labels. If this is undesirable behavior
use drop_var_labs function: w_cor(drop_var_labs(x))
. Weighted
Spearman correlation coefficients are calculated with weights rounded to nearest
integer. It gives the same result as in SPSS Statistics software. By
now this algorithm is not memory efficient.
a numeric value of length one/correlation matrix
data(mtcars) dfs = mtcars %>% columns(mpg, disp, hp, wt) with(dfs, w_mean(hp, weight = 1/wt)) # apply labels mtcars = mtcars %>% apply_labels( mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (lb/1000)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c(automatic = 0, manual=1), gear = "Number of forward gears", carb = "Number of carburetors" ) # weighted correlations with labels w_cor(dfs, weight = 1/dfs$wt) # without labels w_cor(drop_var_labs(dfs), weight = 1/dfs$wt)
data(mtcars) dfs = mtcars %>% columns(mpg, disp, hp, wt) with(dfs, w_mean(hp, weight = 1/wt)) # apply labels mtcars = mtcars %>% apply_labels( mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (lb/1000)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c(automatic = 0, manual=1), gear = "Number of forward gears", carb = "Number of carburetors" ) # weighted correlations with labels w_cor(dfs, weight = 1/dfs$wt) # without labels w_cor(drop_var_labs(dfs), weight = 1/dfs$wt)
This is a "brute force" weighting procedure. Each row of the dataset is replicated "case weight" times. If 'weight' is not integer it will be rounded to the nearest integer. So cases with weight less than 0.5 will be removed from the dataset. Such weighting is used in the several statistical procedures in the SPSS Statistic, e. g. for the Spearman correlation coefficient or GLM.
weight_by(data, weight = NULL)
weight_by(data, weight = NULL)
data |
data.frame, data.table or matrix. Dataset which will be weighted. |
weight |
unquoted column name of weights in 'data' or vector of weights. If it is NULL 'data' will be returned unchanged. |
'data' with each row replicated according to case weight.
data(state) # US states # convert matrix to data.table states = data.table(state.x77, keep.rownames = "State") # create weighted dataset states_weighted = states %>% let( # calculate 'weight' variable. weight = Population/100 ) %>% weight_by(weight) # Each row in the weighted dataset is represented proportionally to the population of the state nrow(states) # unweigthed number of cases nrow(states_weighted) # number of cases in the weighted dataset str(states_weighted)
data(state) # US states # convert matrix to data.table states = data.table(state.x77, keep.rownames = "State") # create weighted dataset states_weighted = states %>% let( # calculate 'weight' variable. weight = Population/100 ) %>% weight_by(weight) # Each row in the weighted dataset is represented proportionally to the population of the state nrow(states) # unweigthed number of cases nrow(states_weighted) # number of cases in the weighted dataset str(states_weighted)
For the data frame cond
will be evaluated in the data.frame's context.
So columns can be referred as variables in the expression (see the examples).
If data
is list then where
will be applied to each element of
the list. For other types (vector/matrix) there is no non-standard
evaluation. There is a special constant .N
which equals to number of
rows in data
for usage in cond
expression.
where(data, cond)
where(data, cond)
data |
data.frame/matrix/vector/list to be subsetted |
cond |
logical or numeric expression indicating elements or rows to
keep: missing values (NA) are taken as FALSE. If |
data.frame/matrix/vector/list which contains just selected rows.
# leave only 'setosa' rows(iris, Species == "setosa") # leave only first five rows rows(iris, 1:5) # example of .N usage. set.seed(42) train = rows(iris, sample(.N, 100)) str(train) set.seed(42) test = rows(iris, -sample(.N, 100)) str(test)
# leave only 'setosa' rows(iris, Species == "setosa") # leave only first five rows rows(iris, 1:5) # example of .N usage. set.seed(42) train = rows(iris, sample(.N, 100)) str(train) set.seed(42) test = rows(iris, -sample(.N, 100)) str(test)
This is faster version of ave. window_fun
applies function
to every subset of x
and return vector of the same length as x
.
window_fun(x, ...)
window_fun(x, ...)
x |
A vector |
... |
Grouping variables all of the same length as x or length 1 and function as last argument. |
vector of the same length as x
window_fun(1:3, mean) # no grouping -> grand mean attach(warpbreaks) window_fun(breaks, wool, mean) window_fun(breaks, tension, function(x) mean(x, trim = 0.1)) detach(warpbreaks)
window_fun(1:3, mean) # no grouping -> grand mean attach(warpbreaks) window_fun(breaks, wool, mean) window_fun(breaks, tension, function(x) mean(x, trim = 0.1)) detach(warpbreaks)
write_labelled_csv
and read_labelled_csv
writes csv
file with labels. By default labels are stored in the commented lines at the
beginning of the file before the data part. *_csv2
write and read data
with a semicolon separator and comma as decimal delimiter. *_tab/*_tab2
write and read data with 'tab' separator and "."/"," as decimal delimiter.
write_labelled_xlsx
and read_labelled_xlsx
write and
read labelled 'xlsx' format. It is a simple Excel file with data and labels on
separate sheets. It can help you with labelled data exchange in the
corporate environment.
write_labelled_fst
and read_labelled_fst
write and read
labelled data in the 'fst' format. See Fst Package.
Data and labels are stored in the separate files. With 'fst' format you can
read and write a huge amount of data very quickly.
write_labelled_spss write 'csv' file with SPSS syntax for reading it. You can use it for the data exchange with SPSS.
create_dictionary
and apply_dictionary
make data.frame
with dictionary, e. g. variable and value labels for each variable. See
format description in the 'Details' section.
write_labels
and write_labels_spss
Write R code and
SPSS syntax for labelling data. It allows to extract labels from *.sav files
that come without accompanying syntax.
old_write_labelled_csv
and old_read_labelled_csv
Read
and write labelled 'csv' in format of the 'expss' version before 0.9.0.
write_labelled_csv( x, filename, remove_new_lines = TRUE, single_file = TRUE, ... ) write_labelled_csv2( x, filename, remove_new_lines = TRUE, single_file = TRUE, ... ) write_labelled_tab( x, filename, remove_new_lines = TRUE, single_file = TRUE, ... ) write_labelled_tab2( x, filename, remove_new_lines = TRUE, single_file = TRUE, ... ) write_labelled_xlsx( x, filename, data_sheet = "data", dict_sheet = "dictionary", remove_repeated = FALSE, use_references = TRUE ) write_labelled_fst(x, filename, ...) read_labelled_csv(filename, undouble_quotes = TRUE, ...) read_labelled_csv2(filename, undouble_quotes = TRUE, ...) read_labelled_tab(filename, undouble_quotes = TRUE, ...) read_labelled_tab2(filename, undouble_quotes = TRUE, ...) read_labelled_xlsx(filename, data_sheet = 1, dict_sheet = "dictionary") read_labelled_fst(filename, ...) write_labelled_spss( x, filename, fileEncoding = "", remove_new_lines = TRUE, ... ) write_labels_spss(x, filename) write_labels(x, filename, fileEncoding = "") create_dictionary(x, remove_repeated = FALSE, use_references = TRUE) apply_dictionary(x, dict) old_write_labelled_csv( x, filename, fileEncoding = "", remove_new_lines = TRUE, ... ) old_read_labelled_csv(filename, fileEncoding = "", undouble_quotes = TRUE, ...)
write_labelled_csv( x, filename, remove_new_lines = TRUE, single_file = TRUE, ... ) write_labelled_csv2( x, filename, remove_new_lines = TRUE, single_file = TRUE, ... ) write_labelled_tab( x, filename, remove_new_lines = TRUE, single_file = TRUE, ... ) write_labelled_tab2( x, filename, remove_new_lines = TRUE, single_file = TRUE, ... ) write_labelled_xlsx( x, filename, data_sheet = "data", dict_sheet = "dictionary", remove_repeated = FALSE, use_references = TRUE ) write_labelled_fst(x, filename, ...) read_labelled_csv(filename, undouble_quotes = TRUE, ...) read_labelled_csv2(filename, undouble_quotes = TRUE, ...) read_labelled_tab(filename, undouble_quotes = TRUE, ...) read_labelled_tab2(filename, undouble_quotes = TRUE, ...) read_labelled_xlsx(filename, data_sheet = 1, dict_sheet = "dictionary") read_labelled_fst(filename, ...) write_labelled_spss( x, filename, fileEncoding = "", remove_new_lines = TRUE, ... ) write_labels_spss(x, filename) write_labels(x, filename, fileEncoding = "") create_dictionary(x, remove_repeated = FALSE, use_references = TRUE) apply_dictionary(x, dict) old_write_labelled_csv( x, filename, fileEncoding = "", remove_new_lines = TRUE, ... ) old_read_labelled_csv(filename, fileEncoding = "", undouble_quotes = TRUE, ...)
x |
data.frame to be written/data.frame whose labels to be written |
filename |
the name of the file which the data are to be read from/write to. |
remove_new_lines |
A logical indicating should we replace new lines with spaces in the character variables. TRUE by default. |
single_file |
logical. TRUE by default. Should we write labels into the same file as data? If it is FALSE dictionary will be written in the separate file. |
... |
additional arguments for fwrite/fread, e. g. column separator, decimal separator, encoding and etc. |
data_sheet |
character "data" by default. Where data will be placed in the '*.xlsx' file. |
dict_sheet |
character "dictionary" by default. Where dictionary will be placed in the '*.xlsx' file. |
remove_repeated |
logical. FALSE by default. If TRUE then we remove repeated variable names. It makes a dictionary to look nicer for humans but less convenient for usage. |
use_references |
logical. When TRUE (default) then if the variable has the same value labels as the previous variable, we use reference to this variable. It makes dictionary significantly more compact for datasets with many variables with the same value labels. |
undouble_quotes |
A logical indicating should we undouble quotes which were escaped by doubling. TRUE by default. Argument will be removed when data.table issue #1109 will be fixed. |
fileEncoding |
character string: if non-empty declares the encoding to be used on a file (not a connection) so the character data can be re-encoded as they are written. Used for writing dictionary. See file. |
dict |
data.frame with labels - a result of |
Dictionary is a data.frame with the following columns:
variable variable name in the data set. It can be omitted
(NA
). In this case name from the previous row will be taken.
value code for label in the column 'label'
.
label in most cases it is value label but its meaning can be changed
by the column 'meta'
.
meta if it is NA then we have value label in the 'label'
column. If it is 'varlab'
, then there is a variable label in the
'label'
column and column 'value'
is ignored. If it is
'reference'
, then there is a variable name in the 'label'
column and we use value labels from this variable, column 'value'
is
ignored.
Functions for writing invisibly return NULL. Functions for reading return labelled data.frame.
## Not run: data(mtcars) mtcars = mtcars %>% apply_labels( mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (lb/1000)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c(automatic = 0, manual=1), gear = "Number of forward gears", carb = "Number of carburetors" ) write_labelled_csv(mtcars, "mtcars.csv") new_mtcars = read_labelled_csv("mtcars.csv") str(new_mtcars) # identically, for xlsx write_labelled_xlsx(mtcars, "mtcars.xlsx") new_mtcars = read_labelled_xlsx("mtcars.xlsx") str(new_mtcars) # to SPSS syntax write_labelled_spss(mtcars, "mtcars.csv") ## End(Not run)
## Not run: data(mtcars) mtcars = mtcars %>% apply_labels( mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (lb/1000)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c(automatic = 0, manual=1), gear = "Number of forward gears", carb = "Number of carburetors" ) write_labelled_csv(mtcars, "mtcars.csv") new_mtcars = read_labelled_csv("mtcars.csv") str(new_mtcars) # identically, for xlsx write_labelled_xlsx(mtcars, "mtcars.xlsx") new_mtcars = read_labelled_xlsx("mtcars.xlsx") str(new_mtcars) # to SPSS syntax write_labelled_spss(mtcars, "mtcars.csv") ## End(Not run)
Note that openxlsx
package is required for these functions. It can be
install by printing install.packages('openxlsx')
in the console. On
Windows system you also may need to
install rtools. You
can export several tables at once by combining them in a list. See examples.
If you need to write all tables to the single sheet you can use
xl_write_file
. It automatically creates workbook, worksheet and save
*.xlsx file for you.
xl_write(obj, wb, sheet, row = 1, col = 1, ...) xl_write_file(obj, filename, sheetname = "Tables", ...) ## Default S3 method: xl_write( obj, wb, sheet, row = 1, col = 1, rownames = FALSE, colnames = !is.atomic(obj), ... ) ## S3 method for class 'list' xl_write(obj, wb, sheet, row = 1, col = 1, gap = 1, ...) ## S3 method for class 'etable' xl_write( obj, wb, sheet, row = 1, col = 1, remove_repeated = c("all", "rows", "columns", "none"), format_table = TRUE, borders = list(borderColour = "black", borderStyle = "thin"), header_format = openxlsx::createStyle(fgFill = "#EBEBEB", halign = "left", wrapText = FALSE), main_format = openxlsx::createStyle(halign = "right", numFmt = format(0, nsmall = get_expss_digits())), row_labels_format = openxlsx::createStyle(halign = "left"), total_format = openxlsx::createStyle(fgFill = "#EBEBEB", border = "TopBottom", borderStyle = "thin", halign = "right", numFmt = "0"), total_row_labels_format = openxlsx::createStyle(fgFill = "#EBEBEB", border = "TopBottom", borderStyle = "thin", halign = "left"), top_left_corner_format = header_format, row_symbols_to_remove = NULL, col_symbols_to_remove = NULL, other_rows_formats = NULL, other_row_labels_formats = NULL, other_cols_formats = NULL, other_col_labels_formats = NULL, additional_cells_formats = NULL, ... ) ## S3 method for class 'with_caption' xl_write( obj, wb, sheet, row = 1, col = 1, remove_repeated = c("all", "rows", "columns", "none"), format_table = TRUE, borders = list(borderColour = "black", borderStyle = "thin"), header_format = openxlsx::createStyle(fgFill = "#EBEBEB", halign = "left", wrapText = FALSE), main_format = openxlsx::createStyle(halign = "right", numFmt = format(0, nsmall = get_expss_digits())), row_labels_format = openxlsx::createStyle(halign = "left"), total_format = openxlsx::createStyle(fgFill = "#EBEBEB", border = "TopBottom", borderStyle = "thin", halign = "right", numFmt = "0"), total_row_labels_format = openxlsx::createStyle(fgFill = "#EBEBEB", border = "TopBottom", borderStyle = "thin", halign = "left"), top_left_corner_format = header_format, row_symbols_to_remove = NULL, col_symbols_to_remove = NULL, other_rows_formats = NULL, other_row_labels_formats = NULL, other_cols_formats = NULL, other_col_labels_formats = NULL, additional_cells_formats = NULL, caption_format = openxlsx::createStyle(textDecoration = "bold", halign = "left"), ... )
xl_write(obj, wb, sheet, row = 1, col = 1, ...) xl_write_file(obj, filename, sheetname = "Tables", ...) ## Default S3 method: xl_write( obj, wb, sheet, row = 1, col = 1, rownames = FALSE, colnames = !is.atomic(obj), ... ) ## S3 method for class 'list' xl_write(obj, wb, sheet, row = 1, col = 1, gap = 1, ...) ## S3 method for class 'etable' xl_write( obj, wb, sheet, row = 1, col = 1, remove_repeated = c("all", "rows", "columns", "none"), format_table = TRUE, borders = list(borderColour = "black", borderStyle = "thin"), header_format = openxlsx::createStyle(fgFill = "#EBEBEB", halign = "left", wrapText = FALSE), main_format = openxlsx::createStyle(halign = "right", numFmt = format(0, nsmall = get_expss_digits())), row_labels_format = openxlsx::createStyle(halign = "left"), total_format = openxlsx::createStyle(fgFill = "#EBEBEB", border = "TopBottom", borderStyle = "thin", halign = "right", numFmt = "0"), total_row_labels_format = openxlsx::createStyle(fgFill = "#EBEBEB", border = "TopBottom", borderStyle = "thin", halign = "left"), top_left_corner_format = header_format, row_symbols_to_remove = NULL, col_symbols_to_remove = NULL, other_rows_formats = NULL, other_row_labels_formats = NULL, other_cols_formats = NULL, other_col_labels_formats = NULL, additional_cells_formats = NULL, ... ) ## S3 method for class 'with_caption' xl_write( obj, wb, sheet, row = 1, col = 1, remove_repeated = c("all", "rows", "columns", "none"), format_table = TRUE, borders = list(borderColour = "black", borderStyle = "thin"), header_format = openxlsx::createStyle(fgFill = "#EBEBEB", halign = "left", wrapText = FALSE), main_format = openxlsx::createStyle(halign = "right", numFmt = format(0, nsmall = get_expss_digits())), row_labels_format = openxlsx::createStyle(halign = "left"), total_format = openxlsx::createStyle(fgFill = "#EBEBEB", border = "TopBottom", borderStyle = "thin", halign = "right", numFmt = "0"), total_row_labels_format = openxlsx::createStyle(fgFill = "#EBEBEB", border = "TopBottom", borderStyle = "thin", halign = "left"), top_left_corner_format = header_format, row_symbols_to_remove = NULL, col_symbols_to_remove = NULL, other_rows_formats = NULL, other_row_labels_formats = NULL, other_cols_formats = NULL, other_col_labels_formats = NULL, additional_cells_formats = NULL, caption_format = openxlsx::createStyle(textDecoration = "bold", halign = "left"), ... )
obj |
|
wb |
xlsx workbook object, result of createWorkbook function. |
sheet |
character or numeric - worksheet name/number in the workbook |
row |
numeric - starting row for writing data |
col |
numeric - starting column for writing data |
... |
further arguments for |
filename |
A character string naming an xlsx file. For |
sheetname |
A character name for the worksheet. For |
rownames |
logical should we write data.frame row names? |
colnames |
logical should we write data.frame column names? |
gap |
integer. Number of rows between list elements. |
remove_repeated |
Should we remove duplicated row or column labels in the rows/columns of the etable? Possible values: "all", "rows", "columns", "none". |
format_table |
logical should we format table? If FALSE all format arguments will be ignored. |
borders |
list Style of the table borders. List with two named elements:
|
header_format |
table header format - result of the createStyle function. |
main_format |
result of the createStyle function. Format of the table main area except total rows. Total rows is rows which row labels contain '#'. |
row_labels_format |
result of the createStyle function. Format of the row labels area except total rows. Total rows is rows which row labels contain '#'. |
total_format |
result of the createStyle function. Format of the total rows in the table main area. Total rows is rows which row labels contain '#'. |
total_row_labels_format |
result of the createStyle function. Format of the total rows in the row labels area. Total rows is rows which row labels contain '#'. |
top_left_corner_format |
result of the createStyle function. |
row_symbols_to_remove |
character vector. Perl-style regular expressions for substrings which will be removed from row labels. |
col_symbols_to_remove |
character vector. Perl-style regular expressions for substrings which will be removed from column names. |
other_rows_formats |
named list. Names of the list are perl-style regular expression patterns, items of the list are results of the createStyle function. Rows in the main area which row labels contain pattern will be formatted according to the appropriate style. |
other_row_labels_formats |
named list. Names of the list are perl-style regular expression patterns, items of the list are results of the createStyle function. Rows in the row labels area which row labels contain pattern will be formatted according to the appropriate style. |
other_cols_formats |
named list. Names of the list are perl-style regular expression patterns, items of the list are results of the createStyle function. Columns in the main area which column labels contain pattern will be formatted according to the appropriate style. |
other_col_labels_formats |
named list. Names of the list are perl-style regular expression patterns, items of the list are results of the createStyle function. Columns in the header area which column labels contain pattern will be formatted according to the appropriate style. |
additional_cells_formats |
list Each item of the list is list which
consists of two elements. First element is two columns matrix or data.frame
with row number and column numbers in the main area of the table. Such
matrix can be produced with code |
caption_format |
result of the createStyle function. |
invisibly return vector with rows and columns (c(rows,
columns)
) occupied by outputted object.
## Not run: library(openxlsx) data(mtcars) # add labels to dataset mtcars = apply_labels(mtcars, mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (lb/1000)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c("Automatic" = 0, "Manual"=1), gear = "Number of forward gears", carb = "Number of carburetors" ) # create table with caption mtcars_table = cross_cpct(mtcars, cell_vars = list(cyl, gear), col_vars = list(total(), am, vs) ) %>% set_caption("Table 1") wb = createWorkbook() sh = addWorksheet(wb, "Tables") # export table xl_write(mtcars_table, wb, sh) saveWorkbook(wb, "table1.xlsx", overwrite = TRUE) ## quick export xl_write_file(mtcars_table, "table1.xlsx") ## custom cells formatting wb = createWorkbook() sh = addWorksheet(wb, "Tables") # we want to mark cells which are greater than total column my_formatter = function(tbl){ greater_than_total = tbl[,-1]>tbl[[2]] which(greater_than_total, arr.ind = TRUE) } # export table xl_write(mtcars_table, wb, sh, additional_cells_formats = list( list(my_formatter, createStyle(textDecoration = "bold", fontColour = "blue")) ) ) saveWorkbook(wb, "table_with_additional_format.xlsx", overwrite = TRUE) ## automated report generation on multiple variables with the same banner banner = with(mtcars, list(total(), am, vs)) # create list of tables list_of_tables = lapply(mtcars, function(variable) { if(length(unique(variable))<7){ cro_cpct(variable, banner) %>% significance_cpct() } else { # if number of unique values greater than seven we calculate mean cro_mean_sd_n(variable, banner) %>% significance_means() } }) wb = createWorkbook() sh = addWorksheet(wb, "Tables") # export list of tables with additional formatting xl_write(list_of_tables, wb, sh, # remove '#' sign from totals col_symbols_to_remove = "#", row_symbols_to_remove = "#", # format total column as bold other_col_labels_formats = list("#" = createStyle(textDecoration = "bold")), other_cols_formats = list("#" = createStyle(textDecoration = "bold")), ) saveWorkbook(wb, "report.xlsx", overwrite = TRUE) ## End(Not run)
## Not run: library(openxlsx) data(mtcars) # add labels to dataset mtcars = apply_labels(mtcars, mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (lb/1000)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c("Automatic" = 0, "Manual"=1), gear = "Number of forward gears", carb = "Number of carburetors" ) # create table with caption mtcars_table = cross_cpct(mtcars, cell_vars = list(cyl, gear), col_vars = list(total(), am, vs) ) %>% set_caption("Table 1") wb = createWorkbook() sh = addWorksheet(wb, "Tables") # export table xl_write(mtcars_table, wb, sh) saveWorkbook(wb, "table1.xlsx", overwrite = TRUE) ## quick export xl_write_file(mtcars_table, "table1.xlsx") ## custom cells formatting wb = createWorkbook() sh = addWorksheet(wb, "Tables") # we want to mark cells which are greater than total column my_formatter = function(tbl){ greater_than_total = tbl[,-1]>tbl[[2]] which(greater_than_total, arr.ind = TRUE) } # export table xl_write(mtcars_table, wb, sh, additional_cells_formats = list( list(my_formatter, createStyle(textDecoration = "bold", fontColour = "blue")) ) ) saveWorkbook(wb, "table_with_additional_format.xlsx", overwrite = TRUE) ## automated report generation on multiple variables with the same banner banner = with(mtcars, list(total(), am, vs)) # create list of tables list_of_tables = lapply(mtcars, function(variable) { if(length(unique(variable))<7){ cro_cpct(variable, banner) %>% significance_cpct() } else { # if number of unique values greater than seven we calculate mean cro_mean_sd_n(variable, banner) %>% significance_means() } }) wb = createWorkbook() sh = addWorksheet(wb, "Tables") # export list of tables with additional formatting xl_write(list_of_tables, wb, sh, # remove '#' sign from totals col_symbols_to_remove = "#", row_symbols_to_remove = "#", # format total column as bold other_col_labels_formats = list("#" = createStyle(textDecoration = "bold")), other_cols_formats = list("#" = createStyle(textDecoration = "bold")), ) saveWorkbook(wb, "report.xlsx", overwrite = TRUE) ## End(Not run)