R--制作Table1

在大多数已发表的文章中，都有一个“表1”，其中包含样本的描述性统计数据。例如，这可能包括连续变量的平均值和标准差，分类变量的频率和比例，也许还包括缺失值的数量。创建这样一个表的暴力方法是为每个感兴趣的变量计算每个统计数据，然后将结果复制并粘贴到表中。即使做了一次，你也会希望有一个更简单的方法！有许多可能的解决方案，但这里展示了一个快速且易于使用的解决方案——gtsummary包。今天我们以网络上一个示例数据进行绘制Table1。

Let's start!!!

案例

Using the COVID dataset (covid_20210908_rmph.rData, see Appendix A.4), numerically and visually examine the continuous variable hospitals_per_100k (number of hospitals per 100,000 persons) and the categorical variable CensusRegionName.

Task1:使用COVID数据集（COVID_20210908_rph.rData，），对连续变量hospitals_per_100k（每100000人的医院数量）和分类变量CensusRegionName进行数值和可视化检验。

#读取数据
load("covid_20210908_rmph.rData")

#Task1:
par(mfrow = c(1,2))
hist(covid$hospitals_per_100k, main = "hospitals_per_100k")
barplot(table(covid$CensusRegionName), main = "CensusRegionName")

Using the United Nations Human Development Data (unhdd2020.rmph.rData, see Appendix A.2), create a “Table 1” of descriptive statistics (mean and standard deviation for continuous variables, frequency and proportion for categorical variables), overall and by Human Development Index group (hdi_group).

hdi: Human Development Index (HDI)
life: Life expectancy at birth (years)
educ_expected: Expected years of schooling (years)
gii: Gender Inequality Index
urban: Urban population (%)

Task2:使用联合国人类发展数据（unhdd22020.rmph.rData），创建一个总体和按人类发展指数组（hdi_group）分类的描述性统计“表1”（连续变量的平均值和标准差，分类变量的频率和比例）。使用包括以上变量的完整案例分析。

#Task2:
load("unhdd2020.rmph.rData")

library(gtsummary)
str(unhdd)
unhdd %>% select(hdi, life, educ_expected, gii, urban, hdi_group) %>% 
  drop_na() %>% #选择完整病例
  tbl_summary(
    # The "by" variable
    by = hdi_group,
    statistic = list(all_continuous()  ~ "{mean} ({sd})",
                     all_categorical() ~ "{n}    ({p}%)"),
    digits = list(all_continuous()  ~ c(2, 2),
                  all_categorical() ~ c(0, 1)),
    type = list(hdi      ~ "continuous",
                life ~ "continuous",
                educ_expected   ~ "continuous",
                gii ~ "continuous",
                urban ~ "continuous"),
    label  = list(hdi      ~ "Hdi",
                  life ~ "Life",
                  educ_expected   ~ "Educ_expected",
                  urban ~ "Urban")
  ) %>%
  modify_header(
    label = "**Variable**",
    # The following adds the % to the column total label
    # <br> is the location of a line break
    all_stat_cols() ~ "**{level}**<br>N = {n} ({style_percent(p, digits=1)}%)"
  ) %>%
  modify_caption("Participant characteristics, by hdi_group") %>%
  bold_labels()  %>%
  # Include an "overall" column
  add_overall(
    last = FALSE,
    # The ** make it bold
    col_label = "**All participants**<br>N = {N}"
  )

具体参数如下：

参数设置

statistic: The default for continuous variables is the median and IQR. The default for categorical variables is the frequency and proportion. Below, this option is used to instead compute the mean and standard deviation for continuous variables (and the default for categorical variables is coded explicitly).#连续变量的默认值是中位数和IQR。分类变量的默认值是频率和比例。下面，此选项用于计算连续变量的平均值和标准差（分类变量的默认值是显式编码的）
digits: tbl_summary() guesses the number of digits to which to round. Use this option to set the number yourself. Since two statistics are specified in the statistic option, two numbers are specified here, one for each statistic. #tbl_summary（）猜测要舍入到的位数。使用此选项可以自行设置数字。由于在statistics选项中指定了两个统计信息，因此此处指定两个数字，每个统计信息一个。
type: tbl_summary() guesses whether a variable is continuous or categorical based on its distribution. Use this option to specify the types yourself, in particular if you need to override the default. #tbl_summary（）根据变量的分布来猜测变量是连续的还是分类的。使用此选项可以自己指定类型，特别是在需要覆盖默认值的情况下。
label: The default row labels are the variable names or labels (if the dataset has been labeled, for example, using the Hmisc library label() function). Use this option to change the row headers. #默认的行标签是变量名或标签（如果数据集已被标记，例如，使用Hmisc库标签（）函数）。使用此选项可以更改行标题。
modify_header: The default column header is “Characteristic”. Use this option to change the column header. Surrounding text by ** results in a bold font. #默认列标题为“Characteristic”。使用此选项可以更改列标题。用将文本包围会产生粗体。
modify_caption: Use this option to add a table caption.#使用此选项可以添加表格标题。
bold_labels: Use this option to display the row labels in a bold font.#使用此选项可以用粗体显示行标签。

Using the COVID dataset (covid_20210908_rmph.rData, see Appendix A.4), create a “Table 1” of descriptive statistics, overall and by the number of hospitals per 100,000 persons (hospitals_per_100k) (use a median split to create a binary version of this “by” variable). Describe the following variables in this table:

pop.usafacts: County population
cases.usafacts.20210908: COVID-19 cumulative cases as of 2021-09-08
deaths.usafacts.20210908: COVID-19 cumulative deaths as of 2021-09-08
MedianAge2010: median age of county in 2010
CensusRegionName: name of census region

Task3:使用COVID数据集（COVID_20210908_rph.rData），创建描述性统计的“表1”，总体上按每100000人的医院数量（hospitals_per_100k）计算（使用中值分割来创建此“by”变量的二进制版本）。请在此表中描述以上变量。 #Task3:

load("covid_20210908_rmph.rData")

library(gtsummary)

covid <- covid %>% mutate(by = if_else(hospitals_per_100k > median(hospitals_per_100k), ">median", "<=median"))

covid %>% select(pop.usafacts, cases.usafacts.20210908, deaths.usafacts.20210908, MedianAge2010, CensusRegionName, by) %>% 
  drop_na() %>% #选择完整病例
  tbl_summary(
    # Use the median split variable as the "by" variable
    by = by,
    statistic = list(all_continuous()  ~ "{median} ({p25}, {p75})",
                     all_categorical() ~ "{n}    ({p}%)"),
    digits = list(all_continuous()  ~ c(2, 2),
                  all_categorical() ~ c(0, 1))
  ) %>%
  modify_header(
    label = "**Variable**",
    all_stat_cols() ~ "**{level}**<br>N = {n} ({style_percent(p, digits=1)}%)"
  ) %>%
  modify_caption("Participant characteristics, by SBP") %>%
  bold_labels() %>% 
  add_overall(last = FALSE,
              col_label = "**All participants**<br>N = {N}")

输出表格

Table1 <- covid %>% select(pop.usafacts, cases.usafacts.20210908, deaths.usafacts.20210908, MedianAge2010, CensusRegionName, by) %>% 
  drop_na() %>% #选择完整病例
  tbl_summary(
    # Use the median split variable as the "by" variable
    by = by,
    statistic = list(all_continuous()  ~ "{median} ({p25}, {p75})",
                     all_categorical() ~ "{n}    ({p}%)"),
    digits = list(all_continuous()  ~ c(2, 2),
                  all_categorical() ~ c(0, 1))
  ) %>%
  modify_header(
    label = "**Variable**",
    all_stat_cols() ~ "**{level}**<br>N = {n} ({style_percent(p, digits=1)}%)"
  ) %>%
  modify_caption("Participant characteristics, by By") %>%
  bold_labels() %>% 
  add_overall(last = FALSE,
              col_label = "**All participants**<br>N = {N}")

Table1 %>% 
  as_flex_table() %>% 
  flextable::save_as_docx(path = "MyTable1.docx")

Word上显示效果还是不错的。。。

添加P值

通常，在一篇已发表的研究文章中，按结果或暴露情况显示描述性统计数据的表1包括p值，这些p值检验每个变量的零假设，即该变量在人群中的所有群体中具有相同的平均值（或中位数或比例）。

covid %>% select(pop.usafacts, cases.usafacts.20210908, deaths.usafacts.20210908, MedianAge2010, CensusRegionName, by) %>% 
  drop_na() %>% #选择完整病例
  tbl_summary(
    # Use the median split variable as the "by" variable
    by = by,
    statistic = list(all_continuous()  ~ "{median} ({p25}, {p75})",
                     all_categorical() ~ "{n}    ({p}%)"),
    digits = list(all_continuous()  ~ c(2, 2),
                  all_categorical() ~ c(0, 1))
  ) %>%
  add_p(
    test = list(all_continuous()  ~ "wilcox.test",
                all_categorical() ~ "chisq.test"),
    pvalue_fun = function(x) style_pvalue(x, digits = 3)) %>% #添加P值
  modify_header(
    label = "**Variable**",
    all_stat_cols() ~ "**{level}**<br>N = {n} ({style_percent(p, digits=1)}%)"
  ) %>%
  modify_caption("Participant characteristics, by By") %>%
  bold_labels() %>% 
  add_overall(last = FALSE,
              col_label = "**All participants**<br>N = {N}")

制作回归分析结果

library(gtsummary)
mod1 <- glm(response ~ trt + age + grade, trial, family = binomial)

t1 <- tbl_regression(mod1, exponentiate = TRUE)

library(survival)

# build survival model table
t2 <- coxph(Surv(ttdeath, death) ~ trt + grade + age, trial) %>%
  tbl_regression(exponentiate = TRUE)

# merge tables 
tbl_merge_ex1 <-
  tbl_merge(
    tbls = list(t1, t2),
    tab_spanner = c("**Tumor Response**", "**Time to Death**")
  )

tbl_merge_ex1

OK!打完收工！

数据链接提取码：gsdp

提取码：gsdp

- END -