3 {discord}

3.1 Behavior Genetics

As a field, (human) behavior – or behavioral – genetics explores individual differences in psychological traits and characteristics that arise from genetic and environmental factors (Burt, Plaisance, and Hambrick 2019). As described in (S. Mason Garrison 2017), the field grew as an offshoot of psychology and psychiatry as our understanding of molecular genetics matured with the identification of DNA’s structure and sequencing – see (Watson and Crick 1953; Sanger, Nicklen, and Coulson 1977). Although considered the “gold standard” for inferring causation (Rubin 2008), conducting randomized experiments to tease apart individual traits from differences in genes and environment are not always possible. For both ethical and practical considerations, behavior geneticists often use quasi-experimental designs which control for potential confounds using a variety of statistical approaches (S. Mason Garrison and Joseph Lee Rodgers 2017).

In this thesis, I present work based on a quasi-experimental design using kinship modeling. Kin comparison designs distinguish within-family variance from between-family variance (Gary Chamberlain and Zvi Griliches 1975). The former is a measure of how family members differ from one another; the latter reflects sources that make family members similar to one another but different from other families (S. Mason Garrison and Joseph Lee Rodgers 2017). By partitioning these sources of variance, behavioral geneticists may greatly reduce confounds when testing causal hypotheses (Lahey and D’Onofrio 2010). To ease researchers in this goal, I have rewritten an R package, {discord}, to provide user-friendly functions for sibling-based quasi-experimental designs.2

3.2 Technical Aspects

At a high-level, the {discord} package implements a modified reciprocal standard dyad model (Kenny, Kashy, and Cook 2006) to facilitate kinship comparisons known as the discordant-kinship model. Consider the simplified case where a behavioral outcome, \(Y\), is predicted by one variable, \(X\). The discordant-kinship model relates the difference in the outcome, \(Y_{i\Delta}\), for the \(i\text{th}\) kinship pair, where \(\bar{Y}_i\) is the mean level of the outcome, \(\bar{X}_i\) is the mean level of the predictor, and \(X_{i\Delta}\) is the between-kin difference in the predictor.

\[ Y_{i\Delta} = \beta_0 + \beta_1 \bar{Y}_i + \beta_2 \bar{X}_i + \beta_3 X_{i\Delta} + \epsilon_i \]

This model partitions variance in line with the above discussion to support causal inference. Specifically, the within-family variance is described by \(Y_{\Delta}\) and \(X_{\Delta}\); between-family variance is captured by \(\bar{Y}\) and \(\bar{X}\) (S. Mason Garrison and Joseph Lee Rodgers 2017).

A non-significant association between \(Y_\Delta\) and \(X_\Delta\) suggests that the variables are not causally related and may have arisen from genetic covariance or shared-environmental factors. In contrast, a significant association may provide support for a causal relationship between variables depending on the relatedness of each kin pair. That is, the discordant-kinship model is applicable for any kin-pair: monozygotic twins who share 100% of their DNA; full-siblings who share 50%; half-siblings who share 25%; cousins who share 12.5%; etc. Thus, a significant relationship found with monozygotic twins would provide stronger support for a causal claim than the same relationship between cousins.

Following (S. Mason Garrison and Joseph Lee Rodgers 2017), I recommend interpreting significant associations as not disproving a causal relationship. Although this design controls for much (sibling) if not all (monozygotic twins) background heterogeneity, it is possible that a significant relationship between a phenotype and plausible covariates is possible due to non-shared environmental influences.

The next section illustrates how to perform a discordant-kinship regression using the {discord} package.

3.3 Vaccine willingness and socioeconomic status

3.3.1 Introduction

The following analysis is a pared-down version of previous work presented at the Behavior Genetics Association 50th Annual Meeting (Jonathan Trattner, Kennon Later, and S. Mason Garrison 2020). The original project was inspired by reports detailing health disparities amongst ethnic minorities during the Covid-19 pandemic (Hooper, Nápoles, and Pérez-Stable 2020). These were often attributed to differences in socioeconomic status (SES), pre-existing health conditions, and Covid-19 symptom severity (Ssentongo et al. 2020; Yang, Gui, and Xiong 2020). In line with the field of behavior genetics, any intervention to address these disparities must explicitly account for known gene-and-environmental confounds (Garrison and Rodgers 2019; Williams et al. 2020).

We thus aimed to identify the relationship between SES and vaccination willingness using a quasi-experimental design. Data for this analysis is from the 1979 National Longitudinal Survey of Youth (NLSY79), a nationally representative household probability sample sponsored by the U.S. Bureau of Labor Statistics and Department of Defense. Participants were surveyed annually from 1979-1994 at which point surveys occurred biennially. The data set is publicly available at https://www.nlsinfo.org/. The data include a biennial flu vaccine survey containing data from 2006-2016. We examined whether SES at age 40 was a significant predictor for vaccination rates using the discordant-kinship model.

As described in (Garrison and Rodgers 2019), SES was quantified using methodology from (Myrianthopoulos and French 1968). Individuals were given a mean quantile score based on their net family income, years of education, and occupation prestige. Missing data was imputed from nonmissing components, and higher scores correspond to higher SES.

The data for this analysis was downloaded with the NLS Investigator and can be found here. The SES at age 40 data can be found here. For clarity, and to emphasize the functionality of {discord}, the data has been pre-processed using this script. This discordant-kinship analysis is possible thanks to recent work that estimated relatedness for approximately 95% of the NLSY79 kin pairs (Rodgers et al. 2016). These kinship links are included in the {NlsyLinks} R package (Beasley et al. 2016) and are easily utilized with the {discord} package.

3.3.2 Data Cleaning

For this example, we will load the following packages.

# For easy data manipulation
library(dplyr)
# For kinship linkages
library(NlsyLinks)
# For discordant-kinship regression
library(discord)
# To clean data frame names
library(janitor)

After some pre-processing, we have a data frame containing subject identifiers, demographic information such as race and sex, and behavioral measurements such as flu vaccination rates and SES at age 40:

Using the kinship relationships included in the {NlsyLinks} package, we can create a data frame that lends itself to behavior-genetic analysis. For each kin pair, the function CreatePairLinksSingleEntered() takes a data set like the one above, a specification of the NLSY database and the kin’s relatedness, and the variables of interest. It returns a data frame where every row is a kin-pair and each column is a variable of interest with a suffix indicating to which individual the value corresponds.

For our example, we want to examine the relationship between flu vaccinations received between 2006-2016 and SES at age 40 between full siblings. I’m specifying the following variables from the above data set.

# Get kinship links for individuals with the following variables:
link_vars <- c("FLU_total", "FLU_2008", "FLU_2010", 
               "FLU_2012", "FLU_2014", "FLU_2016", 
               "S00_H40", "RACE", "SEX")

We now link the subjects by the specified variables using CreatePairLinksSingleEntered().

# Narrow NLSY Link Pairs 
link_pairs <- Links79PairExpanded %>%
  filter(RelationshipPath == "Gen1Housemates" & RFull == 0.5)

df_link <- CreatePairLinksSingleEntered(outcomeDataset = flu_ses_data,
                                        linksPairDataset = link_pairs,
                                        outcomeNames = link_vars)

We’ve saved this data frame as df_link. A subset of this data looks like this:

Notice that, with the exception of the first column indicating the specific pair, each column name has the suffix “_S1” and “_S2”. As mentioned above, these indicate to which sibling the column values correspond.

This data is almost ready for analysis, but we want to ensure that the data are representative of actual trends. The FLU_total column is simply a sum of the biennial survey responses. So for a given sibling-pair, one or both individuals may not have responded to the survey indicating their vaccination status. If that’s the case, we wish to exclude those siblings to reduce non-response bias. We can do this by examining the biennial responses and removing any rows that have NA.

# Take the linked data, group by the sibling pairs and
# count the number of responses for flu each year. If there is an NA, 
# then data is missing for one of the years, and we omit it.
consistent_kin <- df_link %>% 
  group_by(SubjectTag_S1, SubjectTag_S2) %>% 
  count(FLU_2008_S1, FLU_2010_S1, FLU_2012_S1, FLU_2014_S1, FLU_2016_S1,
        FLU_2008_S2, FLU_2010_S2, FLU_2012_S2, FLU_2014_S2, FLU_2016_S2) %>% 
  na.omit()

# Create the flu_modeling_data object with only consistent responders.
# Clean the column names with the {janitor} package.
flu_modeling_data <- semi_join(df_link, consistent_kin, 
                               by = c("SubjectTag_S1", "SubjectTag_S2")) %>%
  clean_names()

To be extra safe in our analysis, we specify that the sibling-pairs should be from unique households (i.e. remove households with more than one sibling-pair).

flu_modeling_data <- flu_modeling_data %>%
  group_by(extended_id) %>%
  slice_sample() %>%
  ungroup()

The data we will use for modeling now contains meta-information for each kin pair, including sex and race of each individual, flu vaccination status for the biennial survey between 2006-2016, and a total flu vaccination count for that period. The total vaccination count ranges from 0 - 5, where 0 indicates that the individual did not get a vaccine in any year between 2006-2016 and 5 indicates that an individual got at least 5 vaccines between 2006-2016. Though our data set has individual years, we are only interested in the total. A subset of the data we will use in our regression looks like this:

## # A tibble: 1,067 x 11
##    extended_id subject_tag_s1 subject_tag_s2 flu_total_s1 flu_total_s2 race_s1
##          <int>          <int>          <int>        <int>        <int>   <dbl>
##  1          17           1700           1800            0            0       0
##  2          29           2900           3000            2            0       0
##  3          37           3700           3800            1            5       0
##  4          40           4000           4100            2            0       0
##  5          58           5800           5900            5            0       0
##  6          61           6100           6200            3            4       0
##  7          67           6700           6800            4            4       0
##  8          74           7500           7600            0            0       0
##  9          83           8300           8400            0            3       1
## 10          85           8500           8600            0            0       1
## # … with 1,057 more rows, and 5 more variables: race_s2 <dbl>, sex_s1 <dbl>,
## #   sex_s2 <dbl>, s00_h40_s1 <dbl>, s00_h40_s2 <dbl>

3.3.3 Modeling and Interpretation

To perform the regression using the {discord} package, we supply the data frame and specify the outcome and predictors. It also requires a kinship pair id, extended_id in our case, as well as pair identifiers – the column name suffixes that identify to which kin a column’s values correspond (“_s1” and “_s2” in our case).3 Optional, though recommended, are columns containing sex and race information to control for as additional covariates. In our case, these columns are prefixed “race” and “sex”. Per the pre-processing script, these columns contain dummy variables where the default race is non-black, non-Hispanic and the default sex is female.

By entering this information into the discord_regression() function, we can run our model as such:

# Setting a seed for reproducibility
set.seed(18)
flu_model_output <- discord_regression(data = flu_modeling_data,
                                outcome = "flu_total",
                                predictors = "s00_h40",
                                id = "extended_id",
                                sex = "sex",
                                race = "race",
                                pair_identifiers = c("_s1", "_s2"))

The default output of discord_regression() is a tidy data frame containing the model metrics – courtesy of the broom package (Robinson, Hayes, and Couch 2021). In this example, our results are as follows:

Term Estimate Standard Error T Statistic P Value
(Intercept) 1.259 0.191 6.598 p<0.001
flu_total_mean 0.186 0.033 5.576 p<0.001
s00_h40_diff 0.006 0.002 2.822 p=0.005
s00_h40_mean 0.005 0.003 1.781 p=0.075
sex_1 -0.113 0.097 -1.173 p=0.241
race_1 0.021 0.101 0.210 p=0.834
sex_2 0.101 0.096 1.043 p=0.297

Looking at our output, we can think of the intercept as the average difference of the outcome between siblings, ignoring all other variables. In this case, it looks like the average sibling difference for two sisters of a non-minority ethnic background (the default sex and race values) is approximately 1.5. The term flu_total_mean is essentially an extra component of the intercept that captures some non-linear trends and allows the difference score to change as a function of the average predictors. Here, this is the mean socioeconomic status for the siblings, s00_h40_mean. We also accounted for sex and race, neither of which have a statistically significant effect on the differences in flu vaccine shots between siblings (different families) or within a sibling pair (same family).

The most important metric from the output, though, is the difference score, s00_h40_diff. In this example, it is statistically significant. To interpret it, we may say “the difference in socioeconomic status between siblings at age 40 is positively associated with the difference in the number of flu vaccinations received between 2006-2016.” This means that a sibling with 10% higher SES is expected to have 0.0564853 more flu shots.

The goal of performing a discordant-kinship regression is to see whether there is a significant difference in some behavioral measure while controlling for as much gene-and-environmental variance as possible. In this example, we found there is a significant difference in the number of flu shots a sibling received and their socioeconomic status. We cannot claim this relationship is causal. However, we cannot eliminate causality because there are statistically significant within- and between-family differences in our predictors and outcomes.

3.4 Conclusion

In its current implementation, the {discord} package encourages best practices for performing discordant-kinship regressions. For example, the main function has the default expectation that sex and race indicators will be supplied. These measures are both important covariates when testing for causality between familial background and psychological characteristics.

This, and other design choices, are crucial to facilitating transparent and reproducible results. To further support this, future releases of {discord} will provide improved documentation and allow for easier inspection of the underlying model implementation and results.

An example of this can be seen in the development version, which is hosted on GitHub. From commit d0189a3, {discord} has the option to return the complete linear model object, as opposed to an abridged summary via {broom}. This allows researchers to get a clearer understanding of model metrics such as degrees of freedom and \(R^2\).

Further, future versions of {discord} will explore hitherto undefined mathematical frameworks for implementing discordant-kinship regressions with non-Gaussian distributions. The example covered in the vignette, while illustrative of how to use {discord}, did not follow appropriate modeling assumptions. To more truthfully describe the relationship between socioeconomic status and flu vaccines received, we would need a distribution that is limited to integers, or counts, such as the Poisson.

In short, {discord} provides user-friendly functions for genetically-informed quasi experimental designs. As of May 11, 2021, the version of {discord} described in this thesis has been downloaded over 3800 times, which I think is incredible. Future work on this package will attempt to make it more transparent and extendable, while maintaining its ease-of-use.