Multinomial regression analysis of suicide data in R

Active Analytics Ltd: posted 21 Dec 2013 11:15 by Chibisi Chima-Okereke [ updated 22 Mar 2014 08:46 ]

Introduction

Forgive me father for I have sinned, it's been more than three months since my last confession. Yes it's been ages since I wrote a blog entry, I have a mixture of excuses but I won't bore you with them.

The aim of this blog is to outline how to carry out a basic multinomial regression analysis in R. In this blog I use the data.table package which is much hyped and with good reason, because it is awesome! I also use the multinom() function in the nnet package to carry out the regressions. Anyone who has tried carry out statistics on real data from a survey knows that it can be pretty ghastly; I hope that the methods outlined here will be useful to those people. No pretty pictures today I'm afraid!

Data Preparation

The data being used comes from the CDC and we will use the research written by Michael Wilson et al at Peer Corps Trust. The paper is free and can be downloaded here http://www.mdpi.com/1660-4601/9/11/4122. I won't spend a great deal of time explaining the analysis because there is already a paper on the methodology. Here I simply concentrate on the R code. I took the MS Access table and copied it into a CSV file. The data set is small, 1432 rows and 87 columns but the main issue here is handling less than perfect data.

First we load the data:

require(foreign)
require(data.table)
require(nnet)

#========  DATA PREP

# Loading the data

sFileName <- "SuicideData.csv"
sFilePath <- "C:/Users/.../"
dfSuicideData <- read.csv(paste(sFilePath, "/", sFileName, sep = ""))

I convert the data.frame to a data.table and overwrite the record column with row number.

dfSuicideData <- data.table(dfSuicideData)
dfSuicideData[, record:= 1:nrow(dfSuicideData)]

We need to create the response variable, which is a categorical variable with three factors. These three categories will be "NO" for no suicide tendency, "SI" for suicide ideation and SISP for suicide attempt/planning. These are derived from question 25 and 26 the "q25" and "q26" columns which. Question 25 asks the adolescents whether they seriously consered attempting suicide, and question 26 asked them whether they made a plan about how they would attempt suicide. The "NO" category answered no to both questions, the "SI" category answered "Yes" to question 25 and if they answered yes to question 26 they get assigned to SISP regardless of the previous answer.

There are many ways to capture this. My approach throughout was to use a set of conditional statements to apply the categories. So for the response, my categorizing function was:

require(foreign)
require(data.table)
require(nnet)

#========  DATA PREP

# Loading the data

sFileName <- "SuicideData.csv"
sFilePath <- "C:/Users/.../"
dfSuicideData <- read.csv(paste(sFilePath, "/", sFileName, sep = ""))

I convert the data.frame to a data.table and overwrite the record column with row number.

dfSuicideData <- data.table(dfSuicideData)
dfSuicideData[, record:= 1:nrow(dfSuicideData)]

We need to create the response variable, which is a categorical variable with three factors. These three categories will be "NO" for no suicide tendency, "SI" for suicide ideation and SISP for suicide attempt/planning. These are derived from question 25 and 26 the "q25" and "q26" columns which. Question 25 asks the adolescents whether they seriously consered attempting suicide, and question 26 asked them whether they made a plan about how they would attempt suicide. The "NO" category answered no to both questions, the "SI" category answered "Yes" to question 25 and if they answered yes to question 26 they get assigned to SISP regardless of the previous answer.

There are many ways to capture this. My approach throughout was to use a set of conditional statements to apply the categories. So for the response, my categorizing function was

AllocateItems <- function(Q25, Q26){
 
        # If both contain missing data
        if(is.na(Q25) & is.na(Q26))
        {
                sOutput <- NA_character_
        }else
        {
                # If both are no
                if(!is.na(Q25) & !is.na(Q26))
                {
                      if(Q25 == 2 & Q26 == 2)
                         sOutput <- "NO"
                }

                # if either of the items are present we assume it is correct with SISP override
                sOutput <- "NO" 
 
                if(!is.na(Q25))
                {
                                if(Q25 == 1)
                                sOutput <- "SI"
                }
 
                if(!is.na(Q26))
                {
                              if(Q26 == 1)
                                 sOutput <- "SISP"
                }
        }
 
        return(sOutput)
}

Then we apply it to the data.table to create a "Response" column.


# Creating response column
dfSuicideData[, Response := AllocateItems(q25, q26), by = record]

In the paper, the age is defined in years, so we simply add an appropriate constant to make this happen


# Age
dfSuicideData[, Age := q1 + 10, by = record]

Next we allocate sex, instead of creating a sex column with male or female categories, I create a "Sex.Male" column with "yes" or "no" as the categories. The function for allocating this is:


AllocateQ2 <- function(Q2)
{
    sOutput <- NA_character_
 
        if(!is.na(Q2)){
                if(Q2 == 1)
                {
                       sOutput <- "Yes"
                }else
                {
                 if(Q2 == 2)
                        sOutput <- "No"
                 }
        }
 
        return(sOutput)
}

Then we apply it to the table


# Sex
dfSuicideData[, Sex.Male := AllocateQ2(q2), by = record]

We apply the categories in the paper, apportioning them as it stipulates until we end up with the analysis data, the rest of the data prep reads very similar to the above so feel free to skip ahead to the Analysis part.

# Function for allocating Q6 Q22, Q23
AllocateQ6.22.23 <- function(Q6)
{
        sOutput <- NA_character_
 
        if(!is.na(Q6)){
                        if(Q6 %in% 1:3)
                {
                        sOutput <- "No"
                       }else
                {
                        if(Q6 %in% 4:5)
                          sOutput <- "Yes"
                 }
        }
 
        return(sOutput)
}

# Assigning to Q6
dfSuicideData[, Hungry := AllocateQ6.22.23(q6), by = record]

# Assign to Q23
dfSuicideData[, Worried := AllocateQ6.22.23(q23), by = record]

# Assign to Q22
dfSuicideData[, Lonely := AllocateQ6.22.23(q22), by = record]

# Allocating the binary
AllocateBinary <- function(iBin)
{
        sOutput <- NA_character_
 
        if(!is.na(iBin)){
                if(iBin == 1)
                {
                         sOutput <- "Yes"
                }else
                {
                 if(iBin == 2)
                        sOutput <- "No"
                 }
        }
        return(sOutput)
}

# Assign Sad
dfSuicideData[, Sad := AllocateBinary(q24), by = record]

# Assign Number of friends
dfSuicideData[, NumberOfFriends := q27-1, by = record]

# Assign Missing school
dfSuicideData[, MissSchool := AllocateQ6.22.23(q50), by = record]

# Function for allocating bullying days
AllocateQ20 <- function(Q20)
{
        sOutput <- NA_character_
 
        if(!is.na(Q20)){
                if(Q20 %in% 1:2)
               {
                        sOutput <- "No"
                }else
                {
                if(Q20 %in% 3:7)
                         sOutput <- "Yes"
                }
        }
 
         return(sOutput)
}

# Assigning bullying
dfSuicideData[, Bullied := AllocateQ20(q20), by = record]

# Function for assigning parental oversight
AllocateQ52.53.54 <- function(Q52)
{
        sOutput <- NA_character_
 
        if(!is.na(Q52)){
                if(Q52 %in% 1:3)
                {
                       sOutput <- "No"
                }else
                {
                 if(Q52 %in% 4:5)
                       sOutput <- "Yes"
                }
        }
 
        return(sOutput)
}

# Assigning parents checking homework
dfSuicideData[, CheckHomework := AllocateQ52.53.54(q52), by = record]

# Assigning understanding parents
dfSuicideData[, UnderStandingParents := AllocateQ52.53.54(q53), by = record]

# Assigning parental oversight on free time
dfSuicideData[, CheckPlaytime := AllocateQ52.53.54(q54), by = record]

# Function for assigning cigarettes and drug use
AllocateQ29 <- function(Q29)
{
        sOutput <- NA_character_
 
        if(!is.na(Q29)){
                if(Q29 == 1)
                {
                       sOutput <- "No"
                }else
                {
                if(Q29 >= 2)
                        sOutput <- "Yes"
                }
        }
        return(sOutput)
}

# Cigarette use
dfSuicideData[, CigaretteUse := AllocateQ29(q29), by = record]

# Drug use
dfSuicideData[, DrugUse := AllocateQ29(q39), by = record]

# Drug problems
dfSuicideData[, AlcholMisuse := AllocateQ29(q38), by = record]

# subsetting the data
sWantedColumns <- c("record", "Response", "Age", "Sex.Male", "Hungry", "Worried", 
"Lonely", "Sad", "NumberOfFriends", "MissSchool", "Bullied", "CheckHomework", 
"UnderStandingParents", "CheckPlaytime", "CigaretteUse", "DrugUse", "AlcholMisuse")

dfAnalysisData <- dfSuicideData[,sWantedColumns, with = FALSE]

Analysis

Now it is time to carry out the analysis. There are three tables in the paper, the first displays the distributions of the categories and the p-value for a one-way analysis - Chi-Squared for the categorical variables and F-test ANOVA for the Age variable. First I gather my "yes" and "no" categorical variables

sOneWayAnalysis <- sWantedColumns[!sWantedColumns %in% c("Response", "Age", "record", "NumberOfFriends")]

Then we write a function to grab the proportions of the yeses and the p-values of the ANOVA, then we use lapply to carry out the analysis ...

dfDistributions <- lapply(sOneWayAnalysis, function(sFactor = "MissSchool"){
        print(sFactor)
        sFormula <- paste("~ ", sFactor, " + Response", sep = "")
        dfProportions <- (prop.table(with(dfAnalysisData, xtabs(as.formula(sFormula))), 
    margin = 2)*100)["Yes",]
        dfProportions <- data.table(t(dfProportions))
        dfProportions[, P.Value := chisq.test(get(sFactor, dfAnalysisData), 
    get("Response", dfAnalysisData))$p.value]
        return(dfProportions)
})

dfDistributions <- do.call(rbind, dfDistributions)
dfDistributions <- data.table("Variable" = sOneWayAnalysis, dfDistributions)

We then separately carry out the analysis for the other groups

# For Age
dfAgeMean <- dfAnalysisData[!is.na(Response), mean(Age, na.rm = TRUE), by = Response]
dfAgeSD <- dfAnalysisData[!is.na(Response), sd(Age, na.rm = TRUE), by = Response]

dP.Value <- data.frame(summary(aov(Age ~ factor(Response), 
     data = dfAnalysisData))[[1]])[1,"Pr..F."]
dfAgeDistribution <- data.table(Variable = "Age", NO = dfAgeMean[Response == "NO", V1],
          SI = dfAgeMean[Response == "SI", V1], SISP = dfAgeMean[Response == "SISP", V1], 
     P.Value = dP.Value)

# Combining with main table
dfDistributions <- rbind(dfAgeDistribution, dfDistributions)

# For Friends
dfFriendData <- prop.table(with(dfAnalysisData, xtabs( ~ NumberOfFriends + Response)), 
     margin = 2)*100
oNameNames <- dimnames(dfFriendData)
names(oNameNames) <- NULL
dfFriendData <- data.frame(matrix(dfFriendData, nr = 4, dimnames = oNameNames))
dFriendsPValue <- with(dfAnalysisData, chisq.test(NumberOfFriends, Response)$p.value)
dfFriendData <- data.table(Variable = paste("NumberOfFriends = ", rownames(dfFriendData), 
      sep = ""), dfFriendData, P.Value = dFriendsPValue)

Below is the equivalent of the Table 1 in the paper

dfOutputTable1 <- rbind(dfDistributions, dfFriendData)
> dfOutputTable1
                Variable        NO       SI     SISP      P.Value
 1:                  Age 14.003788 13.48515 14.09283 1.189294e-03
 2:             Sex.Male 47.523810 58.58586 40.25424 7.609487e-03
 3:               Hungry 13.022352 25.00000 22.07792 6.946819e-05
 4:              Worried  6.362773 19.19192 27.11864 3.988193e-21
 5:               Lonely  7.939509 28.57143 26.77824 1.221959e-19
 6:                  Sad 23.432980 51.64835 60.69869 8.005775e-31
 7:           MissSchool  3.636364 13.18681 11.45374 1.422812e-07
 8:              Bullied 17.820658 41.33333 37.70492 1.094117e-11
 9:        CheckHomework 40.684794 41.75824 35.24229 2.955763e-01
10: UnderStandingParents 37.627812 31.03448 27.31278 9.359673e-03
11:        CheckPlaytime 40.570846 40.65934 31.98198 5.762363e-02
12:         CigaretteUse 11.976048 28.57143 27.10280 6.309457e-10
13:              DrugUse  9.772952 20.00000 25.11013 6.134090e-10
14:         AlcholMisuse 16.714976 33.68421 32.18884 3.943569e-09
15:  NumberOfFriends = 0  3.625954 13.82979 11.39241 4.625285e-08
16:  NumberOfFriends = 1 14.503817 15.95745 15.18987 4.625285e-08
17:  NumberOfFriends = 2 15.935115 18.08511 21.94093 4.625285e-08
18:  NumberOfFriends = 3 65.935115 52.12766 51.47679 4.625285e-08

The attentive will notice that these numbers are different from those in the paper. This is because the paper has some mistakes in the count data. Now for the multinomial regression part. Table 2 and three can be calculated with one function. First we need to ensure that the number of friends is a factor with the base level at 3, we do this using the relevel() function which is quite useful when you want to adjust the base level in a regression.

# We need to make sure that the regression treats the number of friends as a factor
dfAnalysisData[, NumberOfFriends := relevel(factor(NumberOfFriends), ref = "3")]

Now we write the function that will do the regression. The idea is to construct a formula that we pass to the multinom() function, and then get the summary. The "nnet" package version I used was "7.3-6" on R 2.15.2. The reason I say this is because there was no p-value in the output of the multinom() function. So I had to calculate it (which is actually straightforward). We can then calculate the relativities of the coefficients given in the model and the confidence intervals then output the final data.

getModelResult <- function(sFormula){
        oModel <- multinom(sFormula, data = dfAnalysisData)

        # Multinomial Model summaries
        dfModelCoef <- data.frame(t(summary(oModel)$coefficients))
        dfModelSE <- data.frame(t(summary(oModel)$standard.errors))
        # Calculating the p-value using wald statistic
        dfModelPValues <- data.frame(apply((dfModelCoef/dfModelSE)^2, 2, 
    pchisq, df = 1, lower.tail = FALSE))


 # The relativities
        dfMeanCoefs <- exp(dfModelCoef)
        dfUpperCoefs <- exp(dfModelCoef + qnorm(.975)*dfModelSE)
        dfLowerCoefs <- exp(dfModelCoef - qnorm(.975)*dfModelSE)

        # Confidence intervals
        names(dfLowerCoefs) <- paste("Lower.", names(dfLowerCoefs), sep = "")
        names(dfUpperCoefs) <- paste("Upper.", names(dfUpperCoefs), sep = "")
        names(dfModelPValues) <- paste("P.Value.", names(dfModelPValues), sep = "")

        # Output table
        dfModelOutput <- data.frame(dfMeanCoefs, dfLowerCoefs, dfUpperCoefs, dfModelPValues)
        return(dfModelOutput)
}

We can then generate table 2 ...

sFormula <- formula(paste(c("Response ~ Age", sOneWayAnalysis, "NumberOfFriends"), collapse = " + "))
dfOutputTable2 <- getModelResult(sFormula = sFormula)
> print(dfOutputTable2, digits = 3)
                           SI   SISP Lower.SI Lower.SISP Upper.SI Upper.SISP P.Value.SI P.Value.SISP
(Intercept)             0.332 0.0364   0.0125    0.00401     8.78      0.330   5.09e-01     3.22e-03
Age                     0.808 1.0345   0.6403    0.88948     1.02      1.203   7.10e-02     6.60e-01
Sex.MaleYes             1.818 0.7347   0.9374    0.46225     3.52      1.168   7.69e-02     1.92e-01
HungryYes               0.786 1.1993   0.3343    0.68677     1.85      2.094   5.82e-01     5.23e-01
WorriedYes              1.118 3.1397   0.4074    1.71937     3.07      5.733   8.29e-01     1.96e-04
LonelyYes               5.883 2.3604   2.6601    1.31230    13.01      4.246   1.21e-05     4.14e-03
SadYes                  2.154 3.3344   1.1167    2.15123     4.15      5.168   2.21e-02     7.21e-08
MissSchoolYes           2.020 1.4335   0.5505    0.53242     7.41      3.860   2.89e-01     4.76e-01
BulliedYes              1.580 1.7436   0.7973    1.07785     3.13      2.821   1.90e-01     2.35e-02
CheckHomeworkYes        1.248 1.0549   0.6361    0.66128     2.45      1.683   5.19e-01     8.22e-01
UnderStandingParentsYes 0.627 0.5899   0.2950    0.34922     1.33      0.996   2.24e-01     4.85e-02
CheckPlaytimeYes        1.337 1.0953   0.6735    0.67990     2.66      1.764   4.06e-01     7.08e-01
CigaretteUseYes         4.609 2.7042   2.0041    1.49007    10.60      4.907   3.23e-04     1.07e-03
DrugUseYes              0.380 1.4016   0.1107    0.69012     1.30      2.847   1.24e-01     3.50e-01
AlcholMisuseYes         1.674 1.7280   0.7850    1.04006     3.57      2.871   1.82e-01     3.47e-02
NumberOfFriends0        2.534 2.0221   0.9123    0.93589     7.04      4.369   7.44e-02     7.32e-02
NumberOfFriends1        0.693 1.2138   0.2257    0.65133     2.13      2.262   5.23e-01     5.42e-01
NumberOfFriends2        1.328 1.4589   0.5787    0.85197     3.05      2.498   5.04e-01     1.69e-01

and then table 3

sFormula <- formula(paste(c("Response ~", sOneWayAnalysis[sOneWayAnalysis != "Sex.Male"], "NumberOfFriends"), collapse = " + "))
dfOutputTable3 <- getModelResult(sFormula = sFormula)

> print(dfOutputTable3, digits = 3)
                            SI   SISP Lower.SI Lower.SISP Upper.SI Upper.SISP P.Value.SI P.Value.SISP
(Intercept)             0.0228 0.0514   0.0116     0.0323   0.0451     0.0818   1.25e-27     6.04e-36
HungryYes               0.8971 1.2019   0.3919     0.6963   2.0536     2.0745   7.97e-01     5.09e-01
WorriedYes              0.9863 3.1040   0.3615     1.7142   2.6907     5.6205   9.78e-01     1.85e-04
LonelyYes               5.3816 2.4364   2.4723     1.3664  11.7145     4.3445   2.23e-05     2.55e-03
SadYes                  2.0209 3.5325   1.0555     2.2905   3.8692     5.4480   3.38e-02     1.13e-08
MissSchoolYes           2.4909 1.2892   0.6857     0.4897   9.0492     3.3946   1.66e-01     6.07e-01
BulliedYes              1.7734 1.7128   0.9021     1.0675   3.4862     2.7483   9.67e-02     2.57e-02
CheckHomeworkYes        1.3807 0.9973   0.7132     0.6310   2.6731     1.5762   3.38e-01     9.91e-01
UnderStandingParentsYes 0.6245 0.6122   0.2955     0.3646   1.3195     1.0278   2.17e-01     6.34e-02
CheckPlaytimeYes        1.3129 1.0512   0.6643     0.6547   2.5948     1.6877   4.33e-01     8.36e-01
CigaretteUseYes         4.3902 2.7013   1.9834     1.5016   9.7177     4.8597   2.63e-04     9.11e-04
DrugUseYes              0.4292 1.3867   0.1288     0.6917   1.4303     2.7799   1.68e-01     3.57e-01
AlcholMisuseYes         1.6264 1.6739   0.7661     1.0138   3.4529     2.7639   2.05e-01     4.40e-02
NumberOfFriends0        2.3265 2.0328   0.8456     0.9436   6.4010     4.3794   1.02e-01     7.00e-02
NumberOfFriends1        0.6372 1.2684   0.2091     0.6825   1.9417     2.3575   4.28e-01     4.52e-01
NumberOfFriends2        1.1270 1.5588   0.5006     0.9213   2.5374     2.6376   7.73e-01     9.81e-02

Conclusion

Carrying out regression analysis, even simple ones like this is quite tricky, you can that see the amount of time spent preparing the data is disproportionate to the amount of time required to carry out the analysis. The model execution is one command multinom(), and the rest of the time is spent manipulating the data and outputs. Dealing with missing data can be tricky and is a whole research field onto itself, here I have basically passed the problem to R's default which is basically to ignore them. Sometimes it may be useful to assign NA values to a category in a categorical explanatory variable especially if there are proportionally a lot of missing values, this allows the missing group in each variable to be compared with others groups in the model output. There are many package that carry out imputation on data sets in R mice is a good place to start. I hope that this has been useful to the reader.

Thank you.

Data Science Consulting & Software Training

Active Analytics Ltd. is a data science consultancy, and Open Source Statistical Software Training company. Please contact us for more details or to comment on the blog.

Dr. Chibisi Chima-Okereke, R Training, Statistics and Data Analysis.