Chapter 9

```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) ``` ```{r, echo=F, message=F, warning=F} library(readr) library(openintro) library(here) library(tidyverse) library(xtable) library(broom) library(DAAG) library(Sleuth3) library(ROCR) data(allbacks) data(COL) ``` # Logistic regression ## Regression so far... \begin{itemize} \item Simple linear regression \begin{itemize} \item Relationship between numerical response and a numerical or categorical predictor \end{itemize} \pause \item Multiple regression \begin{itemize} \item Relationship between numerical response and multiple numerical and/or categorical predictors \end{itemize} \end{itemize} \pause What we haven't seen is what to do when the predictors are weird (nonlinear, complicated dependence structure, etc.) or when the response is weird (categorical, count data, etc.) ## Odds Odds are another way of quantifying the probability of an event, commonly used in gambling (and logistic regression).\\ For some event $E$, \[\text{odds}(E) = \frac{P(E)}{P(E^c)} = \frac{P(E)}{1-P(E)}\] Similarly, if we are told the odds of E are $x$ to $y$ then \[\text{odds}(E) = \frac{x}{y} = \frac{x/(x+y)}{y/(x+y)} \] which implies \[P(E) = x/(x+y),\quad P(E^c) = y/(x+y)\] ## Example - Donner Party In 1846 the Donner and Reed families left Springfield, Illinois, for California by covered wagon. In July, the Donner Party, as it became known, reached Fort Bridger, Wyoming. There its leaders decided to attempt a new and untested rote to the Sacramento Valley. Having reached its full size of 87 people and 20 wagons, the party was delayed by a difficult crossing of the Wasatch Range and again in the crossing of the desert west of the Great Salt Lake. The group became stranded in the eastern Sierra Nevada mountains when the region was hit by heavy snows in late October. By the time the last survivor was rescued on April 21, 1847, 40 of the 87 members had died from famine and exposure to extreme cold. ## Example - Donner Party - Data \begin{center} \begin{tabular}{rlll} \hline & Age & Sex & Status \\ \hline 1 & 23.00 & Male & Died \\ 2 & 40.00 & Female & Survived \\ 3 & 40.00 & Male & Survived \\ 4 & 30.00 & Male & Died \\ 5 & 28.00 & Male & Died \\ $\vdots$ & ~~~$\vdots$ & ~~~$\vdots$ & ~~~$\vdots$ \\ 43 & 23.00 & Male & Survived \\ 44 & 24.00 & Male & Died \\ 45 & 25.00 & Female & Survived \\ \hline \end{tabular} \end{center} ## Example - Donner Party - EDA Status vs. Gender: \begin{center} \begin{tabular}{rrr} \hline & Male & Female \\ \hline Died & 20 & 5 \\ Survived & 10 & 10 \\ \hline \end{tabular} \end{center} \pause \vfill Status vs. Age: ```{r, echo=F, message=F, warning=F, fig.width=4, fig.height=2,fig.align='center'} donner = case2001 par(mar=c(2,4,2,1), las=1, mgp=c(3,0.7,0), cex.lab = 1.25, cex.axis = 1.25) boxplot(donner$Age ~ donner$Status, xlab = "", ylab = "", cex.axis = 1.5) ``` ## Example - Donner Party It seems clear that both age and gender have an effect on someone's survival, how do we come up with a model that will let us explore this relationship? \pause Even if we set Died to 0 and Survived to 1, this isn't something we can transform our way out of - we need something more. \pause One way to think about the problem - we can treat Survived and Died as successes and failures arising from a binomial distribution where the probability of a success is given by a transformation of a linear model of the predictors. ## Generalized linear models It turns out that this is a very general way of addressing this type of problem in regression, and the resulting models are called generalized linear models (GLMs). Logistic regression is just one example of this type of model. \pause All generalized linear models have the following three characteristics: \begin{enumerate} \item A probability distribution describing the outcome variable \item A linear model \begin{itemize} \item $\eta = \beta_0+\beta_1 X_1 + \cdots + \beta_n X_n$ \end{itemize} \item A link function that relates the linear model to the parameter of the outcome distribution \begin{itemize} \item $g(p) = \eta$ or $p = g^{-1}(\eta)$ \end{itemize} \end{enumerate} ## Logistic Regression Logistic regression is a GLM used to model a binary categorical variable using numerical and categorical predictors. We assume a binomial distribution produced the outcome variable and we therefore want to model $p$ the probability of success for a given set of predictors. \pause To finish specifying the Logistic model we just need to establish a reasonable link function that connects $\eta$ to $p$. There are a variety of options but the most commonly used is the logit function. \[logit(p) = \log\left(\frac{p}{1-p}\right),\text{ for $0\le p \le 1$}\] ## The logistic regression model The logit function takes a value between 0 and 1 and maps it to a value between $-\infty$ and $\infty$. \[g^{-1}(x) = \frac{\exp(x)}{1+\exp(x)} = \frac{1}{1+\exp(-x)}\] The inverse logit function takes a value between $-\infty$ and $\infty$ and maps it to a value between 0 and 1. This formulation also has some use when it comes to interpreting the model as logit can be interpreted as the log odds of a success, more on this later. ## The logistic regression model The three GLM criteria give us: \begin{align*} y_i &\sim \text{Binom}(p_i)\\ \\ \eta &= \beta_0+\beta_1 x_1 + \cdots + \beta_n x_n\\ \\ \text{logit}(p) &= \eta \end{align*} From which we arrive at, \[ p_i = \frac{\exp(\beta_0+\beta_1 x_{1,i} + \cdots + \beta_n x_{n,i})}{1+\exp(\beta_0+\beta_1 x_{1,i} + \cdots + \beta_n x_{n,i})} \] ## Example - Donner Party - Model In R we fit a GLM in the same was as a linear model except using \texttt{glm} instead of \texttt{lm} and we must also specify the type of GLM to fit using the \texttt{family} argument. \tiny ```{r} g1=glm(Status~Age, data=donner, family=binomial) summary(g1) ``` ## Example - Donner Party - Prediction \begin{center} \begin{tabular}{rrrrr} \hline & Estimate & Std. Error & z value & Pr($>$$|$z$|$) \\ \hline (Intercept) & 1.8185 & 0.9994 & 1.82 & 0.0688 \\ Age & -0.0665 & 0.0322 & -2.06 & 0.0391 \\ \hline \end{tabular} \end{center} Model: \[\log\left(\frac{p}{1-p}\right) = 1.8185-0.0665\times \text{Age}\] \pause Odds / Probability of survival for a newborn (Age=0): \pause \begin{align*} \log\left(\frac{p}{1-p}\right) &= 1.8185-0.0665\times \alert{0}\\ \frac{p}{1-p} &= \exp(1.8185) = 6.16 \\ p &= 6.16/7.16 = 0.86 \end{align*} ## Example - Donner Party - Prediction Model: \small \[\log\left(\frac{p}{1-p}\right) = 1.8185-0.0665\times \text{Age}\] \normalsize Odds / Probability of survival for a 25 year old: \pause \small \begin{align*} \log\left(\frac{p}{1-p}\right) &= 1.8185-0.0665\times \alert{25}\\ \frac{p}{1-p} &= \exp(0.156) = 1.17 \\ p &= 1.17/2.17 = 0.539 \end{align*} \normalsize \pause Odds / Probability of survival for a 50 year old: \pause \small \begin{align*} \log\left(\frac{p}{1-p}\right) &= 1.8185-0.0665\times \alert{50}\\ \frac{p}{1-p} &= \exp(-1.5065) = 0.222 \\ p &= 0.222/1.222 = 0.181 \end{align*} ## Example - Donner Party - Prediction \scriptsize \[\log\left(\frac{p}{1-p}\right) = 1.8185-0.0665\times \text{Age}\] \normalsize ```{r, echo=F, message=F, warning=F, out.width="95%",fig.align='center'} g=glm(Status~Age+Sex, data=donner, family=binomial) par(mar=c(2,4,2,1), las=1, mgp=c(3,0.7,0), cex.lab = 1.5, cex.axis = 1.25) plot(donner$Age,as.numeric(donner$Status)-1+0.01*((as.numeric(donner$Sex)-1)*2-1),xlab="Age",ylab="Status",xlim=c(0,80),pch=c(15,17)[donner$Sex], col = c(COL[1,2], COL[2,2]), cex = 2) legend("topright",c("Male","Female"),pch=c(15,17), col = c(COL[1,2], COL[2,2]), cex = 2, inset = 0.025) ``` ## Example - Donner Party - Prediction \scriptsize \[\log\left(\frac{p}{1-p}\right) = 1.8185-0.0665\times \text{Age}\] \normalsize ```{r, echo=F, message=F, warning=F, out.width="95%",fig.align='center'} par(mar=c(2,4,2,1), las=1, mgp=c(3,0.7,0), cex.lab = 1.5, cex.axis = 1.25) plot(donner$Age,as.numeric(donner$Status)-1+0.01*((as.numeric(donner$Sex)-1)*2-1),xlab="Age",ylab="Status",xlim=c(0,80),pch=c(15,17)[donner$Sex], col = c(COL[1,2], COL[2,2]), cex = 2) x=0:80 p = predict(g1,newdata=data.frame(Age=x)) lines(x,exp(p)/(1+exp(p)), col = COL[4], lwd = 4) p1 = exp(p[1])/(1 + exp(p[1])) p2 = exp(p[26])/(1 + exp(p[26])) p3 = exp(p[51])/(1 + exp(p[51])) points(x = c(0,25,50), y = c(p1,p2,p3), col = COL[4], pch = 19, cex = 2) legend("topright",c("Male","Female"),pch=c(15,17), col = c(COL[1,2], COL[2,2]), cex = 2, inset = 0.025) ``` ## Example - Donner Party - Interpretation \scriptsize \begin{center} \begin{tabular}{rrrrr} \hline & Estimate & Std. Error & z value & Pr($>$$|$z$|$) \\ \hline (Intercept) & 1.8185 & 0.9994 & 1.82 & 0.0688 \\ Age & -0.0665 & 0.0322 & -2.06 & 0.0391 \\ \hline \end{tabular} \end{center} \normalsize Simple interpretation is only possible in terms of log odds and log odds ratios for intercept and slope terms. \textbf{Intercept}: The log odds of survival for a party member with an age of 0. From this we can calculate the odds or probability, but additional calculations are necessary. \textbf{Slope}: For a unit increase in age (being 1 year older) how much will the log odds ratio change, not particularly intuitive. More often then not we care only about sign and relative magnitude. ## Example - Donner Party - Interpretation - Slope \scriptsize \begin{align*} \log\left(\frac{p_1}{1-p_1}\right) &= 1.8185-0.0665 (x+1) \\ &= 1.8185-0.0665 x-0.0665 \\ \log\left(\frac{p_2}{1-p_2}\right) &= 1.8185-0.0665 x \\ \\ \log\left(\frac{p_1}{1-p_1}\right) - \log\left(\frac{p_2}{1-p_2}\right) &= -0.0665 \\ \log\left(\left. \frac{p_1}{1-p_1} \right/ \frac{p_2}{1-p_2} \right) &= -0.0665 \\ \left. \frac{p_1}{1-p_1} \right/ \frac{p_2}{1-p_2} &= \exp(-0.0665) = 0.94 \end{align*} ## Example - Donner Party - Age and Gender \tiny ```{r} g=glm(Status~Age+Sex, data=donner, family=binomial) summary(g) ``` \normalsize **Gender slope:** When the other predictors are held constant this is the log odds ratio between the given level (Female) and the reference level (Male). ## Example - Donner Party - Gender Models Just like MLR we can plug in gender to arrive at two status vs age models for men and women respectively. General model: \scriptsize \[\log\left(\frac{p_1}{1-p_1}\right) = 1.63312 + -0.07820\times\text{Age} + 1.59729\times\text{Sex}\] \normalsize Male model: \scriptsize \begin{align*} \log\left(\frac{p_1}{1-p_1}\right) &= 1.63312 + -0.07820\times\text{Age} + 1.59729\times\alert{0}\\ &= 1.63312 + -0.07820\times\text{Age} \end{align*} \normalsize Female model: \scriptsize \begin{align*} \log\left(\frac{p_1}{1-p_1}\right) &= 1.63312 + -0.07820\times\text{Age} + 1.59729\times\alert{1}\\ &= 3.23041 + -0.07820\times\text{Age} \end{align*} ## Example - Donner Party - Gender Models ```{r, echo=F, message=F, warning=F, out.width="100%",fig.align='center'} par(mar=c(2,4,2,1), las=1, mgp=c(3,0.7,0), cex.lab = 1.5, cex.axis = 1.25) plot(donner$Age,as.numeric(donner$Status)-1+0.01*((as.numeric(donner$Sex)-1)*2-1),xlab="Age",ylab="Status",xlim=c(0,80),pch=c(15,17)[donner$Sex], col = c(COL[1,2], COL[2,2]), cex = 2) x=0:80 p_male = predict(g,newdata=data.frame(Age=x,Sex="Male")) p_female = predict(g,newdata=data.frame(Age=x,Sex="Female")) lines(x,exp(p_male)/(1+exp(p_male)), col = COL[1], lwd = 4) lines(x,exp(p_female)/(1+exp(p_female)), col = COL[2], lwd = 2, lty = 4) legend("topright",c("Male","Female"),pch=c(15,17), col = c(COL[1,2], COL[2,2]), inset = 0.025, cex = 2) text(x = 45, y = 0.7, "Female", col = COL[2], cex = 2) text(x = 20, y = 0.3, "Male", col = COL[1], cex = 2) ``` ## Hypothesis test for the whole model \tiny ```{r} g=glm(Status~Age+Sex, data=donner, family=binomial) summary(g) ``` \footnotesize \noindent\rule{4cm}{0.4pt} \alert{Note:} The model output does not include any F-statistic, as a general rule there are not single model hypothesis tests for GLM models. ## Hypothesis tests for a coefficient \begin{center} \begin{tabular}{rrrrr} \hline & Estimate & Std. Error & z value & Pr($>$$|$z$|$) \\ \hline (Intercept) & 1.6331 & 1.1102 & 1.47 & 0.1413 \\ Age & -0.0782 & 0.0373 & -2.10 & 0.0359 \\ SexFemale & 1.5973 & 0.7555 & 2.11 & 0.0345 \\ \hline \end{tabular} \end{center} We are however still able to perform inference on individual coefficients, the basic setup is exactly the same as what we've seen before except we use a Z test. \noindent\rule{4cm}{0.4pt} \alert{Note:} The only tricky bit, which is way beyond the scope of this course, is how the standard error is calculated. ## Testing for the slope of Age \small \begin{center} \begin{tabular}{rrrrr} \hline & Estimate & Std. Error & z value & Pr($>$$|$z$|$) \\ \hline (Intercept) & 1.6331 & 1.1102 & 1.47 & 0.1413 \\ Age & \alert{-0.0782} & \textcolor{green}{0.0373} & \alert{-2.10} & \textcolor{blue}{0.0359} \\ SexFemale & 1.5973 & 0.7555 & 2.11 & 0.0345 \\ \hline \end{tabular} \end{center} \normalsize \pause \begin{align*} H_0:~& \beta_{age} = 0 \\ H_A:~& \beta_{age} \ne 0 \end{align*} \pause \begin{align*} Z &= \frac{\hat{\beta_{age}} - \beta_{age}}{SE_{age}} = \frac{\alert{-0.0782} - 0}{\textcolor{green}{0.0373}} = \alert{-2.10} \\ \\ \text{p-value} &= P(|Z| > \alert{2.10}) = P(Z>\alert{2.10})+P(Z<\alert{-2.10})\\ &= 2\times 0.0178 = \textcolor{blue}{0.0359} \end{align*} ## Confidence interval for age slope coefficient \begin{center} \begin{tabular}{rrrrr} \hline & Estimate & Std. Error & z value & Pr($>$$|$z$|$) \\ \hline (Intercept) & 1.6331 & 1.1102 & 1.47 & 0.1413 \\ Age & -0.0782 & 0.0373 & -2.10 & 0.0359 \\ SexFemale & 1.5973 & 0.7555 & 2.11 & 0.0345 \\ \hline \end{tabular} \end{center} Remember, the interpretation for a slope is the change in log odds ratio per unit change in the predictor. \pause Log odds ratio: \[ CI = PE \pm CV \times SE = -0.0782 \pm 1.96 \times 0.0373 = (-0.1513, -0.0051) \] \pause Odds ratio: \[ \exp(CI) = (\exp{-0.1513}, \exp{-0.0051}) = (0.8596 0.9949) \] ## Example - Birdkeeping and Lung Cancer A 1972 - 1981 health survey in The Hague, Netherlands, discovered an association between keeping pet birds and increased risk of lung cancer. To investigate birdkeeping as a risk factor, researchers conducted a case-control study of patients in 1985 at four hospitals in The Hague (population 450,000). They identified 49 cases of lung cancer among the patients who were registered with a general practice, who were age 65 or younger and who had resided in the city since 1965. They also selected 98 controls from a population of residents having the same general age structure. ## Example - Birdkeeping and Lung Cancer - Data \scriptsize \begin{tabular}{rllllrrr} \hline & \texttt{LC} & \texttt{FM} & \texttt{SS} & \texttt{BK} & \texttt{AG} & \texttt{YR} & \texttt{CD} \\ \hline 1 & LungCancer & Male & Low & Bird & 37.00 & 19.00 & 12.00 \\ 2 & LungCancer & Male & Low & Bird & 41.00 & 22.00 & 15.00 \\ 3 & LungCancer & Male & High & NoBird & 43.00 & 19.00 & 15.00 \\ \vdots & ~~~~~~\vdots & ~~~\vdots & ~~\vdots & ~~~\vdots & \vdots~~~ & \vdots~~~ & \vdots~~~ \\ 147 & NoCancer & Female & Low & NoBird & 65.00 & 7.00 & 2.00 \\ \hline \end{tabular} \small \begin{center} \begin{tabular}{ll} \texttt{LC} & Whether subject has lung cancer \\ \texttt{FM} & Sex of subject \\ \texttt{SS} & Socioeconomic status \\ \texttt{BK} & Indicator for birdkeeping \\ \texttt{AG} & Age of subject (years) \\ \texttt{YR} & Years of smoking prior to diagnosis or examination \\ \texttt{CD} & Average rate of smoking (cigarettes per day) \end{tabular} \end{center} \alert{Note:} NoCancer is the reference response (0 or failure), LungCancer is the non-reference response (1 or success) - this matters for interpretation. ## Example - Birdkeeping and Lung Cancer - EDA ```{r, echo=F, message=F, warning=F, out.width="80%",fig.align='center'} birds = case2002 birds$pch = NA birds$pch[birds$LC == "LungCancer" & birds$BK == "Bird"] = 17 birds$pch[birds$LC == "LungCancer" & birds$BK == "NoBird"] = 16 birds$pch[birds$LC == "NoCancer" & birds$BK == "Bird"] = 2 birds$pch[birds$LC == "NoCancer" & birds$BK == "NoBird"] = 1 birds$col[birds$LC == "LungCancer"] = COL[2,2] birds$col[birds$LC == "NoCancer"] = COL[1,2] # birds par(mar=c(2,4,2,1), las=1, mgp=c(3,0.7,0), cex.lab = 1.5, cex.axis = 1.25) plot(YR~AG, data = birds, pch = birds$pch, col = birds$col, lwd = 2) ``` \begin{center} \begin{tabular}{r|cc} & Bird & No Bird \\ \hline Lung Cancer & \raisebox{0.5pt}{\tikz{\node[draw,scale=0.3,regular polygon, regular polygon sides=3,fill=green!10!green,rotate=0](){};}} & \tikz\draw[green,fill=green] (0,0) circle (.5ex); \\ No Lung Cancer & \raisebox{0.5pt}{\tikz{\node[draw,scale=0.3,regular polygon, regular polygon sides=3,color = blue,fill=none,rotate=0](){};}} & \tikz\draw[blue,fill=none] (0,0) circle (.5ex); \end{tabular} \end{center} ## Example - Birdkeeping and Lung Cancer - Model \tiny ```{r} summary(glm(LC ~ FM + SS + BK + AG + YR + CD, data=birds, family=binomial)) ``` ## Example - Birdkeeping and Lung Cancer - Interpretation \scriptsize \begin{center} \begin{tabular}{rrrrr} \hline & Estimate & Std. Error & z value & Pr($>$$|$z$|$) \\ \hline (Intercept) & -1.9374 & 1.8043 & -1.07 & 0.2829 \\ FMFemale & 0.5613 & 0.5312 & 1.06 & 0.2907 \\ SSHigh & 0.1054 & 0.4688 & 0.22 & 0.8221 \\ \alert{BKBird} & \alert{1.3626} & \alert{0.4113} & \alert{3.31} & \alert{0.0009} \\ AG & -0.0398 & 0.0355 & -1.12 & 0.2625 \\ \alert{YR} & \alert{0.0729} & \alert{0.0265} & \alert{2.75} & \alert{0.0059} \\ CD & 0.0260 & 0.0255 & 1.02 & 0.3081 \\ \hline \end{tabular} \end{center} \normalsize \pause Keeping all other predictors constant then, \begin{itemize} \pause \item The odds ratio of getting lung cancer for bird keepers vs non-bird keepers is $\exp(1.3626) = 3.91$. \pause \item The odds ratio of getting lung cancer for an additional year of smoking is $\exp(0.0729) = 1.08$. \end{itemize} ## What do the numbers not mean ... The most common mistake made when interpreting logistic regression is to treat an odds ratio as a ratio of probabilities.\pause Bird keepers are \underline{not} 4x more likely to develop lung cancer than non-bird keepers. \pause This is the difference between relative risk and an odds ratio. \[RR = \frac{P(\text{disease} | \text{exposed})}{P(\text{disease} | \text{unexposed})} \] \[OR = \frac{P(\text{disease} | \text{exposed}) / [1-P(\text{disease} | \text{exposed})]}{P(\text{disease} | \text{unexposed})/[1-P(\text{disease} | \text{unexposed})]} \] ## Back to the birds What is probability of lung cancer in a bird keeper if we knew that $P(\text{lung cancer}|\text{no birds}) = 0.05$? \begin{align*} OR &= \frac{P(\text{lung cancer} | \text{birds}) / [1-P(\text{lung cancer} | \text{birds})]}{P(\text{lung cancer} | \text{no birds})/[1-P(\text{lung cancer} | \text{no birds})]} \\ \\ &= \frac{P(\text{lung cancer} | \text{birds}) / [1-P(\text{lung cancer} | \text{birds})]}{0.05/[1-0.05]} = 3.91 \end{align*} \pause \[ P(\text{lung cancer} | \text{birds}) = \frac{3.91 \times \frac{0.05}{0.95}}{1+3.91 \times \frac{0.05}{0.95}} = 0.171\] \pause \[ RR = P(\text{lung cancer} | \text{birds}) / P(\text{lung cancer} | \text{no birds})\] \[= 0.171 / 0.05 = 3.41\] ## Bird OR Curve ```{r, echo=F, message=F, warning=F, out.width="100%",fig.align='center'} # odds ratios p = (1:99)/100 f = function(p) (3.91 * p/(1-p))/(1+3.91 * p/(1-p)) par(mar=c(5, 4, 2, 2) + 0.1) plot(p,f(p),xlab="P(lung cancer | no birds)",ylab="P(lung cancer | birds)",type="l", cex.axis = 1.5, cex.lab = 1.5, lwd = 4) abline(a=0,b=1,col='lightgrey', lwd = 4) ``` ## Bird OR Curve ```{r, echo=F, message=F, warning=F, out.width="100%",fig.align='center'} # odds ratios par(mar=c(5, 4, 2, 2) + 0.1) plot(p,f(p),xlab="P(lung cancer | no birds)",ylab="P(lung cancer | birds)",type="l", cex.axis = 1.5, cex.lab = 1.5, lwd = 4) abline(a=0,b=1,col='lightgrey', lwd = 4) points(0.05,f(0.05),col='red',pch=16, cex = 2) lines(c(0.05,0.05),c(0,f(0.05)),col='red', lwd = 4) ``` ## Bird OR Curve ```{r, echo=F, message=F, warning=F, out.width="100%",fig.align='center'} # odds ratios par(mar=c(5, 4, 2, 2) + 0.1) plot(p,f(p),xlab="P(lung cancer | no birds)",ylab="P(lung cancer | birds)",type="l", cex.axis = 1.5, cex.lab = 1.5, lwd = 4) abline(a=0,b=1,col='lightgrey', lwd = 4) points(0.05,f(0.05),col='red',pch=16, cex = 2) lines(c(0.05,0.05),c(0,f(0.05)),col='red', lwd = 4) lines(c(-1,0.05),c(f(0.05),f(0.05)),col='red', lwd = 4) ``` ## Bird OR Curve ![](images/all_OR.pdf) ## (An old) Example - \underline{House} If you've ever watched the TV show \underline{House} on Fox, you know that Dr. House regularly states, "It's never lupus." Lupus is a medical phenomenon where antibodies that are supposed to attack foreign cells to prevent infections instead see plasma proteins as foreign bodies, leading to a high risk of blood clotting. It is believed that 2\% of the population suffer from this disease. The test for lupus is very accurate if the person actually has lupus, however is very inaccurate if the person does not. More specifically, the test is 98\% accurate if a person actually has the disease. The test is 74\% accurate if a person does not have the disease. Is Dr. House correct even if someone tests positive for Lupus? ## (An old) Example - \underline{House} \centering ![](tree_lupus.pdf){width="85%"} \begin{align*} P(\text{Lupus} | +) &= \frac{P(+,\text{Lupus})}{P(+,\text{Lupus})+P(+,\text{No Lupus})} \\ &= \frac{0.0196}{0.0196+0.2548} = 0.0714 \end{align*} ## Testing for lupus It turns out that testing for Lupus is actually quite complicated, a diagnosis usually relies on the outcome of multiple tests, often including: a complete blood count, an erythrocyte sedimentation rate, a kidney and liver assessment, a urinalysis, and or an antinuclear antibody (ANA) test. It is important to think about what is involved in each of these tests (e.g. deciding if complete blood count is high or low) and how each of the individual tests and related decisions plays a role in the overall decision of diagnosing a patient with lupus. ## Testing for lupus At some level we can view a diagnosis as a binary decision (lupus or no lupus) that involves the complex integration of various explanatory variables. The example does not give us any information about how a diagnosis is made, but what it does give us is just as important - the sensitivity and the specificity of the test. These values are critical for our understanding of what a positive or negative test result actually means. ## Sensitivity and specificity \textbf{Sensitivity} - measures a tests ability to identify positive results. \[P(\text{Test }+~|~\text{Conditon }+) = P(+ | \text{lupus}) = 0.98\] \textbf{Specificity} - measures a tests ability to identify negative results. \[P(\text{Test }-~|~\text{Condition }-) = P(- | \text{no lupus}) = 0.74\] \pause It is illustrative to think about the extreme cases - what is the sensitivity and specificity of a test that always returns a positive result? What about a test that always returns a negative result? ## Sensitivity and specificity \centering ![](images/SenSpec.pdf){width="75%"} \small \begin{align*} \textbf{Sensitivity} &= P(\text{Test }+~|~\text{Condition }+) = TP / (TP + FN) \\ \textbf{Specificity} &= P(\text{Test }-~|~\text{Condition }-) = TN / (FP + TN) \\ \textbf{False negative rate} (\beta) &= P(\text{Test }-~|~\text{Condition }+) = FN / (TP + FN) \\ \textbf{False positive rate} (\beta) &= P(\text{Test }+~|~\text{Condition }-) = FP / (FP + TN) \end{align*} \pause \begin{align*} \textbf{Sensitivity} &= 1 - \textbf{False negative rate} = \text{Power}\\ \textbf{Specificity} &= 1 - \textbf{False positive rate} \end{align*} ## So what? Clearly it is important to know the Sensitivity and Specificity of test (and or the false positive and false negative rates). Along with the incidence of the disease (e.g. $P(\text{lupus})$) these values are necessary to calculate important quantities like $P(\text{lupus} | + )$. Additionally, our brief foray into power analysis before the first midterm should also give you an idea about the trade offs that are inherent in minimizing false positive and false negative rates (increasing power required either increasing $\alpha$ or $n$). How should we use this information when we are trying to come up with a decision? ## Back to Spam In lab this week, we examined a data set of emails where we were interesting in identifying the spam messages. We examined different logistic regression models to evaluate how different predictors influenced the probability of a message being spam. These models can also be used to assign probabilities to incoming messages (this is equivalent to prediction in the case of SLR / MLR). However, if we were designing a spam filter this would only be half of the battle, we would also need to use these probabilities to make a decision about which emails get flagged as spam. \pause While not the only possible solution, we will consider a simple approach where we choose a threshold probability and any email that exceeds that probability is flagged as spam. ## Picking a threshold ![](images/spam1.pdf) \pause Lets see what happens if we pick out threshold to be \alert{0.75}. ## Picking a threshold ![](images/spam2.pdf) Lets see what happens if we pick out threshold to be \alert{0.75}. ## Picking a threshold ![](images/spam3.pdf) Lets see what happens if we pick out threshold to be \alert{0.75}. ## Picking a threshold ![](images/spam4.pdf) Lets see what happens if we pick out threshold to be \alert{0.75}. ## Picking a threshold ![](images/spam5.pdf) Lets see what happens if we pick out threshold to be \alert{0.75}. ## Consequences of picking a threshold For our data set picking a threshold of 0.75 gives us the following results: \begin{align*} FN = 340 &\qquad TP = 27 \\ TN = 3545 &\qquad FP = 9 \end{align*} \pause \alert{What are the sensitivity and specificity for this particular decision rule?} \pause \begin{align*} \text{Sensitivity} & = TP / (TP + FN) = 27 / (27+340) = 0.073 \\ \text{Specificity} & = TN / (FP + TN) = 3545 / (9+3545) = 0.997 \\ \end{align*} ## Trying other thresholds ![](images/spam5-1.pdf) \begin{center} \begin{tabular}{|c|ccccc|} \hline Threshold & 0.75 & 0.625 & 0.5 & 0.375 & 0.25 \\ \hline Sensitivity & 0.074 & & & & \\ Specificity & 0.997 & & & & \\ \hline \end{tabular} \end{center} ## Trying other thresholds ![](images/spam5-2.pdf) \begin{center} \begin{tabular}{|c|ccccc|} \hline Threshold & 0.75 & 0.625 & 0.5 & 0.375 & 0.25 \\ \hline Sensitivity & 0.074 & 0.106 & & & \\ Specificity & 0.997 & 0.995 & & & \\ \hline \end{tabular} \end{center} ## Trying other thresholds ![](images/spam5-3.pdf) \begin{center} \begin{tabular}{|c|ccccc|} \hline Threshold & 0.75 & 0.625 & 0.5 & 0.375 & 0.25 \\ \hline Sensitivity & 0.074 & 0.106 & 0.136 & & \\ Specificity & 0.997 & 0.995 & 0.995 & & \\ \hline \end{tabular} \end{center} ## Trying other thresholds ![](images/spam5-4.pdf) \begin{center} \begin{tabular}{|c|ccccc|} \hline Threshold & 0.75 & 0.625 & 0.5 & 0.375 & 0.25 \\ \hline Sensitivity & 0.074 & 0.106 & 0.136 & 0.305 & \\ Specificity & 0.997 & 0.995 & 0.995 & 0.963 & \\ \hline \end{tabular} \end{center} ## Trying other thresholds ![](images/spam5-5.pdf) \begin{center} \begin{tabular}{|c|ccccc|} \hline Threshold & 0.75 & 0.625 & 0.5 & 0.375 & 0.25 \\ \hline Sensitivity & 0.074 & 0.106 & 0.136 & 0.305 & 0.510 \\ Specificity & 0.997 & 0.995 & 0.995 & 0.963 & 0.936 \\ \hline \end{tabular} \end{center} ## Relationship between Sensitivity and Specificity \begin{center} \begin{tabular}{|c|ccccc|} \hline Threshold & 0.75 & 0.625 & 0.5 & 0.375 & 0.25 \\ \hline Sensitivity & 0.074 & 0.106 & 0.136 & 0.305 & 0.510 \\ Specificity & 0.997 & 0.995 & 0.995 & 0.963 & 0.936 \\ \hline \end{tabular} \end{center} \centering ![](images/sen_vs_spec1.pdf){width="65%"} ## Relationship between Sensitivity and Specificity \begin{center} \begin{tabular}{|c|ccccc|} \hline Threshold & 0.75 & 0.625 & 0.5 & 0.375 & 0.25 \\ \hline Sensitivity & 0.074 & 0.106 & 0.136 & 0.305 & 0.510 \\ Specificity & 0.997 & 0.995 & 0.995 & 0.963 & 0.936 \\ \hline \end{tabular} \end{center} \centering ![](images/sen_vs_spec2.pdf){width="65%"} ## Relationship between Sensitivity and Specificity \begin{center} \begin{tabular}{|c|ccccc|} \hline Threshold & 0.75 & 0.625 & 0.5 & 0.375 & 0.25 \\ \hline Sensitivity & 0.074 & 0.106 & 0.136 & 0.305 & 0.510 \\ Specificity & 0.997 & 0.995 & 0.995 & 0.963 & 0.936 \\ \hline \end{tabular} \end{center} \centering ![](images/sen_vs_spec3.pdf){width="65%"} ## Receiver operating characteristic (ROC) curve ```{r, echo=F, message=F, warning=F, out.width="100%",fig.align='center'} par(mar=c(5, 4, 2, 2) + 0.1) g_full = glm(spam ~ ., data=email, family=binomial) pred = prediction(g_full$fitted.values, email$spam) perf = performance(pred,"tpr","fpr") plot(perf, lwd = 4, cex.lab = 1.5) abline(a=0,b=1,col="lightgrey", lwd = 4) ``` ## Receiver operating characteristic (ROC) curve Why do we care about ROC curves? \begin{itemize} \item Shows the trade off in sensitivity and specificity for all possible thresholds. \item Straight forward to compare performance vs. chance. \item Can use the area under the curve (AUC) as an assessment of the predictive ability of a model. \end{itemize} ## Refining the Spam model \tiny ```{r} g_refined = glm(spam ~ to_multiple+cc+image+attach+winner+password+line_breaks+format+re_subj+urgent_subj+exclaim_mess, data=email, family=binomial) summary(g_refined) ``` ## Comparing models ```{r, echo=F, message=F, warning=F, out.width="100%",fig.align='center'} pred_full = prediction( g_full$fitted.values, email$spam) pred_refined = prediction( g_refined$fitted.values, email$spam) aucs = round(c(performance(pred_full,"auc")@y.values[[1]], performance(pred_refined,"auc")@y.values[[1]]),3) par(mar=c(5, 4, 2, 2) + 0.1, cex.lab = 1.5) plot(performance(pred_full,"tpr","fpr"), lwd = 4) plot(performance(pred_refined,"tpr","fpr"),add=TRUE, col=COL[1], lwd = 4) legend("bottomright",paste0(c("Full","Refined"), " (AUC: ",aucs,")"),col=c("black",COL[1]), lwd = 2, lty=1, cex = 1.5) abline(a=0,b=1,col="lightgrey", lwd = 4) ``` ## Utility Functions There are many other reasonable quantitative approaches we can use to decide on what is the "best" threshold. If you've taken an economics course you have probably heard of the idea of utility functions, we can assign costs and benefits to each of the possible outcomes and use those to calculate a utility for each circumstance. ## Utility function for our spam filter To write down a utility function for a spam filter we need to consider the costs / benefits of each out. \renewcommand\arraystretch{1.5} \begin{center} \begin{tabular}{l|c} Outcome & Utility \\ \hline True Positive & 1 \\ True Negative & \\ False Positive & \\ False Negative & \\ \end{tabular} \end{center} ## Utility function for our spam filter To write down a utility function for a spam filter we need to consider the costs / benefits of each out. \renewcommand\arraystretch{1.5} \begin{center} \begin{tabular}{l|c} Outcome & Utility \\ \hline True Positive & 1 \\ True Negative & 1 \\ False Positive & \\ False Negative & \\ \end{tabular} \end{center} ## Utility function for our spam filter To write down a utility function for a spam filter we need to consider the costs / benefits of each out. \renewcommand\arraystretch{1.5} \begin{center} \begin{tabular}{l|c} Outcome & Utility \\ \hline True Positive & 1 \\ True Negative & 1 \\ False Positive & -50 \\ False Negative & \\ \end{tabular} \end{center} ## Utility function for our spam filter To write down a utility function for a spam filter we need to consider the costs / benefits of each out. \renewcommand\arraystretch{1.5} \begin{center} \begin{tabular}{l|c} Outcome & Utility \\ \hline True Positive & 1 \\ True Negative & 1 \\ False Positive & -50 \\ False Negative & -5 \\ \end{tabular} \end{center} ## Utility function for our spam filter To write down a utility function for a spam filter we need to consider the costs / benefits of each out. \renewcommand\arraystretch{1.5} \begin{center} \begin{tabular}{l|c} Outcome & Utility \\ \hline True Positive & 1 \\ True Negative & 1 \\ False Positive & -50 \\ False Negative & -5 \\ \end{tabular} \end{center} \[U(p) = TP(p) + TN(p) - 50 \times FP(p) - 5 \times FN(p)\] ## Utility for the 0.75 threshold For the email data set picking a threshold of 0.75 gives us the following results: \begin{align*} FN = 340 &\qquad TP = 27 \\ TN = 3545 &\qquad FP = 9 \end{align*} \pause \begin{align*} U(p) &= TP(p) + TN(p) - 50 \times FP(p) - 5 \times FN(p) \\ &= 27+3545-50 \times 9-5 \times 340 = 1422 \end{align*} \pause Not useful by itself, but allows us to compare with other thresholds. ## Utility curve ```{r, echo=F, message=F, warning=F, out.width="100%",fig.align='center'} U = c() p = c() for(thres in seq(0,1,0.01)) { if (thres %in% c(0,1)) next t = table(email$spam, g_full$fitted.values>thres) FN = t[2,1] FP = t[1,2] TP = t[2,2] TN = t[1,1] U = c(U, TP + TN - 50 * FP - 5 * FN) p = c(p,thres) } par(mar=c(5, 4, 2, 2) + 0.1) plot(U ~ p,type='l', lwd = 4, cex.lab = 1.5) points(0.75,1422,col=COL[4],cex = 2, lwd = 3) ``` ## Utility curve (zoom) ```{r, echo=F, message=F, warning=F, out.width="100%",fig.align='center'} par(mar=c(5, 4, 2, 2) + 0.1) plot(U[p>0.6] ~ p[p>0.6],type='l', lwd = 4, cex.lab = 1.5) points(0.7535,1422,col=COL[4],cex = 2, lwd = 3) ``` ## Utility curve (zoom) ```{r, echo=F, message=F, warning=F, out.width="100%",fig.align='center'} par(mar=c(5, 4, 2, 2) + 0.1) plot(U[p>0.6] ~ p[p>0.6],type='l', lwd = 4, cex.lab = 1.5) points(0.7535,1422,col=COL[4],cex = 2, lwd = 3) i=which.max(U) points(p[i],U[i],col=COL[1],pch=16,cex = 2, lwd = 3) ``` ## Maximum Utility ![](images/utility4.pdf)

Introduction to Probability & Statistics