Linear-Regression.html

<!DOCTYPE html>
<html>
  <head>
    <title>Linear Regression With R</title>
    <meta charset="utf-8">
    <meta name="Description" content="R Language Tutorials for Advanced Statistics">
    <meta name="Keywords" content="R, Tutorial, Machine learning, Statistics, Data Mining, Analytics, Data science, Linear Regression, Logistic Regression, Time series, Forecasting">
    <meta name="Distribution" content="Global">
    <meta name="Author" content="Selva Prabhakaran">
    <meta name="Robots" content="index, follow">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <link rel="shortcut icon" href="/screenshots/iconb-64.png" type="image/x-icon" />
    <link href="www/bootstrap.min.css" rel="stylesheet">
    <link href="www/highlight.css" rel="stylesheet">

    <link href='http://fonts.googleapis.com/css?family=Inconsolata:400,700'
      rel='stylesheet' type='text/css'>
    <!-- Color Script -->
    <style type="text/css">
      a {
       color: #3675C5;
       color: rgb(25, 145, 248);
       color: #4582ec;
       color: #3F73D8;
      }
      li {
        line-height: 1.65;
      }
      /* reduce spacing around math formula*/
      .MathJax_Display {
        margin: 0em 0em;
      }   
    </style>
    <!-- Add Google search -->
    <script language="Javascript" type="text/javascript">
  function my_search_google()
  {
    var query = document.getElementById("my-google-search").value;
    window.open("http://google.com/search?q=" + query
	+ "%20site:" + "http://r-statistics.co");
  }
</script>
  </head>

  <body>

    <div class="container">

      <div class="masthead">
       <!--
        <ul class="nav nav-pills pull-right">
          <li class="dropdown">
            <a href="#" class="dropdown-toggle" data-toggle="dropdown">
              Table of contents<b class="caret"></b>
            </a>
            <ul class="dropdown-menu pull-right" role="menu">
              <li class="dropdown-header"></li>
<li class="dropdown-header">Tutorial</li>
<li><a href="R-Tutorial.html">R Tutorial</a></li>
<li class="dropdown-header">ggplot2</li>
<li><a href="ggplot2-Tutorial-With-R.html">ggplot2 Short Tutorial</a></li>
<li><a href="Complete-Ggplot2-Tutorial-Part1-With-R-Code.html">ggplot2 Tutorial 1 - Intro</a></li>
<li><a href="Complete-Ggplot2-Tutorial-Part2-Customizing-Theme-With-R-Code.html">ggplot2 Tutorial 2 - Theme</a></li>
<li><a href="Top50-Ggplot2-Visualizations-MasterList-R-Code.html">ggplot2 Tutorial 3 - Masterlist</a></li>
<li><a href="ggplot2-cheatsheet.html">ggplot2 Quickref</a></li>
<li class="dropdown-header">Foundations</li>
<li><a href="Linear-Regression.html">Linear Regression</a></li>
<li><a href="Statistical-Tests-in-R.html">Statistical Tests</a></li>
<li><a href="Missing-Value-Treatment-With-R.html">Missing Value Treatment</a></li>
<li><a href="Outlier-Treatment-With-R.html">Outlier Analysis</a></li>
<li><a href="Variable-Selection-and-Importance-With-R.html">Feature Selection</a></li>
<li><a href="Model-Selection-in-R.html">Model Selection</a></li>
<li><a href="Logistic-Regression-With-R.html">Logistic Regression</a></li>
<li><a href="Environments.html">Advanced Linear Regression</a></li>

<li class="dropdown-header">Advanced Regression Models</li>
<li><a href="adv-regression-models.html">Advanced Regression Models</a></li>

<li class="dropdown-header">Time Series</li>
<li><a href="Time-Series-Analysis-With-R.html">Time Series Analysis</a></li>
<li><a href="Time-Series-Forecasting-With-R.html">Time Series Forecasting </a></li>
<li><a href="Time-Series-Forecasting-With-R-part2.html">More Time Series Forecasting</a></li>

<li class="dropdown-header">High Performance Computing</li>
<li><a href="Parallel-Computing-With-R.html">Parallel computing</a></li>
<li><a href="Strategies-To-Improve-And-Speedup-R-Code.html">Strategies to Speedup R code</a></li>

<li class="dropdown-header">Useful Techniques</li>
<li><a href="Association-Mining-With-R.html">Association Mining</a></li>
<li><a href="Multi-Dimensional-Scaling-With-R.html">Multi Dimensional Scaling</a></li>
<li><a href="Profiling.html">Optimization</a></li>
<li><a href="Information-Value-With-R.html">InformationValue package</a></li>
            </ul>
          </li>
        </ul>
        -->

        <ul class="nav nav-pills pull-right">
          <div class="input-group">
            <form onsubmit="my_search_google()">
                <input type="text" class="form-control" id="my-google-search" placeholder="Search..">
            <form>
          </div><!-- /input-group -->
        </ul><!-- /.col-lg-6 -->

        <h3 class="muted"><a href="/">r-statistics.co</a><small> by Selva Prabhakaran</small></h3>
        <hr>
      </div>

      <div class="row">
        <div class="col-xs-12 col-sm-3" id="nav">
        <div class="well">
          <li>
            <ul class="list-unstyled">
                <li class="dropdown-header"></li>
<li class="dropdown-header">Tutorial</li>
<li><a href="R-Tutorial.html">R Tutorial</a></li>
<li class="dropdown-header">ggplot2</li>
<li><a href="ggplot2-Tutorial-With-R.html">ggplot2 Short Tutorial</a></li>
<li><a href="Complete-Ggplot2-Tutorial-Part1-With-R-Code.html">ggplot2 Tutorial 1 - Intro</a></li>
<li><a href="Complete-Ggplot2-Tutorial-Part2-Customizing-Theme-With-R-Code.html">ggplot2 Tutorial 2 - Theme</a></li>
<li><a href="Top50-Ggplot2-Visualizations-MasterList-R-Code.html">ggplot2 Tutorial 3 - Masterlist</a></li>
<li><a href="ggplot2-cheatsheet.html">ggplot2 Quickref</a></li>
<li class="dropdown-header">Foundations</li>
<li><a href="Linear-Regression.html">Linear Regression</a></li>
<li><a href="Statistical-Tests-in-R.html">Statistical Tests</a></li>
<li><a href="Missing-Value-Treatment-With-R.html">Missing Value Treatment</a></li>
<li><a href="Outlier-Treatment-With-R.html">Outlier Analysis</a></li>
<li><a href="Variable-Selection-and-Importance-With-R.html">Feature Selection</a></li>
<li><a href="Model-Selection-in-R.html">Model Selection</a></li>
<li><a href="Logistic-Regression-With-R.html">Logistic Regression</a></li>
<li><a href="Environments.html">Advanced Linear Regression</a></li>

<li class="dropdown-header">Advanced Regression Models</li>
<li><a href="adv-regression-models.html">Advanced Regression Models</a></li>

<li class="dropdown-header">Time Series</li>
<li><a href="Time-Series-Analysis-With-R.html">Time Series Analysis</a></li>
<li><a href="Time-Series-Forecasting-With-R.html">Time Series Forecasting </a></li>
<li><a href="Time-Series-Forecasting-With-R-part2.html">More Time Series Forecasting</a></li>

<li class="dropdown-header">High Performance Computing</li>
<li><a href="Parallel-Computing-With-R.html">Parallel computing</a></li>
<li><a href="Strategies-To-Improve-And-Speedup-R-Code.html">Strategies to Speedup R code</a></li>

<li class="dropdown-header">Useful Techniques</li>
<li><a href="Association-Mining-With-R.html">Association Mining</a></li>
<li><a href="Multi-Dimensional-Scaling-With-R.html">Multi Dimensional Scaling</a></li>
<li><a href="Profiling.html">Optimization</a></li>
<li><a href="Information-Value-With-R.html">InformationValue package</a></li>
            </ul>
          </li>
        </div>

        <div class="well">
          <p>Stay up-to-date. <a href="https://docs.google.com/forms/d/1xkMYkLNFU9U39Dd8S_2JC0p8B5t6_Yq6zUQjanQQJpY/viewform">Subscribe!</a></p>
          <p><a href="https://docs.google.com/forms/d/13GrkCFcNa-TOIllQghsz2SIEbc-YqY9eJX02B19l5Ow/viewform">Chat!</a></p>
        </div>

        
        <h4>Contents</h4>
        

          <ul class="list-unstyled" id="toc"></ul>
        <!--
          <hr>
          <p><a href="/contribute.html">How to contribute</a></p>

          <p><a class="btn btn-primary" href="">Edit this page</a></p>
        -->  
        </div>

        <div id="content" class="col-xs-12 col-sm-8 pull-right">

          <h1>Linear Regression</h1>
<blockquote>
<p>Linear regression is used to predict the value of an outcome variable <span class="math inline"><em>Y</em></span> based on one or more input predictor variables <span class="math inline"><em>X</em></span>. The aim is to establish a linear relationship (a mathematical formula) between the predictor variable(s) and the response variable, so that, we can use this formula to estimate the value of the response <span class="math inline"><em>Y</em></span>, when only the predictors (<span class="math inline"><em>X</em><em>s</em></span>) values are known.</p>
</blockquote>
<h2>Introduction</h2>
<p>The aim of linear regression is to model a continuous variable <span class="math inline"><em>Y</em></span> as a mathematical function of one or more <span class="math inline"><em>X</em></span> variable(s), so that we can use this regression model to predict the <span class="math inline"><em>Y</em></span> when only the <span class="math inline"><em>X</em></span> is known. This mathematical equation can be generalized as follows:</p>
<p><br /><span class="math display"><em>Y</em> = <em>β</em><sub>1</sub> + <em>β</em><sub>2</sub><em>X</em> + <em>ϵ</em></span><br /></p>
<p>where, <span class="math inline"><em>β</em><sub>1</sub></span> is the intercept and <span class="math inline"><em>β</em><sub>2</sub></span> is the slope. Collectively, they are called <em>regression coefficients</em>. <span class="math inline"><em>ϵ</em></span> is the error term, the part of <span class="math inline"><em>Y</em></span> the regression model is unable to explain.</p>
<p><img src='screenshots/linear-regression-small.png' width='393' height='352' /></p>
<h2>Example Problem</h2>
<p>For this analysis, we will use the <em>cars</em> dataset that comes with R by default. <code>cars</code> is a standard built-in dataset, that makes it convenient to demonstrate linear regression in a simple and easy to understand fashion. You can access this dataset simply by typing in <code>cars</code> in your R console. You will find that it consists of 50 observations(rows) and 2 variables (columns) – <code>dist</code> and <code>speed</code>. Lets print out the first six observations here..</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">head</span>(cars)  <span class="co"># display the first 6 observations</span>
<span class="co">#&gt;   speed dist</span>
<span class="co">#&gt; 1     4    2</span>
<span class="co">#&gt; 2     4   10</span>
<span class="co">#&gt; 3     7    4</span>
<span class="co">#&gt; 4     7   22</span>
<span class="co">#&gt; 5     8   16</span>
<span class="co">#&gt; 6     9   10</span></code></pre></div>
<p>Before we begin building the regression model, it is a good practice to analyze and understand the variables. The graphical analysis and correlation study below will help with this.</p>
<h2>Graphical Analysis</h2>
<p>The aim of this exercise is to build a simple regression model that we can use to predict Distance (dist) by establishing a statistically significant linear relationship with Speed (speed). But before jumping in to the syntax, lets try to understand these variables graphically. Typically, for each of the independent variables (predictors), the following plots are drawn to visualize the following behavior:</p>
<ol style="list-style-type: decimal">
<li><strong>Scatter plot</strong>: Visualize the linear relationship between the predictor and response</li>
<li><strong>Box plot</strong>: To spot any outlier observations in the variable. Having outliers in your predictor can drastically affect the predictions as they can easily affect the direction/slope of the line of best fit.</li>
<li><strong>Density plot</strong>: To see the distribution of the predictor variable. Ideally, a close to normal distribution (a bell shaped curve), without being skewed to the left or right is preferred. Let us see how to make each one of them.</li>
</ol>
<h3>Scatter Plot</h3>
<p>Scatter plots can help visualize any linear relationships between the dependent (response) variable and independent (predictor) variables. Ideally, if you are having multiple predictor variables, a scatter plot is drawn for each one of them against the response, along with the line of best as seen below.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">scatter.smooth</span>(<span class="dt">x=</span>cars$speed, <span class="dt">y=</span>cars$dist, <span class="dt">main=</span><span class="st">&quot;Dist ~ Speed&quot;</span>)  <span class="co"># scatterplot</span></code></pre></div>
<p><img src='screenshots/dist-speed-scatterplot.png' width='528' height='371' /></p>
<p>The scatter plot along with the smoothing line above suggests a linearly increasing relationship between the ‘dist’ and ‘speed’ variables. This is a good thing, because, one of the underlying assumptions in linear regression is that the relationship between the response and predictor variables is linear and additive.</p>
<h3>BoxPlot – Check for outliers</h3>
<p>Generally, any datapoint that lies outside the 1.5 * interquartile-range (<span class="math inline">1.5 * <em>I</em><em>Q</em><em>R</em></span>) is considered an outlier, where, IQR is calculated as the distance between the 25th percentile and 75th percentile values for that variable.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">par</span>(<span class="dt">mfrow=</span><span class="kw">c</span>(<span class="dv">1</span>, <span class="dv">2</span>))  <span class="co"># divide graph area in 2 columns</span>
<span class="kw">boxplot</span>(cars$speed, <span class="dt">main=</span><span class="st">&quot;Speed&quot;</span>, <span class="dt">sub=</span><span class="kw">paste</span>(<span class="st">&quot;Outlier rows: &quot;</span>, <span class="kw">boxplot.stats</span>(cars$speed)$out))  <span class="co"># box plot for &#39;speed&#39;</span>
<span class="kw">boxplot</span>(cars$dist, <span class="dt">main=</span><span class="st">&quot;Distance&quot;</span>, <span class="dt">sub=</span><span class="kw">paste</span>(<span class="st">&quot;Outlier rows: &quot;</span>, <span class="kw">boxplot.stats</span>(cars$dist)$out))  <span class="co"># box plot for &#39;distance&#39;</span></code></pre></div>
<p><img src='screenshots/boxplot.png' width='528' height='289' /></p>
<h3>Density plot – Check if the response variable is close to normality</h3>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(e1071)
<span class="kw">par</span>(<span class="dt">mfrow=</span><span class="kw">c</span>(<span class="dv">1</span>, <span class="dv">2</span>))  <span class="co"># divide graph area in 2 columns</span>
<span class="kw">plot</span>(<span class="kw">density</span>(cars$speed), <span class="dt">main=</span><span class="st">&quot;Density Plot: Speed&quot;</span>, <span class="dt">ylab=</span><span class="st">&quot;Frequency&quot;</span>, <span class="dt">sub=</span><span class="kw">paste</span>(<span class="st">&quot;Skewness:&quot;</span>, <span class="kw">round</span>(e1071::<span class="kw">skewness</span>(cars$speed), <span class="dv">2</span>)))  <span class="co"># density plot for &#39;speed&#39;</span>
<span class="kw">polygon</span>(<span class="kw">density</span>(cars$speed), <span class="dt">col=</span><span class="st">&quot;red&quot;</span>)
<span class="kw">plot</span>(<span class="kw">density</span>(cars$dist), <span class="dt">main=</span><span class="st">&quot;Density Plot: Distance&quot;</span>, <span class="dt">ylab=</span><span class="st">&quot;Frequency&quot;</span>, <span class="dt">sub=</span><span class="kw">paste</span>(<span class="st">&quot;Skewness:&quot;</span>, <span class="kw">round</span>(e1071::<span class="kw">skewness</span>(cars$dist), <span class="dv">2</span>)))  <span class="co"># density plot for &#39;dist&#39;</span>
<span class="kw">polygon</span>(<span class="kw">density</span>(cars$dist), <span class="dt">col=</span><span class="st">&quot;red&quot;</span>)</code></pre></div>
<p><img src='screenshots/density-plot.png' width='528' height='289' /></p>
<h2>Correlation</h2>
<p>Correlation is a statistical measure that suggests the level of linear dependence between two variables, that occur in pair – just like what we have here in speed and dist. Correlation can take values between -1 to +1. If we observe for every instance where speed increases, the distance also increases along with it, then there is a high positive correlation between them and therefore the correlation between them will be closer to 1. The opposite is true for an inverse relationship, in which case, the correlation between the variables will be close to -1.</p>
<p>A value closer to 0 suggests a weak relationship between the variables. A low correlation (-0.2 &lt; x &lt; 0.2) probably suggests that much of variation of the response variable (<span class="math inline"><em>Y</em></span>) is unexplained by the predictor (<span class="math inline"><em>X</em></span>), in which case, we should probably look for better explanatory variables.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">cor</span>(cars$speed, cars$dist)  <span class="co"># calculate correlation between speed and distance </span>
<span class="co">#&gt; [1] 0.8068949</span></code></pre></div>
<h2>Build Linear Model</h2>
<p>Now that we have seen the linear relationship pictorially in the scatter plot and by computing the correlation, lets see the syntax for building the linear model. The function used for building linear models is <code>lm()</code>. The <code>lm()</code> function takes in two main arguments, namely: 1. Formula 2. Data. The data is typically a data.frame and the formula is a object of class <code>formula</code>. But the most common convention is to write out the formula directly in place of the argument as written below.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">linearMod &lt;-<span class="st"> </span><span class="kw">lm</span>(dist ~<span class="st"> </span>speed, <span class="dt">data=</span>cars)  <span class="co"># build linear regression model on full data</span>
<span class="kw">print</span>(linearMod)
<span class="co">#&gt; Call:</span>
<span class="co">#&gt; lm(formula = dist ~ speed, data = cars)</span>
<span class="co">#&gt; </span>
<span class="co">#&gt; Coefficients:</span>
<span class="co">#&gt; (Intercept)        speed  </span>
<span class="co">#&gt;     -17.579        3.932</span></code></pre></div>
<p>Now that we have built the linear model, we also have established the relationship between the predictor and response in the form of a mathematical formula for Distance (dist) as a function for speed. For the above output, you can notice the ‘Coefficients’ part having two components: <em>Intercept</em>: -17.579, <em>speed</em>: 3.932 These are also called the beta coefficients. In other words, <strong><br /><span class="math display"><em>d</em><em>i</em><em>s</em><em>t</em> = <em>I</em><em>n</em><em>t</em><em>e</em><em>r</em><em>c</em><em>e</em><em>p</em><em>t</em> + (<em>β</em> ∗ <em>s</em><em>p</em><em>e</em><em>e</em><em>d</em>)</span><br /></strong> =&gt; dist = −17.579 + 3.932∗speed</p>
<h2>Linear Regression Diagnostics</h2>
<p>Now the linear model is built and we have a formula that we can use to predict the dist value if a corresponding speed is known. Is this enough to actually use this model? NO! Before using a regression model, you have to ensure that it is statistically significant. How do you ensure this? Lets begin by printing the summary statistics for linearMod.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">summary</span>(linearMod)  <span class="co"># model summary</span>
<span class="co">#&gt; Call:</span>
<span class="co">#&gt; lm(formula = dist ~ speed, data = cars)</span>
<span class="co">#&gt; </span>
<span class="co">#&gt; Residuals:</span>
<span class="co">#&gt;     Min      1Q  Median      3Q     Max </span>
<span class="co">#&gt; -29.069  -9.525  -2.272   9.215  43.201 </span>
<span class="co">#&gt; </span>
<span class="co">#&gt; Coefficients:</span>
<span class="co">#&gt;             Estimate Std. Error t value Pr(&gt;|t|)    </span>
<span class="co">#&gt; (Intercept) -17.5791     6.7584  -2.601   0.0123 *  </span>
<span class="co">#&gt; speed         3.9324     0.4155   9.464 1.49e-12 ***</span>
<span class="co">#&gt; ---</span>
<span class="co">#&gt; Signif. codes:  0 &#39;***&#39; 0.001 &#39;**&#39; 0.01 &#39;*&#39; 0.05 &#39;.&#39; 0.1 &#39; &#39; 1</span>
<span class="co">#&gt; </span>
<span class="co">#&gt; Residual standard error: 15.38 on 48 degrees of freedom</span>
<span class="co">#&gt; Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 </span>
<span class="co">#&gt; F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12</span></code></pre></div>
<h2>The p Value: Checking for statistical significance</h2>
<p>The summary statistics above tells us a number of things. One of them is the model p-Value (bottom last line) and the p-Value of individual predictor variables (extreme right column under ‘Coefficients’). The p-Values are very important because, We can consider a linear model to be statistically significant only when both these p-Values are less that the pre-determined statistical significance level, which is ideally 0.05. This is visually interpreted by the significance stars at the end of the row. The more the stars beside the variable’s p-Value, the more significant the variable.</p>
<h4>Null and alternate hypothesis</h4>
<p>When there is a p-value, there is a hull and alternative hypothesis associated with it. In Linear Regression, the Null Hypothesis is that the coefficients associated with the variables is equal to zero. The alternate hypothesis is that the coefficients are not equal to zero (i.e. there exists a relationship between the independent variable in question and the dependent variable).</p>
<h4>t-value</h4>
<p>We can interpret the t-value something like this. A larger <em>t-value</em> indicates that it is less likely that the coefficient is not equal to zero purely by chance. So, higher the t-value, the better.</p>
<p><em>Pr(&gt;|t|)</em> or <em>p-value</em> is the probability that you get a t-value as high or higher than the observed value when the Null Hypothesis (the <span class="math inline"><em>β</em></span> coefficient is equal to zero or that there is no relationship) is true. So if the <em>Pr(&gt;|t|)</em> is low, the coefficients are significant (significantly different from zero). If the <em>Pr(&gt;|t|)</em> is high, the coefficients are not significant.</p>
<p>What this means to us? when p Value is less than significance level (&lt; 0.05), we can safely reject the null hypothesis that the co-efficient <em>β</em> of the predictor is zero. In our case, <code>linearMod</code>, both these p-Values are well below the 0.05 threshold, so we can conclude our model is indeed statistically significant.</p>
<p>It is absolutely important for the model to be statistically significant before we can go ahead and use it to predict (or estimate) the dependent variable, otherwise, the confidence in predicted values from that model reduces and may be construed as an event of chance.</p>
<h2>How to calculate the t Statistic and p-Values?</h2>
<p>When the model co-efficients and standard error are known, the formula for calculating t Statistic and p-Value is as follows: <br /><span class="math display">$$t−Statistic = {β−coefficient \over Std.Error}$$</span><br /></p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">modelSummary &lt;-<span class="st"> </span><span class="kw">summary</span>(linearMod)  <span class="co"># capture model summary as an object</span>
modelCoeffs &lt;-<span class="st"> </span>modelSummary$coefficients  <span class="co"># model coefficients</span>
beta.estimate &lt;-<span class="st"> </span>modelCoeffs[<span class="st">&quot;speed&quot;</span>, <span class="st">&quot;Estimate&quot;</span>]  <span class="co"># get beta estimate for speed</span>
std.error &lt;-<span class="st"> </span>modelCoeffs[<span class="st">&quot;speed&quot;</span>, <span class="st">&quot;Std. Error&quot;</span>]  <span class="co"># get std.error for speed</span>
t_value &lt;-<span class="st"> </span>beta.estimate/std.error  <span class="co"># calc t statistic</span>
p_value &lt;-<span class="st"> </span><span class="dv">2</span>*<span class="kw">pt</span>(-<span class="kw">abs</span>(t_value), <span class="dt">df=</span><span class="kw">nrow</span>(cars)-<span class="kw">ncol</span>(cars))  <span class="co"># calc p Value</span>
f_statistic &lt;-<span class="st"> </span>linearMod$fstatistic[<span class="dv">1</span>]  <span class="co"># fstatistic</span>
f &lt;-<span class="st"> </span><span class="kw">summary</span>(linearMod)$fstatistic  <span class="co"># parameters for model p-value calc</span>
model_p &lt;-<span class="st"> </span><span class="kw">pf</span>(f[<span class="dv">1</span>], f[<span class="dv">2</span>], f[<span class="dv">3</span>], <span class="dt">lower=</span><span class="ot">FALSE</span>)</code></pre></div>
<pre><code>## t Value:  9.46399</code></pre>
<pre><code>## p Value:  1.489836e-12</code></pre>
<pre><code>## Model F Statistic:  89.56711 1 48</code></pre>
<pre><code>## Model p-Value:  1.489836e-12</code></pre>
<h2>R-Squared and Adj R-Squared</h2>
<p>The actual information in a data is the total variation it contains, remember?. What R-Squared tells us is the proportion of variation in the dependent (response) variable that has been explained by this model.</p>
<p><br /><span class="math display">$$ R^{2} = 1 - \frac{SSE}{SST}$$</span><br /></p>
<p>where, <span class="math inline"><em>S</em><em>S</em><em>E</em></span> is the <em>sum of squared errors</em> given by <span class="math inline">$SSE = \sum_{i}^{n} \left( y_{i} - \hat{y_{i}} \right) ^{2}$</span> and <span class="math inline">$SST = \sum_{i}^{n} \left( y_{i} - \bar{y_{i}} \right) ^{2}$</span> is the <em>sum of squared total</em>. Here, <span class="math inline">$\hat{y_{i}}$</span> is the fitted value for observation <span class="math inline"><em>i</em></span> and <span class="math inline">$\bar{y}$</span> is the mean of <span class="math inline"><em>Y</em></span>.</p>
<p>We don’t necessarily discard a model based on a low R-Squared value. Its a better practice to look at the AIC and prediction accuracy on validation sample when deciding on the efficacy of a model.</p>
<p><strong>Now thats about R-Squared. What about adjusted R-Squared?</strong> As you add more <span class="math inline"><em>X</em></span> variables to your model, the R-Squared value of the new bigger model will always be greater than that of the smaller subset. This is because, since all the variables in the original model is also present, their contribution to explain the dependent variable will be present in the super-set as well, therefore, whatever new variable we add can only add (if not significantly) to the variation that was already explained. It is here, the adjusted R-Squared value comes to help. Adj R-Squared penalizes total value for the number of terms (read predictors) in your model. Therefore when comparing nested models, it is a good practice to look at adj-R-squared value over R-squared.</p>
<p><br /><span class="math display">$$ R^{2}_{adj} = 1 - \frac{MSE}{MST}$$</span><br /></p>
<p>where, <span class="math inline"><em>M</em><em>S</em><em>E</em></span> is the <em>mean squared error</em> given by <span class="math inline">$MSE = \frac{SSE}{\left( n-q \right)}$</span> and <span class="math inline">$MST = \frac{SST}{\left( n-1 \right)}$</span> is the <em>mean squared total</em>, where <span class="math inline"><em>n</em></span> is the number of observations and <span class="math inline"><em>q</em></span> is the number of coefficients in the model.</p>
<p>Therefore, by moving around the numerators and denominators, the relationship between <span class="math inline"><em>R</em><sup>2</sup></span> and <span class="math inline"><em>R</em><sub><em>a</em><em>d</em><em>j</em></sub><sup>2</sup></span> becomes:</p>
<p><br /><span class="math display">$$R^{2}_{adj} =  1 - \left( \frac{\left( 1 - R^{2}\right) \left(n-1\right)}{n-q}\right)$$</span><br /></p>
<h2>Standard Error and F-Statistic</h2>
<p>Both standard errors and F-statistic are measures of goodness of fit.</p>
<p><br /><span class="math display">$$Std. Error = \sqrt{MSE} = \sqrt{\frac{SSE}{n-q}}$$</span><br /></p>
<p><br /><span class="math display">$$F-statistic = \frac{MSR}{MSE}$$</span><br /></p>
<p>where, <span class="math inline"><em>n</em></span> is the number of observations, <span class="math inline"><em>q</em></span> is the number of coefficients and <span class="math inline"><em>M</em><em>S</em><em>R</em></span> is the <em>mean square regression</em>, calculated as,</p>
<p><br /><span class="math display">$$MSR=\frac{\sum_{i}^{n}\left( \hat{y_{i} - \bar{y}}\right)}{q-1} = \frac{SST - SSE}{q - 1}$$</span><br /></p>
<h2>AIC and BIC</h2>
<p>The Akaike’s information criterion - AIC (Akaike, 1974) and the Bayesian information criterion - BIC (Schwarz, 1978) are measures of the goodness of fit of an estimated statistical model and can also be used for model selection. Both criteria depend on the maximized value of the likelihood function L for the estimated model.</p>
<p>The AIC is defined as:</p>
<p><br /><span class="math display"><em>A</em><em>I</em><em>C</em> = (−2) × <em>l</em><em>n</em>(<em>L</em>) + (2×<em>k</em>)</span><br /></p>
<p>where, k is the number of model parameters and the BIC is defined as:</p>
<p><br /><span class="math display"><em>B</em><em>I</em><em>C</em> = (−2) × <em>l</em><em>n</em>(<em>L</em>) + <em>k</em> × <em>l</em><em>n</em>(<em>n</em>)</span><br /></p>
<p>where, n is the sample size.</p>
<p>For model comparison, the model with the lowest AIC and BIC score is preferred.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">AIC</span>(linearMod)  <span class="co"># AIC =&gt; 419.1569</span>
<span class="kw">BIC</span>(linearMod)  <span class="co"># BIC =&gt; 424.8929</span></code></pre></div>
<h2>How to know if the model is best fit for your data?</h2>
<p>The most common metrics to look at while selecting the model are:</p>
<table>
<colgroup>
<col width="57%" />
<col width="42%" />
</colgroup>
<thead>
<tr class="header">
<th align="left">STATISTIC</th>
<th align="left">CRITERION</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">R-Squared</td>
<td align="left">Higher the better <em>(&gt; 0.70)</em></td>
</tr>
<tr class="even">
<td align="left">Adj R-Squared</td>
<td align="left">Higher the better</td>
</tr>
<tr class="odd">
<td align="left">F-Statistic</td>
<td align="left">Higher the better</td>
</tr>
<tr class="even">
<td align="left">Std. Error</td>
<td align="left">Closer to zero the better</td>
</tr>
<tr class="odd">
<td align="left">t-statistic</td>
<td align="left">Should be greater 1.96 for p-value to be less than 0.05</td>
</tr>
<tr class="even">
<td align="left">AIC</td>
<td align="left">Lower the better</td>
</tr>
<tr class="odd">
<td align="left">BIC</td>
<td align="left">Lower the better</td>
</tr>
<tr class="even">
<td align="left">Mallows cp</td>
<td align="left">Should be close to the number of predictors in model</td>
</tr>
<tr class="odd">
<td align="left">MAPE (Mean absolute percentage error)</td>
<td align="left">Lower the better</td>
</tr>
<tr class="even">
<td align="left">MSE (Mean squared error)</td>
<td align="left">Lower the better</td>
</tr>
<tr class="odd">
<td align="left">Min_Max Accuracy =&gt; mean(min(actual, predicted)/max(actual, predicted))</td>
<td align="left">Higher the better</td>
</tr>
</tbody>
</table>
<h2>Predicting Linear Models</h2>
<p>So far we have seen how to build a linear regression model using the whole dataset. If we build it that way, there is no way to tell how the model will perform with new data. So the preferred practice is to split your dataset into a 80:20 sample (training:test), then, build the model on the 80% sample and then use the model thus built to predict the dependent variable on test data.</p>
<p>Doing it this way, we will have the model predicted values for the 20% data (test) as well as the actuals (from the original dataset). By calculating accuracy measures (like min_max accuracy) and error rates (MAPE or MSE), we can find out the prediction accuracy of the model. Now, lets see how to actually do this..</p>
<h4>Step 1: Create the training (development) and test (validation) data samples from original data.</h4>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Create Training and Test data -</span>
<span class="kw">set.seed</span>(<span class="dv">100</span>)  <span class="co"># setting seed to reproduce results of random sampling</span>
trainingRowIndex &lt;-<span class="st"> </span><span class="kw">sample</span>(<span class="dv">1</span>:<span class="kw">nrow</span>(cars), <span class="fl">0.8</span>*<span class="kw">nrow</span>(cars))  <span class="co"># row indices for training data</span>
trainingData &lt;-<span class="st"> </span>cars[trainingRowIndex, ]  <span class="co"># model training data</span>
testData  &lt;-<span class="st"> </span>cars[-trainingRowIndex, ]   <span class="co"># test data</span></code></pre></div>
<h4>Step 2: Develop the model on the training data and use it to predict the distance on test data</h4>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Build the model on training data -</span>
lmMod &lt;-<span class="st"> </span><span class="kw">lm</span>(dist ~<span class="st"> </span>speed, <span class="dt">data=</span>trainingData)  <span class="co"># build the model</span>
distPred &lt;-<span class="st"> </span><span class="kw">predict</span>(lmMod, testData)  <span class="co"># predict distance</span></code></pre></div>
<h4>Step 3: Review diagnostic measures.</h4>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">summary</span> (lmMod)  <span class="co"># model summary</span>
<span class="co">#&gt; </span>
<span class="co">#&gt; Call:</span>
<span class="co">#&gt; lm(formula = dist ~ speed, data = trainingData)</span>
<span class="co">#&gt; </span>
<span class="co">#&gt; Residuals:</span>
<span class="co">#&gt;     Min      1Q  Median      3Q     Max </span>
<span class="co">#&gt; -23.350 -10.771  -2.137   9.255  42.231 </span>
<span class="co">#&gt; </span>
<span class="co">#&gt; Coefficients:</span>
<span class="co">#&gt;             Estimate Std. Error t value Pr(&gt;|t|)    </span>
<span class="co">#&gt; (Intercept)  -22.657      7.999  -2.833  0.00735 ** </span>
<span class="co">#&gt; speed          4.316      0.487   8.863 8.73e-11 ***</span>
<span class="co">#&gt; ---</span>
<span class="co">#&gt; Signif. codes:  0 &#39;***&#39; 0.001 &#39;**&#39; 0.01 &#39;*&#39; 0.05 &#39;.&#39; 0.1 &#39; &#39; 1</span>
<span class="co">#&gt; </span>
<span class="co">#&gt; Residual standard error: 15.84 on 38 degrees of freedom</span>
<span class="co">#&gt; Multiple R-squared:  0.674,  Adjusted R-squared:  0.6654 </span>
<span class="co">#&gt; F-statistic: 78.56 on 1 and 38 DF,  p-value: 8.734e-11</span>
<span class="kw">AIC</span> (lmMod)  <span class="co"># Calculate akaike information criterion</span>
<span class="co">#&gt; [1] 338.4489</span></code></pre></div>
<p>From the model summary, the model p value and predictor’s p value are less than the significance level, so we know we have a statistically significant model. Also, the R-Sq and Adj R-Sq are comparative to the original model built on full data.</p>
<h4>Step 4: Calculate prediction accuracy and error rates</h4>
<p>A simple correlation between the actuals and predicted values can be used as a form of accuracy measure. A higher correlation accuracy implies that the actuals and predicted values have similar directional movement, i.e. when the actuals values increase the predicteds also increase and vice-versa.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">actuals_preds &lt;-<span class="st"> </span><span class="kw">data.frame</span>(<span class="kw">cbind</span>(<span class="dt">actuals=</span>testData$dist, <span class="dt">predicteds=</span>distPred))  <span class="co"># make actuals_predicteds dataframe.</span>
correlation_accuracy &lt;-<span class="st"> </span><span class="kw">cor</span>(actuals_preds)  <span class="co"># 82.7%</span>
<span class="kw">head</span>(actuals_preds)
<span class="co">#&gt;    actuals predicteds</span>
<span class="co">#&gt; 1        2  -5.392776</span>
<span class="co">#&gt; 4       22   7.555787</span>
<span class="co">#&gt; 8       26  20.504349</span>
<span class="co">#&gt; 20      26  37.769100</span>
<span class="co">#&gt; 26      54  42.085287</span>
<span class="co">#&gt; 31      50  50.717663</span></code></pre></div>
<p>Now lets calculate the Min Max accuracy and MAPE: <strong><br /><span class="math display">$$MinMaxAccuracy = mean \left( \frac{min\left(actuals, predicteds\right)}{max\left(actuals, predicteds \right)} \right)$$</span><br /></strong></p>
<p><strong><br /><span class="math display">$$MeanAbsolutePercentageError \ (MAPE) = mean\left( \frac{abs\left(predicteds−actuals\right)}{actuals}\right)$$</span><br /></strong></p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">min_max_accuracy &lt;-<span class="st"> </span><span class="kw">mean</span>(<span class="kw">apply</span>(actuals_preds, <span class="dv">1</span>, min) /<span class="st"> </span><span class="kw">apply</span>(actuals_preds, <span class="dv">1</span>, max))  
<span class="co"># =&gt; 58.42%, min_max accuracy</span>
mape &lt;-<span class="st"> </span><span class="kw">mean</span>(<span class="kw">abs</span>((actuals_preds$predicteds -<span class="st"> </span>actuals_preds$actuals))/actuals_preds$actuals)  
<span class="co"># =&gt; 48.38%, mean absolute percentage deviation</span></code></pre></div>
<h2>k- Fold Cross validation</h2>
<p>Suppose, the model predicts satisfactorily on the 20% split (test data), is that enough to believe that your model will perform equally well all the time? It is important to rigorously test the model’s performance as much as possible. One way is to ensure that the model equation you have will perform well, when it is ‘built’ on a different subset of training data and predicted on the remaining data.</p>
<p>How to do this is? Split your data into ‘k’ mutually exclusive random sample portions. Keeping each portion as test data, we build the model on the remaining (k-1 portion) data and calculate the mean squared error of the predictions. This is done for each of the ‘k’ random sample portions. Then finally, the average of these mean squared errors (for ‘k’ portions) is computed. We can use this metric to compare different linear models.</p>
<p>By doing this, we need to check two things:</p>
<ol style="list-style-type: decimal">
<li>If the model’s prediction accuracy isn’t varying too much for any one particular sample, and</li>
<li>If the lines of best fit don’t vary too much with respect the the slope and level.</li>
</ol>
<p>In other words, they should be parallel and as close to each other as possible. You can find a more detailed explanation for interpreting the cross validation charts when you learn about advanced linear model building.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(DAAG)
cvResults &lt;-<span class="st"> </span><span class="kw">suppressWarnings</span>(<span class="kw">CVlm</span>(<span class="dt">df=</span>cars, <span class="dt">form.lm=</span>dist ~<span class="st"> </span>speed, <span class="dt">m=</span><span class="dv">5</span>, <span class="dt">dots=</span><span class="ot">FALSE</span>, <span class="dt">seed=</span><span class="dv">29</span>, <span class="dt">legend.pos=</span><span class="st">&quot;topleft&quot;</span>,  <span class="dt">printit=</span><span class="ot">FALSE</span>, <span class="dt">main=</span><span class="st">&quot;Small symbols are predicted values while bigger ones are actuals.&quot;</span>));  <span class="co"># performs the CV</span>
<span class="kw">attr</span>(cvResults, <span class="st">&#39;ms&#39;</span>)  <span class="co"># =&gt; 251.2783 mean squared error</span></code></pre></div>
<p>In the below plot, Are the dashed lines parallel? Are the small and big symbols are not over dispersed for one particular color?</p>
<p><img src='screenshots/cv-plot.png' width='528' height='528' /></p>
<h3>Where to go from here?</h3>
<p>We have covered the basic concepts about linear regression. Besides these, you need to understand that linear regression is based on certain underlying <a href="Assumptions-of-Linear-Regression.html">assumptions</a> that must be taken care especially when working with multiple <span class="math inline"><em>X</em><em>s</em></span>. Once you are familiar with that, the <a href="adv-regression-models-html">advanced regression models</a> will show you around the various special cases where a different form of regression would be more suitable.</p>


        </div>
      </div>

      <div class="footer">
        <hr>
        <p>&copy; 2016-17 Selva Prabhakaran. Powered by <a href="http://jekyllrb.com/">jekyll</a>,
          <a href="http://yihui.name/knitr/">knitr</a>, and
          <a href="http://johnmacfarlane.net/pandoc/">pandoc</a>.
          This work is licensed under the <a href="http://creativecommons.org/licenses/by-nc/3.0/">Creative Commons License.</a>
        </p>
      </div>

    </div> <!-- /container -->

  <script src="//code.jquery.com/jquery.js"></script>
  <script src="www/bootstrap.min.js"></script>
  <script src="www/toc.js"></script>
  <!-- MathJax Script -->
  
  <script type="text/x-mathjax-config">
    MathJax.Hub.Config({
      tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}
    });
  </script>
  <script type="text/javascript"
    src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
  </script>
  
  <!-- Google Analytics Code -->
  <script>
    (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
    (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
    m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
    })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

    ga('create', 'UA-69351797-1', 'auto');
    ga('send', 'pageview');
  </script>

  <style type="text/css">
  /* reduce spacing around math formula*/
    .MathJax_Display {
      margin: 0em 0em;
    }
    body {
      font-family: 'Helvetica Neue', Roboto, Arial, sans-serif;
      font-size: 16px;
      line-height: 27px;
      font-weight: 400;
    }

    blockquote p {
      line-height: 1.75;
      color: #717171;
    }

    .well li{
      line-height: 28px;
    }

    li.dropdown-header {
      display: block;
      padding: 0px;
      font-size: 14px;
    }
  </style>
  </body>
</html>