Linear Models

Stats

linear models

Published

May 3, 2025

Welcome to the first tutorial on data analysis!!! Today we are going to talk about one of the most flexible statistical methods: Linear models.

Let’s be clear, WE ARE NOT STATISTICIANS!!!!

We’ll be discussing linear models in a very accessible and practical manner. Our explanations might not align with the rigorous definitions statisticians are accustomed to, and for that, we apologize in advance! However, our aim is to provide a stepping stone for you to grasp the concept of linear models and similar analyses. Let’s get started!

Note

and only R

In the tutorials regarding stats, we will use R (and R-Studio as the IDE). The concepts and procedures are the same in other programming languages (e.g., Python, Julia). However, we choose R for a simple reason. R is just better at most statistical things!!

Deal with it!

What is a linear model?

A linear model is a simple statistical test that tries to find the best line that represent the relations between two variables ( or more).

What’s truly fascinating about linear models is their versatility. They start off incredibly simple, but their complexity can grow exponentially! This makes them a remarkably flexible tool in the world of data analysis.

Think of linear models as the foundation of a house.

You can start with a basic structure, but with some clever modifications (like mixed effect models, generalized linear models, quadratic models or additive models), you can build anything from a cozy cottage to a multi-story mansion.

In essence, linear models offer a perfect balance: they’re accessible enough for beginners to grasp, yet powerful enough to satisfy the needs of advanced researchers. As we dive deeper into this topic, you’ll see just how these seemingly simple tools can unlock complex insights in your data.

Hands-on

OK, enough chitchat - let’s start with a practical example. We’ll be working with a data set we created specifically for this tutorial. As mentioned, we’re going to begin with a very basic model, and in the upcoming tutorials, we’ll gradually increase both the complexity and accuracy of our approach.

So, if you notice something that doesn’t seem quite perfect at this stage, don’t worry! It’s all part of the plan. Our goal is to guide you step-by-step towards building the best model. Just remember, this process takes time!

Import data

You can download the data that we will use in this tutorial from here:

Once downloaded we need to import it in our R session. Here we read our csv and we print a small preview of it.

df = read.csv("../../resources/Stats/Dataset.csv")
head(df)

  Id    Event TrialN ReactionTime LookingTime SES
1  1 NoReward      1     344.1243   1139.8102 low
2  1 NoReward      2     361.9389   1156.4140 low
3  1 NoReward      3     406.3499   1073.5468 low
4  1 NoReward      4     395.2774   1106.5766 low
5  1 NoReward      5     382.1649    990.5572 low
6  1 NoReward      6     367.3000   1053.6605 low

You can see that the data is really simple! We have 4 columns:

Id column that tell us form which participant the data was collected
Event represent which condition we were in
TrialN the sequential number of trials in the experiment
ReactionTime how quickly participants reacted to the stimuli. We will ignore this variable for now
LookingTimethe variable of interest for today, the one that we want to model
SES represents socioeconomic status collected via questionnaire (high, medium, or low) for each participant - we’ll also ignore this for now

Long format

One important information that we need to keep in mind is that to run lm() (and most models!!) in R we need the data in a long format and not a wide format.

In long format, each row represents a single observation. Variables are organized in columns, with one column for the variable names and another for the values. This means that the column you want to model (in the example LookingTime) has 1 row for observation but the other columns usually have repeated entries ( e.g. Id , TrialN, Event)

Wide format, on the other hand, has each row representing a subject or group, with multiple columns for different variables or time-points. While this can be visually appealing for humans, it’s not optimal for our linear modeling needs.

Here a small comparison of the two formats:

Long Format

# A tibble: 8 × 4
  Id    Event    TrialN LookingTime
  <chr> <chr>     <int>       <dbl>
1 1     Reward        1        1450
2 1     Reward        2        1420
3 1     Reward        3        1390
4 1     NoReward      1        1380
5 1     NoReward      2        1350
6 1     NoReward      3        1320
7 2     Reward        1        1480
8 2     Reward        2        1460

Wide Format

# A tibble: 3 × 7
  Id    Reward_Trial1 Reward_Trial2 Reward_Trial3 NoReward_Trial1
  <chr>         <dbl>         <dbl>         <dbl>           <dbl>
1 1              1450          1420          1390            1380
2 2              1480          1460          1410            1390
3 3              1520          1490          1450            1410
# ℹ 2 more variables: NoReward_Trial2 <dbl>, NoReward_Trial3 <dbl>

If your data is currently in wide format, don’t worry! R provides tools like the tidyr package with functions such as pivot_longer() to easily convert your data from wide to long format.

Formula

To run models in R we usually use formulas! Sounds complex doesn’t it?!? Well it is not, let me guide you through it.

In R, model formulas follow a specific structure. On the left side of the formula, we place our dependent variable - the outcome we collected and we’re interested in studying. In this case, it’s the LookingTime column. Next, we use the tilde symbol ~. This tilde tells R that we want to predict the variable on the left using the variables on the right side of the formula. On the right side, we list the independent variables (predictors) we believe may influence our dependent variable. To test whether TrialN predicts LookingTime, we can use the formula:

LookingTime ~ TrialN. This basic structure allows us to examine a single predictor.

We can extend this model by adding another variable, such as Event, to see if it also predicts LookingTime:

LookingTime ~ TrialN + Event. This formulation tells the model to assess whether either TrialN and Event predicts LookingTime, treating them as independent predictors.

To examine the interaction between these variables, we use a slightly different syntax:

LookingTime ~ TrialN : Event. This instructs the model to evaluate whether the interaction between the two variables predicts LookingTime.

It’s important to note that using : only tests the interaction, not the individual effects of each variable. To include both main effects and their interaction, we can use the formula:

LookingTime ~ TrialN + Event + TrialN:Event.

R offers a shorthand for this complete model using the * operator. The formula:

LookingTime ~ TrialN * Event is equivalent to the longer version above, testing both main effects and the interaction in a more concise format.

These formulas are for simple linear models. Different types of models add small and different pieces to this basic structure. We will see in the next tutorial how to handle these “add-ons”. Now that we have seen how to make a proper formula let’s use it in our model!!

Run the model

OK, now we have our amazing data! Let’s run this Linear model.

It’s extremely simple. We will use the function lm() and we will pass our data df and the formula we just made together!!

After fitting the model we extract the summary of it. This is how we will get all the information we need.

mod_lm = lm(LookingTime ~ TrialN*Event, data = df)
summary(mod_lm)


Call:
lm(formula = LookingTime ~ TrialN * Event, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-772.0 -159.7   -6.1  190.5  607.1 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        1390.1655    29.1769  47.646   <2e-16 ***
TrialN               -7.5998     2.9897  -2.542   0.0113 *  
EventReward          99.8279    39.3360   2.538   0.0114 *  
TrialN:EventReward    0.1332     3.7240   0.036   0.9715    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 253.2 on 656 degrees of freedom
  (140 observations deleted due to missingness)
Multiple R-squared:  0.05147,   Adjusted R-squared:  0.04713 
F-statistic: 11.87 on 3 and 656 DF,  p-value: 1.417e-07

Woohoo! This was super simple!! 🎉 We can use the output of the model to understand whether the variables are predicting LookingTime. The magic number we’re looking for is the p-value, hiding in the last column of the Coefficients section. If the p-value is below 0.05, we’ve got ourselves an effect! If it’s above, sadly we don’t. AND YES, EVEN IF IT’S 0.051!!! Rules are rules in the p-value game!

What goodies can we spot here? First, our Intercept is significant (don’t worry, we’ll decode this mysterious value later!). Even more exciting, we’ve got a significant effect of TrialN – our continuous variable – (p = 0.0113) and a significant effect of Event (p = 0.0114). Double win! However, the interaction between TrialN and Event is playing hard to get with that p = 0.9715. That’s nowhere near significant!

This is already pretty cool, right?!? But here’s the deal - when looking at model outputs, people often get hypnotized by p-values. However, there’s MUCH more to unpack in a model summary! The full story of our data is waiting to be discovered in the complete model output. Let’s dive in together and crack this statistical puzzle!

Interpret our Model

Here what the summary is telling us!!

Call: This section just reports the function call that we passed to lm(). Nothing fancy, just reminding us what we asked for!

Residuals: This section shows the leftovers of our model - the differences between what we observed and what our model predicted. Think of residuals as the “Hmm, that’s interesting” parts that our model couldn’t explain. We want these to be as small as possible! 🔍

Coefficients: This is where the real magic happens! This section displays:
- Estimates - The actual numbers that tell us how much our outcome changes when we tweak each predictor
  
  Note
  
  Estimates are often called beta coefficients and are represented using the Greek letter β (we will use this in some plots as it is easier to show)
- Standard Error - How confident we are about those estimates (smaller = more confident!)
- T-value - A measure of how many standard errors away from zero our estimate is
- p-value - The statistical measure that helps us determine if an effect is statistically significant

Model Fit: The final act of our statistical show! Here we find:
- Residual standard error: The average miss by our prediction line ( lower scores are better).
- Multiple R-Squared: The proportion of variance explained by our predictors. Ranges from 0 (no explanation) to 1 (perfect explanation).
- Adjusted R-squared: A modified version of R-squared that accounts for the number of predictors in the model. Useful for comparing models with different numbers of variables.
- F-statistics and p-value: Tests whether the model as a whole is statistically significant. If p < 0.05, the model performs significantly better than using just the intercept.

Estimates

The Estimate section is probably one of the most important parts of our model summary. While the other columns (t-values and p-values) are just numbers that tell us whether the predictor fits, the estimates tell us HOW they fit!!

Let’s go together through the most challenging information:

(Intercept)

The intercept often confuses people who approach linear models for the first time. What exactly is it? 🤔

The (Intercept) represents the reference levels where all our predictors (TrialN and Event) are 0. While TrialN is easy to understand when it’s 0 (TrialN 0 in our case would be the first trial), you may be scratching your head thinking…how can Event be 0? It’s a categorical variable, it can’t be 0!!!! TrialN == 0…sure….but Event??

You are absolutely right! When a model encounters a categorical variable, it cleverly selects the first level of such variable as the reference level. If you take another look at our model summary, you can see that there’s information for the Reward level of the Event variable but nothing about the NoReward level. This is because the NoReward level has been selected by the model as the reference level and is thus represented in the intercept value! 💡

So our intercept (1390.1) actually represents the predicted LookingTime when:

We’re on the first trial (TrialN = 0)
We’re in the NoReward condition

The Standard Error of the estimate (29.1) tells us the precision of the estimate.

Tip

Since the intercept has a significant p-value, it means that the estimate for the NoReward condition at trial 0 is actually significantly different from 0. In other words, our participants are definitely looking at something during the NoReward condition in the first trial – their LookingTime isn’t zero! This might seem obvious (of course they’re looking!), but statistically confirming this baseline is actually meaningful.

Ok, all this explanation is great…but it’s much easier to visualize these Estimates!

As you can see, the intercept is pretty straightforward—it gives us the estimate when everything is set to 0, both for continuous and categorical variables. The intercept is the foundation of your model, where all the predictors are at their baseline value (in this case, NoReward was cleverly selected as the reference or 0 level for the categorical variable).

Event

Awesome! Now that we’ve got the intercept down, let’s take a look at the rest of the model output. We’ll skip over the TrialN variable for now and focus on what’s happening with the Event.

At first, the results for Event [Reward] might look like they’re giving us the value of looking time for the Reward. Super easy, right?!?

Well… not exactly! 🤔

In linear models, each coefficient shows the difference in relation to the intercept (the 0 or the reference level), not the exact value of the Reward condition.

It sounds a bit confusing, but let’s break it down. If we want to understand what is the estimate for a Rewarding event we need to take the Intercept (1390.1) –as we mentioned that is actually the event NoReward – and then just simply add to it the Estimate for the Event [Reward] (99.8). So the model is telling us that a Rewarding event should be 1390.1 + 99.8 = 1489.9.

See? Not too bad! Let’s visualize it and make it even clearer!

TrialN

So, interpreting the coefficients for categorical variables wasn’t too tricky, right? But what about continuous variables like TrialN?

No worries, it’s actually pretty straightforward! The coefficient for a continuous variable represents the slope, or the incline, of the line for that variable.

In simpler terms, it shows how much the outcome (in this case, LookingTime) changes for each unit increase in the continuous variable (TrialN). So, in our case the coefficient for TrialN is -7.5, this means that for each unit increase in TrialN, the LookingTime is expected to decrease by 7.5 units (assuming all other variables stay the same).

Important

Remember!! This coefficient represents the effect of TrialN specifically when Event is at its reference level (NoReward). In other words, this -7.5 decrease in LookingTime per trial applies specifically to the NoReward condition!

Even easier..let’s plot again!

Interaction

Alright, now we’re getting to the final step! Let’s talk about the interaction between TrialN and Event! Now, we’re not just dealing with a single factor or continuous variable, but looking at how they interact with each other. Don’t worry—if you understood the previous steps, this will be a breeze!

We’ll take it step by step and look at the interaction in our model parameters. The interaction term between TrialN and Event [Reward] tells us how the relationship between TrialN and LookingTime changes when we switch from the reference Event (NoReward) to the Reward condition.

To put it simply:

For NoReward events, LookingTime decreases by 7.5 units per trial (that’s our main TrialN coefficient)
The interaction coefficient (0.13) tells us how this slope changes for Reward events

So for Reward events, the slope would be: -7.5 + 0.13 = -7.37 units per trial.

What does this mean in real terms? If the interaction coefficient is positive (like in our example), it means participants’ looking time decreases more slowly during Reward trials compared to NoReward trials – the estimate is slightly less negative.

Important

In this model the interaction effect is extremely small and not significant. This means that there is actually no difference in how the TrialN predicted Looking time in either NoReward or Reward! Looking time decreases at basically the same rate regardless of which event type we’re looking at!!!

While there is no significant difference, let’s plot the estimated effects for TrialN for both Reward and NoReward conditions to visualize this relationship!

[1] time                        est_performance_EventTrial 
[3] est_performance_Interaction
<0 rows> (or 0-length row.names)

Important

Do you notice any difference in how the slopes between the two conditions? Probably not! If you examine the model summary, you’ll see the interaction effect is very small and not statistically significant.

This tells us that TrialN predicts Looking time similarly in both NoReward and Reward conditions. In other words, looking time decreases at essentially the same rate regardless of which event type participants observed.

I hope that by now you got a decent understanding on how to interpret what a linear model is telling you!! Different kinds of models will add small pieces of information here and there, but the main information will still be there. Thus, if you got here, you are a step closer to becoming a stats genius!!! 🧠📊

But wait, we’re not done yet! Now it’s time to check whether our model actually meets the assumptions of linear regression. Remember those plots we generated earlier? Let’s make sure our statistical foundation is solid before we draw any final conclusions!

Centering Predictors

We’ve seen how both continuous and categorical predictors are set to 0 in the intercept, and then we interpret the model output based on the difference between these reference levels.

But here’s the thing - sometimes comparing to 0 for a continuous variable doesn’t make much sense! Imagine we’re talking about body weight as a predictor. A weight of 0 would be… well, pretty weird as a comparison level, right? 🤔 This is one of the main reasons why it’s often suggested to center continuous predictors. By centering (subtracting the mean from each value), we transform our values so they have a mean of 0! This means the intercept, being at 0, will actually reflect the mean of the predictor rather than some arbitrary or impossible value.

Z-scoring takes this one step further by not only centering the data (subtracting the mean) but also scaling it by dividing by the standard deviation. This standardizes your variables to have a mean of 0 and a standard deviation of 1. The big advantage? It puts all your predictors on the same scale, making their coefficients directly comparable in terms of effect size.

In practice, it’s mostly suggested that you just go straight to z-scoring your continuous predictors as a simple one-size-fits-all approach. It gives you all the benefits of centering plus standardized coefficients that are easier to compare across different predictors!

On top of making it easier to interpret your model results, it will also help your model fit!! We won’t dwell too much on the statistical reasons, but these transformations reduce multicollinearity, stabilize coefficient estimates (especially for interactions), and improve model convergence. Your model just works better and makes more intuitive sense! 🚀

Run the model

Ok, this sounds fair right? Before running a model we can just standardize the values of our predictors (in our case TrialN). Yeah..but how to? Our favorite way is to use the standardize() function from the the easystats library.

Note

Easystats

Easystats is a collection of R packages that includes tools dedicated to the post-processing of statistical models. It is made of all these packages: report, correlation, modelbased, bayestestR, effectsize, see, parameters, performance, insight, datawizard. We will extensively use all these package in our tutorials. The cool thing is that you can import all of them by just simply importing the collection Easystats with library(easystats).

In this tutorial here we will use the function from the packages datawizardand performance(in the Model checks sections).

datawizard allows to manipulate, clean, transform, and prepare your data
performance is a package to check model performance metrics.

So now we import easystats and we use the function standardize()

library(easystats)
df$StandardTrialN = standardize(df$TrialN)
head(df)

  Id    Event TrialN ReactionTime LookingTime SES StandardTrialN
1  1 NoReward      1     344.1243   1139.8102 low     -1.6464789
2  1 NoReward      2     361.9389   1156.4140 low     -1.4731654
3  1 NoReward      3     406.3499   1073.5468 low     -1.2998518
4  1 NoReward      4     395.2774   1106.5766 low     -1.1265382
5  1 NoReward      5     382.1649    990.5572 low     -0.9532246
6  1 NoReward      6     367.3000   1053.6605 low     -0.7799111

We have created like this a new column StandardTrialN with the standardized values of the TrialN and now we can run the model again

mod_lm = lm(LookingTime ~ StandardTrialN*Event, data = df)
summary(mod_lm)


Call:
lm(formula = LookingTime ~ StandardTrialN * Event, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-772.0 -159.7   -6.1  190.5  607.1 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                1310.3674    16.9387  77.359  < 2e-16 ***
StandardTrialN              -43.8501    17.2505  -2.542   0.0113 *  
EventReward                 101.2269    21.1961   4.776 2.21e-06 ***
StandardTrialN:EventReward    0.7688    21.4872   0.036   0.9715    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 253.2 on 656 degrees of freedom
  (140 observations deleted due to missingness)
Multiple R-squared:  0.05147,   Adjusted R-squared:  0.04713 
F-statistic: 11.87 on 3 and 656 DF,  p-value: 1.417e-07

As you can see the estimates are different!!! Because as we mentioned the reference level and the scale are different now. However do not panic, even if some values are different the relations are the same, we still have negative estimate for the time variable (here StandardTrialN and before TrialN) and positive for the Reward condition. Also the p-values are the same!

So there you have it - standardizing your predictors makes your model both more interpretable and more robust, all while keeping the fundamental relationships in your data intact. It’s one of those simple steps that can make a big difference in your statistical journey! 🌟

Model checks

So now we have run our model and seen the summary… That’s great but how can we know that our model actually is ok?? Linear models, like most statistical techniques require some data assumption to be run. These assumption need to be met otherwise even if our model could be showing amazing results it won’t be valid.

Warning

As we mentioned at the beginning of our tutorial the model we have run in this tutorial is a very simple model that is not actully…correct!!

What are these assumptions?? Well they depend a lot on the model you are running. We won’t go into much details as there are very good website that explain them¹ ,², in this simple linear mode they are:

Linear relationship: There exists a linear relationship between the independent variable, x, and the dependent variable, y.
Independence: The residuals are independent. In particular, there is no correlation between consecutive residuals in time-series data.
Homoscedasticity/Homogeneity of variance: The residuals have constant variance at every level of x.
Normality: The residuals of the model are normally distributed.

Again this is not a theoretical tutorial. So we won’t go into details as which are the assumptions (please read some of the link provided tho!!) but we will show you how to actually check these assumptions.

There is a super easy and convenient way we usually check these assumptions. Using the easystats library we can run the function check_model() to indeed check the model assumptions.

check_model(mod_lm)

Perfect all done!! We have a plot of the model assumptions and we can check if they are met!! But what do these plot represent? Here below we created a table that mirrors each plot with it explanation in it. These are brief and simple explanations. If you want to understand more about the check_model() function we suggest you to read the documentation about it and also the very nice vignette that the package provides.

Model Diagnostic Checks
Posterior predictive checks Compare simulated data from the fitted model to actual data. This reveals systematic discrepancies, helping assess if the model captures key data features and if the chosen distribution family is appropriate.	Linearity Checks if predictors have a linear relationship with the outcome. A straight, horizontal line suggests the model specification is appropriate. Curved or sloped lines indicate potential non-linear relationships, signaling the need for model adjustments.
Homoscedasticity/Homogeneity of variance Checks for homoscedasticity (constant variance). Residuals should spread evenly around a horizontal line across all predictor values. Uneven spread suggests variance inconsistencies, potentially requiring model adjustments.	Outliers Identifies influential observations using Cook’s distance. Points beyond the dashed lines are outliers that may excessively impact model estimates, warranting further investigation or potential removal.
Multi-Collinearity Evaluates predictor independence. High collinearity indicates redundant predictors or possible unobserved variables influencing multiple predictors, which can affect interpretation and model stability.	Normality of residuals Uses Q-Q plots to assess if residuals follow a normal distribution. Points should align with the reference line; deviations suggest the model poorly predicts certain outcome ranges, potentially violating normality assumptions.

Given these information about these plots…. don’t you think there is something that doesn’t add up???????? 🤔

Well some assumptions are not really met aren’t they!!! Look at those residuals!!! Not great…not really normal…not really straight!! 📉

Do not worry!! We have fit an extremely simple model in this tutorial… as you will see in the next tutorial this model is actually WRONG!!! Just do not worry…this model has served its purpose…….for now.

Tip

One of the awesome features of easystats is its broad support for various model types. What’s the big deal? Well, it means that the check_model() function adapts its checks based on the specific model you’re using! This flexibility makes it an incredibly powerful and user-friendly tool. Most of the linear and non-linear models (or at least most of them) can be fed into the check_model() function, allowing you to easily verify if it meets the necessary assumptions.

Keep in mind: Always be aware of which assumptions your model should satisfy. We’re not suggesting you use this function blindly! Instead, we’re showing you how to efficiently plot all relevant assumptions in one go. It’s simpler and quicker!!

Statistical tests

You’ve probably noticed that we’ve been relying on visual checks so far. In our view, this is often the best approach, as statistical tests for model assumptions can sometime be overly stringent. However, there may be situations where you need to provide statistical evidence to support your model assumptions. This often happens when a reviewer (let’s call them Reviewer 2, shall we?) insists on seeing numerical proof. Fortunately, easystats has got your back.

Here are some examples of what you can use:

check_normality(mod_lm)

Warning: Non-normality of residuals detected (p < .001).

To check the normality of our residuals and:

check_homogeneity(mod_lm)

OK: There is not clear evidence for different variances across groups (Bartlett Test, p = 0.986).

to check homoscedasticity/homogeneity of variance. Again you can find all the function in the performance package (part of the Easystats collection)

END

Well done reaching the end!!

This was a tough one!! But again, remember these are solid foundations that will come into use so many times in the future!!

Understanding linear models is like learning to ride a bike - it might feel wobbly at first, but once you get it, you’ll never forget! You now have the skills to interpret intercepts, coefficients, p-values, and interactions, plus you know how to check if your model is actually doing what it’s supposed to do.

--- title: "Linear Models" date: "5/03/2025" author-meta: "Tommaso Ghilardi" description-meta: "Learn what a linear model is, how to run it and how to check its assumptions with a simple tutorial" keywords-meta: "R, lm, Linear models, statistics, analysis, psychology, tutorial, experiment, DevStart, developmental science" categories: - Stats - R - linear models --- Welcome to the first tutorial on data analysis!!! Today we are going to talk about one of the most flexible statistical methods: Linear models. Let’s be clear, WE ARE NOT STATISTICIANS!!!! We’ll be discussing linear models in a very accessible and practical manner. Our explanations might not align with the rigorous definitions statisticians are accustomed to, and for that, we apologize in advance! However, our aim is to provide a stepping stone for you to grasp the concept of linear models and similar analyses. Let’s get started! ::: callout-note and only R [![](https://www.r-project.org/Rlogo.png){style="float:left ; margin-right:10px;margin-bottom:10px;" width="31" height="21"}](https://www.r-project.org/) In the tutorials regarding stats, we will use R (and R-Studio as the IDE). The concepts and procedures are the same in other programming languages (e.g., Python, Julia). However, we choose R for a simple reason. R is just better at most statistical things!! Deal with it! ::: # What is a linear model? A linear model is a simple statistical test that tries to find the best line that represent the relations between two variables ( or more). [![image found on towardsdatascience.com](https://miro.medium.com/v2/resize:fit:640/format:webp/1*MtuQBTW0-XbjA2RrP2z3Kw.gif){fig-align="center"}](https://towardsdatascience.com/linear-regression-5100fe32993a) What's truly fascinating about linear models is their versatility. They start off incredibly simple, but their complexity can grow exponentially! This makes them a remarkably flexible tool in the world of data analysis. Think of linear models as the foundation of a house. ![](/images/Stats/Lm/House.png){fig-align="right" style="float:left; margin-right:10px;margin-bottom:10px;" width="150"} You can start with a basic structure, but with some clever modifications (like mixed effect models, generalized linear models, quadratic models or additive models), you can build anything from a cozy cottage to a multi-story mansion. In essence, linear models offer a perfect balance: they're accessible enough for beginners to grasp, yet powerful enough to satisfy the needs of advanced researchers. As we dive deeper into this topic, you'll see just how these seemingly simple tools can unlock complex insights in your data. # Hands-on OK, enough chitchat - let's start with a practical example. We'll be working with a data set we created specifically for this tutorial. As mentioned, we're going to begin with a very basic model, and in the upcoming tutorials, we'll gradually increase both the complexity and accuracy of our approach. So, if you notice something that doesn't seem quite perfect at this stage, don't worry! It's all part of the plan. Our goal is to guide you step-by-step towards building the best model. Just remember, this process takes time! ## Import data You can download the data that we will use in this tutorial from here: {{< downloadthis ../../resources/Stats/Dataset.csv label="Dataset.csv" dname="Dataset" type="secondary" >}} Once downloaded we need to import it in our R session. Here we read our csv and we print a small preview of it. ```{r} df = read.csv("../../resources/Stats/Dataset.csv") head(df) ``` You can see that the data is really simple! We have 4 columns: - `Id` column that tell us form which participant the data was collected - `Event` represent which condition we were in - `TrialN` the sequential number of trials in the experiment - `ReactionTime` how quickly participants reacted to the stimuli. We will ignore this variable for now - `LookingTime`the variable of interest for today, the one that we want to model - `SES` represents socioeconomic status collected via questionnaire (high, medium, or low) for each participant - we'll also ignore this for now :::::: callout-important ## Long format One important information that we need to keep in mind is that to run `lm()` (and most models!!) in R we need the data in a long format and not a wide format. In long format, each row represents a single observation. Variables are organized in columns, with one column for the variable names and another for the values. This means that the column you want to model (in the example `LookingTime`) has 1 row for observation but the other columns usually have repeated entries ( e.g. `Id` , `TrialN`, `Event`) Wide format, on the other hand, has each row representing a subject or group, with multiple columns for different variables or time-points. While this can be visually appealing for humans, it's not optimal for our linear modeling needs. Here a small comparison of the two formats: ```{r} #| message: false #| warning: false #| include: false ##| layout: 1,1 library(tidyverse) # Create a small eyetracking dataset in long format long_data <- tibble( Id = rep(c("1", "2", "3"), each = 6), Event = rep(rep(c("Reward", "NoReward"), each = 3), 3), TrialN = rep(1:3, 6), LookingTime = c( # P01 - Reward condition 1450, 1420, 1390, # P01 - NoReward condition 1380, 1350, 1320, # P02 - Reward condition 1480, 1460, 1410, # P02 - NoReward condition 1390, 1370, 1340, # P03 - Reward condition 1520, 1490, 1450, # P03 - NoReward condition 1410, 1380, 1350 ) ) wide_data <- long_data %>% pivot_wider( id_cols = Id, names_from = c(Event, TrialN), values_from = LookingTime, names_sep = "_Trial" ) head(wide_data) head(long_data, 8) ``` ::::: columns ::: {.column width="35%"} ### Long Format ```{r} #| echo: false head(long_data, 8) ``` ::: ::: {.column width="59%"} ### Wide Format ```{r} #| echo: false wide_data ``` ::: ::::: If your data is currently in wide format, don't worry! R provides tools like the [tidyr](https://tidyr.tidyverse.org/index.html) package with functions such as [pivot_longer()](https://tidyr.tidyverse.org/reference/pivot_longer.html) to easily convert your data from wide to long format. :::::: ## Formula To run models in R we usually use formulas! Sounds complex doesn't it?!? Well it is not, let me guide you through it. In R, model formulas follow a specific structure. On the left side of the formula, we place our dependent variable - the outcome we collected and we're interested in studying. In this case, it's the `LookingTime` column. Next, we use the tilde symbol `~`. This tilde tells R that we want to predict the variable on the left using the variables on the right side of the formula. On the right side, we list the independent variables (predictors) we believe may influence our dependent variable. To test whether `TrialN` predicts LookingTime, we can use the formula: - `LookingTime ~ TrialN`. This basic structure allows us to examine a single predictor. We can extend this model by adding another variable, such as `Event`, to see if it also predicts LookingTime: - `LookingTime ~ TrialN + Event`. This formulation tells the model to assess whether either `TrialN` and `Event` predicts `LookingTime`, treating them as independent predictors. To examine the interaction between these variables, we use a slightly different syntax: - `LookingTime ~ TrialN : Event`. This instructs the model to evaluate whether the interaction between the two variables predicts `LookingTime`. It's important to note that using `:` only tests the interaction, not the individual effects of each variable. To include both main effects and their interaction, we can use the formula: - `LookingTime ~ TrialN + Event + TrialN:Event`. R offers a shorthand for this complete model using the `*` operator. The formula: - `LookingTime ~ TrialN * Event` is equivalent to the longer version above, testing both main effects and the interaction in a more concise format. These formulas are for simple linear models. Different types of models add small and different pieces to this basic structure. We will see in the next tutorial how to handle these "add-ons". Now that we have seen how to make a proper formula let's use it in our model!! ## Run the model OK, now we have our amazing data! Let's run this Linear model. It's extremely simple. We will use the function `lm()` and we will pass our data `df` and the formula we just made together!! After fitting the model we extract the summary of it. This is how we will get all the information we need. ```{r} #| label: RunLm #| message: false #| warning: false mod_lm = lm(LookingTime ~ TrialN*Event, data = df) summary(mod_lm) ``` Woohoo! This was super simple!! 🎉 We can use the output of the model to understand whether the variables are predicting LookingTime. The magic number we're looking for is the p-value, hiding in the last column of the **Coefficients section**. If the p-value is below 0.05, we've got ourselves an effect! If it's above, sadly we don't. [**AND YES, EVEN IF IT'S 0.051!!!**]{.underline} Rules are rules in the p-value game! What goodies can we spot here? First, our **Intercept** is significant (don't worry, we'll decode this mysterious value later!). Even more exciting, we've got a significant effect of **TrialN** – our continuous variable – (p = 0.0113) and a significant effect of **Event** (p = 0.0114). Double win! However, the interaction between TrialN and Event is playing hard to get with that p = 0.9715. That's nowhere near significant! This is already pretty cool, right?!? But here's the deal - when looking at model outputs, people often get hypnotized by p-values. **However, there's MUCH more to unpack in a model summary!** The full story of our data is waiting to be discovered in the complete model output. Let's dive in together and crack this statistical puzzle! # Interpret our Model Here what the summary is telling us!! - **Call**: This section just reports the function call that we passed to `lm()`. Nothing fancy, just reminding us what we asked for!  - **Residuals**: This section shows the leftovers of our model - the differences between what we observed and what our model predicted. Think of residuals as the "Hmm, that's interesting" parts that our model couldn't explain. We want these to be as small as possible! 🔍  - **Coefficients**: This is where the real magic happens! This section displays: - **Estimates** - The actual numbers that tell us how much our outcome changes when we tweak each predictor ::: callout-note Estimates are often called **beta coefficients** and are represented using the Greek letter β (we will use this in some plots as it is easier to show) ::: - **Standard Error** - How confident we are about those estimates (smaller = more confident!) - **T-value** - A measure of how many standard errors away from zero our estimate is - **p-value** - The statistical measure that helps us determine if an effect is statistically significant  - **Model Fit**: The final act of our statistical show! Here we find: - **Residual standard error:** The average miss by our prediction line ( lower scores are better). - **Multiple R-Squared:** The proportion of variance explained by our predictors. Ranges from 0 (no explanation) to 1 (perfect explanation). - **Adjusted R-squared:** A modified version of R-squared that accounts for the number of predictors in the model. Useful for comparing models with different numbers of variables. - **F-statistics and p-value:** Tests whether the model as a whole is statistically significant. If p \< 0.05, the model performs significantly better than using just the intercept. ## Estimates The Estimate section is probably one of the most important parts of our model summary. While the other columns (t-values and p-values) are just numbers that tell us whether the predictor fits, the estimates tell us HOW they fit!! Let's go together through the most challenging information: ### (Intercept) The intercept often confuses people who approach linear models for the first time. What exactly is it? 🤔 The (Intercept) represents the reference levels where all our predictors (`TrialN` and `Event`) are **0**. While `TrialN` is easy to understand when it's 0 (`TrialN` 0 in our case would be the first trial), you may be scratching your head thinking...how can `Event` be 0? It's a categorical variable, it can't be 0!!!! `TrialN == 0`...sure....but `Event`?? You are absolutely right! When a model encounters a categorical variable, it cleverly selects the first level of such variable as the reference level. If you take another look at our model summary, you can see that there's information for the **Reward** level of the `Event` variable but nothing about the **NoReward** level. This is because the **NoReward** level has been selected by the model as the reference level and is thus represented in the intercept value! 💡 So our intercept (1390.1) actually represents the predicted **LookingTime** when: 1. We're on the first trial (`TrialN` = 0) 2. We're in the **NoReward** condition The Standard Error of the estimate (29.1) tells us the precision of the estimate. ::: callout-tip Since the intercept has a significant p-value, it means that the estimate for the **NoReward** condition at trial 0 is actually significantly different from 0. In other words, our participants are definitely looking at something during the **NoReward** condition in the first trial – their **LookingTime** isn't zero! This might seem obvious (of course they're looking!), but statistically confirming this baseline is actually meaningful. ::: Ok, all this explanation is great...but it's much easier to visualize these Estimates! ```{r} #| echo: false #| message: false #| warning: false #| fig-dpi: 300 #| fig-height: 8 #| fig-width: 10 library(ggplot2) library(easystats) library(ggbreak) model_p = parameters(mod_lm) # Define intercept and slope for the NoReward intercept_brush <- model_p[1,2] intercept_brush_se <- model_p[1,3] # Create a named vector for colors color_map <- c("NoReward" = "darkred") ggplot() + geom_vline(xintercept = 0, lwd = 0.4, linetype = 'dashed') + geom_pointrange( aes(x = -0.15, y = intercept_brush, ymin = intercept_brush - intercept_brush_se, ymax = intercept_brush + intercept_brush_se, color = "NoReward"), lwd = 1.3 ) + geom_segment( aes(x = 0.15, xend = 0.15, y = 0, yend = intercept_brush), color = "black", lwd = 1.2, arrow = arrow(length = unit(0.1, "inches"), ends = "both") ) + annotate("text", x = 0.20, y = 1300, label = expression(beta), size = 4) + scale_color_manual( values = color_map) + labs(y = "Looking time", x = "Event number", color = '') + theme_modern(base_size = 20) + theme( legend.position = "bottom", axis.line.y.right = element_blank(), axis.ticks.y.right = element_blank(), axis.text.y.right = element_blank() ) + # Only set x-limits via scale_x_continuous scale_x_continuous( limits = c(-1, 3), breaks = c(-1, 0, 1, 2, 3), labels = c("", "0\n(Intercept)", "", "", "") ) + # Let ggbreak handle y-limits and the break scale_y_break( breaks = c(200, 1250), scale = 3, space = 1 )+ ylim(0, 1550) ``` As you can see, the intercept is pretty straightforward—it gives us the estimate when everything is set to 0, both for continuous and categorical variables. The intercept is the foundation of your model, where all the predictors are at their baseline value (in this case, **NoReward** was cleverly selected as the reference or 0 level for the categorical variable). ### Event Awesome! Now that we've got the intercept down, let's take a look at the rest of the model output. We'll skip over the **TrialN** variable for now and focus on what's happening with the **Event**. At first, the results for `Event [Reward]` might look like they're giving us the value of looking time for the **Reward**. Super easy, right?!? Well... not exactly! 🤔 **In linear models, each coefficient shows the difference in relation to the intercept (the 0 or the reference level), not the exact value of the Reward condition.** It sounds a bit confusing, but let's break it down. If we want to understand what is the estimate for a **Rewarding** event we need to take the **Intercept** (1390.1) –as we mentioned that is actually the event **NoReward** – and then just simply add to it the Estimate for the `Event [Reward]` (99.8). So the model is telling us that a Rewarding event should be 1390.1 + 99.8 = 1489.9. See? Not too bad! Let's visualize it and make it even clearer! ```{r} #| echo: false #| message: false #| warning: false #| fig-dpi: 300 #| fig-height: 8 #| fig-width: 10 # Define intercept and slope for the NoReward intercept_NoReward <- model_p[1,2] intercept_NoReward_se <- model_p[1,3] intercept_Reward <- model_p[3,2] + intercept_NoReward intercept_Reward_se <- model_p[3,3] color_map <- c("NoReward" = "darkred", "Reward" = 'darkblue') ggplot() + geom_vline(xintercept = 0, lwd = 0.4, linetype = 'dashed') + geom_pointrange(aes(x=-.15, y=intercept_NoReward, ymin= intercept_NoReward-intercept_NoReward_se, ymax= intercept_brush+intercept_NoReward_se, color = 'NoReward'), lwd=1.3)+ geom_pointrange(aes(x=.15, y=intercept_Reward, ymin= intercept_Reward-intercept_Reward_se, ymax= intercept_Reward+intercept_Reward_se, color = 'Reward'), lwd=1.3)+ geom_segment(aes(x = 0, xend = 0, y = intercept_NoReward, yend = intercept_Reward), color = "black", lwd = 1.2, arrow = arrow( length = unit(0.1, "inches"), ends = "both")) + annotate("text", x = -0.05, y = 1450, label = expression(beta), size = 4) + # Customize the color scale for the legend scale_color_manual( values = color_map) + coord_cartesian(xlim = c(-1,3), ylim = c(1200,1550))+ # Labels and theme labs(y = 'Looking time', x = 'Event number', color = '') + theme_classic(base_size = 20)+ theme(legend.position = 'bottom')+ # Customize x-axis labels scale_x_continuous( breaks = c(0), # The x-values where you want ticks labels = c("0\n(Intercept)") # Corresponding custom labels ) ``` ### TrialN So, interpreting the coefficients for categorical variables wasn't too tricky, right? But what about continuous variables like `TrialN`? No worries, it's actually pretty straightforward! The coefficient for a continuous variable represents the slope, or the incline, of the line for that variable. In simpler terms, it shows how much the outcome (in this case, `LookingTime`) changes for each unit increase in the continuous variable (`TrialN`). So, in our case the coefficient for `TrialN` is -7.5, this means that for each unit increase in `TrialN`, the LookingTime is expected to decrease by 7.5 units (assuming all other variables stay the same). ::: callout-important Remember!! This coefficient represents the effect of `TrialN` specifically when `Event` is at its reference level (**NoReward**). In other words, this -7.5 decrease in LookingTime per trial applies specifically to the **NoReward** condition! ::: Even easier..let's plot again! ```{r} #| echo: false #| message: false #| warning: false #| fig-dpi: 300 #| fig-height: 8 #| fig-width: 10 slope_EventTrial <- model_p[2,2] # Some hypothetical TrialN values time_values <- seq(-1, 20, by = 2) # Create a small data frame for the main (NoReward) line df_p <- data.frame( time = time_values, est_performance_EventTrial = intercept_NoReward + slope_EventTrial * time_values ) color_map <- c("NoReward" = "darkred", "Reward" = 'darkblue', "EventTrial" = '#e04d01') ggplot() + geom_vline(xintercept = 0, lwd =0.4,linetype = "dashed") + geom_line(data = df_p, aes(x = time, y = est_performance_EventTrial), color = 'darkred', size = 1) + # geom_pointrange(aes(x =0, y = intercept_brush, ymin = intercept_brush - intercept_brush_se,ymax = intercept_brush + intercept_brush_se, fill = 'NoReward' ), # color = 'black', shape = 21, # lwd = 1.3,fatten = 8) + geom_pointrange(aes(x=0, y=intercept_brush, ymin= intercept_brush-intercept_brush_se, ymax= intercept_brush+intercept_brush_se, color = 'NoReward'), lwd=1.3)+ annotate("text", x = 4.4, y = 1315, label = expression(beta), size = 4) + geom_path(aes(x = c(4, 4, 10), y = c(1350, 1300, 1300)), color = "black", lwd = 1.2, arrow = arrow( length = unit(0.1, "inches"), ends = "both"))+ # Customize the color scale for the legend scale_color_manual(values = color_map) + scale_fill_manual(values = color_map)+ # Set the limits for x and y axes coord_cartesian(xlim = c(-1,20), ylim = c(1200,1550))+ # Labels and theme labs(y = 'Looking time', x = 'Event number', color = '') + theme_classic(base_size = 20)+ theme(legend.position = 'bottom')+ # Customize x-axis labels scale_x_continuous( breaks = seq(0,20,4), # The x-values where you want ticks labels = c("0\n(Intercept)", as.character(seq(4,20,4))) # Corresponding custom labels ) ``` ### Interaction Alright, now we're getting to the final step! Let's talk about the interaction between `TrialN` and `Event`! Now, we're not just dealing with a single factor or continuous variable, but looking at how they interact with each other. Don't worry—if you understood the previous steps, this will be a breeze! We'll take it step by step and look at the interaction in our model parameters. The interaction term between `TrialN` and `Event [Reward]` tells us how the relationship between `TrialN` and `LookingTime` changes when we switch from the reference Event (**NoReward**) to the **Reward** condition. To put it simply: - For **NoReward** events, **LookingTime** decreases by 7.5 units per trial (that's our main `TrialN` coefficient) - The interaction coefficient (0.13) tells us how this slope changes for **Reward** events So for **Reward** events, the slope would be: -7.5 + 0.13 = -7.37 units per trial. What does this mean in real terms? If the interaction coefficient is positive (like in our example), it means participants' looking time decreases more slowly during **Reward** trials compared to **NoReward** trials – the estimate is slightly less negative. ::: callout-important In this model the interaction effect is extremely small and not significant. This means that there is actually no difference in how the TrialN predicted Looking time in either NoReward or Reward! Looking time decreases at basically the same rate regardless of which event type we're looking at!!! ::: While there is no significant difference, let's plot the estimated effects for **TrialN** for both Reward and **NoReward** conditions to visualize this relationship! ```{r} #| echo: false #| message: false #| warning: false #| fig-dpi: 300 #| fig-height: 8 #| fig-width: 10 # Define intercept and slope for the NoReward condition intercept_NoReward <- model_p[1,2] # Intercept for NoReward intercept_NoReward_se <- model_p[1,3] # Define intercept for Reward condition (Intercept + Reward Effect) intercept_Reward <- model_p[3,2] + intercept_NoReward intercept_Reward_se <- model_p[3,3] # Define slopes slope_EventTrial <- model_p[2,2] # Slope for Event Trial slope_Interaction <- model_p[4,2] # Interaction effect # Some hypothetical TrialN values time_values <- seq(-1, 20, by = .05) # index <- which.min(abs(time_values - (-0.15))) # Create a small data frame for plotting df_p <- data.frame( time = time_values, est_performance_EventTrial = intercept_NoReward + slope_EventTrial * time_values, est_performance_Interaction = intercept_Reward + (slope_EventTrial + slope_Interaction) * time_values ) df_p[df_p$time == -0.15,] color_map <- c("NoReward" = "darkred", "Reward" = 'darkblue') ggplot()+ geom_vline(xintercept = 0, lwd =0.4, linetype = 'dashed') + geom_line(data = df_p, aes(x = time, y = est_performance_EventTrial), color = "darkred", lwd = 1) + geom_line(data = df_p, aes(x = time, y = est_performance_Interaction), color = "darkblue", lwd = 1)+ geom_pointrange(aes(x =-.15, y = intercept_brush, ymin = intercept_brush - intercept_brush_se,ymax = intercept_brush + intercept_brush_se, color = 'NoReward' ), lwd = 1.3) + geom_pointrange(aes(x=.15, y=intercept_Reward, ymin= intercept_Reward-intercept_Reward_se, ymax= intercept_Reward+intercept_Reward_se, color = 'Reward'), lwd = 1.3)+ annotate("text", x = 4.4, y = 1315, label = expression(beta), size = 4) + geom_path(aes(x = c(4, 4, 10), y = c(1350, 1300, 1300)), color = "black", lwd = 1.2, arrow = arrow( length = unit(0.1, "inches"), ends = "both"))+ annotate("text", x = 6.4, y = 1415, label = expression(beta), size = 4) + geom_path(aes(x = c(6, 6, 11), y = c(1440, 1400, 1400)), color = "black", lwd = 1.2, arrow = arrow( length = unit(0.1, "inches"), ends = "both"))+ scale_color_manual( values = color_map) + coord_cartesian(xlim = c(-1,20), ylim = c(1200,1550))+ # Labels and theme labs(y = 'Looking time', x = 'Event number', color = '') + theme_classic(base_size = 20)+ theme(legend.position = 'bottom')+ # Customize x-axis labels scale_x_continuous( breaks = seq(0,20,4), # The x-values where you want ticks labels = c("0\n(Intercept)", as.character(seq(4,20,4))) # Corresponding custom labels ) ``` ::: callout-important Do you notice any difference in how the slopes between the two conditions? Probably not! If you examine the model summary, you'll see the interaction effect is very small and not statistically significant. This tells us that TrialN predicts Looking time similarly in both **NoReward** and Reward conditions. In other words, looking time decreases at essentially the same rate regardless of which event type participants observed. ::: I hope that by now you got a decent understanding on how to interpret what a linear model is telling you!! Different kinds of models will add small pieces of information here and there, but the main information will still be there. Thus, if you got here, you are a step closer to becoming a stats genius!!! 🧠📊 **But wait, we're not done yet!** Now it's time to check whether our model actually meets the assumptions of linear regression. Remember those plots we generated earlier? Let's make sure our statistical foundation is solid before we draw any final conclusions! # Centering Predictors We've seen how both continuous and categorical predictors are set to 0 in the intercept, and then we interpret the model output based on the difference between these reference levels. **But here's the thing** - sometimes comparing to 0 for a continuous variable doesn't make much sense! Imagine we're talking about body weight as a predictor. A weight of 0 would be... well, pretty weird as a comparison level, right? 🤔 This is one of the main reasons why it's often suggested to center continuous predictors. By centering (subtracting the mean from each value), we transform our values so they have a mean of 0! This means the intercept, being at 0, will actually reflect the mean of the predictor rather than some arbitrary or impossible value. Z-scoring takes this one step further by not only centering the data (subtracting the mean) but also scaling it by dividing by the standard deviation. This standardizes your variables to have a mean of 0 and a standard deviation of 1. The big advantage? It puts all your predictors on the same scale, making their coefficients directly comparable in terms of effect size. In practice, it's mostly suggested that you just go straight to z-scoring your continuous predictors as a simple one-size-fits-all approach. It gives you all the benefits of centering plus standardized coefficients that are easier to compare across different predictors! On top of making it easier to interpret your model results, it will also help your model fit!! We won't dwell too much on the statistical reasons, but these transformations reduce multicollinearity, stabilize coefficient estimates (especially for interactions), and improve model convergence. Your model just works better and makes more intuitive sense! 🚀 ## Run the model Ok, this sounds fair right? Before running a model we can just standardize the values of our predictors (in our case TrialN). Yeah..but how to? Our favorite way is to use the `standardize()` function from the the [easystats](https://easystats.github.io/easystats/) library. ::: callout-note [**Easystats**](https://easystats.github.io/easystats/) Easystats is a collection of R packages that includes tools dedicated to the post-processing of statistical models. It is made of all these packages: [**report**](https://easystats.github.io/report/index.html), [**correlation**](https://easystats.github.io/correlation/index.html), [**modelbased**](https://easystats.github.io/modelbased/index.html), [**bayestestR**](https://easystats.github.io/bayestestR/index.html), [**effectsize**](https://easystats.github.io/effectsize/index.html), [**see**](https://easystats.github.io/see/index.html), [**parameters**](https://easystats.github.io/parameters/index.html), [**performance**](https://easystats.github.io/performance/), [**insight**](https://easystats.github.io/insight/index.html), [**datawizard**](https://easystats.github.io/datawizard/index.html). We will extensively use all these package in our tutorials. The cool thing is that you can import all of them by just simply importing the collection **Easystats** with `library(easystats)`. In this tutorial here we will use the function from the packages [**datawizard**](https://easystats.github.io/datawizard/index.html)and [**performance**](https://easystats.github.io/performance/)(in the [Model checks] sections)**.** - [**datawizard**](https://easystats.github.io/datawizard/index.html) allows to manipulate, clean, transform, and prepare your data - [**performance**](https://easystats.github.io/performance/) is a package to check model performance metrics. ::: So now we import `easystats` and we use the function `standardize()` ```{r} #| message: false #| warning: false library(easystats) df$StandardTrialN = standardize(df$TrialN) head(df) ``` We have created like this a new column `StandardTrialN` with the standardized values of the TrialN and now we can run the model again ```{r} #| label: RunStandardLm #| message: false #| warning: false mod_lm = lm(LookingTime ~ StandardTrialN*Event, data = df) summary(mod_lm) ``` As you can see the estimates are different!!! Because as we mentioned the reference level and the scale are different now. However do not panic, even if some values are different the relations are the same, we still have negative estimate for the time variable (here StandardTrialN and before TrialN) and positive for the Reward condition. Also the p-values are the same! So there you have it - standardizing your predictors makes your model both more interpretable and more robust, all while keeping the fundamental relationships in your data intact. It's one of those simple steps that can make a big difference in your statistical journey! 🌟 # Model checks So now we have run our model and seen the summary... That's great but how can we know that our model actually is ok?? Linear models, like most statistical techniques require some data assumption to be run. These assumption need to be met otherwise even if our model could be showing amazing results it won't be valid. ::: callout-warning As we mentioned at the beginning of our tutorial the model we have run in this tutorial is a very simple model that is not actully...correct!! ::: What are these assumptions?? Well they depend a lot on the model you are running. We won't go into much details as there are very good website that explain them[1](https://bookdown.org/pingapang9/linear_models_bookdown/assumptions.html) ,[2](https://www.statology.org/linear-regression-assumptions/), in this simple linear mode they are: 1. **Linear relationship:** There exists a linear relationship between the independent variable, x, and the dependent variable, y. 2. **Independence:** The residuals are independent. In particular, there is no correlation between consecutive residuals in time-series data. 3. **Homoscedasticity/Homogeneity of variance:** The residuals have constant variance at every level of x. 4. **Normality:** The residuals of the model are normally distributed. Again this is not a theoretical tutorial. So we won't go into details as which are the assumptions (please read some of the link provided tho!!) but we will show you how to actually check these assumptions. There is a super easy and convenient way we usually check these assumptions. Using the [easystats](https://easystats.github.io/easystats/) library we can run the function [`check_model()`](https://easystats.github.io/performance/reference/check_model.html) to indeed check the model assumptions. ```{r} #| label: PlotCheck #| message: false #| warning: false #| fig-dpi: 300 #| fig-height: 11 #| fig-width: 14 check_model(mod_lm) ``` Perfect all done!! We have a plot of the model assumptions and we can check if they are met!! But what do these plot represent? Here below we created a table that mirrors each plot with it explanation in it. These are brief and simple explanations. If you want to understand more about the [`check_model()`](https://easystats.github.io/performance/reference/check_model.html) function we suggest you to read the documentation about it and also the [very nice vignette](https://easystats.github.io/performance/articles/check_model.html) that the package provides. ```{r} #| label: TableChecks #| echo: false #| message: false #| warning: false #| fig-height: 10 #| fig-width: 10 library(gt) library(gtExtras) library(glue) # Create a dataframe that resembles the content of the image data <- data.frame( Column1 = c( glue("**Posterior predictive checks** Compare simulated data from the fitted model to actual data. This reveals systematic discrepancies, helping assess if the model captures key data features and if the chosen distribution family is appropriate."), glue("**Homoscedasticity/Homogeneity of variance** Checks for homoscedasticity (constant variance). Residuals should spread evenly around a horizontal line across all predictor values. Uneven spread suggests variance inconsistencies, potentially requiring model adjustments."), glue("**Multi-Collinearity** Evaluates predictor independence. High collinearity indicates redundant predictors or possible unobserved variables influencing multiple predictors, which can affect interpretation and model stability.") ), Column2 = c( glue("**Linearity** Checks if predictors have a linear relationship with the outcome. A straight, horizontal line suggests the model specification is appropriate. Curved or sloped lines indicate potential non-linear relationships, signaling the need for model adjustments."), glue("**Outliers** Identifies influential observations using Cook's distance. Points beyond the dashed lines are outliers that may excessively impact model estimates, warranting further investigation or potential removal."), glue("**Normality of residuals** Uses Q-Q plots to assess if residuals follow a normal distribution. Points should align with the reference line; deviations suggest the model poorly predicts certain outcome ranges, potentially violating normality assumptions.") ) ) # Create the gt table, apply the custom theme, and add additional styling gt(data) %>% fmt_markdown(columns = everything()) %>% gt_theme_nytimes() %>% # Add a title to the table and ensure it is centered tab_header( title = md("**Model Diagnostic Checks**") ) %>% tab_style( style = cell_text(align = "center"), locations = cells_title(groups = "title") ) %>% # Hide the headers by setting empty labels cols_label( Column1 = "", Column2 = "" ) %>% # Add borders around all cells tab_style( style = cell_borders(sides = "all", color = "black", weight = px(1)), locations = cells_body() ) %>% # Add borders around the entire table tab_options( table.border.top.style = "solid", table.border.top.color = "black", table.border.top.width = px(1), table.border.bottom.style = "solid", table.border.bottom.color = "black", table.border.bottom.width = px(1), table.border.left.style = "solid", table.border.left.color = "black", table.border.left.width = px(1), table.border.right.style = "solid", table.border.right.color = "black", table.border.right.width = px(1), column_labels.hidden = TRUE ) ``` Given these information about these plots.... don't you think there is something that doesn't add up???????? 🤔 Well some assumptions are not really met aren't they!!! Look at those residuals!!! Not great...not really normal...not really straight!! 📉 Do not worry!! We have fit an extremely simple model in this tutorial... as you will see in the next tutorial this model is actually WRONG!!! Just do not worry...this model has served its purpose.......for now. ::: callout-tip One of the awesome features of `easystats` is its broad support for various model types. What's the big deal? Well, it means that the `check_model()` function adapts its checks based on the specific model you're using! This flexibility makes it an incredibly powerful and user-friendly tool. Most of the linear and non-linear models (or at least most of them) can be fed into the `check_model()` function, allowing you to easily verify if it meets the necessary assumptions. **Keep in mind: Always be aware of which assumptions your model should satisfy. We're not suggesting you use this function blindly! Instead, we're showing you how to efficiently plot all relevant assumptions in one go. It's simpler and quicker!!** ::: ## Statistical tests You've probably noticed that we've been relying on visual checks so far. In our view, this is often the best approach, as statistical tests for model assumptions can sometime be overly stringent. However, there may be situations where you need to provide statistical evidence to support your model assumptions. This often happens when a reviewer (let's call them Reviewer 2, shall we?) insists on seeing numerical proof. Fortunately, `easystats` has got your back. Here are some examples of what you can use: ```{r} #| label: StatNormality #| warning: false check_normality(mod_lm) ``` To check the normality of our residuals and: ```{r} #| label: StatHomogeinity #| warning: false check_homogeneity(mod_lm) ``` to check homoscedasticity/homogeneity of variance. Again you can find all the function in the **performance** package (part of the Easystats collection) # END Well done reaching the end!! This was a tough one!! But again, remember these are solid foundations that will come into use so many times in the future!! Understanding linear models is like learning to ride a bike - it might feel wobbly at first, but once you get it, you'll never forget! You now have the skills to interpret intercepts, coefficients, p-values, and interactions, plus you know how to check if your model is actually doing what it's supposed to do.