News Flash! Effect Size doesn't matter!
PGS-world is moving from anger/denial to the bargaining phase*
[Small changes made in response to comments 7/2/25]
I have argued for some time that the world of GWAS-based behavior genetics is facing an effect size crisis. The relationships between PGS [which is method for adding up tiny bits of genetic variants for form a normally distributed variable that can be used to predict outcomes from genetic data] and their intended targets have never been exactly large, but two things have happened: (1) The original estimates of PGS-validity, say around the time of EA3 (the third in a series of four GWAS investigating educational attainment, published around 2015), always came with a promise that once samples got really really big we would be explaining a whole lot of variance. That has never happened. (2) In fact, as we have gotten better and better at removing non-genetic variance from polygenic scores, the effect sizes have only gone downward. In the recent paper by Tan et al (written, it should be noted, by a team of PGS enthusiasts), the effect sizes finally converged, on zero. Nada.
Now a new post, by David Hugh-Jones, makes the surprising argument that none of this actually matters: small effect sizes are fine!
The basic argument, which represents a complete misunderstanding of regression, prediction, and analysis of variance, is this: Why waste time worrying about variance explained when the unstandardized regression coefficient estimated by a model stays the same whether the R2 is .01 or .80? Hugh-Jones tells us:
In other words, the size of a variable’s effect is unrelated to the amount of variation it explains. This is not news to statisticians.
I think this might have been news to Karl Pearson, who developed the mathematical statistics of the concept of r2. It might have been news to RA Fisher, who used Pearson’s work to develop the analysis of variance. It might have been news to the generations of mid-century behavior geneticists who referred to R2 as the “coefficient of determination.” It might be news to more recent theorists like Jacob Cohen and Paul Meehl, who placed the concept of effect size at the center of their prescriptions for meaningful behavioral science.
A bit hard to know where to start here. One thing at a time:
None of this post has anything in particular to do with polygenic scores. It is a discourse about the correlation coefficient as it is used to quantify the effectiveness of a PGS, along with a million other relationships in the behavioral sciences. The idea, and I hesitate to even type out such a ridiculous assertion, is that the magnitudes of correlations don’t matter, because the underlying unstandardized regression coefficient remains the same even when the correlation is tiny. So for example, if the correlation between shoe size and IQ score is .01, it doesn’t matter because the model still estimates IQ points per unit shoe size. Once again, this is not a correction of interpretive standards in behavioral GWAS, it is a revision of almost the entirety of ANOVA-based behavioral science, which is uniformly founded on the idea that “variance explained” is a meaningful and indeed necessary measure of strength of relationship.
The argument has nothing to do with their example of a difference between an R2 of 1 and an R2 of .25. In fact it works for everything. Let’s do the same thing for an R2 of .001. Chatgpt estimates that the scatterplots in the figures were computed using a regression equation of EA = 17.5 + .75*PGS. So let’s execute the following code:
PGS<-rnorm(50,mean=0,sd=1)
Educ<-17.5+.75*PGS
Educ<-Educ+rnorm(50,sd=10)
cor(PGS,Educ)
plot(PGS,Educ)
Model<-lm(Educ~PGS)
summary(Model)
abline(Model)
Here is the result:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 16.8948 1.3519 12.497 <2e-16 ***
PGS 0.9861 1.5205 0.649 0.52
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.514 on 48 degrees of freedom
Multiple R-squared: 0.008687, Adjusted R-squared: -0.01197
F-statistic: 0.4206 on 1 and 48 DF, p-value: 0.5197
[Yes, I know that the scale of EA is unreasonable. It’s hard to squeeze enough error variance into the simulated data to get the R2 down while keeping the range of EA realistic. I’m working on it using a truncated normal. But the point is that the slope of the regression line remains the same, regardless of the fact that the two variables are now essentially unrelated.]
So now the model accounts for .001 of the variance in EA, but don’t worry! Because the unstandardized regression coefficient is still the same. In fact it is even higher (.9861) than the .75 that was used to simulate the data. Is Hugh-Jones really committed to the idea that none of this matters, that, “the size of a variable’s effect is unrelated to the amount of variation it explains“? Would that still be true at R2 = .000001?
What the post is really forgetting about is precision of estimation. Notice in the above output that even though the slope of the regression line is reasonably close to what it is supposed to be, it isn’t significant; it is substantially less than its standard error of 1.5. A 95% confidence interval on the slope of the line would be something like -2.1 < b < 3.9, which is to say we know essentially nothing about it. Let’s contrast that to what happens if we construct the data with R2 = .25.
PGS<-rnorm(50,mean=0,sd=1)
Educ<-17.5+.75*PGS
Educ<-Educ+rnorm(50,sd=1.5)
cor(PGS,Educ)
plot(PGS,Educ)
Model<-lm(Educ~PGS)
summary(Model)
abline(Model)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.6212 0.2211 79.693 < 2e-16 ***
PGS 0.9555 0.2172 4.398 6.04e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.528 on 48 degrees of freedom
Multiple R-squared: 0.2873, Adjusted R-squared: 0.2724
F-statistic: 19.35 on 1 and 48 DF, p-value: 6.036e-05
The plot looks different because the Y axis can be more reasonable, but that slope still hasn’t changed:
The slope is still more or less correctly estimated at .95 (there is a lot of sampling error simulating data with n = 50, but I am trying to reproduce the OPs work.) But now the same slope is highly significant, with a CI something like .73 < b < 1.17. What changed? The R2 is what changed. The strength of the relationship between PGS and EA is a measure of the precision we have in estimating the unstandardized regression slope. This is no mere statistical fine point. It is telling us that if we did our little PGS experiment over again, in the first instance there is almost 0 chance we would get a similar estimate of the slope. In the example with a higher R2 we are actually learning something about the slope of the regression line in the population. More plainly, in the second case the PGS is doing something useful, and in the first case it is not.
I have an idea! Since all we want to do is to come up with an estimate, no matter how imprecise, of the unstandardized regression slope, why mess around with those enormous PGS evaluation samples? The expectation for the regression slope is exactly the same at n=10 as it is at n=50 or n=500,000. Just like as for R2, the only difference is in the precision of the estimate, and who cares about that?
The dichotomized analysis is equally pointless. The OP is correct that none of this depends on whether the Y variable is continuous or dichotomous. The usual effect size in the case of a dichotomous outcome is Cohen’s d, ie, the standardized difference in PGS between the two “groups.”
\(d = \frac{\bar{X}_1 - \bar{X}_2}{s_p} \)Cohen’s d is just another analysis of variance, a reparameterization of R2.
\(R^2 = \frac{d^2}{d^2 + 4}\)OMIGOD Decile Plots. I have been railing against these things for a long while.
Think about that histogram Hugh-Jones posts:
It’s really just another scatterplot. The slope of the tops of the bars is that same unstandardized regression coefficient they keep estimating, but unlike a real scatter plot they have eliminated all the actual points around the regression line. They have eliminated the error in their prediction model by simply refusing to plot it. And once again, as I said many times on the old site, this has nothing to do with polygenic scores— you can do the exact same thing with any tiny correlation. It is not an argument for interpreting PGS; it is an argument for not caring about weak prediction.
Finally, the real bottom line in all this is precision of prediction. The point of PGS models is to predict something. This is going to come as a shock, but prediction models with high R2 do a better job at predicting than prediction models with low R2! The best place to see this is in the prediction interval, the confidence interval around the prediction of Y you make on the basis of X. I already explained this in some detail here:
Next: Does confounding matter?
\* h/t Sasha Gusev
Nice job, Eric. It’s great that you can take this on. I just wish there was discussion other than pedantic statistical arguments, related to the fact that individual differences in behavioral traits are probably not significantly influenced by genetic variation. There are direct scientific, philosophical and psychological implications beyond just dealing with this statistical obfuscation that gets lost.
Unless I'm missing something, this feels like an uncharacteristically bad post. I had expected better from you.
Let's take a different example: exam scores. If you are taking an exam, then your final score will depend partly on how good you are at the subject the exam is testing for, and partly on how much effort you put in at the exam.
The fraction of variance explained by effort depends partly on the effect size b of effort, but also on the residual variance v in ability, and on the variance e in effort. We could expect the R^2 to be something like b^2 e/(b^2 e + v). Clearly if v is big then the R^2 will be low.
In one of your simulations in the post, you seem to have set the parameters such that it's not uncommon to receive 35 years of education.