[R] A question on Statistics regarding regression

Sat Aug 24 20:05:12 CEST 2024

Hi,

I have asked this question elsewhere however failed to get any
response, so hoping to get some insight from experts and statisticians
here.

Let say we are fitting a regression equation where one explanatory
variable is categorical with 2 categories. However in the sample, one
category has 95% of values but other category has just 5%. Means, the
categories are highly unbalanced.

Typically SE of estimate may be inflated for such highly unbalanced
categorical explanatory variable.

Such unbalanced case may come from 2 scenarios 1) there is a flaw in
sample or it is just by chance that second category has just 5% values
in the sample or 2) in the population itself, the second category has
very small number of occurrences which is reflected in the sample.

My question how the SE would be impacted in above 2 cases? Will the
impact be same i.e. we would get incorrect estimate of SE in both
cases? If yes, is there any way to prove analytically or may be based
on simulation?

My apologies as this question is not directly R related. However I
just wanted to get some insight on above problem related to Statistics
from some of the great Statisticians in this forum.

Thanks for your time.