programming-examples/r/Dummy variables in regression.r

55 lines
2.0 KiB
R
Raw Normal View History

2019-11-18 14:03:28 +01:00
# Goal: "Dummy variables" in regression.
# Suppose you have this data:
people = data.frame(
age = c(21,62,54,49,52,38),
education = c("college", "school", "none", "school", "college", "none"),
education.code = c( 2, 1, 0, 1, 2, 0 )
)
# Here people$education is a string categorical variable and
# people$education.code is the same thing, with a numerical coding system.
people
# Note the structure of the dataset --
str(people)
# The strings supplied for `education' have been treated (correctly) as
# a factor, but education.code is being treated as an integer and not as
# a factor.
# We want to do a dummy variable regression. Normally you would have:
# 1 Chosen college as the omitted category
# 2 Made a dummy for "none" named educationnone
# 3 Made a dummy for "school" named educationschool
# 4 Ran a regression like lm(age ~ educationnone + educationschool, people)
# But this is R. Things are cool:
lm(age ~ education, people)
# ! :-)
# When you feed him an explanatory variable like education, he does all
# these steps automatically. (He chose college as the omitted category).
# If you use an integer coding, then the obvious thing goes wrong --
lm(age ~ education.code, people)
# because he's thinking that education.code is an integer explanatory
# variable. So you need to:
lm(age ~ factor(education.code), people)
# (he choose a different omitted category)
# Alternatively, fix up the dataset --
people$education.code <- factor(people$education.code)
lm(age ~ education.code, people)
#
# Bottom line:
# Once the dataset has categorical variables correctly represented as factors, i.e. as
str(people)
# doing OLS in R induces automatic generation of dummy variables while leaving one out:
lm(age ~ education, people)
lm(age ~ education.code, people)
# But what if you want the X matrix?
m <- lm(age ~ education, people)
model.matrix(m)
# This is the design matrix that went into the regression m.