55 lines
2.0 KiB
R
55 lines
2.0 KiB
R
|
# Goal: "Dummy variables" in regression.
|
||
|
|
||
|
# Suppose you have this data:
|
||
|
people = data.frame(
|
||
|
age = c(21,62,54,49,52,38),
|
||
|
education = c("college", "school", "none", "school", "college", "none"),
|
||
|
education.code = c( 2, 1, 0, 1, 2, 0 )
|
||
|
)
|
||
|
# Here people$education is a string categorical variable and
|
||
|
# people$education.code is the same thing, with a numerical coding system.
|
||
|
people
|
||
|
|
||
|
# Note the structure of the dataset --
|
||
|
str(people)
|
||
|
# The strings supplied for `education' have been treated (correctly) as
|
||
|
# a factor, but education.code is being treated as an integer and not as
|
||
|
# a factor.
|
||
|
|
||
|
|
||
|
# We want to do a dummy variable regression. Normally you would have:
|
||
|
# 1 Chosen college as the omitted category
|
||
|
# 2 Made a dummy for "none" named educationnone
|
||
|
# 3 Made a dummy for "school" named educationschool
|
||
|
# 4 Ran a regression like lm(age ~ educationnone + educationschool, people)
|
||
|
# But this is R. Things are cool:
|
||
|
lm(age ~ education, people)
|
||
|
|
||
|
# ! :-)
|
||
|
# When you feed him an explanatory variable like education, he does all
|
||
|
# these steps automatically. (He chose college as the omitted category).
|
||
|
|
||
|
# If you use an integer coding, then the obvious thing goes wrong --
|
||
|
lm(age ~ education.code, people)
|
||
|
# because he's thinking that education.code is an integer explanatory
|
||
|
# variable. So you need to:
|
||
|
|
||
|
lm(age ~ factor(education.code), people)
|
||
|
# (he choose a different omitted category)
|
||
|
|
||
|
# Alternatively, fix up the dataset --
|
||
|
people$education.code <- factor(people$education.code)
|
||
|
lm(age ~ education.code, people)
|
||
|
|
||
|
#
|
||
|
# Bottom line:
|
||
|
# Once the dataset has categorical variables correctly represented as factors, i.e. as
|
||
|
str(people)
|
||
|
# doing OLS in R induces automatic generation of dummy variables while leaving one out:
|
||
|
lm(age ~ education, people)
|
||
|
lm(age ~ education.code, people)
|
||
|
|
||
|
# But what if you want the X matrix?
|
||
|
m <- lm(age ~ education, people)
|
||
|
model.matrix(m)
|
||
|
# This is the design matrix that went into the regression m.
|