refineFactors.Rd
Refine two data frames x1
and x2
to be usable as a pair of
a train/test set pair in a modeling or classification task, such
that a model/classifier (e.g. a binomial GLM) can be trained on the
train set and can be then directly applied to the test set.
Note that had the sets not been refined this way, a model trained on
a train set could not have been directly applied to the test set
because a new factor level (not appearing in the train set) would
have appeared in it, giving no clue how to predict response using
such an unknown factor level.
refineFactors(x1, x2, unify = TRUE, dropSingular = TRUE, naLimit = Inf, k = 5, naLevelName = "(NA)", verbose = FALSE, debug = FALSE)
x1 | first data frame |
---|---|
x2 | second data frame |
unify | shall factors be unified? |
dropSingular | shall factors having only a single level be removed? |
naLimit | numeric columns containing more than 'naLimit' NA's
(in |
k | the number of intervals into which numeric columns having at least 'naLimit' NA's (in 'x1' and in 'x2') will be converted |
naLevelName | the name of the special factor level used to represent missing values |
verbose | report progress? |
debug | if TRUE, debugs will be printed. If numeric of value greater than 1, verbose debugs will be produced. |
Usually, the refinement consists of i) making the levels of factors in individual columns in each of the sets identical, and ii) removing columns containing factors of only a single level. This first task is achieved by removing rows in which appears a factor of level not appearing in the twin data frame, and dropping unused levels from factors.
A list of refined x1
and x2
.
# unify factor levels and remove constant factors: x<-data.frame(x = 1:6, y = c('a','b','c','b','c','d'), z = c('d','c','c','c','c','d')) x.train <- x[1:3, ] x.test <- x[4:6, ] print(x.train)#> x y z #> 1 1 a d #> 2 2 b c #> 3 3 c cprint(x.test)#> x y z #> 4 4 b c #> 5 5 c c #> 6 6 d drefineFactors(x.train, x.test)#> $x1 #> x y #> 2 2 b #> 3 3 c #> #> $x2 #> x y #> 4 4 b #> 5 5 c #># Note: 'x[1,]' and 'x2[3,]' dropped because it had no counterpart # in the twin data frame. # Note: 'x$z' dropped because after removal of 'x[1,]' and 'x2[3,]', # there was only a single factor level left, which was dropped, by default. # unify factor levels but keep constant factors: refineFactors(x.train, x.test, dropSingular = FALSE)#> $x1 #> x y z #> 2 2 b c #> 3 3 c c #> #> $x2 #> x y z #> 4 4 b c #> 5 5 c c #># Note: now 'x$z' is left # convert numeric columns with many NA's into a factor x<-data.frame(x = 1:10, y = c(NA,NA,1,2,3,NaN,NA,1,2,3), z = c(1,2,NA,4,5,1,2,NA,4,5)) x1 <- x[1:5, ] x2 <- x[6:10, ] print(x1)#> x y z #> 1 1 NA 1 #> 2 2 NA 2 #> 3 3 1 NA #> 4 4 2 4 #> 5 5 3 5print(x2)#> x y z #> 6 6 NaN 1 #> 7 7 NA 2 #> 8 8 1 NA #> 9 9 2 4 #> 10 10 3 5refineFactors(x1, x2, naLimit=2)#> $x1 #> x y z #> 1 1 (NA) 1 #> 2 2 (NA) 2 #> 4 4 (1.8,2.2] 4 #> 5 5 (2.6,3] 5 #> #> $x2 #> x y z #> 6 6 (NA) 1 #> 7 7 (NA) 2 #> 9 9 (1.8,2.2] 4 #> 10 10 (2.6,3] 5 #># add a special 'NA' level to factors with many NA's x<-data.frame(x = 1:10, y = factor(c(NA,NA,1,2,3,NA,NA,1,2,3)), z = factor(c(1,2,NA,4,5,1,2,NA,4,5))) x1 <- x[1:5, ] x2 <- x[6:10, ] print(x1)#> x y z #> 1 1 <NA> 1 #> 2 2 <NA> 2 #> 3 3 1 <NA> #> 4 4 2 4 #> 5 5 3 5print(x2)#> x y z #> 6 6 <NA> 1 #> 7 7 <NA> 2 #> 8 8 1 <NA> #> 9 9 2 4 #> 10 10 3 5refineFactors(x1, x2, naLimit=2)#> $x1 #> x y z #> 1 1 (NA) 1 #> 2 2 (NA) 2 #> 4 4 2 4 #> 5 5 3 5 #> #> $x2 #> x y z #> 6 6 (NA) 1 #> 7 7 (NA) 2 #> 9 9 2 4 #> 10 10 3 5 #># NaN in factors differs from NA - this would behave differently: # x<-data.frame(a = 1:10, b = factor(c(NA,NA,1,2,3,NA,NaN,1,2,3)), c=factor(c(1,2,NA,4,5,1,2,NA,4,5)))