Fehlende Werte - Missing Values (2024)

Rmd

Missing Data werden in R durch NA (not available) repräsentiert und haben eine eigene Funktionalität. Häufig wird das Ergebnis einer Operation, in der NA vorkommen ebenfalls auf NA gesetzt. Viele Funktionen und Verfahren haben ein Flag (Aufrufparamter) für den Umgang mit NA (na.rm), das häufig als Voreinstellung auf TRUE gesetzt ist, nicht auf FALSE. Dabei werden die Beobachtungen mit NA aus der jeweiligen Berechnung ausgeschlossen. Dies entspricht dem Standard-Vorgehen vieler Statistikprogramme, führt aber u. U. zu unterschiedlichen Stichproben bei den Berechnungen.

Für logische Operationen gibt es den Befehl is.na bzw. !is.na (not is NA, logische Prüfung auf nicht-fehlend)

# get a working base ideaddf <- data.frame(x = c(1, 2, 3), y = c(0, 10, NA))is.na(ddf)

## x y## [1,] FALSE FALSE## [2,] FALSE FALSE## [3,] FALSE TRUE

!is.na(ddf)

## x y## [1,] TRUE TRUE## [2,] TRUE TRUE## [3,] TRUE FALSE

Häufig ist die Anzahl von fehlenden Werten relevant. Hier kann das Auszählen von TRUE-Werten beim Prüfen auf fehlende Werte helfen. Alternativ können wir which() einsetzen

# create sample dataframeddf <-data.frame( subj = c( 1, 2, 3, 4, 5, 6, 7, 8, 9), uni = c( 1, 1, 1, 2, 2, 2, 3, 3, 3), grade1 = c(1.0, NA, 3.7, 1.3, NA, 1.0, 3.3, 4.0, NA), grade2 = c(4.0, 3.0, 1.3, 1.3, 1.0, 1.3, 2.7, 4.0, 3.3), grade3 = c(1.3, NA, 2.7, 1.0, 1.3, 1.3, 2.3, 3.7, 3.0))# we get a T/F vectoris.na(ddf)

## subj uni grade1 grade2 grade3## [1,] FALSE FALSE FALSE FALSE FALSE## [2,] FALSE FALSE TRUE FALSE TRUE## [3,] FALSE FALSE FALSE FALSE FALSE## [4,] FALSE FALSE FALSE FALSE FALSE## [5,] FALSE FALSE TRUE FALSE FALSE## [6,] FALSE FALSE FALSE FALSE FALSE## [7,] FALSE FALSE FALSE FALSE FALSE## [8,] FALSE FALSE FALSE FALSE FALSE## [9,] FALSE FALSE TRUE FALSE FALSE

length(is.na(ddf))

## [1] 45

length(is.na(ddf)[is.na(ddf) == T])

## [1] 4

# or using which()which(is.na(ddf))

## [1] 20 23 27 38

# and count itlength(which(is.na(ddf)))

## [1] 4

# how about missings in a columnlength(which(is.na(ddf$grade1)))

## [1] 3

# or in a single rowlength(which(is.na(ddf[2,])))

## [1] 2

Welche Beobachtungen haben missing values in einem Data-Frame?

# take care: variables in common have to be equal in both dataframes for all subjects# example: gender of subj 2 is different in the two dataframes# data of v1: three subjects, 3 varsddf.v1 <- data.frame( subj = c( 1, 2, 3), gender = c('w', 'm', 'w'), age = c( 20, 22, 27) )# data of v2, more subjects, two more vars, subj 2 has a typo in variable genderddf.v2 <- data.frame( subj = c( 1, 2, 3, 4, 5), weight = c( 67, 85, 78, 66, 72), gender = c('w', 'x', 'w', 'w', 'm'), height = c(172, 185, 180, 165, 177) ) # take care: gender of ddf.v2 replaces gender of ddf.v1# variables, the dataframes have in common, have to appear in parameter `by`# there still are missing data, so the problem of what to do with incomplete data persists# solution 1: get only complete subjects after mergemerge(ddf.v1, ddf.v2, by=c("subj", "gender"))

## subj gender age weight height## 1 1 w 20 67 172## 2 3 w 27 78 180

# solution 2: impute NA where data are missing by using flag `all`merge(ddf.v1, ddf.v2, by=c("subj", "gender"), all=T)

## subj gender age weight height## 1 1 w 20 67 172## 2 2 m 22 NA NA## 3 2 x NA 85 185## 4 3 w 27 78 180## 5 4 w NA 66 165## 6 5 m NA 72 177

ddf <- merge(ddf.v1, ddf.v2, by=c("subj", "gender"), all=T)# where are missings in ddf? how many?apply(ddf, 1, function(x) length(which(is.na(x))))

## [1] 0 2 1 0 1 1

Alle Beobachtungen mit missing values aus Data-Frame ausschließen mit cleaned <- na.omit(Data-Frame).

# get a working base ideaddf <- data.frame(x = c(1, 2, 3), y = c(0, 10, NA))ddf

## x y## 1 1 0## 2 2 10## 3 3 NA

# na.omit() deletes!na.omit(ddf)

## x y## 1 1 0## 2 2 10

# search for the first missing valueddf = read.csv("http://www.bodowinter.com/tutorial/politeness_data.csv")# first occurance of a missing in dataframewhich(is.na(ddf))

## [1] 375

# number issued comes from iterating over subjects column after columnwhich(is.na(ddf)) %% length(ddf[,1])

## [1] 39

# exclude all observations with missing dataddf.cleaned <- na.omit(ddf) #remove the cases with missing values# n of excluded:cat(nrow(ddf) - nrow(ddf.cleaned), ' observations excluded')

## 1 observations excluded

Alle Werte eines Dataframes, die negativ sind, auf NA setzen.

ddf <- data.frame(x = c(1, 2, 3), y = c(0, 10, NA))ddf <- data.frame(lapply(ddf, function(x) { x[x < 0] <- NA; x }))

Welche Beobachtungen haben missing values in einem Data-Frame?

ddf = read.csv("http://www.bodowinter.com/tutorial/politeness_data.csv")## todo

Ein paar Beispiele zur Verwendung von Hmisc::impute()

library(Hmisc)

## Loading required package: grid## Loading required package: lattice## Loading required package: survival## Loading required package: Formula## Loading required package: ggplot2## ## Attaching package: 'Hmisc'## ## The following objects are masked from 'package:base':## ## format.pval, round.POSIXt, trunc.POSIXt, units

DF <- data.frame(age = c(10, 20, NA, 40), sex = c('male','female'))# impute with mean valueDF$imputed_age <- with(DF, impute(age, mean))# impute with random valueDF$imputed_age2 <- with(DF, impute(age, 'random'))# impute with the mediawith(DF, impute(age, median))

## 1 2 3 4 ## 10 20 20* 40

# impute with the minimumwith(DF, impute(age, min))

## 1 2 3 4 ## 10 20 10* 40

# impute with the maximumwith(DF, impute(age, max))

## 1 2 3 4 ## 10 20 40* 40

aregImpute()

require(Hmisc)# read dataddf <- read.delim("http://md.psych.bio.uni-goettingen.de/data/virt/v_bmi.txt")# attach ddf for more comfortattach(ddf)# copy column c_phys_appn_c_phys_app <- c_phys_app# choose values to set to NAto_erase <- sample(length(c_phys_app), 5)# show observations to eraseto_erase

## [1] 28 2 12 26 18

# set selected values to missingn_c_phys_app[to_erase] <- NA# show new var with erased datan_c_phys_app

## [1] 2 NA 5 4 3 3 4 4 4 5 2 NA 3 2 1 3 2 NA 3 4 3 5 1## [24] 2 1 NA 3 NA 5 2

# compare with originalcbind(c_phys_app, n_c_phys_app)

## c_phys_app n_c_phys_app## [1,] 2 2## [2,] 3 NA## [3,] 5 5## [4,] 4 4## [5,] 3 3## [6,] 3 3## [7,] 4 4## [8,] 4 4## [9,] 4 4## [10,] 5 5## [11,] 2 2## [12,] 4 NA## [13,] 3 3## [14,] 2 2## [15,] 1 1## [16,] 3 3## [17,] 2 2## [18,] 4 NA## [19,] 3 3## [20,] 4 4## [21,] 3 3## [22,] 5 5## [23,] 1 1## [24,] 2 2## [25,] 1 1## [26,] 4 NA## [27,] 3 3## [28,] 3 NA## [29,] 5 5## [30,] 2 2

ddf$n_c_phys_app

## NULL

imputed_n_c_phys_app <- aregImpute(~ gender + height + weight + grade + c_phys_app + c_good_way + c_dress + c_bad_way + c_figure + filling_time, data=ddf, n.impute=10)

## Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5 Iteration 6 Iteration 7 Iteration 8 Iteration 9 Iteration 10

imputed_n_c_phys_app

## ## Multiple Imputation using Bootstrap and PMM## ## aregImpute(formula = ~gender + height + weight + grade + c_phys_app + ## c_good_way + c_dress + c_bad_way + c_figure + filling_time, ## data = ddf, n.impute = 10)## ## n: 30 p: 10 Imputations: 10 nk: 3 ## ## Number of NAs:## gender height weight grade c_phys_app ## 0 0 0 0 0 ## c_good_way c_dress c_bad_way c_figure filling_time ## 0 0 0 0 0 ## ## type d.f.## gender l NA## height s NA## weight s NA## grade s NA## c_phys_app s NA## c_good_way s NA## c_dress s NA## c_bad_way s NA## c_figure s NA## filling_time s NA## ## Transformation of Target Variables Forced to be Linear## ## R-squares for Predicting Non-Missing Values for Each Variable## Using Last Imputations of Predictors## named numeric(0)

ddf$n_c_phys_app

## NULL

Aus Baron & Lee [http://www.psych.upenn.edu/~baron/rpsych/rpsych.html]

7.15 Imputation of missing data

Schafer and Graham (2002) provide a good review of methods for dealing with missing data. R provides many of the methods that they discuss. One method is multiple imputation, which is found in the Hmisc package. Each missing datum is inferred repeated from different samples of other variables, and the repeated inferences are used to determine the error. It turns out that this method works best with the ols() function from the Design package rather than with (the otherwise equivalent) lm() function. Here is an example, using the data set t1.

# todo:#library(Hmisc)#f <- aregImpute(~v1+v2+v3+v4, n.impute=20,# fweighted=.2, tlinear=T, data=t1)#library(Design)#fmp <- fit.mult.impute(v1~v2+v3, ols, f, data=t1)#summary(fmp)

The first command (f) imputes missing values in all four variables, using the other three for each variable. The second command (fmp) estimates a regression model in which v1 is predicted from two of the remaining variables. A variable can be used (and should be used, if it is useful) for imputation even when it is not used in the model.