[datatable-help] Flagging duplicate (non-unique) values based on specifications

Fri Oct 4 17:57:56 CEST 2013

Hi,

I'm working with about 60 data sets which need to have duplicate
(non-unique) values removed. 

The data sets have 22 unique column names (the same for each data set):
[1] "LakeID"                    "LakeName"                 
"SourceVariableName"       
 [4] "SourceVariableDescription" "SourceFlags"              
"LagosVariableID"          
 [7] "LagosVariableName"         "Value"                     "Units"                    
[10] "CensorCode"                "DetectionLimit"            "Date"                     
[13] "LabMethodName"             "LabMethodInfo"             "SampleType"               
[16] "SamplePosition"            "SampleDepth"               "MethodInfo"               
[19] "BasinType"                 "Subprogram"                "Comments"                 
[22] "Dup" 

I am interested in flagging observations that are duplicate (replicate)
values. I am defining observations that are NOT duplicate as unique for
"LakeID" "LagosVariableID" "Value" "Date" "SamplePosition" and "SampleDepth
for each row.  

Note that the "Dup" column is where I want to flag whether or not an
observation is duplicate (NA= not duplicate, 1= duplicate)

I have tried the follow code, where Final.Export= the data set with the 22
columns listed above:

library(data.table)
#flag the unique (non-duplicate) values as NA
data1=data.table(Final.Export,key=c('LakeID','Date','LagosVariableID','SampleDepth','SamplePosition','Value')
data1=data1[unique(data1[,key(data1),with=FALSE]),mult='first']
data1$Dup=NA
#flag the duplicate values as "1"
data2=data.table(Final.Export,key=c('LakeID','Date','LagosVariableID','SampleDepth','SamplePosition','Value')
data2=data2[duplicated(data2[,key(data2),with=FALSE]),mult='first']
data2$Dup=1
#check to see if adds to total
(length(data1$Value))+((length(data2$Value)))
length(data2$Value)
length(Final.Export$Value) #adds up to total  
#bind the tables
Final.Export1=rbind(data1,data2,use.names=TRUE)    

The code works for flagging the duplicate observations, however, the values
for several of the variables in the original data frame "Final.Export" are
converted to NA in "Final.Export1."  

Any ideas how to prevent that from happening?    

--
View this message in context: http://r.789695.n4.nabble.com/Flagging-duplicate-non-unique-values-based-on-specifications-tp4677610.html
Sent from the datatable-help mailing list archive at Nabble.com.