<html>
  <head>
    <meta content="text/html; charset=ISO-8859-1"
      http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <div class="moz-cite-prefix">It's more efficient to ask questions
      like this on Stack Overflow please :<br>
      <meta http-equiv="content-type" content="text/html;
        charset=ISO-8859-1">
      <a
href="http://stackoverflow.com/questions/tagged/data.table?sort=active&pagesize=50">http://stackoverflow.com/questions/tagged/data.table</a><br>
      You can edit the question there, and people can add or remove
      quick comments.<br>
      <br>
      In v1.8.10 on CRAN you can pass 'by' to unique and duplicated
      (thanks to Steve).  This would simplify the question and make it
      easier to answer.<br>
      <br>
      Matt<br>
      <br>
      On 04/10/13 16:57, limno.sam wrote:<br>
    </div>
    <blockquote cite="mid:1380902276566-4677610.post@n4.nabble.com"
      type="cite">
      <pre wrap="">Hi,

I'm working with about 60 data sets which need to have duplicate
(non-unique) values removed. 

The data sets have 22 unique column names (the same for each data set):
[1] "LakeID"                    "LakeName"                 
"SourceVariableName"       
 [4] "SourceVariableDescription" "SourceFlags"              
"LagosVariableID"          
 [7] "LagosVariableName"         "Value"                     "Units"                    
[10] "CensorCode"                "DetectionLimit"            "Date"                     
[13] "LabMethodName"             "LabMethodInfo"             "SampleType"               
[16] "SamplePosition"            "SampleDepth"               "MethodInfo"               
[19] "BasinType"                 "Subprogram"                "Comments"                 
[22] "Dup" 

I am interested in flagging observations that are duplicate (replicate)
values. I am defining observations that are NOT duplicate as unique for
"LakeID" "LagosVariableID" "Value" "Date" "SamplePosition" and "SampleDepth
for each row.  

Note that the "Dup" column is where I want to flag whether or not an
observation is duplicate (NA= not duplicate, 1= duplicate)

I have tried the follow code, where Final.Export= the data set with the 22
columns listed above:

library(data.table)
#flag the unique (non-duplicate) values as NA
data1=data.table(Final.Export,key=c('LakeID','Date','LagosVariableID','SampleDepth','SamplePosition','Value')
data1=data1[unique(data1[,key(data1),with=FALSE]),mult='first']
data1$Dup=NA
#flag the duplicate values as "1"
data2=data.table(Final.Export,key=c('LakeID','Date','LagosVariableID','SampleDepth','SamplePosition','Value')
data2=data2[duplicated(data2[,key(data2),with=FALSE]),mult='first']
data2$Dup=1
#check to see if adds to total
(length(data1$Value))+((length(data2$Value)))
length(data2$Value)
length(Final.Export$Value) #adds up to total  
#bind the tables
Final.Export1=rbind(data1,data2,use.names=TRUE)    

The code works for flagging the duplicate observations, however, the values
for several of the variables in the original data frame "Final.Export" are
converted to NA in "Final.Export1."  

Any ideas how to prevent that from happening?    
  



--
View this message in context: <a class="moz-txt-link-freetext" href="http://r.789695.n4.nabble.com/Flagging-duplicate-non-unique-values-based-on-specifications-tp4677610.html">http://r.789695.n4.nabble.com/Flagging-duplicate-non-unique-values-based-on-specifications-tp4677610.html</a>
Sent from the datatable-help mailing list archive at Nabble.com.
_______________________________________________
datatable-help mailing list
<a class="moz-txt-link-abbreviated" href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a>
<a class="moz-txt-link-freetext" href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a>

</pre>
    </blockquote>
    <br>
  </body>
</html>