<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">It's more efficient to ask questions

      like this on Stack Overflow please :<br>

      <meta http-equiv="content-type" content="text/html;

        charset=ISO-8859-1">

      <a

href="http://stackoverflow.com/questions/tagged/data.table?sort=active&pagesize=50">http://stackoverflow.com/questions/tagged/data.table</a><br>

      You can edit the question there, and people can add or remove

      quick comments.<br>

      <br>

      In v1.8.10 on CRAN you can pass 'by' to unique and duplicated

      (thanks to Steve).  This would simplify the question and make it

      easier to answer.<br>

      <br>

      Matt<br>

      <br>

      On 04/10/13 16:57, limno.sam wrote:<br>

    </div>

    <blockquote cite="mid:1380902276566-4677610.post@n4.nabble.com"

      type="cite">

      <pre wrap="">Hi,

I'm working with about 60 data sets which need to have duplicate

(non-unique) values removed. 

The data sets have 22 unique column names (the same for each data set):

[1] "LakeID"                    "LakeName"                 

"SourceVariableName"       

 [4] "SourceVariableDescription" "SourceFlags"              

"LagosVariableID"          

 [7] "LagosVariableName"         "Value"                     "Units"                    

[10] "CensorCode"                "DetectionLimit"            "Date"                     

[13] "LabMethodName"             "LabMethodInfo"             "SampleType"               

[16] "SamplePosition"            "SampleDepth"               "MethodInfo"               

[19] "BasinType"                 "Subprogram"                "Comments"                 

[22] "Dup" 

I am interested in flagging observations that are duplicate (replicate)

values. I am defining observations that are NOT duplicate as unique for

"LakeID" "LagosVariableID" "Value" "Date" "SamplePosition" and "SampleDepth

for each row.  

Note that the "Dup" column is where I want to flag whether or not an

observation is duplicate (NA= not duplicate, 1= duplicate)

I have tried the follow code, where Final.Export= the data set with the 22

columns listed above:

library(data.table)

#flag the unique (non-duplicate) values as NA

data1=data.table(Final.Export,key=c('LakeID','Date','LagosVariableID','SampleDepth','SamplePosition','Value')

data1=data1[unique(data1[,key(data1),with=FALSE]),mult='first']

data1$Dup=NA

#flag the duplicate values as "1"

data2=data.table(Final.Export,key=c('LakeID','Date','LagosVariableID','SampleDepth','SamplePosition','Value')

data2=data2[duplicated(data2[,key(data2),with=FALSE]),mult='first']

data2$Dup=1

#check to see if adds to total

(length(data1$Value))+((length(data2$Value)))

length(data2$Value)

length(Final.Export$Value) #adds up to total  

#bind the tables

Final.Export1=rbind(data1,data2,use.names=TRUE)    

The code works for flagging the duplicate observations, however, the values

for several of the variables in the original data frame "Final.Export" are

converted to NA in "Final.Export1."  

Any ideas how to prevent that from happening?    

--

View this message in context: <a class="moz-txt-link-freetext" href="http://r.789695.n4.nabble.com/Flagging-duplicate-non-unique-values-based-on-specifications-tp4677610.html">http://r.789695.n4.nabble.com/Flagging-duplicate-non-unique-values-based-on-specifications-tp4677610.html</a>

Sent from the datatable-help mailing list archive at Nabble.com.

_______________________________________________

datatable-help mailing list

<a class="moz-txt-link-abbreviated" href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a>

<a class="moz-txt-link-freetext" href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a>

</pre>

    </blockquote>

    <br>

  </body>

</html>