[datatable-help] Flagging duplicate (non-unique) values based on specifications

Matt Dowle mdowle at mdowle.plus.com
Fri Oct 4 18:29:09 CEST 2013


It's more efficient to ask questions like this on Stack Overflow please :
http://stackoverflow.com/questions/tagged/data.table 
<http://stackoverflow.com/questions/tagged/data.table?sort=active&pagesize=50>
You can edit the question there, and people can add or remove quick 
comments.

In v1.8.10 on CRAN you can pass 'by' to unique and duplicated (thanks to 
Steve).  This would simplify the question and make it easier to answer.

Matt

On 04/10/13 16:57, limno.sam wrote:
> Hi,
>
> I'm working with about 60 data sets which need to have duplicate
> (non-unique) values removed.
>
> The data sets have 22 unique column names (the same for each data set):
> [1] "LakeID"                    "LakeName"
> "SourceVariableName"
>   [4] "SourceVariableDescription" "SourceFlags"
> "LagosVariableID"
>   [7] "LagosVariableName"         "Value"                     "Units"
> [10] "CensorCode"                "DetectionLimit"            "Date"
> [13] "LabMethodName"             "LabMethodInfo"             "SampleType"
> [16] "SamplePosition"            "SampleDepth"               "MethodInfo"
> [19] "BasinType"                 "Subprogram"                "Comments"
> [22] "Dup"
>
> I am interested in flagging observations that are duplicate (replicate)
> values. I am defining observations that are NOT duplicate as unique for
> "LakeID" "LagosVariableID" "Value" "Date" "SamplePosition" and "SampleDepth
> for each row.
>
> Note that the "Dup" column is where I want to flag whether or not an
> observation is duplicate (NA= not duplicate, 1= duplicate)
>
> I have tried the follow code, where Final.Export= the data set with the 22
> columns listed above:
>
> library(data.table)
> #flag the unique (non-duplicate) values as NA
> data1=data.table(Final.Export,key=c('LakeID','Date','LagosVariableID','SampleDepth','SamplePosition','Value')
> data1=data1[unique(data1[,key(data1),with=FALSE]),mult='first']
> data1$Dup=NA
> #flag the duplicate values as "1"
> data2=data.table(Final.Export,key=c('LakeID','Date','LagosVariableID','SampleDepth','SamplePosition','Value')
> data2=data2[duplicated(data2[,key(data2),with=FALSE]),mult='first']
> data2$Dup=1
> #check to see if adds to total
> (length(data1$Value))+((length(data2$Value)))
> length(data2$Value)
> length(Final.Export$Value) #adds up to total
> #bind the tables
> Final.Export1=rbind(data1,data2,use.names=TRUE)
>
> The code works for flagging the duplicate observations, however, the values
> for several of the variables in the original data frame "Final.Export" are
> converted to NA in "Final.Export1."
>
> Any ideas how to prevent that from happening?
>    
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Flagging-duplicate-non-unique-values-based-on-specifications-tp4677610.html
> Sent from the datatable-help mailing list archive at Nabble.com.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131004/d8ceffb1/attachment.html>


More information about the datatable-help mailing list