[datatable-help] Reading corrupt csv and replace wrong value

Steve Lianoglou mailinglist.honeypot at gmail.com
Thu Jun 16 23:55:28 CEST 2011


Hi,

On Thu, Jun 16, 2011 at 5:40 PM, DanMik <dan at dd-software.dk> wrote:
> Im fairly new to R.
>
> I have a huge csv file, of 400.000+ K, and now it looks like one of the
> values is corrupt. (it contains a ?, so one value becomes:
> "0,0742076391?39524")
> Because of the size i can't edit it in a text editor, and the file took
> several days to create (many calculations)
>
> When i read the file it cant be converted to numbers because of this one
> value which i found with scan() and have found the coordinates of.
>
> I'm reading the file with:
>
> x <- read.csv2("filename.csv", stringsAsFactor= FALSE)
>
> Can i read the file with everything as numeric, and replace non numeric
> values with 0 ?
>
> or somehow correct this one value?
>
> I have tried first reading the file, then set the value to 0 and then use
> as.matrix and afterwards as.numeric. This just creates a lot of NA

I think you've got the right approach here.

Maybe the as.numeric is creating a lot of NAs because you have "," as
your decimal separator? For example:

R> as.numeric("0,000")
[1] NA
Warning message:
NAs introduced by coercion

I'm not sure if R's locale settings correctly handles this?

Maybe its worth trying to change "," to "." after you read the file in
and before you try to convert it to a numeric matrix, you can convert
all "," to "." with something like this:

R> converted <- gsub(",", ".", your.data, fixed=TRUE)

Or something like that (depending on how `your.data` is stored .. is
it just a character vector?

Alternatively, you can try to edit the file using an editor like vi (I
bet it can handle it) -- the problem is you have to know how to use vi
a bit, but if you know which line your problem is in, you can quickly
hop to it and fix.

HTH,
-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact


More information about the datatable-help mailing list