[datatable-help] Losing header names when using skip argument in fread in R
FCH808
fch808 at gmail.com
Fri May 9 23:34:56 CEST 2014
R package: data.table - version. 1.9.2
I have a ";" delimited text file that I need to subset based on the dates
that appear in the first column. I used fread() to read the first column
only, and return the indices with the dates needed so I could use the min()
of the indices to skip to, and the length() for number of rows to read. (In
this case I only need 2 sequential days - 2880 rows/readings)
The problem is that the header = TRUE only seems to capture the row of data
immediately preceding the rows read and uses it as the header info, and
instead of the actual headers in the first line of the text file.
I wrapped it in a function and timed it, and it seems to be a reasonably
quick way to have a minimal impact on RAM usage for the filtering needed.
This file is only about 2 million rows so it wouldn't be a problem just
reading the whole thing in and subsetting but I would like a solution that
works as my text files get larger.
findRows<-fread("power.txt", header = TRUE, select = 1)
all<-(which(findRows$Date %in% c("14/2/2008", "15/2/2008")) )
skipLines<- min(all)
keepRows<- length(all)
feb<- fread("power.txt", skip = skipLines , nrows = keepRows,
header = TRUE)
rm(findRows)
head(feb)
14/2/2008 00:00:00 0.252 0.000 244.230 1.000 0.000 0.000 0.000
1: 14/2/2008 00:01:00 0.254 0 245.24 1 0 0 0
2: 14/2/2008 00:01:00 0.254 0 245.24 1 0 0 0
3: 14/2/2008 00:02:00 0.254 0 245.31 1 0 0 0
4: 14/2/2008 00:03:00 0.252 0 244.44 1 0 0 0
5: 14/2/2008 00:04:00 0.252 0 244.27 1 0 0 0
6: 14/2/2008 00:05:00 0.252 0 244.62 1 0 0 0
> system.time(loadF())
user system elapsed
0.55 0.01 0.56
I was able to circumvent this by setting header = FALSE and just reading the
first line into another tiny dataset and extracting all the column names
(since I only ever read the first column the first time around) and setting
those names to the data.table but this doesn't seem like the best solution
if there is a way to do within the fread() call.
findRows<-fread("power.txt", header = TRUE, select = 1)
all<-(which(findRows$Date %in% c("14/2/2008", "15/2/2008")) )
skipLines<- min(all)
keepRows<- length(all)
feb<- fread("power.txt", skip = (skipLines) , nrows = keepRows,
header = FALSE)
rm(findRows)
febNames<- names(fread("power.txt", nrow = 1))
setnames(feb, febNames)
head(feb)
Date Time Global_active_power Global_reactive_power
Voltage
1: 14/2/2008 00:00:00 0.252 0
244.23
2: 14/2/2008 00:01:00 0.254 0
245.24
3: 14/2/2008 00:02:00 0.254 0
245.31
4: 14/2/2008 00:03:00 0.252 0
244.44
5: 14/2/2008 00:04:00 0.252 0
244.27
6: 14/2/2008 00:05:00 0.252 0
244.62
Global_intensity Sub_metering_1 Sub_metering_2 Sub_metering_3
1: 1 0 0 0
2: 1 0 0 0
3: 1 0 0 0
4: 1 0 0 0
5: 1 0 0 0
6: 1 0 0 0
> system.time(loadF())
user system elapsed
0.61 0.05 0.66
Is there a way to accomplish this within the fread() call that skips to row
610,957 and initially creates the feb data.table instead of having to create
another data.table of length 1 just to read the headers?
--
View this message in context: http://r.789695.n4.nabble.com/Losing-header-names-when-using-skip-argument-in-fread-in-R-tp4690268.html
Sent from the datatable-help mailing list archive at Nabble.com.
More information about the datatable-help
mailing list