[datatable-help] fread(character string) limited to strings less than 4096 long?

Timothée Carayol timothee.carayol at gmail.com
Thu Mar 28 15:58:37 CET 2013


Input contains a \n (or is ""), taking this to be text input (not a
filename)
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.

Using line 30 to detect sep (the last non blank line in the first 30) ...
'\t'
Found 2 columns

First row with 2 fields occurs on line 1 (either column names or first row
of data)
All the fields on line 1 are character fields. Treating as the column
names.
Count of eol after first data row: 1023

Subtracted 1 for last eol and any trailing empty lines, leaving 1022 data
rows
Type codes: 33 (first 5 rows)

Type codes: 33 (+middle 5 rows)

Type codes: 33 (+last 5 rows)

   0.000s (-nan%) Memory map (rerun may be quicker)

   0.000s (-nan%) sep and header detection

   0.000s (-nan%) Count rows (wc -l)

   0.000s (-nan%) Column type detection (first, middle and last 5 rows)

   0.000s (-nan%) Allocation of 1022x2 result (xMB) in RAM

   0.000s (-nan%) Reading data

   0.000s (-nan%) Allocation for type bumps (if any), including gc time if
triggered
   0.000s (-nan%) Coercing data already read in type bumps (if any)

   0.000s (-nan%) Changing na.strings to NA

   0.000s        Total

4092 1022

Input contains a \n (or is ""), taking this to be text input (not a
filename)
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.

Using line 30 to detect sep (the last non blank line in the first 30) ...
'\t'
Found 2 columns

First row with 2 fields occurs on line 1 (either column names or first row
of data)
All the fields on line 1 are character fields. Treating as the column
names.
Count of eol after first data row: 1023

Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data
rows
Type codes: 33 (first 5 rows)

Type codes: 33 (+middle 5 rows)

Type codes: 33 (+last 5 rows)

   0.000s (-nan%) Memory map (rerun may be quicker)

   0.000s (-nan%) sep and header detection

   0.000s (-nan%) Count rows (wc -l)

   0.000s (-nan%) Column type detection (first, middle and last 5 rows)

   0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM

   0.000s (-nan%) Reading data

   0.000s (-nan%) Allocation for type bumps (if any), including gc time if
triggered
   0.000s (-nan%) Coercing data already read in type bumps (if any)

   0.000s (-nan%) Changing na.strings to NA

   0.000s        Total

4096 1023

Input contains a \n (or is ""), taking this to be text input (not a
filename)
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.

Using line 30 to detect sep (the last non blank line in the first 30) ...
'\t'
Found 2 columns

First row with 2 fields occurs on line 1 (either column names or first row
of data)
All the fields on line 1 are character fields. Treating as the column
names.
Count of eol after first data row: 1023

Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data
rows
Type codes: 33 (first 5 rows)

Type codes: 33 (+middle 5 rows)

Type codes: 33 (+last 5 rows)

   0.000s (-nan%) Memory map (rerun may be quicker)

   0.000s (-nan%) sep and header detection

   0.000s (-nan%) Count rows (wc -l)

   0.000s (-nan%) Column type detection (first, middle and last 5 rows)

   0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM

   0.000s (-nan%) Reading data

   0.000s (-nan%) Allocation for type bumps (if any), including gc time if
triggered
   0.000s (-nan%) Coercing data already read in type bumps (if any)

   0.000s (-nan%) Changing na.strings to NA

   0.000s        Total

4100 1023

Input contains a \n (or is ""), taking this to be text input (not a
filename)
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.

Using line 30 to detect sep (the last non blank line in the first 30) ...
'\t'
Found 2 columns

First row with 2 fields occurs on line 1 (either column names or first row
of data)
All the fields on line 1 are character fields. Treating as the column
names.
Count of eol after first data row: 1023

Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data
rows
Type codes: 33 (first 5 rows)

Type codes: 33 (+middle 5 rows)

Type codes: 33 (+last 5 rows)

   0.000s (-nan%) Memory map (rerun may be quicker)

   0.000s (-nan%) sep and header detection

   0.000s (-nan%) Count rows (wc -l)

   0.000s (-nan%) Column type detection (first, middle and last 5 rows)

   0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM

   0.000s (-nan%) Reading data

   0.000s (-nan%) Allocation for type bumps (if any), including gc time if
triggered
   0.000s (-nan%) Coercing data already read in type bumps (if any)

   0.000s (-nan%) Changing na.strings to NA

   0.000s        Total

40000 1023




On Thu, Mar 28, 2013 at 2:55 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

> **
>
>
>
> Hm this is odd.
>
> Could you run the following and paste back the (verbose) results please.
>
> for (n in c(1023:1025, 10000)) {
>  input = paste( rep('a\tb\n', n), collapse='')
>  A = fread(input,verbose=TRUE)
>  cat(nchar(input), nrow(A), "\n")
> }
>
>
>
>
>
> On 28.03.2013 14:38, Timothée Carayol wrote:
>
>  Curiouser and curiouser..
>
> I can reproduce on two computers with different versions of R and of
> data.table.
>
>
>
> Computer 1 (it says unknown-linux but is actually ubuntu):
>
> R version 2.15.3 (2013-03-01)
>
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
>
>
> locale:
>
>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
> LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
> LC_MONETARY=en_GB.UTF-8
>    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=C                 LC_NAME=C
>          LC_ADDRESS=C
> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8
> LC_IDENTIFICATION=C
>
>
>
> attached base packages:
>
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
>
>
> other attached packages:
>
> [1] bit64_0.9-2      bit_1.1-10       data.table_1.8.9 colorout_1.0-0
>
> Computer 2:
>  R version 2.15.2 (2012-10-26)
> Platform: x86_64-redhat-linux-gnu (64-bit)
>
> locale:
>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
>  [7] LC_PAPER=C                 LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] data.table_1.8.8
>
> loaded via a namespace (and not attached):
> [1] tools_2.15.2
>
>
> On Thu, Mar 28, 2013 at 2:31 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>
>>
>>
>> Interesting, what's your sessionInfo() please?
>>
>> For me it seems to work ok :
>>
>> [1] 1022
>> [1] 1023
>> [1] 1024
>> [1] 9999
>>
>> > sessionInfo()
>> R version 2.15.2 (2012-10-26)
>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>
>>
>>
>> On 27.03.2013 22:49, Timothée Carayol wrote:
>>
>>  Agree with Muhammad, longer character strings are definitely permitted
>> in R.
>> A minimal example that show something strange happening with fread:
>>   for (n in c(1023:1025, 10000)) {
>>   A
>>              paste(
>>                  rep('a\tb\n', n),
>>                  collapse=''
>>                  ),
>>            sep='\t'
>>            )
>>   print(nrow(A))
>> }
>> On my computer, I obtain:
>>  [1] 1022
>> [1] 1023
>> [1] 1023
>> [1] 1023
>>  Hope this helps
>> Timothée
>>
>>
>> On Wed, Mar 27, 2013 at 9:23 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>>
>>> Hi,
>>> Nice to hear from you. Nope not known to me. Obviously 4096 is 4k, is
>>> that
>>> the R limit for a character string length? What happens at 4097?
>>> Matthew
>>>
>>> > Hi,
>>> >
>>> > I have an example of a string of 4097 characters which can't be parsed
>>> by
>>> > fread; however, if I remove any character, it can be parsed just fine.
>>> Is
>>> > that a known limitation?
>>> >
>>> > (If I write the string to a file and then fread the file name, it works
>>> > too.)
>>> >
>>> > Let me know if you need the string and/or a bug report.
>>> >
>>> > Thanks
>>> > Timothée
>>>  > _______________________________________________
>>> > datatable-help mailing list
>>> > datatable-help at lists.r-forge.r-project.org
>>> >
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>
>>>
>>>
>>
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130328/b4527072/attachment-0001.html>


More information about the datatable-help mailing list