[Traminer-users] Error message with SPELL, R hangs with TSE

Fri Apr 13 01:30:25 CEST 2012

Thanks for your reply, Gilbert.

The data I have describes which webpage (i.e., StepNumber) in a sequence of
webpages that a person visits. Although there's an intended sequence (1.1,
1.2, 1.3, etc.), the person can visit them in any order they wish.
Ultimately, I'd like to relate the different ways people navigate this
website to some outcome measure data I have elsewhere.

I converted my date/time variables into integers as you suggested. And
following Nathan Green's experience (posted
here<http://lists.r-forge.r-project.org/pipermail/traminer-users/2011-June/000073.html>),
I rounded the numbers so that this "11/28/2011 9:41:23 AM" eventually
became this "35". I think this worked but I suspect it may be problematic
later that a lot of finer grained timing information was lost by rounding
the numbers. I just wanted to get something working for now, but I wonder
if you have any suggestions on this.

I'm now encountering some errors and warning messages about missing values
that I hope you can help me interpret.

Here's how my data now look:

> head(d[1:5,3:7])

WorkgroupID StartTime StopTime StepNumber                   StepTitle

1       45857        32       34        1.1     1.1 Meet the scientist!

2       45857        34       35        1.2 1.2 Your ideas about cancer

3       45857        35       35        1.1     1.1 Meet the scientist!

4       45857        35       35        1.2 1.2 Your ideas about cancer

5       45857        35       36        1.3        1.3 What is mitosis?

And here are the various data types:

> str(d)

'data.frame': 11977 obs. of  30 variables:

 $ WorkgroupID             : int  45857 45857 45857 45857 45857 45857 45857
45857 45857 45857 ...

 $ StartTime               : int  32 34 35 35 35 36 36 37 39 39 ...

 $ StopTime                : int  34 35 35 35 36 36 37 39 39 40 ...

 $ StepNumber              : num  1.1 1.2 1.1 1.2 1.3 1.2 1.3 1.2 1.3 1.2
...

 $ StepTitle               : Factor w/ 38 levels "1.1 Meet the
scientist!",..: 1 2 1 2 4 2 4 2 4 2 ...

 $ StepType                : Factor w/ 10 levels "AssessmentList",..: 5 6 5
6 4 6 4 6 4 6 ...

 $ TimeSpent.Seconds       : int  138 49 2 2 73 49 102 146 17 40 ...

 $ StepNumber_factor       : Factor w/ 36 levels "1.1","1.2","1.25",..: 1 2
1 2 4 2 4 2 4 2 ...

 $ StepNumber_int          : int  1 1 1 1 1 1 1 1 1 1 ...

 $ StepTitle_int           : int  1 2 1 2 4 2 4 2 4 2 ...

This is how I've made a sequence object from SPELL formatted data, and the
messages I see as a result:

> d.labels <- levels(d$StepTitle)

> d.states <- 1:length(d.labels)

> d.seq <- seqdef(d, var = c("WorkgroupID", "StartTime", "StopTime",
"StepTitle_int"), informat = "SPELL", states = d.states, labels = d.labels,
process = TRUE)

 [>] SPELL data converted into 77 STS sequences

 [>] found missing values ('NA') in sequence data

 [>] preparing 77 sequences

 [>] coding void elements with '%' and missing values with '*'

 [!] sequence with index:
9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77
contains only missing values.

     This may produce inconsistent results.

 [>] alphabet (state labels):

     1 = 1 (1.1 Meet the scientist!)

     2 = 2 (1.2 Your ideas about cancer)

     3 = 3 (1.25 Distinguish the phases: A time-lapse animation 2)

     4 = 4 (1.3 What is mitosis?)

     5 = 5 (1.3 What is mitosis? )

     6 = 6 (1.4 Fast and slow dividers)

     7 = 7 (1.5 Why do some cells divide fast and others slow?)

     8 = 8 (1.6 A definition of cancer)

     9 = 9 (2.1 Interphase: When cells don't divide)

     10 = 10 (2.10 Another look through the microscope)

     11 = 11 (2.2 Look through the microscope)

     12 = 12 (2.3 Putting the picture together)

      ... (38 states)

 [>] no color palette attributed, provide one to use graphical functions

 [>] 77 sequences in the data set

 [>] min/max sequence length: 5/100

Warning message:

 [!] no automatic color palete attributed, number of states>12.

     Use 'cpal' argument to define one.

If I understand correctly, I think all the missing values are due to the
events each beginning at different times (i.e., the website visitors began
their visits at different times). I figure I need to do something more
about specifying a process time axis than to just say process=TRUE. Is that
correct?

I also tried your suggestion and converted my SPELL data to an STS format:

> d.sts <- seqformat(d, id="WorkgroupID", begin = "StartTime", end =
"StopTime", status = "StepTitle_int", from = "SPELL", to = "STS", process =
"TRUE")

 [>] SPELL data converted into 77 STS sequences

But I still see these missing values when I then create a sequence object
from the STS formatted data:

> d.sts.seq <- seqdef(d.sts)

 [>] found missing values ('NA') in sequence data

 [>] preparing 77 sequences

 [>] coding void elements with '%' and missing values with '*'

 [!] sequence with index:
9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77
contains only missing values.

     This may produce inconsistent results.

 [>] 15 distinct states appear in the data:

     1 = 1

     2 = 2

     3 = 4

     4 = 6

     5 = 7

     6 = 8

     7 = 9

     8 = 11

     9 = 12

     10 = 13

     11 = 14

     12 = 15

      ...

 [>] alphabet (state labels):

     1 = 1 (1)

     2 = 2 (2)

     3 = 4 (4)

     4 = 6 (6)

     5 = 7 (7)

     6 = 8 (8)

     7 = 9 (9)

     8 = 11 (11)

     9 = 12 (12)

     10 = 13 (13)

     11 = 14 (14)

     12 = 15 (15)

      ... (15 states)

 [>] no color palette attributed, provide one to use graphical functions

 [>] 77 sequences in the data set

 [>] min/max sequence length: 5/100

Warning message:

 [!] no automatic color palete attributed, number of states>12.

     Use 'cpal' argument to define one.

So I'm following the directions in sections 5.2.2 (p.44) and 6.1 (p.50) of
the manual to specify a process time axis. I created a new dataset that
looks like this:

> head(d.StartTimes)

  WorkgroupID StartTime

1       45813         1

2       45848        26

3       45857        32

4       45859        34

5       45860        35

6       45861        36

And when I then try to create a sequence object from SPELL formatted data,
I get this error and don't know what it means:

> d.seq <- seqdef(d, var = c("WorkgroupID", "StartTime", "StopTime",
"StepTitle_int"), informat = "SPELL", states = d.states, labels = d.labels,
process = TRUE, pdata = d.StartTimes, pvar = c("WorkgroupID", "StartTime"))

Error in rep(state, dur) : invalid 'times' argument

In addition: Warning messages:

1: In if (is.na(age1)) { :

  the condition has length > 1 and only the first element will be used

2: In if (age1 >= 0) { :

  the condition has length > 1 and only the first element will be used

3: In if (is.na(sstart) | is.na(sstop)) { :

  the condition has length > 1 and only the first element will be used

4: In if (sstop <= limit) { :

  the condition has length > 1 and only the first element will be used

I tried it with STS formatted data, and I still see all these missing
values:

> d.seq <- seqdef(d.sts, var = 1:29, states = d.states, labels = d.labels,
process = TRUE, pdata = d.StartTimes, pvar = c("WorkgroupID", "StartTime"))

 [>] found missing values ('NA') in sequence data

 [>] preparing 77 sequences

 [>] coding void elements with '%' and missing values with '*'

 [!] sequence with index:
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77
contains only missing values.

     This may produce inconsistent results.

 [>] alphabet (state labels):

     1 = 1 (1.1 Meet the scientist!)

     2 = 2 (1.2 Your ideas about cancer)

     3 = 3 (1.25 Distinguish the phases: A time-lapse animation 2)

     4 = 4 (1.3 What is mitosis?)

     5 = 5 (1.3 What is mitosis? )

     6 = 6 (1.4 Fast and slow dividers)

     7 = 7 (1.5 Why do some cells divide fast and others slow?)

     8 = 8 (1.6 A definition of cancer)

     9 = 9 (2.1 Interphase: When cells don't divide)

     10 = 10 (2.10 Another look through the microscope)

     11 = 11 (2.2 Look through the microscope)

     12 = 12 (2.3 Putting the picture together)

      ... (38 states)

 [>] no color palette attributed, provide one to use graphical functions

 [>] 77 sequences in the data set

 [>] min/max sequence length: 5/29

Warning message:

 [!] no automatic color palete attributed, number of states>12.

     Use 'cpal' argument to define one.

I found section 6.5.2 (p. 57) on handling missing values, and tried this:

> d.sts.noNA <- seqdef(d.sts, left="DEL")

 [>] found missing values ('NA') in sequence data

 [>] preparing 77 sequences

 [>] coding void elements with '%' and missing values with '*'

 [!] sequence with index:
9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77
contains only missing values.

     This may produce inconsistent results.

 [>] 15 distinct states appear in the data:

     1 = 1

     2 = 2

     3 = 4

     4 = 6

     5 = 7

     6 = 8

     7 = 9

     8 = 11

     9 = 12

     10 = 13

     11 = 14

     12 = 15

      ...

 [>] alphabet (state labels):

     1 = 1 (1)

     2 = 2 (2)

     3 = 4 (4)

     4 = 6 (6)

     5 = 7 (7)

     6 = 8 (8)

     7 = 9 (9)

     8 = 11 (11)

     9 = 12 (12)

     10 = 13 (13)

     11 = 14 (14)

     12 = 15 (15)

      ... (15 states)

 [>] no color palette attributed, provide one to use graphical functions

 [>] 77 sequences in the data set

 [>] min/max sequence length: 4/100

Warning message:

 [!] no automatic color palete attributed, number of states>12.

     Use 'cpal' argument to define one.

I sort of feel like I'm stumbling around in the dark trying to understand
what this all means. Are these missing values problematic? I hope you or
someone can clarify for me what's going on, and what's the best way to
approach these data.

Thanks in advance for your help!

Camillia

On Mon, Apr 9, 2012 at 11:46 AM, Gilbert Ritschard <
Gilbert.Ritschard at unige.ch> wrote:

Dear Camillia,

It is not clear to me what your StepNumber is. Does it stand for states?
How many different values does it take?

Any way. I think you are better to first transform your spell data into STS
format with the seqformat() function, and then define your state sequence
object from the STS data. You will have to specify whether you want to
align your sequences on calendar time (default) or a process time (time
since a individual start event).

Currently, the seqformat function of TraMineR does not support date or time
format for the "begin" and "end" arguments. You should first transform
those start and end times into integers, so that they can be interpreted as
positions in the sequence.

For your attempt to use the methods for  event sequences, again, I am not
sure what your StepNumber stands for. You use it as if it defined the event
occurring at the time stamp. Is that what you want to do?

Gilbert

On 07-Apr-12 21:24, Camillia Matuk wrote:

Hello,

I'm having problems getting started with my data, and am very new to both R
and to TraMineR. I hope someone can help.

Relevant columns in my csv file are WorkgroupID (e.g., 65472), Start and
Stop times (e.g., 11/28/11 9:37 AM), and StepNumber (e.g., "4.5").

After reading in my data, this is what I did:

> WorkgroupID_factor <- factor(d$WorkgroupID)

> StepNumber_factor <- factor(d$StepNumber)

> d <- data.frame(d, WorkgroupID_factor, StepNumber_factor)

I figured I should treat this as SPELL formatted data, so I did this:

> d.labels <- seqstatl(d$StepNumber_factor)

> d.states <- 1:length(d.labels)

But I get error messages when I do this:

> d.seq <- seqdef(d, var = c("WorkgroupID_factor", "StartTime", "StopTime",
"StepNumber_factor"), informat = "SPELL", states = d.states, labels =
d.labels, process = FALSE)

Error in Summary.factor(c(NA_integer_, NA_integer_, NA_integer_,
NA_integer_,  :

 min not meaningful for factors

In addition: Warning messages:

1: In Ops.factor(begincolumn, 1) : < not meaningful for factors

2: In Ops.factor(endcolumn, begincolumn) : - not meaningful for factors

3: In Ops.factor(begincolumn, 0) : > not meaningful for factors

Abandoning that, I then tried treating my data as though it were in TSE
format. I'm not sure if that's proper thing to do...

d.seqe <- seqecreate(id = d$WorkgroupID_factor, timestamp = d$StartTime,
event = d$StepNumber_factor)

This works, although I'm still unsure about how to read it:

> print(d.seqe[2]) #Displays the sequence of events

[1]
67.00-(1.1)-2.00-(1.2)-3.00-(1.3)-3.00-(1.4)-2.00-(1.5)-2.00-(1.4)-3.00-(1.5,1.5,1.6)-198.00-(1.6)-1.00-(1.5,1.5,1.6)-1.00-(1.4,1.4,1.5)-2.00-(2.1,2.3)-4.00-(2.3,2.3)-1.00-(2.3,2.3,2.3,2.4)-1.00-(2.3,2.3,2.4)-3.00-(2.3,2.3)-3.00-(2.3)-7.00-(2.3,2.4)-9.00-(2.3,2.4,2.4)-167.00-(2.4,2.4,2.5,2.6)-(...)

But when I run this command, R hangs and I have to force quit and restart:

d.fsubseq <- seqefsub(d.seqe, minSupport = 50)

I hope someone can point out what I'm doing wrong. Thanks for any
assistance!

-- 

Camillia

http://sites.google.com/site/cfmatuk/

_______________________________________________

Traminer-users mailing list

Traminer-users at lists.r-forge.r-project.org

https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/traminer-users

-- 

Gilbert Ritschard, Department of Economics and

Institute for Demographic and Life Course Studies,

University of Geneva, 40, bd du Pont-d'Arve, CH-1211 Genève 4, Switzerland

http://mephisto.unige.ch

-- 

Camillia

http://sites.google.com/site/cfmatuk/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/traminer-users/attachments/20120412/704309fe/attachment-0001.html>