[Traminer-users] Error message with SPELL, R hangs with TSE

Gilbert Ritschard Gilbert.Ritschard at unige.ch
Fri Apr 13 09:17:41 CEST 2012


Dear Camillia,

There are two points to which you should care when converting from SPELL 
data:

1. By default, the STS sequence length is set to 100. Your sequence seem 
to be longer and you should therefore specify the maximum sequence 
length with the limit argument of seqformat, e.g. "limit=300".

2. By default sequences are aligned on a calendar axis, and, as you 
pointed out, this generates missing values at positions before the first 
the spell as well as after the last spell. You are probably better to 
align the start time of all sequences, which you do by specifying 
"process=TRUE" alongside with the "pdata" and "pvar" arguments. See 
help(seqformat).

Good luck.

Gilbert


On 13-Apr-12 1:30, Camillia Matuk wrote:
>
> Thanks for your reply, Gilbert.
>
> The data I have describes which webpage (i.e., StepNumber) in a 
> sequence of webpages that a person visits. Although there's an 
> intended sequence (1.1, 1.2, 1.3, etc.), the person can visit them in 
> any order they wish. Ultimately, I'd like to relate the different ways 
> people navigate this website to some outcome measure data I have 
> elsewhere.
>
> I converted my date/time variables into integers as you suggested. And 
> following Nathan Green's experience (posted here 
> <http://lists.r-forge.r-project.org/pipermail/traminer-users/2011-June/000073.html>), 
> I rounded the numbers so that this "11/28/2011 9:41:23 AM" eventually 
> became this "35". I think this worked but I suspect it may be 
> problematic later that a lot of finer grained timing information was 
> lost by rounding the numbers. I just wanted to get something working 
> for now, but I wonder if you have any suggestions on this.
>
> I'm now encountering some errors and warning messages about missing 
> values that I hope you can help me interpret.
>
> Here's how my data now look:
>
> > head(d[1:5,3:7])
>
> WorkgroupID StartTime StopTime StepNumber                   StepTitle
>
> 1       45857        32       34        1.1     1.1 Meet the scientist!
>
> 2       45857        34       35        1.2 1.2 Your ideas about cancer
>
> 3       45857        35       35        1.1     1.1 Meet the scientist!
>
> 4       45857        35       35        1.2 1.2 Your ideas about cancer
>
> 5       45857        35       36        1.3        1.3 What is mitosis?
>
>
> And here are the various data types:
>
> > str(d)
>
> 'data.frame':11977 obs. of  30 variables:
>
>  $ WorkgroupID             : int  45857 45857 45857 45857 45857 45857 
> 45857 45857 45857 45857 ...
>
>  $ StartTime               : int  32 34 35 35 35 36 36 37 39 39 ...
>
>  $ StopTime                : int  34 35 35 35 36 36 37 39 39 40 ...
>
>  $ StepNumber              : num  1.1 1.2 1.1 1.2 1.3 1.2 1.3 1.2 1.3 
> 1.2 ...
>
>  $ StepTitle               : Factor w/ 38 levels "1.1 Meet the 
> scientist!",..: 1 2 1 2 4 2 4 2 4 2 ...
>
>  $ StepType                : Factor w/ 10 levels "AssessmentList",..: 
> 5 6 5 6 4 6 4 6 4 6 ...
>
>  $ TimeSpent.Seconds       : int  138 49 2 2 73 49 102 146 17 40 ...
>
>  $ StepNumber_factor       : Factor w/ 36 levels 
> "1.1","1.2","1.25",..: 1 2 1 2 4 2 4 2 4 2 ...
>
>  $ StepNumber_int          : int  1 1 1 1 1 1 1 1 1 1 ...
>
>  $ StepTitle_int           : int  1 2 1 2 4 2 4 2 4 2 ...
>
>
> This is how I've made a sequence object from SPELL formatted data, and 
> the messages I see as a result:
>
> > d.labels <- levels(d$StepTitle)
>
> > d.states <- 1:length(d.labels)
>
> > d.seq <- seqdef(d, var = c("WorkgroupID", "StartTime", "StopTime", 
> "StepTitle_int"), informat = "SPELL", states = d.states, labels = 
> d.labels, process = TRUE)
>
>  [>] SPELL data converted into 77 STS sequences
>
>  [>] found missing values ('NA') in sequence data
>
>  [>] preparing 77 sequences
>
>  [>] coding void elements with '%' and missing values with '*'
>
>  [!] sequence with index: 
> 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77 
> contains only missing values.
>
>      This may produce inconsistent results.
>
>  [>] alphabet (state labels):
>
>      1 = 1 (1.1 Meet the scientist!)
>
>      2 = 2 (1.2 Your ideas about cancer)
>
>      3 = 3 (1.25 Distinguish the phases: A time-lapse animation 2)
>
>      4 = 4 (1.3 What is mitosis?)
>
>      5 = 5 (1.3 What is mitosis? )
>
>      6 = 6 (1.4 Fast and slow dividers)
>
>      7 = 7 (1.5 Why do some cells divide fast and others slow?)
>
>      8 = 8 (1.6 A definition of cancer)
>
>      9 = 9 (2.1 Interphase: When cells don't divide)
>
>      10 = 10 (2.10 Another look through the microscope)
>
>      11 = 11 (2.2 Look through the microscope)
>
>      12 = 12 (2.3 Putting the picture together)
>
>       ... (38 states)
>
>  [>] no color palette attributed, provide one to use graphical functions
>
>  [>] 77 sequences in the data set
>
>  [>] min/max sequence length: 5/100
>
> Warning message:
>
>  [!] no automatic color palete attributed, number of states>12.
>
>      Use 'cpal' argument to define one.
>
> If I understand correctly, I think all the missing values are due to 
> the events each beginning at different times (i.e., the website 
> visitors began their visits at different times). I figure I need to do 
> something more about specifying a process time axis than to just say 
> process=TRUE. Is that correct?
>
> I also tried your suggestion and converted my SPELL data to an STS format:
>
> > d.sts <- seqformat(d, id="WorkgroupID", begin = "StartTime", end = 
> "StopTime", status = "StepTitle_int", from = "SPELL", to = "STS", 
> process = "TRUE")
>
>  [>] SPELL data converted into 77 STS sequences
>
>
> But I still see these missing values when I then create a sequence 
> object from the STS formatted data:
>
> > d.sts.seq <- seqdef(d.sts)
>
>  [>] found missing values ('NA') in sequence data
>
>  [>] preparing 77 sequences
>
>  [>] coding void elements with '%' and missing values with '*'
>
>  [!] sequence with index: 
> 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77 
> contains only missing values.
>
>      This may produce inconsistent results.
>
>  [>] 15 distinct states appear in the data:
>
>      1 = 1
>
>      2 = 2
>
>      3 = 4
>
>      4 = 6
>
>      5 = 7
>
>      6 = 8
>
>      7 = 9
>
>      8 = 11
>
>      9 = 12
>
>      10 = 13
>
>      11 = 14
>
>      12 = 15
>
>       ...
>
>  [>] alphabet (state labels):
>
>      1 = 1 (1)
>
>      2 = 2 (2)
>
>      3 = 4 (4)
>
>      4 = 6 (6)
>
>      5 = 7 (7)
>
>      6 = 8 (8)
>
>      7 = 9 (9)
>
>      8 = 11 (11)
>
>      9 = 12 (12)
>
>      10 = 13 (13)
>
>      11 = 14 (14)
>
>      12 = 15 (15)
>
>       ... (15 states)
>
>  [>] no color palette attributed, provide one to use graphical functions
>
>  [>] 77 sequences in the data set
>
>  [>] min/max sequence length: 5/100
>
> Warning message:
>
>  [!] no automatic color palete attributed, number of states>12.
>
>      Use 'cpal' argument to define one.
>
>
> So I'm following the directions in sections 5.2.2 (p.44) and 6.1 
> (p.50) of the manual to specify a process time axis. I created a new 
> dataset that looks like this:
>
> > head(d.StartTimes)
>
>   WorkgroupID StartTime
>
> 1       45813         1
>
> 2       45848        26
>
> 3       45857        32
>
> 4       45859        34
>
> 5       45860        35
>
> 6       45861        36
>
>
> And when I then try to create a sequence object from SPELL formatted 
> data, I get this error and don't know what it means:
>
> > d.seq <- seqdef(d, var = c("WorkgroupID", "StartTime", "StopTime", 
> "StepTitle_int"), informat = "SPELL", states = d.states, labels = 
> d.labels, process = TRUE, pdata = d.StartTimes, pvar = 
> c("WorkgroupID", "StartTime"))
>
> Error in rep(state, dur) : invalid 'times' argument
>
> In addition: Warning messages:
>
> 1: In if (is.na <http://is.na>(age1)) { :
>
>   the condition has length > 1 and only the first element will be used
>
> 2: In if (age1 >= 0) { :
>
>   the condition has length > 1 and only the first element will be used
>
> 3: In if (is.na <http://is.na>(sstart) | is.na <http://is.na>(sstop)) { :
>
>   the condition has length > 1 and only the first element will be used
>
> 4: In if (sstop <= limit) { :
>
>   the condition has length > 1 and only the first element will be used
>
>
> I tried it with STS formatted data, and I still see all these missing 
> values:
>
> > d.seq <- seqdef(d.sts, var = 1:29, states = d.states, labels = 
> d.labels, process = TRUE, pdata = d.StartTimes, pvar = 
> c("WorkgroupID", "StartTime"))
>
>  [>] found missing values ('NA') in sequence data
>
>  [>] preparing 77 sequences
>
>  [>] coding void elements with '%' and missing values with '*'
>
>  [!] sequence with index: 
> 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77 
> contains only missing values.
>
>      This may produce inconsistent results.
>
>  [>] alphabet (state labels):
>
>      1 = 1 (1.1 Meet the scientist!)
>
>      2 = 2 (1.2 Your ideas about cancer)
>
>      3 = 3 (1.25 Distinguish the phases: A time-lapse animation 2)
>
>      4 = 4 (1.3 What is mitosis?)
>
>      5 = 5 (1.3 What is mitosis? )
>
>      6 = 6 (1.4 Fast and slow dividers)
>
>      7 = 7 (1.5 Why do some cells divide fast and others slow?)
>
>      8 = 8 (1.6 A definition of cancer)
>
>      9 = 9 (2.1 Interphase: When cells don't divide)
>
>      10 = 10 (2.10 Another look through the microscope)
>
>      11 = 11 (2.2 Look through the microscope)
>
>      12 = 12 (2.3 Putting the picture together)
>
>       ... (38 states)
>
>  [>] no color palette attributed, provide one to use graphical functions
>
>  [>] 77 sequences in the data set
>
>  [>] min/max sequence length: 5/29
>
> Warning message:
>
>  [!] no automatic color palete attributed, number of states>12.
>
>      Use 'cpal' argument to define one.
>
>
> I found section 6.5.2 (p. 57) on handling missing values, and tried this:
>
> > d.sts.noNA <- seqdef(d.sts, left="DEL")
>
>  [>] found missing values ('NA') in sequence data
>
>  [>] preparing 77 sequences
>
>  [>] coding void elements with '%' and missing values with '*'
>
>  [!] sequence with index: 
> 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77 
> contains only missing values.
>
>      This may produce inconsistent results.
>
>  [>] 15 distinct states appear in the data:
>
>      1 = 1
>
>      2 = 2
>
>      3 = 4
>
>      4 = 6
>
>      5 = 7
>
>      6 = 8
>
>      7 = 9
>
>      8 = 11
>
>      9 = 12
>
>      10 = 13
>
>      11 = 14
>
>      12 = 15
>
>       ...
>
>  [>] alphabet (state labels):
>
>      1 = 1 (1)
>
>      2 = 2 (2)
>
>      3 = 4 (4)
>
>      4 = 6 (6)
>
>      5 = 7 (7)
>
>      6 = 8 (8)
>
>      7 = 9 (9)
>
>      8 = 11 (11)
>
>      9 = 12 (12)
>
>      10 = 13 (13)
>
>      11 = 14 (14)
>
>      12 = 15 (15)
>
>       ... (15 states)
>
>  [>] no color palette attributed, provide one to use graphical functions
>
>  [>] 77 sequences in the data set
>
>  [>] min/max sequence length: 4/100
>
> Warning message:
>
>  [!] no automatic color palete attributed, number of states>12.
>
>      Use 'cpal' argument to define one.
>
>
> I sort of feel like I'm stumbling around in the dark trying to 
> understand what this all means. Are these missing values problematic? 
> I hope you or someone can clarify for me what's going on, and what's 
> the best way to approach these data.
>
> Thanks in advance for your help!
>
> Camillia
>
>
>
>
>
> On Mon, Apr 9, 2012 at 11:46 AM, Gilbert Ritschard 
> <Gilbert.Ritschard at unige.ch <mailto:Gilbert.Ritschard at unige.ch>> wrote:
>
> Dear Camillia,
>
>
> It is not clear to me what your StepNumber is. Does it stand for 
> states? How many different values does it take?
>
>
> Any way. I think you are better to first transform your spell data 
> into STS format with the seqformat() function, and then define your 
> state sequence object from the STS data. You will have to specify 
> whether you want to align your sequences on calendar time (default) or 
> a process time (time since a individual start event).
>
>
> Currently, the seqformat function of TraMineR does not support date or 
> time format for the "begin" and "end" arguments. You should first 
> transform those start and end times into integers, so that they can be 
> interpreted as positions in the sequence.
>
>
> For your attempt to use the methods for  event sequences, again, I am 
> not sure what your StepNumber stands for. You use it as if it defined 
> the event occurring at the time stamp. Is that what you want to do?
>
>
> Gilbert
>
>
>
>
>
>
>
> On 07-Apr-12 21:24, Camillia Matuk wrote:
>
> Hello,
>
>
> I'm having problems getting started with my data, and am very new to 
> both R and to TraMineR. I hope someone can help.
>
>
> Relevant columns in my csv file are WorkgroupID (e.g., 65472), Start 
> and Stop times (e.g., 11/28/11 9:37 AM), and StepNumber (e.g., "4.5").
>
>
> After reading in my data, this is what I did:
>
>
> > WorkgroupID_factor <- factor(d$WorkgroupID)
>
> > StepNumber_factor <- factor(d$StepNumber)
>
> > d <- data.frame(d, WorkgroupID_factor, StepNumber_factor)
>
>
> I figured I should treat this as SPELL formatted data, so I did this:
>
>
> > d.labels <- seqstatl(d$StepNumber_factor)
>
> > d.states <- 1:length(d.labels)
>
>
> But I get error messages when I do this:
>
> > d.seq <- seqdef(d, var = c("WorkgroupID_factor", "StartTime", 
> "StopTime", "StepNumber_factor"), informat = "SPELL", states = 
> d.states, labels = d.labels, process = FALSE)
>
>
> Error in Summary.factor(c(NA_integer_, NA_integer_, NA_integer_, 
> NA_integer_,  :
>
>  min not meaningful for factors
>
> In addition: Warning messages:
>
> 1: In Ops.factor(begincolumn, 1) : < not meaningful for factors
>
> 2: In Ops.factor(endcolumn, begincolumn) : - not meaningful for factors
>
> 3: In Ops.factor(begincolumn, 0) : > not meaningful for factors
>
>
> Abandoning that, I then tried treating my data as though it were in 
> TSE format. I'm not sure if that's proper thing to do...
>
> d.seqe <- seqecreate(id = d$WorkgroupID_factor, timestamp = 
> d$StartTime, event = d$StepNumber_factor)
>
>
> This works, although I'm still unsure about how to read it:
>
> > print(d.seqe[2]) #Displays the sequence of events
>
> [1] 
> 67.00-(1.1)-2.00-(1.2)-3.00-(1.3)-3.00-(1.4)-2.00-(1.5)-2.00-(1.4)-3.00-(1.5,1.5,1.6)-198.00-(1.6)-1.00-(1.5,1.5,1.6)-1.00-(1.4,1.4,1.5)-2.00-(2.1,2.3)-4.00-(2.3,2.3)-1.00-(2.3,2.3,2.3,2.4)-1.00-(2.3,2.3,2.4)-3.00-(2.3,2.3)-3.00-(2.3)-7.00-(2.3,2.4)-9.00-(2.3,2.4,2.4)-167.00-(2.4,2.4,2.5,2.6)-(...)
>
>
> But when I run this command, R hangs and I have to force quit and restart:
>
> d.fsubseq <- seqefsub(d.seqe, minSupport = 50)
>
>
>
> I hope someone can point out what I'm doing wrong. Thanks for any 
> assistance!
>
>
> -- 
>
> Camillia
>
>
> http://sites.google.com/site/cfmatuk/
>
>
>
> _______________________________________________
>
> Traminer-users mailing list
>
> Traminer-users at lists.r-forge.r-project.org 
> <mailto:Traminer-users at lists.r-forge.r-project.org>
>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/traminer-users
>
>
> -- 
>
> Gilbert Ritschard, Department of Economics and
>
> Institute for Demographic and Life Course Studies,
>
> University of Geneva, 40, bd du Pont-d'Arve, CH-1211 Genève 4, Switzerland
>
> http://mephisto.unige.ch <http://mephisto.unige.ch/>
>
>
>
>
>
> -- 
>
> Camillia
>
>
> http://sites.google.com/site/cfmatuk/
>

-- 
Gilbert Ritschard, Department of Economics and
Institute for Demographic and Life Course Studies,
University of Geneva, 40, bd du Pont-d'Arve, CH-1211 Genève 4, Switzerland
http://mephisto.unige.ch



More information about the Traminer-users mailing list