Stata: Counting number of consecutive occurrences of a pre-defined length -


observations in data set contain history of moves each player. count number of consecutive series of moves of pre-defined length (2, 3 , more 3 moves) in first , second halves of game. sequences cannot overlap, i.e. sequence 1111 should considered sequence of length 4, not 2 sequences of length 2. is, observation this:

+-------+-------+-------+-------+-------+-------+-------+-------+ | move1 | move2 | move3 | move4 | move5 | move6 | move7 | move8 | +-------+-------+-------+-------+-------+-------+-------+-------+ |     1 |     1 |     1 |     1 | .     | .     |     1 |     1 | +-------+-------+-------+-------+-------+-------+-------+-------+ 

…the following variables should generated:

number of sequences of 2 in first half =0  number of sequences of 2 in second half =1 number of sequences of 3 in first half =0 number of sequences of 3 in second half =0 number of sequences of >3 in first half =1  number of sequences of >3 in second half = 0 

i have 2 potential options of how proceed task neither of leads final solution:

option 1: elaborating on nick’s tactical suggestion use strings (stata: maximum number of consecutive occurrences of same value across variables), have concatenated “move*” variables , tried identify starting position of substring:

egen test1 = concat(move*) gen test2 = subinstr(test1,"11","x",.) // find consecutive series of length 2 

there several problems option 1: (1) not account cases overlapping sequences (“1111” recognized 2 sequences of 2) (2) shortens resulting string test2 positions of x no longer correspond starting positions in test1 (3) not account variable length of substring if need check sequences of length greater 3.

option 2: create auxiliary set of variables identify starting positions of consecutive set (sets) of 1s of fixed predefined length. building on earlier example, in order count sequences of length 2, trying auxiliary set of variables equal 1 if sequence of started @ given move, , 0 otherwise:

+-------+-------+-------+-------+-------+-------+-------+-------+ | move1 | move2 | move3 | move4 | move5 | move6 | move7 | move8 | +-------+-------+-------+-------+-------+-------+-------+-------+ |     0 |     0 |     0 |     0 |     0 |     0 |     1 |     0 | +-------+-------+-------+-------+-------+-------+-------+-------+ 

my code looks follows breaks when trying restart counting consecutive occurrences:

quietly forval = 1/42 { gen temprow`i' =. egen rowsum = rownonmiss(seq1-seq`i') //count number of occurrences replace temprow`i'=rowsum  mvdecode seq1-seq`i',mv(1) if rowsum==2  drop rowsum } 

does know way of solving task?

assume string variable concatenating moves all (the name test1 hardly evocative).

first try: taking example literally

from example 8 moves, first half of game moves 1-4 , second half moves 5-8. there each half 1 way have >3 moves, namely there 4 moves. in case each substring "1111" , counting reduces testing 1 possibility:

gen count_1_4 = substr(all, 1, 4) == "1111" gen count_2_4 = substr(all, 5, 4) == "1111"  

extending approach, there 2 ways have 3 moves in sequence:

gen count_1_3 = inlist(substr(all, 1, 4), "111.", ".111")   gen count_2_3 = inlist(substr(all, 5, 4), "111.", ".111")   

in similar style, there can't 2 instances of 2 moves in sequence in each half of game qualify 4 moves. so, @ there 1 instance of 2 moves in sequence in each half. instance must match either of 2 patterns, "11." or ".11". ".11." allowed, either includes both. must exclude false match sequence of 3 moves, mentioned.

gen count_1_2 = (strpos(substr(all, 1, 4), "11.") | strpos(substr(all, 1, 4), ".11") ) & !count_1_3  gen count_2_2 = (strpos(substr(all, 5, 4), "11.") | strpos(substr(all, 5, 4), ".11") ) & !count_2_3  

the result of each strpos() evaluation positive if match found , (arg1 | arg2) true (1) if either argument positive. (for stata, non-zero true in logical evaluations.)

that's tailored particular problem, not worse that.

p.s. didn't try hard understand code. seem confusing subinstr() strpos(). if want know positions, subinstr() cannot help.

second try

your last code segment implies example quite misleading: if there can 42 moves, approach above can not extended without pain. need different approach.

let's suppose string variable all can 42 characters long. set aside distinction between first , second halves, can tackled modifying approach. @ simplest, split history 2 variables, 1 first half , 1 second , repeat approach twice.

you can clone history

  clonevar work =    gen length1 = .    gen length2 = .  

and set count variables. here count_4 hold counts of 4 or more.

  gen count_4 = 0    gen count_3 = 0    gen count_2 = 0  

first move sequences of length 42, ..., 2. every time find one, blank out , bump count.

  qui forval j = 42(-1)2 {         replace length1 = length(work)         local pattern : di _dup(`j') "1"         replace work = subinstr(work, "`pattern'", "", .)         replace length2 = length(work)         if `j' >= 4 {             replace count4 = count4 + (length1 - length2) / `j'         }        else if `j' == 3 {              replace count3 = count3 + (length1 - length2) / 3        }        else if `j' == 2 {              replace count2 = count2 + (length1 - length2) / 2         }   } 

the important details here

  1. if delete (repeated instances of) pattern , measure change in length, have deleted (change in length) / (length of pattern) instances of pattern. so, if "11" , found length decreased 4, found 2 instances.

  2. working downwards , deleting found ensures don't find false positives, e.g. if "1111111" deleted, don't find later "111111", "11111", ..., "11" included within it.

  3. deletion implies should work on clone in order not destroy of interest.


Comments

Popular posts from this blog

c++ - CryptStringToBinary API behavior -

c++ - Correct method for redrawing a layered window -

java.util.scanner - How to read and add only numbers to array from a text file -