r - How to optimize for loops in extremely large dataframe -

i have dataframe "x" 5.9 million rows , 4 columns: idnumber/integer, compdate/integer , judge/character,, representing individual cases completed in administrative court. data imported stata dataset , date field came in integer, fine purposes. want create caseload variable calculating number of cases completed judge within 30 day window of completion date of case @ issue.

here first 34 rows of data:

idnumber    compdate    judge 1   9615    jvc 2   15316   ban 3   15887   wla 4   11968   wfn 5   15001   clr 6   13914   ieb 7   14760   hsd 8   11063   rjd 9   10948   ppl 10  16502   ban 11  15391   wcp 12  14587   lrd 13  10672   rtg 14  11864   jcw 15  15071   gmr 16  15082   pam 17  11697   dlk 18  10660   adp 19  13284   ecc 20  13052   jwr 21  15987   mak 22  10105   hea 23  14298   clr 24  18154   mmt 25  10392   hea 26  10157   erh 27  9188    rbr 28  12173   jcw 29  10234   par 30  10437   adp 31  11347   rdw 32  14032   jtz 33  11876   amc 34  11470   amc

here's came with. each record i'm taking subset of data particular judge , subsetting cases decided in 30 day window, , assigning length of vector in subsetted dataframe caseload variable subject case, follows:

for(i in 1:length(x$idnumber)){   e<-x$compdate[i]   f<-e-29   a<-x[x$judge==x$judge[i] & !is.na(x$compdate),]   b<-a[a$compdate<=e & a$compdate>=f,]   x$caseload[i]<-length(b$idnumber) }

it working taking extremely long complete. how can optimize or easier. sorry i'm new r , programming -- i'm law professor trying analyze court data.... appreciated. thanks. ken

i don't have experience rolling calculations, but...

calculate per-day, not per-case (since same cases on same day).
calculate cumulative sum of number of cases, , take difference of current value of sum , value of sum 31 days ago (or min{daysago:daysago>30} since cases not resolved every day).

it's fastest use data.table. attempt, using @nograpes simulated data. comments start #.

require(data.table) dt <- data.table(x) dt[,compdate:=as.integer(compdate)] setkey(dt,judge,compdate)  # count cases each day ldt <- dt[,.n,by='judge,compdate'] # cumulative sum of counts ldt[,nrun:=cumsum(n),by=judge] # see how far ldt[,lookbk:=sapply(1:.n,function(i){     z       <-  compdate[i]-compdate[i:1]     older   <-  which(z>30)     if (length(older)) min(older)-1l else as(na,'integer') }),by=judge] # compute cumsum(today) - cumsum(more 30 days ago) ldt[,wload:=list(sapply(1:.n,function(i)     nrun[i]-ifelse(is.na(lookbk[i]),0,nrun[i-lookbk[i]]) ))]

on laptop, takes under minute. run command see output 1 judge:

print(ldt['xyz'],nrow=120)

Search This Blog

Bradly

r - How to optimize for loops in extremely large dataframe -

Comments

Post a Comment

Popular posts from this blog

java.util.scanner - How to read and add only numbers to array from a text file -

What is the end of string notation in python -

php - Add the correct number of days for each month -