[R] How to identify runs or clusters of events in time

Sat Jul 2 04:31:22 CEST 2016

See below

On Fri, 1 Jul 2016, Mark Shanks wrote:

> Hi,
>
>
> Imagine the two problems:
>
>
> 1) You have an event that occurs repeatedly over time. You want to 
> identify periods when the event occurs more frequently than the base 
> rate of occurrence. Ideally, you don't want to have to specify the 
> period (e.g., break into months), so the analysis can be sensitive to 
> scenarios such as many events happening only between, e.g., June 10 and 
> June 15 - even though the overall number of events for the month may not 
> be much greater than usual. Similarly, there may be a cluster of events 
> that occur from March 28 to April 3. Ideally, you want to pull out the 
> base rate of occurrence and highlight only the periods when the 
> frequency is less or greater than the base rate.
>

A good place to start is:

Siegmund, D. O., N. R. Zhang, and B. Yakir. "False discovery rate
for scanning statistics." Biometrika 98.4 (2011): 979-985.

and

Aldous, David. Probability approximations via the Poisson clumping 
heuristic. Vol. 77. Springer Science & Business Media, 2013.

---

A nice illustration of how scan statistcis can be used is:

Aberdein, Jody, and David Spiegelhalter. "Have London's roads
become more dangerous for cyclists?." Significance 10.6 (2013):
46-48.

>
> 2) Events again occur repeatedly over time in an inconsistent way. 
> However, this time, the event has positive or negative outcomes - such 
> as a spot check of conformity to regulations. You again want to know 
> whether there is a group of negative outcomes close together in time. 
> This analysis should take into account the negative outcomes as well 
> though. E.g., if from June 10 to June 15 you get 5 negative outcomes and 
> no positive outcomes it should be flagged. On the other hand, if from 
> June 10 to June 15 you get 5 negative outcomes interspersed between many 
> positive outcomes it should be ignored.
>
>
> I'm guessing that there is some statistical approach designed to look at 
> these types of issues. What is it called?

`Scan statistic' is a good search term. `Poisson clumping', too.

> What package in R implements it? I basically just need to know where to 
> start.
>
>

There are some R packages.

CRAN has packages SNscan and graphscan, which sound like they 
might interest you.

My BioConductor package geneRxCluster:

http://bioconductor.org/packages/release/bioc/html/geneRxCluster.html

seeks clusters in a binary sequence as described in detail at

http://bioinformatics.oxfordjournals.org/content/30/11/1493

HTH,

Chuck