[R] How to identify runs or clusters of events in time
Charles C. Berry
ccberry at ucsd.edu
Sat Jul 2 04:31:22 CEST 2016
See below
On Fri, 1 Jul 2016, Mark Shanks wrote:
> Hi,
>
>
> Imagine the two problems:
>
>
> 1) You have an event that occurs repeatedly over time. You want to
> identify periods when the event occurs more frequently than the base
> rate of occurrence. Ideally, you don't want to have to specify the
> period (e.g., break into months), so the analysis can be sensitive to
> scenarios such as many events happening only between, e.g., June 10 and
> June 15 - even though the overall number of events for the month may not
> be much greater than usual. Similarly, there may be a cluster of events
> that occur from March 28 to April 3. Ideally, you want to pull out the
> base rate of occurrence and highlight only the periods when the
> frequency is less or greater than the base rate.
>
A good place to start is:
Siegmund, D. O., N. R. Zhang, and B. Yakir. "False discovery rate
for scanning statistics." Biometrika 98.4 (2011): 979-985.
and
Aldous, David. Probability approximations via the Poisson clumping
heuristic. Vol. 77. Springer Science & Business Media, 2013.
---
A nice illustration of how scan statistcis can be used is:
Aberdein, Jody, and David Spiegelhalter. "Have London's roads
become more dangerous for cyclists?." Significance 10.6 (2013):
46-48.
>
> 2) Events again occur repeatedly over time in an inconsistent way.
> However, this time, the event has positive or negative outcomes - such
> as a spot check of conformity to regulations. You again want to know
> whether there is a group of negative outcomes close together in time.
> This analysis should take into account the negative outcomes as well
> though. E.g., if from June 10 to June 15 you get 5 negative outcomes and
> no positive outcomes it should be flagged. On the other hand, if from
> June 10 to June 15 you get 5 negative outcomes interspersed between many
> positive outcomes it should be ignored.
>
>
> I'm guessing that there is some statistical approach designed to look at
> these types of issues. What is it called?
`Scan statistic' is a good search term. `Poisson clumping', too.
> What package in R implements it? I basically just need to know where to
> start.
>
>
There are some R packages.
CRAN has packages SNscan and graphscan, which sound like they
might interest you.
My BioConductor package geneRxCluster:
http://bioconductor.org/packages/release/bioc/html/geneRxCluster.html
seeks clusters in a binary sequence as described in detail at
http://bioinformatics.oxfordjournals.org/content/30/11/1493
HTH,
Chuck
More information about the R-help
mailing list