[R] Calculate daily means from 5-minute interval data
Rich Shepard
r@hep@rd @end|ng |rom @pp|-eco@y@@com
Tue Aug 31 23:11:05 CEST 2021
On Sun, 29 Aug 2021, Jeff Newmiller wrote:
> The general idea is to create a "grouping" column with repeated values for
> each day, and then to use aggregate to compute your combined results. The
> dplyr package's group_by/summarise functions can also do this, and there
> are also proponents of the data.table package which is high performance
> but tends to depend on altering data in-place unlike most other R data
> handling functions.
Jeff,
I've read a number of docs discussing dplyr's summerize and group_by
functions (including that section of Hadley's 'R for Data Science' book, yet
I'm missing something; I think that I need to separate the single sampdate
column into colums for year, month, and day and group_by year/month
summarizing within those groups.
The data are of this format:
sampdate,samptime,cfs
2020-08-26,09:30,136000
2020-08-26,09:35,126000
2020-08-26,09:40,130000
2020-08-26,09:45,128000
2020-08-26,09:50,126000
2020-08-26,09:55,125000
2020-08-26,10:00,121000
2020-08-26,10:05,117000
2020-08-26,10:10,120000
My curent script is:
-------8<--------------
library('tidyverse')
discharge <- read.table('../data/discharge.dat', header = TRUE, sep = ',', stringsAsFactors = TRUE)
discharge$sampdate <- as.Date(discharge$sampdate)
discharge$cfs <- as.numeric(discharge$cfs, length = 6)
# use dplyr.summarize grouped by date
# need to separate sampdate into %Y-%M-%D in order to group_by the month?
by_month <- discharge %>%
group_by(sampdate ...
summarize(by_month, exp_value = mean(cfs, na.rm = TRUE), sd(cfs))
---------------->8--------
and the results are:
> str(discharge)
'data.frame': 93254 obs. of 3 variables:
$ sampdate: Date, format: "2020-08-26" "2020-08-26" ...
$ samptime: Factor w/ 728 levels "00:00","00:05",..: 115 116 117 118 123 128 133 138 143 148 ...
$ cfs : num 176 156 165 161 156 154 144 137 142 142 ...
> ls()
[1] "by_month" "discharge"
> by_month
# A tibble: 93,254 × 3
# Groups: sampdate [322]
sampdate samptime cfs
<date> <fct> <dbl>
1 2020-08-26 09:30 176
2 2020-08-26 09:35 156
3 2020-08-26 09:40 165
4 2020-08-26 09:45 161
5 2020-08-26 09:50 156
6 2020-08-26 09:55 154
7 2020-08-26 10:00 144
8 2020-08-26 10:05 137
9 2020-08-26 10:10 142
10 2020-08-26 10:15 142
# … with 93,244 more rows
I don't know why the discharge values are truncated to 3 digits when they're
6 digits in the input data.
Suggested readings appreciated,
Rich
More information about the R-help
mailing list