[R] Antwort: Re: R Script Template
G.Maubach at weinwolf.de
G.Maubach at weinwolf.de
Thu Apr 28 09:17:50 CEST 2016
Hi All,
please find enclosed the missing attachment.
Kind regards
Georg
-- cut --
#-[ Header ]
------------------------------------------------------------------
# Program : Framework for R scripts
# Author : Georg Maubach
# Date : 2016-03-03
# Update : 2016-04-27
# Description : Foundation for the analysis process
# Source System : R 3.2.5 (64 Bit)
# Target System : R 3.2.5 (64 Bit)
# Release : 1
# License : CC-BY-NC-SA
# File Name : 2016-04-27_Template_Scipt.R
#-------------------------------------------------------------------------------
#- [ Purpose of the document ]
------------------------------------------------
# This document provides a framework for a script able to handle real
world
# data throughout the complete analysis process. In each step examples or
# prototypes of needed or helpful commands are given. Chapters and
sections in
# this document can be regarded as a toolbox. The needed tools shall be
adapted
# to the processed data. Commands are ordered an a consistent way to
support the
# user to produce high quality output.
#-------------------------------------------------------------------------------
# - [ At hand ]
----------------------------------------------------------------
# help("function") # Extract or Replace Parts of an Object
# example("function") # Examples on "Extract"
# demo(package = .packages(all.available = TRUE)) # Show demos of packages
#-------------------------------------------------------------------------------
# - [ Editing Marks ]
----------------------------------------------------------
# %ROTA% : Result of the analysis in text form if needed to explain
further
# steps
# %ToDo% : ToDo's
#-------------------------------------------------------------------------------
# - [ Warrenty Disclaimer ]
----------------------------------------------------
# The software is provided "as-is". The author disclaims to the fullest
extent
# authorized by law any and all warranties, whether express or implied,
# including, without limitation, any implied warranties of merchantability
or
# fitness for a particular purpose. Without limitation of the foregoing,
the
# author expressly does not warrant that:
#
# (a) the software will meet your requirements or expectations;
# (b) the software or the software content will be free of bugs, errors,
# viruses or other defects;
# (c) any results, output, or data provided through or generated by the
software
# will be accurate, up-to-date, complete or reliable;
# (d) the software will be compatible with third party software;
# (e) any errors in the software will be corrected.
#-------------------------------------------------------------------------------
# - [ Limitation of Liability ]
------------------------------------------------
# In no event will the author be liable for any direct, indirect,
consequential,
# incidental, special, exemplary, or punitive damages or liabilities
whatsoever
# arising from or relating to the software, the software content or this
# agreement, whether based on contract, tort (including negligence),
strict
# liability or other theory, even if the author has been advised of the
# possibility of such damages.
#
# The use of the software goes to the whole risk of the user.
#-------------------------------------------------------------------------------
#--------1---------2---------3---------4---------5---------6---------7---------8
#-------#
# Setup #
#-------#
# Environment
# Please make sure that RTools is installed
Sys.getenv("R_ZIPCMD", "zip")
# needed for openxlsx::write.xlsx()
Sys.setenv(R_ZIPCMD= "C:/R-Project/Rtools/bin/zip")
.libPaths() # Install directory for libraries
# .libPaths("new path if needed")
# Workplace
sessionInfo() # Environment
list.files(R.home()) # Show R home directory
getwd() # Get working directory
list.dirs() # List directories in working directory
list.files() # List files in working directory
library() # List all installed packages
search() # List all loaded packages
ls() # List objects in environment
#-----------#
# Configure #
#-----------#
path <- file.path("path", "to","directory")
setwd(path) # Set working directory
options(width = 65) # Set output width
#---------#
# Install #
#---------#
available.packages()
# Desired packages
my_packages <- c(
"ctv" # Package to install packages based on themes
"data.table", # Fast manipulation of large datasets
"dplyr", # Data manipulation for data frames
"geoR",
"haven", # import data from stastical packages
"Hmisc",
"httr", # package to deal with HTTP requests
"installr", # Dependency of openxlsx::write.xlsx()
"lubridate",
"mapdata", # data for high-quality maps
"maps", # draw maps
"maptools", # import ESRI data
"memisc", # package data import and management for SPSS
"openxlsx", # Read and write Excel files
"reshape2", # Restructure data
"stringr",
"tidyr", # Data cleaning
"plotrix",
"plyr", # Data manipulation
# "rattle", # Be careful! rattel needs a different RGTK2 lib!
"RColorBrewer", # Install ColorBrewer color palettes
"Rcpp",
"rgdal", # Connect to GDAL, Mercator transformation
"RMySQL", # Replace by package for your database
"sp", # Draw maps, depends on grid and lattice
"sqldf", # Execute SQL queries on R datasets
"zoo" # Time series analysis
)
# Install
## EITHER: Install from CRAN
## install.packages(pkgs = my_packages, dependencies = TRUE)
## OR: Build local CRAN repository
### EITHER: Install miniCRAN from CRAN
### install.packages(pkgs = "miniCRAN", dependencies = TRUE)
### OR: Install miniCRAN from localhost
install.packages(pkgs =
"C:/Software/R-Project/CRAN/bin/windows/contrib/3.2/xml2_0.1.2.zip",
dependencies = TRUE,
repos = NULL)
install.packages(pkgs =
"C:/Software/R-Project/CRAN/bin/windows/contrib/3.2/miniCRAN_0.2.5.zip",
dependencies = TRUE,
repos = NULL)
library(miniCRAN)
### Determine the dependencies for desired packages
pkg_list <- pkgDep(pkg = my_packages, suggest = TRUE)
### Define a path to local repository
repo <- file.path("H:","2016","Software","R-Project", "CRAN")
# Create repository
# Internet connection required
# If no internet connection available download at machine with internet
# connection and transfer the downloaded repository manually to the target
# machine.
makeRepos(pkgs = pkg_list, path = repo, type = "win.binary")
repo_path = paste0("file:", repo)
install.packages(pkgs = pkg_list,
dependencies = TRUE,
repos = repo_path)
# Install packages bases on themes
# see (*6) p. 26
install.packages("ctv")
library(ctv)
install.views("SocialSciences")
# update.packages(ask = FALSE) # Update (if necessary)
# vignette(all = TRUE) # Show vignettes of all installed
packages
#--------#
# Import #
#--------#
# rm(list = ls()) # Clear workplace (if necessary)
# Comment out if handed out to someone else
# for not deleting somebody else's workplace
# Load data
path <- file.path("path", "to", "filename")
load(path)
# Import data
# see (*10)
path <- file.path("path", "to", "filename")
## From Spreadsheets, e. g. Microsoft Excel
library(readxl)
### Read only one Excel sheet
path <- file.path("path", "to", "filename")
sheet1 <- read_excel(path, sheet = 1)
sheet2 <- read_excel(path, sheet = 2)
sheets <- excel_sheets(path)
first_sheet <- read_excel(path, sheet = "first_sheet")
first_sheet <- read_excel(path, sheet = "second_sheet")
sheets <- excel_sheets(path)
first_sheet <- read_excel(path,
sheet = sheets[1],
col_names = TRUE | FALSE | c("name1", "name2",
...),
skip = n)
columns <- c("Column_Name_1", paste0("Column_Name_", 2:n))
first_sheet <- read_excel(path,
sheet = sheets[2],
col_names = columns,
col_types = c(NULL | "numeric" | "text" | "date"
| "blank"))
### Read all Excel sheets within a workbook
### using lapply creating a list of data.frames
path <- file.path("path", "to", "filename")
my_workbook <- lapply(excel_sheets(path),
read_excel,
path = path)
detach("package:readxl")
library(XLConnect)
### Create a workbook object
path <- file.path("path", "to", "filename")
book <- loadWorkbook(filename = path)
sheets <- getSheets(book)
data <- readWorksheet(book,
sheet = 1 | "sheet name",
header = TRUE | FALSE,
startCol = n,
startRow = n,
endRow = n)
df <- data.frame("a data.frame")
createSheet(book, "sheet name")
writeWorksheet(book, df, "sheet_name")
saveWorkbook(book, file = path)
detach("package:XLConnect")
## From other statistical packages
### With haven
### From SAS
#### From SPSS
library(haven)
path <- file.path("path", "to", "filename")
dataset <- read_sav(path = path)
#### Convert the labelled class from SPSS to factor in R
dataset$variable <- haven::as_factor(dataset$variable)
detach("package:haven")
#### From STATA
### With foreign
library(foreign)
#### From SAS
#### From SPSS
dataset <- read.spss(file = path,
to.data.frame = TRUE,
# convert labelled variables to factors
use.value.labels = TRUE)
#### From STATA
dataset <- foreign::read.dta(file = path,
convert.factors = FALSE
convert.underscore = FALSE)
detach("package:foreign")
## From Databases
library("DBI") # library(RMySQL) not required
con <- dbConnect(drv = RMySQL::MySQL(),
dbname = "database name",
host = "hostname",
port = port number,
user = "username",
password = "password")
tableList <- dbListTables(conn = con)
### Read entire database table
table <- dbReadTable(con,
name = "table name")
### Read a selection respectively a subset of data from a database table
selection <- dbGetQuery(con,
statement = "SELECT col_name
FROM table_name
WHERE col_name > some_condition")
### Example 1
products <- dbReadTable(conn = con,
name = "Products")
products_selection <- subset(products,
subset = contract == 1)
### Example 2
### Produces the same result as Example 1
### but is more efficient because the subsetting is done in the database
### and only the needed entries are read into R.
products_selection <- dbGetQuery(conn = con,
statement = "SELECT *
FROM products_selection
WHERE contract = 1")
### Example 3
### Read a database table one chunk at a time
res <- dbSendQuery(conn = con,
statement = "SELECT *
FROM products
WHERE contract = 1")
#### The data is stored in temporary file
while(!dbHasCompleted(res)) {
chunk <- dbFetch(res, n = 1) # n can have any suitable value
print(chunk) # work with chunk of data in any suitable
way
}
dbClearResult(res) # deletes the file temporarily created by dbSendQuery
### Always disconnect from the database
dbDisconnect(conn = con)
detach("packages:DBI")
## From the web
## Check if the import function can access web sites
## right away using e. g. the file argument
## Downloading a file to you local machine
url <- "http://machine.server.com/path/to/filename"
destination <- file.path("path", "to", "filename")
download.file(url = url,
destfile = destination)
### Import the locaclly stored file with the known import functions
## With httr
library(httr)
url <- "http://machine.server.com/path/to/filename"
response <- GET(url)
content <- content(x = response, as = data.frame)
detach("package:httr")
## With jsonlite
library("jsonlite")
fromJSON("string")
### JSON object: unordered collection of name:value pairs
### name = string
### value = string | number | boolean | null | JSON object | JSON array
### JSAN array: ordered sequence of objects
detach("packages:jsonlite")
# Build datasets from data
## Rename variables
names(dataset) <- new_colnames
names(dataset)[names(kino == "variable name")] <- "new variable name"
## Sort variables
dataset2 <- dataset[sort(names(dataset))]
## Sort cases
## see (*1) p. 333ff
## Sorting can only be done for numeric variables.
### Save the original order
#### Row numbers are stored as characters.
#### Thus type conversion is necessary.
dataset$orig_order <- as.numeric(row.names(dataset))
### Missing values are placed at the end by default
#### na.last = FALSE places missing values at the beginning
#### na.last = NA removes missing values from sorted data
#### Order is ascending by default.
#### Reverse order is only available for numeric variables
#### and done using a minus sign ("-") before each variable.
dataset2 <- order(-as.numeric(dataset$gender), # descending gender
dataset$age) # ascending age
# Save data
path <- file.path("path", "to", "filename")
save.image(path)
# Match datasets
## Merge only two datasets at a time
## see (*1) p. 288
dataset3 <- merge(dataset1, dataset2,
# use by if the variable names match
by.x = id_variable_first_dataset,
by.y = id_variable_second_dataset,
# use all = TRUE if both datasets deliver cases
all.x = TRUE, all.y = TRUE)
## Merge two or more datasets at a time
## %ToDo%
# Check
xlsx_check <- "Projectname_Checks.xlsx"
wb <- createWorkbook()
addWorksheet(wb,
sheetName = "Import")
writeData(wb,
sheet = "Import",
x = dataset)
saveWorkbook(wb,
file = xlsxCheck,
overwrite = TRUE)
# Create Recovery Point
path <- file.path("Path", "to", "file", "Projectname_Import.RData")
save(dataset, file = path)
#----------#
# Cleaning #
#----------#
# Recover data from previous section
path <- file.path("Path", "to", "file", "Projectname_Import.RData")
load(file = path)
## 1. Inspect raw data
### Get to know the structure of datasets
### see (*6) p. 59
class()
dim()
names()
str()
dplyr::glimpse()
summary()
memisc::codebook(dataset)
### Get to know the data within datasets
### see (*6) p. 59
### see (*8)
head(dataset, n = 10)
tail(dataset, n = 10)
#print()
summary()
Hmisc::describe(dataset)
hist()
plot()
## see (*9)
## 2. Tidy the data according to the principals of tidy data
library(tidyr)
### a) Ensure observations/values are in rows
tidyr::gather()
### b) Ensure variables are in columns
tidyr::spread()
### c) each observation type is stored in its own dataset
tidyr::separate()
### d) Each table is one type of observational unit
detach("package:tidyr")
## 3. Type conversions
### Numericals
library(dplr)
dataset2 <- dplyr::mutate_each(dataset1, funs(as.numeric); var1:varX)
detach("package:dplr")
### Strings
library(stringr)
stringr::str_replace()
stringr::unite()
detach("package:stringr")
### Dates
library(lubridate)
lubridate::ymd(dataset$dateString)
detach("package:lubridate")
## 4. Missing value analysis and handling
### Missing value analysis
### see (*3)
any(is.na()) # Showing if observations contain NA
sum(is.na(variable)) # Show number of observations system missing values
sum(variable == -999, na.rm = TRUE) # Count the occurrence of -999 (omit
NA)
sum(variable %in% c(-998,-999)) # Count multiple user missing values
summary() # Check summary result for NA
plot(variable) # Spot missings graphically
table(factor(variable)) # Spot missings in contingency table
sum(!complete.cases(dataset)) # Count complete cases
which(!complete.cases(dataset)) # Show incomplete cases
cases_with_na <- which(is.na(dataset$variable)) # find indices of cases
with NA
dataset[cases_with_na, ] # Look at the full rows for records having
missings
#### Missing value handling
##### Replace missing value in new variable
dataset1$new_variable <- dataset1$variable_with_na
dataset$new_variable[cases_with_na] <- new_value
##### Replace missing value in new dataset
dataset2 <- dataset1
dataset2$variable_with_na[cases_with_na] <- new_value
##### Recode missing values
##### see (*3)
variable[variable == -999] = NA # Recode all -999 as NA
variable[is.na(x)] = -999 # Recode all NA in x as -999
variable[variable %in% c(-998,-999)] = NA # Recode any -998 or -999 as NA
variable[variable %in% -990:-999] = 0 # Recode any value between -990
and
# -999 as 0
#### Keep only cases without any missings
dataset2 <- dataset1[complete.cases(dataset1), ]
dataset2 <- na.omit(dataset1)
# Attention:
# Watch for symbols for missing values from
# a) external sources:
# #N/A (Excel)
# . (SPSS/SAS)
# empty string
# b) internal sources
# Inf (infinite value, e. g. 1/0)
# NaN (not a number, e. g. 0/0)
## 5. Identify and correct errors
### Outliers
summary(dataset)
hist(dataset)
boxplot(dataset)
cases_with_data_error <- which(dataset$variable == error_value)
dataset[cases_with_data_error, ]
dataset$variable[cases_with_data_error] <- correct_value
## 6. Visualize the results of data cleaning
summary(dataset)
head(dataset, n = 20)
hist(dataset)
plot(dataset$variable1, dataset$variable2)
boxplot(dataset$variable)
# Check
xlsx_check <- "Projectname_Checks.xlsx"
wb <- createWorkbook()
addWorksheet(wb,
sheetName = "Cleaning")
writeData(wb,
sheet = "Cleaning",
x = dataset)
saveWorkbook(wb,
file = xlsxCheck,
overwrite = FALSE)
# Create Recovery Point
path <- file.path("Path", "to", "file", "Projectname_Cleaning.RData")
save(dataset, file = path)
#---------#
# Prepare #
#---------#
# Recover data from previous section
path <- file.path("Path", "to", "file", "Projectname_Cleaning.RData")
load(file = path)
# Delete variables
reduced_dataset <- dataset$variable <- NULL # delete a variable
reduced_dataset <- dataset[, -c(index_of_variables)] # delete variables
# Create new variables
dataset2 <- data.frame(dataset1, new_variable)
dataset2 <- cbind(dataset, new_variable)
# Recode 1: Manually
## see (*2) p. 87ff
dataset$recoded <- (dataset$variable <=17) * 1 +
(dataset$variable >18 & dataset$variable <= 30) * 2 +
(dataset$variable >30 & dataset$variable <= 65) * 3 +
(dataset$variable <65) * 4
## see (*1) p. 378ff
dataset$recoded <- factor(dataset$recoded,
levels = c(1, 2, 3, 4),
labels = c("Pupils", "Young Professionals",
"Professionals", "Retired"))
# Handling of character vectors
# see (*4)
# x = variable
tolower(x) # converts x to all lower case
toupper(x) # converts x to all upper case
nchar(x) # a vector of the lengths of each value
paste(a,b,sep="_") # concatenates character values
substr(x,start,stop) # extract characters from positions start to
stop
strsplit(x,split) # split each value of x into a list of
strings
# using split as the delimiter
grep(pattern,x) # return a vector of the elements that
included
# pattern
grepl(pattern,x) # returns a logical vector indicating whether
# each element of x contained pattern
regexpr(pattern,x) # returns the integer positions of the first
# occurrence of pattern in each element of x
gregexpr(pattern,x) # returns a list of the integer positions of
all
# of the occurrences of pattern in each value
of x
gsub(pattern,replacement,x) # replaces each occurrence of pattern with
occurrence
# see (*5)
match() # compares two vectors, can be also numeric
pmatch() # compares parts of two vectors, can be also
numeric
# Creating variables on the fly
# see (*7)
str(Kunden01)
for (year in 2011:2015) {
Reeller_Kunde <- paste0("Reeller_Kunde_", year)
Umsatz <- paste0("Umsatz_", year)
cat('Creating', Reeller_Kunde,'from', Umsatz,'\n')
Kunden01[[ Reeller_Kunde ]] <- ifelse( Kunden01[[ Umsatz ]] <= 0, 1, 2)
Kunden01[[ Reeller_Kunde ]] <- factor( Kunden01[[ Reeller_Kunde ]],
levels=c(1,2),
labels= c("NICHT kaufend",
"kaufend")
)
}
str(Kunden01)
# Sort dataset
dataset2 <- dataset[order(dataset$var_1_to_be_sorted_by,
dataset$var_2_to_be_sorted_by), ]
# Save data
## As R data file
path <- file.path("path", "to", "filename")
save.image(path)
## As Excel file
library(XLConnect)
### Create a workbook object
path <- file.path("path", "to", "filename")
book <- loadWorkbook(filename = path)
df <- data.frame("a data.frame")
createSheet(book, "sheet name")
writeWorksheet(book, df, "sheet_name")
saveWorkbook(book, file = path)
detach("packages:XLConnect")
# Check
xlsx_check <- "Projectname_Checks.xlsx"
wb <- createWorkbook()
addWorksheet(wb,
sheetName = "Preparation")
writeData(wb,
sheet = "Cleaning",
x = dataset)
saveWorkbook(wb,
file = xlsxCheck,
overwrite = FALSE)
# Create Recovery Point
path <- file.path("Path", "to", "file", "Projectname_Preparation.RData")
save(dataset, file = path)
#---------#
# Analyse #
#---------#
# Recover data from previous section
path <- file.path("Path", "to", "file", "Projectname_Preparation.RData")
load(file = path)
library(dplyr)
# Combination of group_by() and mutate() creates new variables
# within each group.
# see (*8)
# If mutate() uses the rank() function within-group rankings
# are calculated.
# Example:
# Filter ArrDelay, group by carrier, create a mean by carrier,
# rank this new mean and then sort the carriers based on the
# ranking.
# Combination of arrange() and rank() ranks the values within-groups
# from the largest to the smallest.
dataset %>%
filter(!is.nat(var1) & var1 > 0) %>%
group_by(var2) %>%
summarise(avg = mean(var1)) %>%
mutate(rank = rank(avg)) %>%
arrange(rank)
detach("package:dplr")
# References
# (*1) Muenchen: R for SAS and SPSS Users, 2. Ed., New York, 2011
# (*2) Hain: Statistik mit R, 1. Ed., Hannover, 2011
# (*3) Allerhand: R Programming, Essential Functions, Missing Values
# (http://forums.psy.ed.ac.uk/R/P01582/essential-10/)
# (*4) Philippi: Data Manipulation in R
# (
http://science.nature.nps.gov/im/datamgmt/statistics/r/fundamentals/manipulation.cfm
)
# (*5) Spector: Introduction to R
# (https://www.stat.berkeley.edu/~spector/Rcourse.pdf)
# (*6) Manderscheid: Sozialwissenschaftliche Datenanalyse mit R, 1.
Aufl.,
# Wiesbaden, 2012
# (*7) MacQueen: Creating Variables on the Fly
# (
http://r.789695.n4.nabble.com/Creating-variables-on-the-fly-td4720034.html
)
# (*8) Grolemund: Data Manipulation in R with dplyr in: Datacamp.com
# (
https://www.datacamp.com/courses/dplyr-data-manipulation-r-tutorial)
# (*9) Carchedi: Cleaning Data in R, in: Datacamp.com
# (https://www.datacamp.com/courses/cleaning-data-in-r)
# (*10) Schouwenaars: Importing Data into R, in: Datacamp.com
# (https://www.datacamp.com/courses/importing-data-into-r)
# EOF
Von: Jeff Newmiller <jdnewmil at dcn.davis.ca.us>
An: G.Maubach at gmx.de, r-help at r-project.org,
Datum: 27.04.2016 22:23
Betreff: Re: [R] R Script Template
Gesendet von: "R-help" <r-help-bounces at r-project.org>
The subject of your email is missing. Perhaps you need to read the Posting
Guide (again?) about attachments. Embedding your example directly in the
body of the email is generally more accessible in archives than attaching
it.
--
Sent from my phone. Please excuse my brevity.
On April 27, 2016 1:14:17 PM GMT+01:00, G.Maubach at gmx.de wrote:
>Hi All,
>
>I am addressing this post to all who are new to R.
>
>When learing R in the last weeks I took some notes for myself to have
>code snippets ready for the data analysis process. I put these snippets
>
>together as a script template for future use. Almost all of the given
>command prototypes are tested. The template script contains snippets
>for best practices and leaves out the commands that should not be used.
>Relying on the given snippets shall lead to high quality code.
>
>The code is based on examples from the ressources given in the
>template. I highly recommend to read the books or take the online
>courses to see how everything works and fits together.
>
>Despite putting everything together with care, the script is provided
>as-is with no warrenty or liability whatsoever.
>
>Please address any remarks or suggestions for improvement to the R-Help
>mailing list.
>
>Kind regards
>
>Georg
>
>______________________________________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list