Getting to know R: a very brief introduction

August 24, 2018
Beginning R R Studio Sampling Data Management and Visualization

Getting to know R: a very brief introduction

What is R?

R is an open-sourced computer language that allows you to conduct data analysis and programming It is extremly versatile with 13317 packages that allow you to do pretty do anything, from spatial statistics to data wrangling to publication graphics. In fact, this document was written with R and Rstudio.

A reason for its acceptance among researchers is its versatility.

R can be used as an object-oriented programming language, or as a statistical environment where functions can be run line-by-line to analyze data.

Why R?

  • R runs on Windows, Mac-OS, and Unix operating systems
  • R provides a vast number of useful statistical tools (and you can write your own)
  • R produces publication-quality graphics in a variety of formats
  • R allows you to generate documents and html
  • R can be intergrated with many other programming languages.
  • R scales, making it useful for small and large projects.
  • There is a huge user base with lots of blogs and answers for your guarenteed many, many questions

Why not R?

  • R can do a lot but it cannot do everything.
  • There is a decent learning curve
  • There have been many improvements over the years but some of the documentation can be opaque.
  • R will make you want to throw your computer through the wall at some point.
  • Contributed packages go through a lot of testing but some are better and more reliable than others
  • R steers clear of point-and-click analysis

Where can I get R?

It is best to download R from CRAN (Comprehensive R Archive Network). It can be installed on Linux, Windows, or Mac operating systems. The current stable version will be available following the link to your operating system. If you are installing on a Windows machine, you will want to select from the base link.

How do I run R?

There are many ways that you can run R (terminal to GUIs to IDEs). You can run R from its basic GUI, while OK it does not have some of the bells and whistles other systems might have.

My favorite IDE is RStudio RStudio truly makes R easier to use. It includes a code editor with syntax highlighting, debugging, and visualization tools.

Using R, you have the options of running code from the console or by saving and reusing scripts. I tend to only use the console for one off things, like pulling up a help page. I keep all of my code in scripts which allows me to reuse code and run things again at a later time. When writing a report or a publication for peer review, you are almost always asked to report something different or revise previous analysis so it is best to work from the script.

How do I learn R?

I will come right out and say it. R is not easy. You will not be able to half ass it and be able to understand R. I came from a background of working with SAS and Matlab and there are still things I do not fully understand. I started learning R in 2007 as a postdoc at the University of Minnesota and it took me a while. I believe the best strategy is immersion. Do not give yourself an out (i.e., use point-and-click propiertory software) and struggle. Over time you will begin to recognize the nuances of the language and how to use it to its fullest ability (even if this requires hacks and work-arounds). I tell students frequently that learning R is like learning a foreign language. How do you best learn that language? Immersion. You go to that country and live there, forcing you to learn how it is. R is no different.

There are three areas that I know I struggled with in R and know lots of students (used in the loose sense referring to anyone trying to learn R) that have found these things problems similarly.

  1. Getting data into R
  2. Understanding the structure of your data (e.g., selecting columns and rows)
  3. Understanding the error messages

I think if you can develop an understanding of these three things, you will be able to get around most of what occurs in R. We will spend more time developing a greater understanding of these areas in subsequent lessons.

How do I get help in R?

The beauty of R and its large and diverse user base is that there is a lot of help out there for you. There are sources that tend to be more prickley (e.g., R mailing lists), to moderatly so (e.g., stackoverflow), and most forgiving (e.g., google groups).

There tend to be three areas that I think students learning R get tripped up on when trying to find help.

  • The first is a general “How do I do so and so in R?” and my solution to that is simple. Call up your favorite search engine and do a search for your question (and add “in R”). For example, to search how to do a linear regression in R, I would pull up google and search “linear regression and R.” Generally, your search results will include tutorials, packages, or questions posted on the online help sites.
  • The second is “how do I use this function in R?” Like the first bullet you can always do a search and come up with an answer. There are several ways to find help within R for a particular function. help(function_name) or ?function_name will bring up the documentation on the particular function you are interested in. This is where I almost always start when I am having issues with a particular function because the pages called will have the input parameters and various options.
  • The third is “What does this warning or error mean?” Part of this will come as you spend more and more time learning the R language but everyonce in a while you will get an error that you have no clue. Very often I will copy that error from the console and past it into google.

Generally when you are trying to help, whether you are using an online source or the R guru down the hall, it important to do your homework ahead of time. The key to getting good feedback no matter from what source is to 1) Google search the issue. It is more than likely that someone else has already had the issue, 2) do not make it seem like you are asking questions about a homework assignment, and 3) provide a reproducible example.

I think going through the process of trying to make a reproducible example ends up helping you solve your issue before you even post to one of the help sites.

Some basics of R

Laying out your R scripts

Whenever I start a new R file I always have several things at the very top of your script. This will help you understand what was done in the file and improves the reproducibility of your analysis.

  1. Title of your R script or a basic comment of what is done
    • This will help you later on when you are trying to go back to the analysis.
  2. The date that your script was most recently edited.
  3. Your name.

Comments

These all can be created using comments in R. Comments are designated as # and anything to the right of them will not be run through R. These are a great way to annotate your code and help you understand what your are doing in your analysis.

# This is a comment symbol.  
 2 + 2 # Anything to the right of the symbol will not be run in the analysis
## [1] 4

TIP If you are using RStudio you can comment on and off blocks of code using control-shift-c.

So using comments at the top of my script, I would write:

## Getting to know R
## 8-26-16
## Chris Chizinski

Options

The next thing that I like to lay out on the top are my options. Option allow the user to set global options that affect how R computes and displays the results. While there are many of them, the one that I conistently have is stringsAsFactors set to FALSE. We will go over factors in a bit, but I like to retain control when factors are set (helps in data manipulation and analysis).

## Getting to know R
## 8-26-16
## Chris Chizinski

options(stringsAsFactors = FALSE)

File paths

Now there are a couple of different ways you can represent directory structures in R. If you have a PC you can represent directory seperators as \\ or as /. In Mac you can only use the /. I like to use the / to be consistent. The entire path needs to be enclosed within quotation marks.

## Getting to know R
## 8-26-16
## Chris Chizinski

options(stringsAsFactors = FALSE)
#file.list("/Users/cchizinski2/Documents/DataDepot/RClass")  # PC
#file.list("\\Users\\cchizinski2\\Documents\\DataDepot\\RClass")  # PC
file.list("/Users/cchizinski2/Documents/DataDepot/RClass")  # Mac

Packages

Packages in the most simpliest of descriptions are a collection of functions that can be called upon to run your analysis. Most packages can be downloaded straight from CRAN within Rstudio by running install.packages(<PACKAGENAME>). Many packages can be held in repositories and will need the devtools package.

Setting a standard library path is important when working with R. I find that after updating your R version or working from multiple computers (keeping your pacakges on a flash drive or network drive) saves a lot of time from having to reinstall R packages. R packages are collections of R functions, data, and compiled code in a well-defined format. You can define the directory where these are stored using .libPaths(). I like to use the personal network drive so that whenever I log in to a computer at UNL that drive is available.

## Getting to know R
## 8-26-16
## Chris Chizinski

options(stringsAsFactors = FALSE)
.libPaths("P:/RLibrary")

Once I have set my library path, I like to list all the R packages that I will be using in the subsequent code. There are a few reasons that I like doing this:

  • all the packages are loaded at the top so if I need to jump down the script and run a specific function, I do not need to search around and find where I loaded the appropriate library
  • I can make sure that if there are conflicts with packages, that they are run in the appropriate order (i.e., making sure things are reproducible)

For example, if you were to try to run the dplyr and the plyr packages to do data manipulation, then the order that these are loaded can affect the functionality of the functions.

library(dplyr)
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following object is masked from 'package:purrr':
## 
##     compact

Types of data

Numerical data

# Write a value to an object

a = 1  # You can use the equal or the arrow
a <- 1  # In RStudio, key board short cut is 'Alt' and '-' for PC or 'Option' and '-'
a # Display the object by typing it, will display the values
## [1] 1
class(a) # display the class of data; base R
## [1] "numeric"
glimpse(a) # display using tidyverse
##  num 1
str(a) # base R
##  num 1
a <- as.integer(a) # convert to integer
glimpse(a) 
##  int 1
b <- 1.223  #

glimpse(b)
##  num 1.22
A <- 1 # Case sensitive
a <- 2
A == a
## [1] FALSE
a <- 4 # Careful when naming objects, R will use the last assignment your provide

You can use objects

a + b # Always remember to assign values to an object, otherwise it is just displayed
## [1] 5.223
c <- a + b # writign a + b to object c

Working with characters

d <- "dog"
e <- "cat"

e <- elephant # quotations are really important
## Error in eval(expr, envir, enclos): object 'elephant' not found
f <- "see spot run"  # strings
f
## [1] "see spot run"
class(f)
## [1] "character"

Factors

g <- c("a", "b", "c", "defe") # create a vector
g
## [1] "a"    "b"    "c"    "defe"
g <- as.factor(g)
as.numeric(g)
## [1] 1 2 3 4
as.numeric(c("a", "b", "c", "defe"))
## Warning: NAs introduced by coercion
## [1] NA NA NA NA
# Setting levels of a factor

g <- c("a", "b", "c", "defe") # create a vector
g
## [1] "a"    "b"    "c"    "defe"
g <- factor(g, levels = c("defe", "c", "b", "a"))
g     
## [1] a    b    c    defe
## Levels: defe c b a
levels(g)
## [1] "defe" "c"    "b"    "a"
# ordered data 

g <- c("a", "b", "c", "defe") # create a vector
g
## [1] "a"    "b"    "c"    "defe"
g <- ordered (g, levels = c("defe","b", "c", "a"))
g
## [1] a    b    c    defe
## Levels: defe < b < c < a

Data structures

# Vectors
# Vectors have one dimensions
g <- c("a", "b", "c", "defe")
length(g) # number of items in a vector
## [1] 4
#  Matrices 
# Matrices have two dimensions and are one type of data (numeric or character)
h <- matrix(c(1,2,3,4,5,6,7,8), ncol = 2, byrow=TRUE) 
h
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
## [3,]    5    6
## [4,]    7    8
dim(h) # dimensions
## [1] 4 2
nrow(h) # number of rows
## [1] 4
ncol(h) # number of column
## [1] 2
length(h) # number of items in a matrix
## [1] 8
# Subsetting data

h[2,2] # display the item on the second row and second column
## [1] 4
h[3,2] <- "7"  # changes the matrix to character

## data frames
## Dataframes are the typical tabular form of data that you will work with in R.  It can have multiple types
# of data represented but must be identical within a column

i <- data.frame(Col1 = c(1,2,3,4), Col2 = c("A", "B", "E", "F"))
glimpse(i)
## Observations: 4
## Variables: 2
## $ Col1 <dbl> 1, 2, 3, 4
## $ Col2 <chr> "A", "B", "E", "F"
as.data.frame(as.matrix(i))
##   Col1 Col2
## 1    1    A
## 2    2    B
## 3    3    E
## 4    4    F

Getting to know R: more about data types

August 31, 2018
Beginning R Data Structures Sampling Data Management and Visualization
comments powered by Disqus