- Big Data Analytics Tutorial
- Big Data Analytics - Home
- Big Data Analytics - Overview
- Big Data Analytics - Characteristics
- Big Data Analytics - Data Life Cycle
- Big Data Analytics - Architecture
- Big Data Analytics - Methodology
- Big Data Analytics - Core Deliverables
- Big Data Adoption & Planning Considerations
- Big Data Analytics - Key Stakeholders
- Big Data Analytics - Data Analyst
- Big Data Analytics - Data Scientist
- Big Data Analytics Project
- Data Analytics - Problem Definition
- Big Data Analytics - Data Collection
- Big Data Analytics - Cleansing data
- Big Data Analytics - Summarizing
- Big Data Analytics - Data Exploration
- Big Data Analytics - Data Visualization
- Big Data Analytics Methods
- Big Data Analytics - Introduction to R
- Data Analytics - Introduction to SQL
- Big Data Analytics - Charts & Graphs
- Big Data Analytics - Data Tools
- Data Analytics - Statistical Methods
- Advanced Methods
- Machine Learning for Data Analysis
- Naive Bayes Classifier
- K-Means Clustering
- Association Rules
- Big Data Analytics - Decision Trees
- Logistic Regression
- Big Data Analytics - Time Series
- Big Data Analytics - Text Analytics
- Big Data Analytics - Online Learning
- Big Data Analytics Useful Resources
- Big Data Analytics - Quick Guide
- Big Data Analytics - Resources
- Big Data Analytics - Discussion
Big Data Analytics - Introduction to R
This section is devoted to introduce the users to the R programming language. R can be downloaded from the cran website. For Windows users, it is useful to install rtools and the rstudio IDE.
The general concept behind R is to serve as an interface to other software developed in compiled languages such as C, C++, and Fortran and to give the user an interactive tool to analyze data.
Navigate to the folder of the book zip file bda/part2/R_introduction and open the R_introduction.Rproj file. This will open an RStudio session. Then open the 01_vectors.R file. Run the script line by line and follow the comments in the code. Another useful option in order to learn is to just type the code, this will help you get used to R syntax. In R comments are written with the # symbol.
In order to display the results of running R code in the book, after code is evaluated, the results R returns are commented. This way, you can copy paste the code in the book and try directly sections of it in R.
# Create a vector of numbers numbers = c(1, 2, 3, 4, 5) print(numbers) # [1] 1 2 3 4 5 # Create a vector of letters ltrs = c('a', 'b', 'c', 'd', 'e') # [1] "a" "b" "c" "d" "e" # Concatenate both mixed_vec = c(numbers, ltrs) print(mixed_vec) # [1] "1" "2" "3" "4" "5" "a" "b" "c" "d" "e"
Let’s analyze what happened in the previous code. We can see it is possible to create vectors with numbers and with letters. We did not need to tell R what type of data type we wanted beforehand. Finally, we were able to create a vector with both numbers and letters. The vector mixed_vec has coerced the numbers to character, we can see this by visualizing how the values are printed inside quotes.
The following code shows the data type of different vectors as returned by the function class. It is common to use the class function to "interrogate" an object, asking him what his class is.
### Evaluate the data types using class ### One dimensional objects # Integer vector num = 1:10 class(num) # [1] "integer" # Numeric vector, it has a float, 10.5 num = c(1:10, 10.5) class(num) # [1] "numeric" # Character vector ltrs = letters[1:10] class(ltrs) # [1] "character" # Factor vector fac = as.factor(ltrs) class(fac) # [1] "factor"
R supports two-dimensional objects also. In the following code, there are examples of the two most popular data structures used in R: the matrix and data.frame.
# Matrix M = matrix(1:12, ncol = 4) # [,1] [,2] [,3] [,4] # [1,] 1 4 7 10 # [2,] 2 5 8 11 # [3,] 3 6 9 12 lM = matrix(letters[1:12], ncol = 4) # [,1] [,2] [,3] [,4] # [1,] "a" "d" "g" "j" # [2,] "b" "e" "h" "k" # [3,] "c" "f" "i" "l" # Coerces the numbers to character # cbind concatenates two matrices (or vectors) in one matrix cbind(M, lM) # [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] # [1,] "1" "4" "7" "10" "a" "d" "g" "j" # [2,] "2" "5" "8" "11" "b" "e" "h" "k" # [3,] "3" "6" "9" "12" "c" "f" "i" "l" class(M) # [1] "matrix" class(lM) # [1] "matrix" # data.frame # One of the main objects of R, handles different data types in the same object. # It is possible to have numeric, character and factor vectors in the same data.frame df = data.frame(n = 1:5, l = letters[1:5]) df # n l # 1 1 a # 2 2 b # 3 3 c # 4 4 d # 5 5 e
As demonstrated in the previous example, it is possible to use different data types in the same object. In general, this is how data is presented in databases, APIs part of the data is text or character vectors and other numeric. In is the analyst job to determine which statistical data type to assign and then use the correct R data type for it. In statistics we normally consider variables are of the following types −
- Numeric
- Nominal or categorical
- Ordinal
In R, a vector can be of the following classes −
- Numeric - Integer
- Factor
- Ordered Factor
R provides a data type for each statistical type of variable. The ordered factor is however rarely used, but can be created by the function factor, or ordered.
The following section treats the concept of indexing. This is a quite common operation, and deals with the problem of selecting sections of an object and making transformations to them.
# Let's create a data.frame df = data.frame(numbers = 1:26, letters) head(df) # numbers letters # 1 1 a # 2 2 b # 3 3 c # 4 4 d # 5 5 e # 6 6 f # str gives the structure of a data.frame, it’s a good summary to inspect an object str(df) # 'data.frame': 26 obs. of 2 variables: # $ numbers: int 1 2 3 4 5 6 7 8 9 10 ... # $ letters: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ... # The latter shows the letters character vector was coerced as a factor. # This can be explained by the stringsAsFactors = TRUE argumnet in data.frame # read ?data.frame for more information class(df) # [1] "data.frame" ### Indexing # Get the first row df[1, ] # numbers letters # 1 1 a # Used for programming normally - returns the output as a list df[1, , drop = TRUE] # $numbers # [1] 1 # # $letters # [1] a # Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z # Get several rows of the data.frame df[5:7, ] # numbers letters # 5 5 e # 6 6 f # 7 7 g ### Add one column that mixes the numeric column with the factor column df$mixed = paste(df$numbers, df$letters, sep = ’’) str(df) # 'data.frame': 26 obs. of 3 variables: # $ numbers: int 1 2 3 4 5 6 7 8 9 10 ... # $ letters: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ... # $ mixed : chr "1a" "2b" "3c" "4d" ... ### Get columns # Get the first column df[, 1] # It returns a one dimensional vector with that column # Get two columns df2 = df[, 1:2] head(df2) # numbers letters # 1 1 a # 2 2 b # 3 3 c # 4 4 d # 5 5 e # 6 6 f # Get the first and third columns df3 = df[, c(1, 3)] df3[1:3, ] # numbers mixed # 1 1 1a # 2 2 2b # 3 3 3c ### Index columns from their names names(df) # [1] "numbers" "letters" "mixed" # This is the best practice in programming, as many times indeces change, but variable names don’t # We create a variable with the names we want to subset keep_vars = c("numbers", "mixed") df4 = df[, keep_vars] head(df4) # numbers mixed # 1 1 1a # 2 2 2b # 3 3 3c # 4 4 4d # 5 5 5e # 6 6 6f ### subset rows and columns # Keep the first five rows df5 = df[1:5, keep_vars] df5 # numbers mixed # 1 1 1a # 2 2 2b # 3 3 3c # 4 4 4d # 5 5 5e # subset rows using a logical condition df6 = df[df$numbers < 10, keep_vars] df6 # numbers mixed # 1 1 1a # 2 2 2b # 3 3 3c # 4 4 4d # 5 5 5e # 6 6 6f # 7 7 7g # 8 8 8h # 9 9 9i