Instruction

Read the instruction carefully and think about how to develop R code to answer each questions.

Question 1

Consider iris dataset (one of the most famous dataset in Data Mining) and learn basic command of data.frame package

01: access

data.frame object has mixed propeties of matrix() and list(); hence, we can access the object using both methods in four ways.
  ##-- data type
  str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
  • access as list with name
  ##-- access as list
  iris$Sepal.Length[1:10]
##  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9
  • access as list with position
  iris[[1]][1:10]
##  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9
  • access as matrix with row and column
  ##-- access as matrix
  iris[1:10,1]
##  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9
  • access as matrix with position and name
  ##-- access as matrix with name
  iris[1:10,"Sepal.Length"]
##  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9

note

  • we can access all variable as character, but it is, in general, not very useful.
 df.name <- "iris"
 get(df.name)[1:3,]
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa

02: sampling

one way to minimize computation time and reduce operation on data.frame is to sample and review only option of data. Here are two tips:
  • show top/bottom
  ##-- show top/bottom 6
  head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
  tail(iris)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica
  • show random/sample data
  set.seed(13)
  nSize <- 5
  smp.idx <- sample(1:nrow(iris),nSize)
  iris[smp.idx,]
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 3            4.7         3.2          1.3         0.2     setosa
## 101          6.3         3.3          6.0         2.5  virginica
## 74           6.1         2.8          4.7         1.2 versicolor
## 6            5.4         3.9          1.7         0.4     setosa
## 132          7.9         3.8          6.4         2.0  virginica

note what is this line do?

  ##-- This code is NOT execute
  head(iris[-smp.idx,])

03: view

Alternatively, a user may want to screen/check datapoint using commands View() edit() and fix().
  ### SOLUTION TO QUESTION 2A ### 
  temp <- iris

  ##-- view only; change are not allow
  View(temp) 
  
  ##-- view and allow change; ???
  fix(temp)  
  
  ##-- view and allow change; ???
  edit(temp) 

note

  • explain their differences among View(), edit(), and fix()

04: misc

Before exploring data, here are misc. tips in data.frame may useful.
  • dimension and names of columns
  ##-- dimension
  nDim <- dim(iris) 
  nCol <- ncol(iris)
  nRow <- nrow(iris)
  
  ##-- column name
  colnames(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"
  id.DF <- data.frame(id=1:nRow)
  • combine data.frame
  ##-- joint data.frame
  head(cbind.data.frame(id.DF,iris))
##   id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1  1          5.1         3.5          1.4         0.2  setosa
## 2  2          4.9         3.0          1.4         0.2  setosa
## 3  3          4.7         3.2          1.3         0.2  setosa
## 4  4          4.6         3.1          1.5         0.2  setosa
## 5  5          5.0         3.6          1.4         0.2  setosa
## 6  6          5.4         3.9          1.7         0.4  setosa
  ##-- check duplication
  any(duplicated(iris))
## [1] TRUE
  nrow(unique(iris))
## [1] 149
  ##-- find duplication index
  which(duplicated(iris)==TRUE)
## [1] 143

note

  • identify the pair of duplicated rows and discuss what you want to do with it?

05: summarytools

one trick that we uses to quickly view data is to use summarytools package
  require(summarytools)
  
  
  print(summarytools::dfSummary(iris),method = "render")

Data Frame Summary

iris

Dimensions: 150 x 5
Duplicates: 1
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 Sepal.Length [numeric]
Mean (sd) : 5.8 (0.8)
min ≤ med ≤ max:
4.3 ≤ 5.8 ≤ 7.9
IQR (CV) : 1.3 (0.1)
35 distinct values 150 (100.0%) 0 (0.0%)
2 Sepal.Width [numeric]
Mean (sd) : 3.1 (0.4)
min ≤ med ≤ max:
2 ≤ 3 ≤ 4.4
IQR (CV) : 0.5 (0.1)
23 distinct values 150 (100.0%) 0 (0.0%)
3 Petal.Length [numeric]
Mean (sd) : 3.8 (1.8)
min ≤ med ≤ max:
1 ≤ 4.3 ≤ 6.9
IQR (CV) : 3.5 (0.5)
43 distinct values 150 (100.0%) 0 (0.0%)
4 Petal.Width [numeric]
Mean (sd) : 1.2 (0.8)
min ≤ med ≤ max:
0.1 ≤ 1.3 ≤ 2.5
IQR (CV) : 1.5 (0.6)
22 distinct values 150 (100.0%) 0 (0.0%)
5 Species [factor]
1. setosa
2. versicolor
3. virginica
50(33.3%)
50(33.3%)
50(33.3%)
150 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.0)
2023-09-13

note possible command in this package include, summarytools::freq(), summarytools::ctree() and summarytools::descr(). When you should apply these commmands?


Question 2

Re-Consider iris, explore the data by its classification with R packages. This question is separated into three approaches with the identical result (the background of users and personal experience play important role to how one select approach) :
  • base is simple/naive way to explore. no good for large and complex dataset
  • data.table is extension of base structure using all core in your machine (it is a little complex)
  • dplyr is extension of SQL with a combination of data.frame and tibble. It is a part of tidyverse, a set of package of data science in R.

base

01: stat

summarize data using basic prescriptive statistic (mean, median, sd, skewness, kurtosis, sd, IQR, and CV) using base
  ### SOLUTION TO QUESTION 1Aa ### 
  summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
  ##-- for numeric data
  iris.numer <- iris[,1:4] ##-- only numeric data can be used
  apply(iris.numer,2,mean) ##-- find mean
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##     5.843333     3.057333     3.758000     1.199333
  symnum(cor(iris.numer)) ##-- find covariance and sign
##              S.L S.W P.L P.W
## Sepal.Length 1              
## Sepal.Width      1          
## Petal.Length +   .   1      
## Petal.Width  +   .   B   1  
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
  require(moments)
  
  ##-- skewness is 3rd moment explaing concentration of data
  apply(iris.numer,2,skewness) 
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##    0.3117531    0.3157671   -0.2721277   -0.1019342
  ##-- kurtosis is 4rd moment explaing normaly distributed of data
  apply(iris.numer,2,kurtosis) 
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##     2.426432     3.180976     1.604464     1.663933
  ##-- typical version01 of IQR
  findIQR <- function(x){ quantile(x,prob=0.75) - quantile(x,prob=0.25)} 
  
  ##-- typical version01 of IQR
  apply(iris.numer,2,findIQR)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##          1.3          0.5          3.5          1.5
  ##--  inline version of IQR
  apply(iris.numer,2,function(o){ quantile(o,prob=0.75) - quantile(o,prob=0.25)}) 
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##          1.3          0.5          3.5          1.5
  ##-- find (Coefficient of variation) CV or relative sd
  apply(iris.numer,2,function(o){ sd(o)/mean(o) } )
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##    0.1417113    0.1425642    0.4697441    0.6355511
  ##-- find mod of data ##Why use max and table
  apply(iris.numer,2,function(o){ max(table(o))} ) 
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##           10           26           13           29

02: query

Find data using the following conditions using base package
  • Species is “versicolor”
  • 1.0 \(\geq\) Petal.Width \(\leq\) 1.5
  isSelect <- which(iris$Species == "versicolor" & 
                      iris$Petal.Width > 1.0 & 
                      iris$Petal.Width < 1.5
                    )
  head(iris[isSelect,])
##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 51          7.0         3.2          4.7         1.4 versicolor
## 54          5.5         2.3          4.0         1.3 versicolor
## 56          5.7         2.8          4.5         1.3 versicolor
## 59          6.6         2.9          4.6         1.3 versicolor
## 60          5.2         2.7          3.9         1.4 versicolor
## 64          6.1         2.9          4.7         1.4 versicolor

03: col select

select column in which contains word ‘Sepal’ in its column names and find their means
  ### SOLUTION TO QUESTION 1B ### 

  simCol <- grep("Sepal",names(iris))
  head(iris[,simCol])
##   Sepal.Length Sepal.Width
## 1          5.1         3.5
## 2          4.9         3.0
## 3          4.7         3.2
## 4          4.6         3.1
## 5          5.0         3.6
## 6          5.4         3.9
  aggregate(Sepal.Width~Species,data=iris,mean)
##      Species Sepal.Width
## 1     setosa       3.428
## 2 versicolor       2.770
## 3  virginica       2.974
  aggregate(Sepal.Length~Species,data=iris,mean)
##      Species Sepal.Length
## 1     setosa        5.006
## 2 versicolor        5.936
## 3  virginica        6.588

04: similar

count number of data that contains word ‘color’ in its Speies using base package
  isSelect <- grepl('color',iris$Species)

  head(iris[isSelect,])
##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 51          7.0         3.2          4.7         1.4 versicolor
## 52          6.4         3.2          4.5         1.5 versicolor
## 53          6.9         3.1          4.9         1.5 versicolor
## 54          5.5         2.3          4.0         1.3 versicolor
## 55          6.5         2.8          4.6         1.5 versicolor
## 56          5.7         2.8          4.5         1.3 versicolor
  nrow(iris[isSelect,])
## [1] 50

05: sort

order data in the following order using base package
  • accending of Petal.Width
  • decending of Sepal.Width
  iris[order(iris$Petal.Width,-iris$Sepal.Width),]
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 33           5.2         4.1          1.5         0.1     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 1            5.1         3.5          1.4         0.2     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 68           5.8         2.7          4.1         1.0 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 51           7.0         3.2          4.7         1.4 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 135          6.1         2.6          5.6         1.4  virginica
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 134          6.3         2.8          5.1         1.5  virginica
## 73           6.3         2.5          4.9         1.5 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 120          6.0         2.2          5.0         1.5  virginica
## 86           6.0         3.4          4.5         1.6 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 130          7.2         3.0          5.8         1.6  virginica
## 84           6.0         2.7          5.1         1.6 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 107          4.9         2.5          4.5         1.7  virginica
## 71           5.9         3.2          4.8         1.8 versicolor
## 126          7.2         3.2          6.0         1.8  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 150          5.9         3.0          5.1         1.8  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 101          6.3         3.3          6.0         2.5  virginica
## 145          6.7         3.3          5.7         2.5  virginica

06: mutate

count number of data in each Species and summarize using the following criteria
type width of petal length of petal
low \([0.00,0.75)\) \([0.0,2.5)\)
medium \([0.75,1.75)\) \([2.5,5.0)\)
high \([1.75,\infty)\) \([5.0,\infty)\)
  ##-- This code is NOT execute
  iris.DF <- iris

  iris.DF$tWidth  <- ifelse(iris.DF$Petal.Width<0.75,"low",
                            ifelse(iris.DF$Petal.Width<1.75,"mid","high"))
  iris.DF$tLength <- ifelse(iris.DF$Petal.Length<2.50,"low",
                            ifelse(iris.DF$Petal.Length<5.00,"mid","high"))

  ftable(tWidth+tLength~Species,data=iris.DF)

data.table

00: intro

data.table is a compact and quick package for transforming data in R based on the following structure.

01: stat

summarize data using basic prescriptive statistic (mean, median, sd, skewness, kurtosis, sd, IQR, and CV) using data.table
  ### SOLUTION TO QUESTION 1Ab ### 
  require(data.table)

  iris.DT <- as.data.table(iris) 
  head(iris.DT)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1:          5.1         3.5          1.4         0.2  setosa
## 2:          4.9         3.0          1.4         0.2  setosa
## 3:          4.7         3.2          1.3         0.2  setosa
## 4:          4.6         3.1          1.5         0.2  setosa
## 5:          5.0         3.6          1.4         0.2  setosa
## 6:          5.4         3.9          1.7         0.4  setosa
  iris.num.DT <- as.data.table(iris[,1:4])
  iris.num.DT[,lapply(.SD,mean)]
##    Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1:     5.843333    3.057333        3.758    1.199333
  iris.num.DT[,lapply(.SD,quantile,prob=0.5)]
##    Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1:          5.8           3         4.35         1.3
  iris.num.DT[,lapply(.SD,sd)]
##    Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1:    0.8280661   0.4358663     1.765298   0.7622377
  require(moments)
  iris.num.DT[,lapply(.SD,skewness)]
##    Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1:    0.3117531   0.3157671   -0.2721277  -0.1019342
  iris.num.DT[,lapply(.SD,kurtosis)]
##    Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1:     2.426432    3.180976     1.604464    1.663933
  iris.num.DT[,lapply(.SD,function(o){ quantile(o,prob=0.75) - quantile(o,prob=0.25)})]
##    Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1:          1.3         0.5          3.5         1.5

02: query

Find data using the following conditions with and without data.table package
  • Species is “versicolor”
  • 1.0 \(\geq\) Petal.Width \(\leq\) 1.5
  head(iris.DT[Species=="versicolor" & between(Petal.Width,1.0,1.5)])
##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1:          7.0         3.2          4.7         1.4 versicolor
## 2:          6.4         3.2          4.5         1.5 versicolor
## 3:          6.9         3.1          4.9         1.5 versicolor
## 4:          5.5         2.3          4.0         1.3 versicolor
## 5:          6.5         2.8          4.6         1.5 versicolor
## 6:          5.7         2.8          4.5         1.3 versicolor

03: col select

select column in which contains word ‘Sepal’ in its column names and find their means
  ### SOLUTION TO QUESTION 1B ### 

  simCol <- names(iris.DT)[which(names(iris.DT) %ilike% 'Sepal')]
  
  head(iris.DT[,..simCol],3) 
##    Sepal.Length Sepal.Width
## 1:          5.1         3.5
## 2:          4.9         3.0
## 3:          4.7         3.2
  head(iris.DT[,.SD,.SDcols=simCol],3)
##    Sepal.Length Sepal.Width
## 1:          5.1         3.5
## 2:          4.9         3.0
## 3:          4.7         3.2
  head(iris.DT[,mean(Petal.Length),by=Species])
##       Species    V1
## 1:     setosa 1.462
## 2: versicolor 4.260
## 3:  virginica 5.552

04: similar

count number of data that contains word ‘color’ in its Speies using data.table
  iris.DT[Species%like% 'color',]
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
##  1:          7.0         3.2          4.7         1.4 versicolor
##  2:          6.4         3.2          4.5         1.5 versicolor
##  3:          6.9         3.1          4.9         1.5 versicolor
##  4:          5.5         2.3          4.0         1.3 versicolor
##  5:          6.5         2.8          4.6         1.5 versicolor
##  6:          5.7         2.8          4.5         1.3 versicolor
##  7:          6.3         3.3          4.7         1.6 versicolor
##  8:          4.9         2.4          3.3         1.0 versicolor
##  9:          6.6         2.9          4.6         1.3 versicolor
## 10:          5.2         2.7          3.9         1.4 versicolor
## 11:          5.0         2.0          3.5         1.0 versicolor
## 12:          5.9         3.0          4.2         1.5 versicolor
## 13:          6.0         2.2          4.0         1.0 versicolor
## 14:          6.1         2.9          4.7         1.4 versicolor
## 15:          5.6         2.9          3.6         1.3 versicolor
## 16:          6.7         3.1          4.4         1.4 versicolor
## 17:          5.6         3.0          4.5         1.5 versicolor
## 18:          5.8         2.7          4.1         1.0 versicolor
## 19:          6.2         2.2          4.5         1.5 versicolor
## 20:          5.6         2.5          3.9         1.1 versicolor
## 21:          5.9         3.2          4.8         1.8 versicolor
## 22:          6.1         2.8          4.0         1.3 versicolor
## 23:          6.3         2.5          4.9         1.5 versicolor
## 24:          6.1         2.8          4.7         1.2 versicolor
## 25:          6.4         2.9          4.3         1.3 versicolor
## 26:          6.6         3.0          4.4         1.4 versicolor
## 27:          6.8         2.8          4.8         1.4 versicolor
## 28:          6.7         3.0          5.0         1.7 versicolor
## 29:          6.0         2.9          4.5         1.5 versicolor
## 30:          5.7         2.6          3.5         1.0 versicolor
## 31:          5.5         2.4          3.8         1.1 versicolor
## 32:          5.5         2.4          3.7         1.0 versicolor
## 33:          5.8         2.7          3.9         1.2 versicolor
## 34:          6.0         2.7          5.1         1.6 versicolor
## 35:          5.4         3.0          4.5         1.5 versicolor
## 36:          6.0         3.4          4.5         1.6 versicolor
## 37:          6.7         3.1          4.7         1.5 versicolor
## 38:          6.3         2.3          4.4         1.3 versicolor
## 39:          5.6         3.0          4.1         1.3 versicolor
## 40:          5.5         2.5          4.0         1.3 versicolor
## 41:          5.5         2.6          4.4         1.2 versicolor
## 42:          6.1         3.0          4.6         1.4 versicolor
## 43:          5.8         2.6          4.0         1.2 versicolor
## 44:          5.0         2.3          3.3         1.0 versicolor
## 45:          5.6         2.7          4.2         1.3 versicolor
## 46:          5.7         3.0          4.2         1.2 versicolor
## 47:          5.7         2.9          4.2         1.3 versicolor
## 48:          6.2         2.9          4.3         1.3 versicolor
## 49:          5.1         2.5          3.0         1.1 versicolor
## 50:          5.7         2.8          4.1         1.3 versicolor
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
  iris.DT[Species%like% 'color',.N]
## [1] 50

note * data.table::uniqueN() can use to count number of unique data

  uniqueN(iris.DT)
## [1] 149

05: sort

sort the data in the following order using data.table packaage order data by
  • accending of Petal.Width
  • decending of Sepal.Width
  iris.DT[order(Petal.Width,-Sepal.Width)]
##      Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
##   1:          5.2         4.1          1.5         0.1    setosa
##   2:          4.9         3.6          1.4         0.1    setosa
##   3:          4.9         3.1          1.5         0.1    setosa
##   4:          4.8         3.0          1.4         0.1    setosa
##   5:          4.3         3.0          1.1         0.1    setosa
##  ---                                                            
## 146:          6.7         3.1          5.6         2.4 virginica
## 147:          5.8         2.8          5.1         2.4 virginica
## 148:          7.2         3.6          6.1         2.5 virginica
## 149:          6.3         3.3          6.0         2.5 virginica
## 150:          6.7         3.3          5.7         2.5 virginica

06: mutate

count number of data in each Species and summarize using the following criteria
type width of petal length of petal
low \([0.00,0.75)\) \([0.0,2.5)\)
medium \([0.75,1.75)\) \([2.5,5.0)\)
high \([1.75,\infty)\) \([5.0,\infty)\)
  wRange <- c(0.00,0.75,1.75,10.0)
  wLabel <- c("low","mid","high")
  lRange <- c(0.00,2.50,5.00,15.0)
  lLabel <- c("low","mid","high")
  
  iris.DT[,tWidth :=cut(Petal.Width ,wRange,wLabel)]
  iris.DT[,tLength:=cut(Petal.Length,lRange,lLabel)]
  iris.DT[,.N,by=.(tWidth,tLength,Species)]
##    tWidth tLength    Species  N
## 1:    low     low     setosa 50
## 2:    mid     mid versicolor 48
## 3:   high     mid versicolor  1
## 4:    mid    high versicolor  1
## 5:   high    high  virginica 38
## 6:    mid     mid  virginica  2
## 7:   high     mid  virginica  7
## 8:    mid    high  virginica  3

dplyr

00: intro

dplyr is a part of tidyr for Data Transformation. It bases on a function of SQL language that consists of:
dplyr SQL desp
‘select()’ SELECT picks column
‘filter()’ WHERE picks cases based on their values.
‘group_by’ GROUP BY group data
‘summarise()’ - reduces column into a summary.
‘arrange()’ ORDER BY order rows
‘join()’ JOIN join data
‘mutate()’ COLUMN AILAS adds new column

01: stat

summarize data using basic prescriptive statistic (mean, median, sd, skewness, kurtosis, sd, IQR, and CV) using dplyr
  ### SOLUTION TO QUESTION 1Ab ### 
  require(dplyr)

  iris.numer <- iris[,1:4]
  summarise_all(iris.numer,.funs=mean)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.843333    3.057333        3.758    1.199333
  require(moments)
  summarise_all(iris.numer,.funs=skewness)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1    0.3117531   0.3157671   -0.2721277  -0.1019342
  summarise_all(iris.numer,.funs=function(o){ 
    quantile(o,prob=0.75) - quantile(o,prob=0.25)
    } )
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          1.3         0.5          3.5         1.5

note

  • glimpse() is the alternative version of str()
  require(dplyr) 

  glimpse(iris)
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…

02: guery

Find data using the following conditions using dplyr package
  • Species is “versicolor”
  • 1.0 \(\leq\) Petal.Width \(\leq\) 1.5
  require(dplyr)  
  
  head(filter(iris,between(Petal.Width,1.0,1.5) & Species == "versicolor"))
##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1          7.0         3.2          4.7         1.4 versicolor
## 2          6.4         3.2          4.5         1.5 versicolor
## 3          6.9         3.1          4.9         1.5 versicolor
## 4          5.5         2.3          4.0         1.3 versicolor
## 5          6.5         2.8          4.6         1.5 versicolor
## 6          5.7         2.8          4.5         1.3 versicolor
  ##-- chain version
  iris %>% filter(between(Petal.Width,1.0,1.5) & Species == "versicolor") -> iris.filter
  head(iris.filter)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1          7.0         3.2          4.7         1.4 versicolor
## 2          6.4         3.2          4.5         1.5 versicolor
## 3          6.9         3.1          4.9         1.5 versicolor
## 4          5.5         2.3          4.0         1.3 versicolor
## 5          6.5         2.8          4.6         1.5 versicolor
## 6          5.7         2.8          4.5         1.3 versicolor

03: group

select column in which contains word ‘Sepal’ in its column names and find their means using dplyr package
  ##-- This code is NOT execute
  select(iris, contains("Sepal"),"Species") -> iris.dp  

  summarise_all(group_by(iris.dp,Species),mean) 

04: similar

count number of data that contains word ‘color’ in its Speies using dplyr package
  head(filter(iris,grepl("color",Species) ))
##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1          7.0         3.2          4.7         1.4 versicolor
## 2          6.4         3.2          4.5         1.5 versicolor
## 3          6.9         3.1          4.9         1.5 versicolor
## 4          5.5         2.3          4.0         1.3 versicolor
## 5          6.5         2.8          4.6         1.5 versicolor
## 6          5.7         2.8          4.5         1.3 versicolor
  nrow(filter(iris,grepl("color",Species) ))
## [1] 50

05: sort

order data by using dplyr package
  • accending of Petal.Width
  • decending of Sepal.Width
  ##-- This code is NOT execute
  head(arrange(iris,Petal.Width,-Sepal.Width))

06: manipulate

count number of data in each Species and summarize using the following criteria
type width of petal length of petal
low \([0.00,0.75)\) \([0.0,2.5)\)
medium \([0.75,1.75)\) \([2.5,5.0)\)
high \([1.75,\infty)\) \([5.0,\infty)\)
  ### SOLUTION TO QUESTION 1F ###   
  wRange  <- c(0.75,1.75)
  lRange <- c(2.50,5.00)
  
  iris.dply <- iris
  
  iris.dply <- mutate(iris.dply,tWidth=case_when(
    Petal.Width < 0.75 ~ "low",
    Petal.Width > 1.75 ~ "high",
    TRUE ~ "mid"
  ) )
  
  iris.dply <- mutate(iris.dply,tLength=case_when(
    Petal.Length < 2.50 ~ "low",
    Petal.Length > 5.00 ~ "high",
    TRUE ~ "mid"
  ) )
  glimpse(iris.dply)
## Rows: 150
## Columns: 7
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
## $ tWidth       <chr> "low", "low", "low", "low", "low", "low", "low", "low", "…
## $ tLength      <chr> "low", "low", "low", "low", "low", "low", "low", "low", "…
  ftable(tWidth+tLength~Species,data=iris.dply)
##            tWidth  high          low          mid        
##            tLength high low mid high low mid high low mid
## Species                                                  
## setosa                0   0   0    0  50   0    0   0   0
## versicolor            0   0   1    0   0   0    1   0  48
## virginica            38   0   7    0   0   0    3   0   2

EXTRA I

Use your knowledage to query and mutate iris in the following step
  • label by its Petal.Length into equal groups, (i.e., ‘PL.H’,‘PL.M’, ‘PL.L’)
  • label by its Sapel.Length into equal groups, (i.e., ‘SL.H’,‘SL.M’, ‘SL.L’)
  • find number rows of each group (there are maximum of \(3 \times 3 \times 3=\) 27)
  • compute median and sd of column Petal.Width of groups
  • compute mode and cv of column Sepal.Width of groups
  • ignore na records

The result should be similar to this table

EXTRA II

Convert the data into long format and convert it back

hint There are two possible packages for this task: reshape2 and data.table

  • reshape2::melt() and reshape2::dcast()
  • data.table::melt() and data.table::dcast()

Question 3

Visualize iris dataset using standard base package and then lattic package for its classification. Please think about its outliers as you need the observation for the next question:

01: RECAP

Here are some plots that you should know in the first half. Please recreate them.

NOTE

  • data point 2, 4, 17, 39, and 90 are marked using text()
  • identify() is a powerful command to do semi-manual labeling

NOTE

  • tickers under box are points values using rug()

  • combine panel can be achieved using par(mfrow=c(2,2))

  • boxplot is a powerful plot to check distribution and outlier.

    oldPar <- par()
  boxplot(iris,col="gray")

  par(mfcol=c(2,2))
  boxplot(Sepal.Length~Species,data=iris,col="gray")
  boxplot(Sepal.Width~Species,data=iris,col="gray")
  boxplot(Petal.Length~Species,data=iris,col="gray")
  boxplot(Petal.Width~Species,data=iris,col="gray")

  par(oldPar)

02: rare plot

here are some interesting plot, but we rarely uses or cover in the first half
  • stem and leaf plot
  stem(iris[,2])
## 
##   The decimal point is 1 digit(s) to the left of the |
## 
##   20 | 0
##   21 | 
##   22 | 000
##   23 | 0000
##   24 | 000
##   25 | 00000000
##   26 | 00000
##   27 | 000000000
##   28 | 00000000000000
##   29 | 0000000000
##   30 | 00000000000000000000000000
##   31 | 00000000000
##   32 | 0000000000000
##   33 | 000000
##   34 | 000000000000
##   35 | 000000
##   36 | 0000
##   37 | 000
##   38 | 000000
##   39 | 00
##   40 | 0
##   41 | 0
##   42 | 0
##   43 | 
##   44 | 0
  • sunflower plot
  sunflowerplot(iris[,1],iris[,2])


03: pairs() and hists()

observe the dataset visually using pairs() and hist() They are basic for understand distribution and relationship
  ### SOLUTION TO QUESTION 2B ### 
  iris.jitter <- apply(iris[,1:4],2,function(o){jitter(o)})
  pairs(iris.jitter,col=iris$Species)  

  hist(iris$Sepal.Length,n=10,col="grey",freq = F)
  rug(jitter(iris$Sepal.Length))
  points(density(iris$Sepal.Length),col="red",type="l")

04: lactic

visualize data by bwplot() in lattice package to visualize its classification and compare with boxplot()
  ### SOLUTION TO QUESTION 2C ### 
  boxplot(Sepal.Length~Species,data=iris,col="grey" )

  boxplot(iris[[2]]~iris$Species,col="grey" ) ##-- alternative version using list

This is an an advance version of boxplot() to plot Species overlapped
  iris.seto <- iris[1:50,]
  iris.virg <- iris[which(iris$Species=="virginica"),]
  iris.vers <- iris[51:100,]
  boxplot(iris.vers[1:4],pch=16,cex=0.5,col="blue")
  boxplot(iris.seto[1:4],pch=16,cex=0.5,col="orange",add=T)
  boxplot(iris.virg[1:4],pch=16,cex=0.5,col="#F0FF00AA",add=T)  

Alternative, lattice package provides a scatter plot using xyplot()
  require(lattice)  
  xyplot(iris[[2]]~iris[[1]]|iris[[5]])

  bwplot(Sepal.Length~factor(ceiling(Sepal.Width)) |Species,data=iris,add=T)

05: outlier

we can identify outlier of boxplot() by saving its in another vairable. For example,
  tempBox <- boxplot(iris[[1]]~iris[[5]],plot=F)
  str(tempBox)
## List of 6
##  $ stats: num [1:5, 1:3] 4.3 4.8 5 5.2 5.8 4.9 5.6 5.9 6.3 7 ...
##  $ n    : num [1:3] 50 50 50
##  $ conf : num [1:2, 1:3] 4.91 5.09 5.74 6.06 6.34 ...
##  $ out  : num 4.9
##  $ group: num 3
##  $ names: chr [1:3] "setosa" "versicolor" "virginica"
Then, we can identify row id using which(). For example,
  which(iris[[1]]==tempBox$out & iris[[5]] == tempBox$names[tempBox$group])
## [1] 107
Manually, you can keep record as set of outlier (‘outlier1’) and combine the sets with command ‘union()’.
  tempBox <- boxplot(iris[[1]]~iris[[5]],plot=F)
  ol1 <- which(iris[[1]]==tempBox$out & iris[[5]] == tempBox$names[tempBox$group])
  tempBox <- boxplot(iris[[2]]~iris[[5]],plot=F)
  ol2 <- which(iris[[2]]==tempBox$out & iris[[5]] == tempBox$names[tempBox$group])
  
  outlier <- union(ol1,ol2)
  outlier
## [1] 107  42

Note other set operation are: * ‘union()’ * ‘intersect()’
* ‘setdiff()’

Here are full implementation of such concept
  • loop for extracting outlier information of all columns
  iris.boxList <- data.frame()
  for(i in 1:4){
    ## i <- 1
    tempBox <- boxplot(iris[[i]]~iris[[5]],plot=F)
    tempDF  <- as.data.frame(tempBox[c("out", "group")])
    tempDF$colName <- i
    tempDF$species <- tempBox$names[tempDF$group]
    iris.boxList <- rbind(iris.boxList,tempDF)
  }
  iris.boxList
##   out group colName    species
## 1 4.9     3       1  virginica
## 2 2.3     1       2     setosa
## 3 1.0     1       3     setosa
## 4 3.0     2       3 versicolor
## 5 0.5     1       4     setosa
## 6 0.6     1       4     setosa
  • loop for finding outliers of all columns
  iris.boxList$which <- NULL
  for(i in 1:nrow(iris.boxList) ){
    ## i <- 1
    colIdx  <- iris.boxList$colName[i]
    species <- iris.boxList$species[i]
    value   <- iris.boxList$out[i]
    resRow  <- which( iris[[colIdx]] == value  & iris[[5]]==species)
    iris.boxList$which[i] <- resRow
  }
  iris.boxList
##   out group colName    species which
## 1 4.9     3       1  virginica   107
## 2 2.3     1       2     setosa    42
## 3 1.0     1       3     setosa    23
## 4 3.0     2       3 versicolor    99
## 5 0.5     1       4     setosa    24
## 6 0.6     1       4     setosa    44

06: put together

Perhaps, the most important part of data mining is data cleansing as the most time-consuming process and the effects of the next step. Before analyzing, it is important to clean data. The major steps are:
  • check for duplication
  • check for missing value
  • check for incorrect obvious error, e.g., swap column, out-of-bound, wrong gender
  • check for irregularity not-so obvious/strange values usually require investigation, e.g., outlier, noise
  • consider remove or impute such data points

Because iris is a cleaned data, we have to worries only outlier. We can manually identify outlier using identify() and scatter plot
  ##-- This code is NOT execute
  xAxis <-  iris[,1]
  yAxis <-  iris[,2]  
  plot(xAxis,yAxis)
  identify(xAxis,yAxis)
Removing outlier can be done latter. Recall that
  iris.boxList
##   out group colName    species which
## 1 4.9     3       1  virginica   107
## 2 2.3     1       2     setosa    42
## 3 1.0     1       3     setosa    23
## 4 3.0     2       3 versicolor    99
## 5 0.5     1       4     setosa    24
## 6 0.6     1       4     setosa    44
  outlier.Idx <- unique(iris.boxList$which)
  iris.cln <- iris[-outlier.Idx,]
Alternatively, we can apply local outlier finding lof() in rlof package that uses the concept of clustering to identify outlier.
  ##-- This code is NOT execute
  require(Rlof)
  lof.dist <- lof(iris[,1:4],k=8) 
  isGood   <-  which(lof.dist < 1.2)
  iris.lof <- iris[isGood,]
  pairs(iris.lof[,1:4],col=iris.lof$Species)  

Queston 4

Visual iris dataset using ggplot2 package

00: concept

ggplot2 package is a part of tidyverse that allows data.table and data.frame objects to plot and visualize. ggplot2 is based on the grammar of graphics, sepertating components: a data set, a coordinate system, and geoms—visual marks that represent data points.

  • DATA data.frame or data.table (required)
  • GEOM_FUNCTION plot style (requied), e.g., geom_point() geom_box()
  • MAPPING axis, plot component (requied) e.g., aes(x=,y=,fill=)
  • STAT_FUNCTION, plot with stat result
  • FACET_FUNCTION, show many plot in same figure

note the code is available in the next tab

01: typical

we will cover a simple geom.
  • Histogram and Density plot
  ggplot(iris, aes(Sepal.Width,fill=Species)) + geom_histogram(bins=25) 

  • Dot plot or overlapped histogram
  ggplot(iris) +  geom_dotplot(aes(x=Sepal.Width,fill=Species))

  • scatter plot
  ggplot(iris,aes(x=Sepal.Width,y=Sepal.Length,color=Species)) +  geom_point(position="jitter") + theme_classic()


02: adv

In general, ggplot() with additional packages can do any form of visualization. Here are some capability that we may use for the team project.
  • facet
  require(ggplot2)
  require(data.table)
  iris.DT <- as.data.table(iris)
  iris.lng <- melt(iris.DT,id.var="Species")
  ggplot(iris.lng, aes(x=variable,y=value,fill=Species)) + geom_violin() -> gg
  gg + facet_grid(cols=vars(Species) ) + xlab("dimension") + ylab("Unit (cm)")

  • 1D density with facet and 2D density
  ggplot(iris.lng, aes(value,color=variable))+ geom_density() + facet_grid(cols=vars(Species) )

  ggplot(iris,aes(x=Sepal.Width,y=Sepal.Length,color=Species)) +  geom_density_2d()

  • combine filter with plot
 ##-- seperate data filter and query 
 iris.DT[Species %like% "color"] %>%
   ggplot(aes(x=Sepal.Width,y=Sepal.Length,color=Species)) + geom_point()

note

  • %>% = piping (passing data) command in dplyr package
  • melt = command to make long table
  • dcast = command to make short table

Question 5

After marked 30 questions, an instructor notice a possible cheating of the following ten students. The questions are TRUE-FALSE question, and instructor has marked ‘1’ for correct answer and ‘0’ for incorrect answer. Can you detect cheaters (source and copier)?

00: concept

Important Concept
  • How and what should be compared?
    • answer of each question between two students at a time (why?)
    • both get a wrong/right answer (any signal)
  • What are measurement of similarity
    • \[ SMC = \frac{\mbox{\# matching attributes}}{\mbox{ \# all attributes}} \]
    • \[ \cos(\mathbf{x},\mathbf{y}) =\frac{\mathbf{x} \cdot \mathbf{y} }{\| \mathbf{x}\| ~\|\mathbf{y}\|}\]

01: prepare

STEP 1: prepare data
  ##-- This code is NOT execute
  ##-- If error check where is your file
  exam.df <- as.data.frame(read.csv(file="examMarked.csv"))
  colnames(exam.df) <- c("id",paste( "Q",1:30,sep=""))
  
  stu1 <- exam.df[1,2:31]
  stu2 <- exam.df[2,2:31]
  
  ##-- example of data 
  rbind(stu1,stu2)[,1:10]
##   Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
## 1  1  1  1  1  0  1  1  1  1   1
## 2  1  1  1  0  0  1  1  0  1   1
create function that compare two students
checkFun <- function(stu1,stu2){
  case00 <- length(which(stu1 == 0 & stu2 == 0))
  case01 <- length(which(stu1 == 0 & stu2 == 1))
  case10 <- length(which(stu1 == 1 & stu2 == 0))
  case11 <- length(which(stu1 == 1 & stu2 == 1))
  return( list(case00=case00,
               case10=case10,
               case01=case01,
               case11=case11) )
}

## test your code
checkFun(stu1,stu2)
## $case00
## [1] 8
## 
## $case10
## [1] 5
## 
## $case01
## [1] 5
## 
## $case11
## [1] 12

02: compare

STEP 2: actual compare student
  ##-- This code is NOT execute
  pair <- combn(10,2)
  nPair<- ncol(pair)
  pairResult <- data.frame(stu1ID=pair[1,],stu2ID=pair[2,],
                           case00=rep(NA,nPair),case10=rep(NA,nPair),
                           case01=rep(NA,nPair),case11=rep(NA,nPair))



#nPair <- nrow(pairResult)
for( i in 1:nPair){
  ## i <- 1 ## debug
  stu1ID <- pairResult$stu1ID[i]
  stu2ID <- pairResult$stu2ID[i]
  exam1  <- exam.df[stu1ID,2:31]
  exam2  <- exam.df[stu2ID,2:31]
  compResult <- checkFun(exam1,exam2)
  pairResult$case00[i] <- compResult$case00
  pairResult$case10[i] <- compResult$case10
  pairResult$case01[i] <- compResult$case01
  pairResult$case11[i] <- compResult$case11  
}
ord <- order(pairResult$smc,decreasing = T)
pairResult[ord,]
smc <- (pairResult$case00+ pairResult$case11)/sum(pairResult[,3:6])

03: stat

cor.test
  cor.test(as.numeric(exam.df[1,2:31]),as.numeric(exam.df[2,2:31]))
## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(exam.df[1, 2:31]) and as.numeric(exam.df[2, 2:31])
## t = 1.7951, df = 28, p-value = 0.08343
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.04410735  0.61083640
## sample estimates:
##      cor 
## 0.321267

Question 6

Consider the following visualization example of number of murders in US from USArrests in package datasets by state with the thermal map

0A: map

maps package has a build-in worldmap function for visualization map(). The details of map may be varied depending on each country. Here is a state map of US.
  require(maps)
  map('state',col=c("red","blue","green"),fill=T) 

0B: data

Before integrating with map, we need to know more about USArrests data.
  data <- as.data.frame(USArrests)
  head(data)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7
  hist(data$Murder,col="grey")
  rug(jitter(data$Murder))
  box()

0C: color

We use heat.colors() scheme as color (red = hot; yellow=warm)
  intQuantile <- c(0.1,0.25,0.5,0.75,0.9)
  nRange      <- length(intQuantile)
  colorRange  <- sort(heat.colors(nRange+1),decreasing = T)
  pie(rep(1,nRange),col=colorRange)

put color and map together
  valMurder   <- quantile(data$Murder,intQuantile)
  myCol       <- as.character(cut(data$Murder,breaks = c(0,valMurder,20),labels=colorRange))
  
  map('state',col=myCol,fill=T)

### 0D: label ##### preparation

  ##-- prepare legend
  legendText <- c(paste(c(">",valMurder[nRange]),collapse=""))
  for(j in (nRange-1):1){
    legendText <- c(legendText,paste(c(valMurder[j],"-",valMurder[j+1]),collapse=""))
  }
  legendText <- c(legendText,paste(c("<",valMurder[1]),collapse=""))
  
  legendText
## [1] ">13.32"      "11.25-13.32" "7.25-11.25"  "4.075-7.25"  "2.56-4.075" 
## [6] "<2.56"
combine togetther
  map('state',col=myCol,fill=T)
  legend("bottomright",legend=legendText,pch=rep(15,nRange),col=colorRange,ncol=1,cex=1.0,pt.cex=3.5
         ,y.intersp=0.5,bty="n")

01: other maps

base on the previous code blocks, represent other three types of arrest, i.e. Assault, UrbanPop, and Rape with the similar manner with function plotThermalMap(type=1,quantLv=c(0.1,0.25,0.5,0.75,0.9))
  ##-- This code is NOT execute and incompleted
  plotThermalMap <- function(type=1,quantLv=c(0.1,0.25,0.5,0.75,0.9)){
    
    data <- as.data.frame(USArrests)

    ##-- This part is intentionaly left out --##
    
    return(0)
  }
  
  plotThermalMap(4)

02: rscript (DOS)

automatically generate the thermal map and export as files (note This can be combined and execute with batch file using Rscript <fileName>.Rin DOS)
  ##-- This code is NOT execute   
  typeName <- colnames(USArrests)
  nType    <- length(typeName)
  for(i in 1:nType){
    ## i <- 1 ##-- for debug
    fileName <- paste( c(typeName[i],".png"),collapse="")
    png(fileName,width = 600,height = 600)
    plot.new()
    
    ##-- function from the previous part
    plotThermalMap(i) 
    dev.off()
  }

Question 7

The final step after understanding patterns and insights of dataset is to prepare data for a model. This involves preparing two data set for training model and testing model.

note This last question is overlapped with the question in the next workshop.

01: insights

Based on the data exploration so far, list useful insights that can be utilized the model selection.

02: seperate

This can be done using command sample.int() or sample(). It is very important to indicate set.seed()
  ##-- using `sample()`
  set.seed(17)
  smpl.idx <- sample(1:nrow(iris),size=30)
  
    ##-- using `sample.int()`
  set.seed(17)
  smpl.idx <- sample.int(nrow(iris),size=30)
  
  iris.test <- iris[smpl.idx,]
  iris.train<- iris[-smpl.idx,]
  
  head(iris.test)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 108          7.3         2.9          6.3         1.8 virginica
## 42           4.5         2.3          1.3         0.3    setosa
## 129          6.4         2.8          5.6         2.1 virginica
## 6            5.4         3.9          1.7         0.4    setosa
## 133          6.4         2.8          5.6         2.2 virginica
## 110          7.2         3.6          6.1         2.5 virginica

NOTE Do the sample balance in term of Spices? If not, is there any solution


03: build model

Different package has different way to build model and activate results. Here are three different examples of three classification packages.

class::knn()

k-Nearest Neighborhood is the easiest method in classification. It also have the most unique way to build model.
  require(class)
  species.knn <- knn(train=iris.train[,-5],test=iris.test[,-5],
                  cl=iris.train[,5],k=3)  
  
  ftable(iris.test[,5],species.knn)
##            species.knn setosa versicolor virginica
##                                                   
## setosa                     11          0         0
## versicolor                  0          9         1
## virginica                   0          1         8

base::glm()

general linear regression is an extension of linear regression model that covers other responses, such as binary and positive (poisson).
  ##-- casting factor into number (0-1)
  iris.train$specIdx <- as.numeric(iris.train$Species)
  iris.test$specIdx  <- as.numeric(iris.test$Species)

  species.glm <- glm(specIdx~. -Species,data=iris.train,family ="poisson")
  
  iris.glm    <- round(predict(species.glm,newdata = iris.test,type = "response"))
  ftable(iris.test[,5],iris.glm)  
##            iris.glm  1  2  3  4
##                                
## setosa              11  0  0  0
## versicolor           0 10  0  0
## virginica            0  0  8  1

rpart::rpart()

Recursive Partitioning and Regression Tree is one way to build decision tree using greedy algorthm. The method requires pruning and checking of Mellow’s \(C_p\) to avoid overfitting.
  require(rpart)
  iris.rpart <- rpart(Species~.,data=iris.train)  
  
  require(rpart.plot)
  prp(iris.rpart)

  species.rpart <- apply(predict(iris.rpart,newdata = iris.test),1,which.max)
  ftable(iris.test[,5],species.rpart)
##            species.rpart  1  2  3
##                                  
## setosa                   11  0  0
## versicolor                0 10  0
## virginica                 0  0  9

NOTE The detail and how to select a suitable model will be discussed in the next workshop.


Copyright 2019   Oran Kittithreerapronchai.   All Rights Reserved.   Last modified: 2023-31-13,