Instruction

Read the instruction carefully and think about how to develop R code to answer each questions.

Question 1

Consider `iris` dataset (one of the most famous dataset in Data Mining) and learn basic command of `data.frame` package

01: access

`data.frame` object has mixed propeties of `matrix()` and `list()`; hence, we can access the object using both methods in four ways.

  ##-- data type
  str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

access as list with name

  ##-- access as list
  iris$Sepal.Length[1:10]

##  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9

access as list with position

  iris[[1]][1:10]

##  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9

access as matrix with row and column

  ##-- access as matrix
  iris[1:10,1]

##  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9

access as matrix with position and name

  ##-- access as matrix with name
  iris[1:10,"Sepal.Length"]

##  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9

note

we can access all variable as character, but it is, in general, not very useful.

 df.name <- "iris"
 get(df.name)[1:3,]

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa

02: sampling

one way to minimize computation time and reduce operation on `data.frame` is to sample and review only option of data. Here are two tips:

show top/bottom

  ##-- show top/bottom 6
  head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

  tail(iris)

##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica

show random/sample data

  set.seed(13)
  nSize <- 5
  smp.idx <- sample(1:nrow(iris),nSize)
  iris[smp.idx,]

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 3            4.7         3.2          1.3         0.2     setosa
## 101          6.3         3.3          6.0         2.5  virginica
## 74           6.1         2.8          4.7         1.2 versicolor
## 6            5.4         3.9          1.7         0.4     setosa
## 132          7.9         3.8          6.4         2.0  virginica

note what is this line do?

  ##-- This code is NOT execute
  head(iris[-smp.idx,])

03: view

Alternatively, a user may want to screen/check datapoint using commands `View()` `edit()` and `fix()`.

  ### SOLUTION TO QUESTION 2A ### 
  temp <- iris

  ##-- view only; change are not allow
  View(temp) 
  
  ##-- view and allow change; ???
  fix(temp)  
  
  ##-- view and allow change; ???
  edit(temp)

note

explain their differences among View(), edit(), and fix()

04: misc

Before exploring data, here are misc. tips in `data.frame` may useful.

dimension and names of columns

  ##-- dimension
  nDim <- dim(iris) 
  nCol <- ncol(iris)
  nRow <- nrow(iris)
  
  ##-- column name
  colnames(iris)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

  id.DF <- data.frame(id=1:nRow)

combine data.frame

  ##-- joint data.frame
  head(cbind.data.frame(id.DF,iris))

##   id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1  1          5.1         3.5          1.4         0.2  setosa
## 2  2          4.9         3.0          1.4         0.2  setosa
## 3  3          4.7         3.2          1.3         0.2  setosa
## 4  4          4.6         3.1          1.5         0.2  setosa
## 5  5          5.0         3.6          1.4         0.2  setosa
## 6  6          5.4         3.9          1.7         0.4  setosa

  ##-- check duplication
  any(duplicated(iris))

## [1] TRUE

  nrow(unique(iris))

## [1] 149

  ##-- find duplication index
  which(duplicated(iris)==TRUE)

## [1] 143

note

identify the pair of duplicated rows and discuss what you want to do with it?

05: `summarytools`

one trick that we uses to quickly view data is to use `summarytools` package

  require(summarytools)
  
  
  print(summarytools::dfSummary(iris),method = "render")

Data Frame Summary

iris

Dimensions: 150 x 5
Duplicates: 1

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

Sepal.Length [numeric]

Mean (sd) : 5.8 (0.8)

min ≤ med ≤ max:

4.3 ≤ 5.8 ≤ 7.9

IQR (CV) : 1.3 (0.1)

35 distinct values

150 (100.0%)

0 (0.0%)

Sepal.Width [numeric]

Mean (sd) : 3.1 (0.4)

min ≤ med ≤ max:

2 ≤ 3 ≤ 4.4

IQR (CV) : 0.5 (0.1)

23 distinct values

150 (100.0%)

0 (0.0%)

Petal.Length [numeric]

Mean (sd) : 3.8 (1.8)

min ≤ med ≤ max:

1 ≤ 4.3 ≤ 6.9

IQR (CV) : 3.5 (0.5)

43 distinct values

150 (100.0%)

0 (0.0%)

Petal.Width [numeric]

Mean (sd) : 1.2 (0.8)

min ≤ med ≤ max:

0.1 ≤ 1.3 ≤ 2.5

IQR (CV) : 1.5 (0.6)

22 distinct values

150 (100.0%)

0 (0.0%)

Species [factor]

1. setosa

2. versicolor

3. virginica

50	(	33.3%	)
50	(	33.3%	)
50	(	33.3%	)

150 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.0)
2023-09-13

note possible command in this package include, summarytools::freq(), summarytools::ctree() and summarytools::descr(). When you should apply these commmands?

Question 2

Re-Consider `iris`, explore the data by its classification with R packages. This question is separated into three approaches with the identical result (the background of users and personal experience play important role to how one select approach) :

base is simple/naive way to explore. no good for large and complex dataset
data.table is extension of base structure using all core in your machine (it is a little complex)
dplyr is extension of SQL with a combination of data.frame and tibble. It is a part of tidyverse, a set of package of data science in R.

`base`

01: stat

summarize data using basic prescriptive statistic (mean, median, sd, skewness, kurtosis, sd, IQR, and CV) using `base`

  ### SOLUTION TO QUESTION 1Aa ### 
  summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

  ##-- for numeric data
  iris.numer <- iris[,1:4] ##-- only numeric data can be used
  apply(iris.numer,2,mean) ##-- find mean

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##     5.843333     3.057333     3.758000     1.199333

  symnum(cor(iris.numer)) ##-- find covariance and sign

##              S.L S.W P.L P.W
## Sepal.Length 1              
## Sepal.Width      1          
## Petal.Length +   .   1      
## Petal.Width  +   .   B   1  
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1

  require(moments)
  
  ##-- skewness is 3rd moment explaing concentration of data
  apply(iris.numer,2,skewness)

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##    0.3117531    0.3157671   -0.2721277   -0.1019342

  ##-- kurtosis is 4rd moment explaing normaly distributed of data
  apply(iris.numer,2,kurtosis)

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##     2.426432     3.180976     1.604464     1.663933

  ##-- typical version01 of IQR
  findIQR <- function(x){ quantile(x,prob=0.75) - quantile(x,prob=0.25)} 
  
  ##-- typical version01 of IQR
  apply(iris.numer,2,findIQR)

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##          1.3          0.5          3.5          1.5

  ##--  inline version of IQR
  apply(iris.numer,2,function(o){ quantile(o,prob=0.75) - quantile(o,prob=0.25)})

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##          1.3          0.5          3.5          1.5

  ##-- find (Coefficient of variation) CV or relative sd
  apply(iris.numer,2,function(o){ sd(o)/mean(o) } )

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##    0.1417113    0.1425642    0.4697441    0.6355511

  ##-- find mod of data ##Why use max and table
  apply(iris.numer,2,function(o){ max(table(o))} )

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##           10           26           13           29

02: query

Find data using the following conditions using `base` package

Species is “versicolor”
1.0 \(\geq\) Petal.Width \(\leq\) 1.5

  isSelect <- which(iris$Species == "versicolor" & 
                      iris$Petal.Width > 1.0 & 
                      iris$Petal.Width < 1.5
                    )
  head(iris[isSelect,])

##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 51          7.0         3.2          4.7         1.4 versicolor
## 54          5.5         2.3          4.0         1.3 versicolor
## 56          5.7         2.8          4.5         1.3 versicolor
## 59          6.6         2.9          4.6         1.3 versicolor
## 60          5.2         2.7          3.9         1.4 versicolor
## 64          6.1         2.9          4.7         1.4 versicolor

03: col select

select column in which contains word ‘Sepal’ in its column names and find their means

  ### SOLUTION TO QUESTION 1B ### 

  simCol <- grep("Sepal",names(iris))
  head(iris[,simCol])

##   Sepal.Length Sepal.Width
## 1          5.1         3.5
## 2          4.9         3.0
## 3          4.7         3.2
## 4          4.6         3.1
## 5          5.0         3.6
## 6          5.4         3.9

  aggregate(Sepal.Width~Species,data=iris,mean)

##      Species Sepal.Width
## 1     setosa       3.428
## 2 versicolor       2.770
## 3  virginica       2.974

  aggregate(Sepal.Length~Species,data=iris,mean)

##      Species Sepal.Length
## 1     setosa        5.006
## 2 versicolor        5.936
## 3  virginica        6.588

04: similar

count number of data that contains word ‘color’ in its Speies using `base` package

  isSelect <- grepl('color',iris$Species)

  head(iris[isSelect,])

##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 51          7.0         3.2          4.7         1.4 versicolor
## 52          6.4         3.2          4.5         1.5 versicolor
## 53          6.9         3.1          4.9         1.5 versicolor
## 54          5.5         2.3          4.0         1.3 versicolor
## 55          6.5         2.8          4.6         1.5 versicolor
## 56          5.7         2.8          4.5         1.3 versicolor

  nrow(iris[isSelect,])

## [1] 50

05: sort

order data in the following order using `base` package

accending of Petal.Width
decending of Sepal.Width

  iris[order(iris$Petal.Width,-iris$Sepal.Width),]

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 33           5.2         4.1          1.5         0.1     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 1            5.1         3.5          1.4         0.2     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 68           5.8         2.7          4.1         1.0 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 51           7.0         3.2          4.7         1.4 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 135          6.1         2.6          5.6         1.4  virginica
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 134          6.3         2.8          5.1         1.5  virginica
## 73           6.3         2.5          4.9         1.5 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 120          6.0         2.2          5.0         1.5  virginica
## 86           6.0         3.4          4.5         1.6 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 130          7.2         3.0          5.8         1.6  virginica
## 84           6.0         2.7          5.1         1.6 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 107          4.9         2.5          4.5         1.7  virginica
## 71           5.9         3.2          4.8         1.8 versicolor
## 126          7.2         3.2          6.0         1.8  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 150          5.9         3.0          5.1         1.8  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 101          6.3         3.3          6.0         2.5  virginica
## 145          6.7         3.3          5.7         2.5  virginica

06: mutate

count number of data in each Species and summarize using the following criteria

type	width of petal	length of petal
low	\([0.00,0.75)\)	\([0.0,2.5)\)
medium	\([0.75,1.75)\)	\([2.5,5.0)\)
high	\([1.75,\infty)\)	\([5.0,\infty)\)

  ##-- This code is NOT execute
  iris.DF <- iris

  iris.DF$tWidth  <- ifelse(iris.DF$Petal.Width<0.75,"low",
                            ifelse(iris.DF$Petal.Width<1.75,"mid","high"))
  iris.DF$tLength <- ifelse(iris.DF$Petal.Length<2.50,"low",
                            ifelse(iris.DF$Petal.Length<5.00,"mid","high"))

  ftable(tWidth+tLength~Species,data=iris.DF)

`data.table`

00: intro

`data.table` is a compact and quick package for transforming data in R based on the following structure.

more information:

[html]!(https://www.listendata.com/2016/10/r-data-table.html)
[cheat sheet]!(https://github.com/rstudio/cheatsheets/raw/master/datatable.pdf)

01: stat

summarize data using basic prescriptive statistic (mean, median, sd, skewness, kurtosis, sd, IQR, and CV) using `data.table`

  ### SOLUTION TO QUESTION 1Ab ### 
  require(data.table)

  iris.DT <- as.data.table(iris) 
  head(iris.DT)

##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1:          5.1         3.5          1.4         0.2  setosa
## 2:          4.9         3.0          1.4         0.2  setosa
## 3:          4.7         3.2          1.3         0.2  setosa
## 4:          4.6         3.1          1.5         0.2  setosa
## 5:          5.0         3.6          1.4         0.2  setosa
## 6:          5.4         3.9          1.7         0.4  setosa

  iris.num.DT <- as.data.table(iris[,1:4])
  iris.num.DT[,lapply(.SD,mean)]

##    Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1:     5.843333    3.057333        3.758    1.199333

  iris.num.DT[,lapply(.SD,quantile,prob=0.5)]

##    Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1:          5.8           3         4.35         1.3

  iris.num.DT[,lapply(.SD,sd)]

##    Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1:    0.8280661   0.4358663     1.765298   0.7622377

  require(moments)
  iris.num.DT[,lapply(.SD,skewness)]

##    Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1:    0.3117531   0.3157671   -0.2721277  -0.1019342

  iris.num.DT[,lapply(.SD,kurtosis)]

##    Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1:     2.426432    3.180976     1.604464    1.663933

  iris.num.DT[,lapply(.SD,function(o){ quantile(o,prob=0.75) - quantile(o,prob=0.25)})]

##    Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1:          1.3         0.5          3.5         1.5

02: query

Find data using the following conditions with and without `data.table` package

Species is “versicolor”
1.0 \(\geq\) Petal.Width \(\leq\) 1.5

  head(iris.DT[Species=="versicolor" & between(Petal.Width,1.0,1.5)])

##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1:          7.0         3.2          4.7         1.4 versicolor
## 2:          6.4         3.2          4.5         1.5 versicolor
## 3:          6.9         3.1          4.9         1.5 versicolor
## 4:          5.5         2.3          4.0         1.3 versicolor
## 5:          6.5         2.8          4.6         1.5 versicolor
## 6:          5.7         2.8          4.5         1.3 versicolor

03: col select

select column in which contains word ‘Sepal’ in its column names and find their means

  ### SOLUTION TO QUESTION 1B ### 

  simCol <- names(iris.DT)[which(names(iris.DT) %ilike% 'Sepal')]
  
  head(iris.DT[,..simCol],3)

##    Sepal.Length Sepal.Width
## 1:          5.1         3.5
## 2:          4.9         3.0
## 3:          4.7         3.2

  head(iris.DT[,.SD,.SDcols=simCol],3)

##    Sepal.Length Sepal.Width
## 1:          5.1         3.5
## 2:          4.9         3.0
## 3:          4.7         3.2

  head(iris.DT[,mean(Petal.Length),by=Species])

##       Species    V1
## 1:     setosa 1.462
## 2: versicolor 4.260
## 3:  virginica 5.552

04: similar

count number of data that contains word ‘color’ in its Speies using `data.table`

  iris.DT[Species%like% 'color',]

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
##  1:          7.0         3.2          4.7         1.4 versicolor
##  2:          6.4         3.2          4.5         1.5 versicolor
##  3:          6.9         3.1          4.9         1.5 versicolor
##  4:          5.5         2.3          4.0         1.3 versicolor
##  5:          6.5         2.8          4.6         1.5 versicolor
##  6:          5.7         2.8          4.5         1.3 versicolor
##  7:          6.3         3.3          4.7         1.6 versicolor
##  8:          4.9         2.4          3.3         1.0 versicolor
##  9:          6.6         2.9          4.6         1.3 versicolor
## 10:          5.2         2.7          3.9         1.4 versicolor
## 11:          5.0         2.0          3.5         1.0 versicolor
## 12:          5.9         3.0          4.2         1.5 versicolor
## 13:          6.0         2.2          4.0         1.0 versicolor
## 14:          6.1         2.9          4.7         1.4 versicolor
## 15:          5.6         2.9          3.6         1.3 versicolor
## 16:          6.7         3.1          4.4         1.4 versicolor
## 17:          5.6         3.0          4.5         1.5 versicolor
## 18:          5.8         2.7          4.1         1.0 versicolor
## 19:          6.2         2.2          4.5         1.5 versicolor
## 20:          5.6         2.5          3.9         1.1 versicolor
## 21:          5.9         3.2          4.8         1.8 versicolor
## 22:          6.1         2.8          4.0         1.3 versicolor
## 23:          6.3         2.5          4.9         1.5 versicolor
## 24:          6.1         2.8          4.7         1.2 versicolor
## 25:          6.4         2.9          4.3         1.3 versicolor
## 26:          6.6         3.0          4.4         1.4 versicolor
## 27:          6.8         2.8          4.8         1.4 versicolor
## 28:          6.7         3.0          5.0         1.7 versicolor
## 29:          6.0         2.9          4.5         1.5 versicolor
## 30:          5.7         2.6          3.5         1.0 versicolor
## 31:          5.5         2.4          3.8         1.1 versicolor
## 32:          5.5         2.4          3.7         1.0 versicolor
## 33:          5.8         2.7          3.9         1.2 versicolor
## 34:          6.0         2.7          5.1         1.6 versicolor
## 35:          5.4         3.0          4.5         1.5 versicolor
## 36:          6.0         3.4          4.5         1.6 versicolor
## 37:          6.7         3.1          4.7         1.5 versicolor
## 38:          6.3         2.3          4.4         1.3 versicolor
## 39:          5.6         3.0          4.1         1.3 versicolor
## 40:          5.5         2.5          4.0         1.3 versicolor
## 41:          5.5         2.6          4.4         1.2 versicolor
## 42:          6.1         3.0          4.6         1.4 versicolor
## 43:          5.8         2.6          4.0         1.2 versicolor
## 44:          5.0         2.3          3.3         1.0 versicolor
## 45:          5.6         2.7          4.2         1.3 versicolor
## 46:          5.7         3.0          4.2         1.2 versicolor
## 47:          5.7         2.9          4.2         1.3 versicolor
## 48:          6.2         2.9          4.3         1.3 versicolor
## 49:          5.1         2.5          3.0         1.1 versicolor
## 50:          5.7         2.8          4.1         1.3 versicolor
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species

  iris.DT[Species%like% 'color',.N]

## [1] 50

note * data.table::uniqueN() can use to count number of unique data

  uniqueN(iris.DT)

## [1] 149

05: sort

sort the data in the following order using `data.table` packaage order data by

accending of Petal.Width
decending of Sepal.Width

  iris.DT[order(Petal.Width,-Sepal.Width)]

##      Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
##   1:          5.2         4.1          1.5         0.1    setosa
##   2:          4.9         3.6          1.4         0.1    setosa
##   3:          4.9         3.1          1.5         0.1    setosa
##   4:          4.8         3.0          1.4         0.1    setosa
##   5:          4.3         3.0          1.1         0.1    setosa
##  ---                                                            
## 146:          6.7         3.1          5.6         2.4 virginica
## 147:          5.8         2.8          5.1         2.4 virginica
## 148:          7.2         3.6          6.1         2.5 virginica
## 149:          6.3         3.3          6.0         2.5 virginica
## 150:          6.7         3.3          5.7         2.5 virginica

06: mutate

count number of data in each Species and summarize using the following criteria

type	width of petal	length of petal
low	\([0.00,0.75)\)	\([0.0,2.5)\)
medium	\([0.75,1.75)\)	\([2.5,5.0)\)
high	\([1.75,\infty)\)	\([5.0,\infty)\)

  wRange <- c(0.00,0.75,1.75,10.0)
  wLabel <- c("low","mid","high")
  lRange <- c(0.00,2.50,5.00,15.0)
  lLabel <- c("low","mid","high")
  
  iris.DT[,tWidth :=cut(Petal.Width ,wRange,wLabel)]
  iris.DT[,tLength:=cut(Petal.Length,lRange,lLabel)]
  iris.DT[,.N,by=.(tWidth,tLength,Species)]

##    tWidth tLength    Species  N
## 1:    low     low     setosa 50
## 2:    mid     mid versicolor 48
## 3:   high     mid versicolor  1
## 4:    mid    high versicolor  1
## 5:   high    high  virginica 38
## 6:    mid     mid  virginica  2
## 7:   high     mid  virginica  7
## 8:    mid    high  virginica  3

`dplyr`

00: intro

`dplyr` is a part of `tidyr` for Data Transformation. It bases on a function of SQL language that consists of:

`dplyr`	SQL	desp
‘select()’	SELECT	picks column
‘filter()’	WHERE	picks cases based on their values.
‘group_by’	GROUP BY	group data
‘summarise()’	-	reduces column into a summary.
‘arrange()’	ORDER BY	order rows
‘join()’	JOIN	join data
‘mutate()’	COLUMN AILAS	adds new column

more information:

[html]!(https://www.listendata.com/2016/08/dplyr-tutorial.html)
[cheat sheet]!(https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf)

01: stat

summarize data using basic prescriptive statistic (mean, median, sd, skewness, kurtosis, sd, IQR, and CV) using `dplyr`

  ### SOLUTION TO QUESTION 1Ab ### 
  require(dplyr)

  iris.numer <- iris[,1:4]
  summarise_all(iris.numer,.funs=mean)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.843333    3.057333        3.758    1.199333

  require(moments)
  summarise_all(iris.numer,.funs=skewness)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1    0.3117531   0.3157671   -0.2721277  -0.1019342

  summarise_all(iris.numer,.funs=function(o){ 
    quantile(o,prob=0.75) - quantile(o,prob=0.25)
    } )

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          1.3         0.5          3.5         1.5

note

glimpse() is the alternative version of str()

  require(dplyr) 

  glimpse(iris)

## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…

02: guery

Find data using the following conditions using `dplyr` package

Species is “versicolor”
1.0 \(\leq\) Petal.Width \(\leq\) 1.5

  require(dplyr)  
  
  head(filter(iris,between(Petal.Width,1.0,1.5) & Species == "versicolor"))

##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1          7.0         3.2          4.7         1.4 versicolor
## 2          6.4         3.2          4.5         1.5 versicolor
## 3          6.9         3.1          4.9         1.5 versicolor
## 4          5.5         2.3          4.0         1.3 versicolor
## 5          6.5         2.8          4.6         1.5 versicolor
## 6          5.7         2.8          4.5         1.3 versicolor

  ##-- chain version
  iris %>% filter(between(Petal.Width,1.0,1.5) & Species == "versicolor") -> iris.filter
  head(iris.filter)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1          7.0         3.2          4.7         1.4 versicolor
## 2          6.4         3.2          4.5         1.5 versicolor
## 3          6.9         3.1          4.9         1.5 versicolor
## 4          5.5         2.3          4.0         1.3 versicolor
## 5          6.5         2.8          4.6         1.5 versicolor
## 6          5.7         2.8          4.5         1.3 versicolor

03: group

select column in which contains word ‘Sepal’ in its column names and find their means using `dplyr` package

  ##-- This code is NOT execute
  select(iris, contains("Sepal"),"Species") -> iris.dp  

  summarise_all(group_by(iris.dp,Species),mean)

04: similar

count number of data that contains word ‘color’ in its Speies using `dplyr` package

  head(filter(iris,grepl("color",Species) ))

##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1          7.0         3.2          4.7         1.4 versicolor
## 2          6.4         3.2          4.5         1.5 versicolor
## 3          6.9         3.1          4.9         1.5 versicolor
## 4          5.5         2.3          4.0         1.3 versicolor
## 5          6.5         2.8          4.6         1.5 versicolor
## 6          5.7         2.8          4.5         1.3 versicolor

  nrow(filter(iris,grepl("color",Species) ))

## [1] 50

05: sort

order data by using `dplyr` package

accending of Petal.Width
decending of Sepal.Width

  ##-- This code is NOT execute
  head(arrange(iris,Petal.Width,-Sepal.Width))

06: manipulate

count number of data in each Species and summarize using the following criteria

type	width of petal	length of petal
low	\([0.00,0.75)\)	\([0.0,2.5)\)
medium	\([0.75,1.75)\)	\([2.5,5.0)\)
high	\([1.75,\infty)\)	\([5.0,\infty)\)

  ### SOLUTION TO QUESTION 1F ###   
  wRange  <- c(0.75,1.75)
  lRange <- c(2.50,5.00)
  
  iris.dply <- iris
  
  iris.dply <- mutate(iris.dply,tWidth=case_when(
    Petal.Width < 0.75 ~ "low",
    Petal.Width > 1.75 ~ "high",
    TRUE ~ "mid"
  ) )
  
  iris.dply <- mutate(iris.dply,tLength=case_when(
    Petal.Length < 2.50 ~ "low",
    Petal.Length > 5.00 ~ "high",
    TRUE ~ "mid"
  ) )
  glimpse(iris.dply)

## Rows: 150
## Columns: 7
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
## $ tWidth       <chr> "low", "low", "low", "low", "low", "low", "low", "low", "…
## $ tLength      <chr> "low", "low", "low", "low", "low", "low", "low", "low", "…

  ftable(tWidth+tLength~Species,data=iris.dply)

##            tWidth  high          low          mid        
##            tLength high low mid high low mid high low mid
## Species                                                  
## setosa                0   0   0    0  50   0    0   0   0
## versicolor            0   0   1    0   0   0    1   0  48
## virginica            38   0   7    0   0   0    3   0   2

EXTRA I

Use your knowledage to query and mutate `iris` in the following step

label by its Petal.Length into equal groups, (i.e., ‘PL.H’,‘PL.M’, ‘PL.L’)
label by its Sapel.Length into equal groups, (i.e., ‘SL.H’,‘SL.M’, ‘SL.L’)
find number rows of each group (there are maximum of \(3 \times 3 \times 3=\) 27)
compute median and sd of column Petal.Width of groups
compute mode and cv of column Sepal.Width of groups
ignore na records

The result should be similar to this table

EXTRA II

Convert the data into long format and convert it back

hint There are two possible packages for this task: reshape2 and data.table

reshape2::melt() and reshape2::dcast()
data.table::melt() and data.table::dcast()

Question 3

Visualize `iris` dataset using standard `base` package and then `lattic` package for its classification. Please think about its outliers as you need the observation for the next question:

01: RECAP

Here are some plots that you should know in the first half. Please recreate them.

NOTE

data point 2, 4, 17, 39, and 90 are marked using text()
identify() is a powerful command to do semi-manual labeling

NOTE

tickers under box are points values using rug()
combine panel can be achieved using par(mfrow=c(2,2))
boxplot is a powerful plot to check distribution and outlier.

    oldPar <- par()
  boxplot(iris,col="gray")

  par(mfcol=c(2,2))
  boxplot(Sepal.Length~Species,data=iris,col="gray")
  boxplot(Sepal.Width~Species,data=iris,col="gray")
  boxplot(Petal.Length~Species,data=iris,col="gray")
  boxplot(Petal.Width~Species,data=iris,col="gray")

  par(oldPar)

02: rare plot

here are some interesting plot, but we rarely uses or cover in the first half

stem and leaf plot

  stem(iris[,2])

## 
##   The decimal point is 1 digit(s) to the left of the |
## 
##   20 | 0
##   21 | 
##   22 | 000
##   23 | 0000
##   24 | 000
##   25 | 00000000
##   26 | 00000
##   27 | 000000000
##   28 | 00000000000000
##   29 | 0000000000
##   30 | 00000000000000000000000000
##   31 | 00000000000
##   32 | 0000000000000
##   33 | 000000
##   34 | 000000000000
##   35 | 000000
##   36 | 0000
##   37 | 000
##   38 | 000000
##   39 | 00
##   40 | 0
##   41 | 0
##   42 | 0
##   43 | 
##   44 | 0

sunflower plot

  sunflowerplot(iris[,1],iris[,2])

03: `pairs()` and `hists()`

observe the dataset visually using `pairs()` and `hist()` They are basic for understand distribution and relationship

  ### SOLUTION TO QUESTION 2B ### 
  iris.jitter <- apply(iris[,1:4],2,function(o){jitter(o)})
  pairs(iris.jitter,col=iris$Species)

  hist(iris$Sepal.Length,n=10,col="grey",freq = F)
  rug(jitter(iris$Sepal.Length))
  points(density(iris$Sepal.Length),col="red",type="l")

04: lactic

visualize data by `bwplot()` in `lattice` package to visualize its classification and compare with `boxplot()`

  ### SOLUTION TO QUESTION 2C ### 
  boxplot(Sepal.Length~Species,data=iris,col="grey" )

  boxplot(iris[[2]]~iris$Species,col="grey" ) ##-- alternative version using list

This is an an advance version of `boxplot()` to plot Species overlapped

  iris.seto <- iris[1:50,]
  iris.virg <- iris[which(iris$Species=="virginica"),]
  iris.vers <- iris[51:100,]
  boxplot(iris.vers[1:4],pch=16,cex=0.5,col="blue")
  boxplot(iris.seto[1:4],pch=16,cex=0.5,col="orange",add=T)
  boxplot(iris.virg[1:4],pch=16,cex=0.5,col="#F0FF00AA",add=T)

Alternative, `lattice` package provides a scatter plot using `xyplot()`

  require(lattice)  
  xyplot(iris[[2]]~iris[[1]]|iris[[5]])

  bwplot(Sepal.Length~factor(ceiling(Sepal.Width)) |Species,data=iris,add=T)

05: outlier

we can identify outlier of `boxplot()` by saving its in another vairable. For example,

  tempBox <- boxplot(iris[[1]]~iris[[5]],plot=F)
  str(tempBox)

## List of 6
##  $ stats: num [1:5, 1:3] 4.3 4.8 5 5.2 5.8 4.9 5.6 5.9 6.3 7 ...
##  $ n    : num [1:3] 50 50 50
##  $ conf : num [1:2, 1:3] 4.91 5.09 5.74 6.06 6.34 ...
##  $ out  : num 4.9
##  $ group: num 3
##  $ names: chr [1:3] "setosa" "versicolor" "virginica"

Then, we can identify row id using `which()`. For example,

  which(iris[[1]]==tempBox$out & iris[[5]] == tempBox$names[tempBox$group])

## [1] 107

Manually, you can keep record as set of outlier (‘outlier1’) and combine the sets with command ‘union()’.

  tempBox <- boxplot(iris[[1]]~iris[[5]],plot=F)
  ol1 <- which(iris[[1]]==tempBox$out & iris[[5]] == tempBox$names[tempBox$group])
  tempBox <- boxplot(iris[[2]]~iris[[5]],plot=F)
  ol2 <- which(iris[[2]]==tempBox$out & iris[[5]] == tempBox$names[tempBox$group])
  
  outlier <- union(ol1,ol2)
  outlier

## [1] 107  42

Note other set operation are: * ‘union()’ * ‘intersect()’
* ‘setdiff()’

Here are full implementation of such concept

loop for extracting outlier information of all columns

  iris.boxList <- data.frame()
  for(i in 1:4){
    ## i <- 1
    tempBox <- boxplot(iris[[i]]~iris[[5]],plot=F)
    tempDF  <- as.data.frame(tempBox[c("out", "group")])
    tempDF$colName <- i
    tempDF$species <- tempBox$names[tempDF$group]
    iris.boxList <- rbind(iris.boxList,tempDF)
  }
  iris.boxList

##   out group colName    species
## 1 4.9     3       1  virginica
## 2 2.3     1       2     setosa
## 3 1.0     1       3     setosa
## 4 3.0     2       3 versicolor
## 5 0.5     1       4     setosa
## 6 0.6     1       4     setosa

loop for finding outliers of all columns

  iris.boxList$which <- NULL
  for(i in 1:nrow(iris.boxList) ){
    ## i <- 1
    colIdx  <- iris.boxList$colName[i]
    species <- iris.boxList$species[i]
    value   <- iris.boxList$out[i]
    resRow  <- which( iris[[colIdx]] == value  & iris[[5]]==species)
    iris.boxList$which[i] <- resRow
  }
  iris.boxList

##   out group colName    species which
## 1 4.9     3       1  virginica   107
## 2 2.3     1       2     setosa    42
## 3 1.0     1       3     setosa    23
## 4 3.0     2       3 versicolor    99
## 5 0.5     1       4     setosa    24
## 6 0.6     1       4     setosa    44

06: put together

Perhaps, the most important part of data mining is data cleansing as the most time-consuming process and the effects of the next step. Before analyzing, it is important to clean data. The major steps are:

check for duplication
check for missing value
check for incorrect obvious error, e.g., swap column, out-of-bound, wrong gender
check for irregularity not-so obvious/strange values usually require investigation, e.g., outlier, noise
consider remove or impute such data points

Because `iris` is a cleaned data, we have to worries only outlier. We can manually identify outlier using `identify()` and scatter plot

  ##-- This code is NOT execute
  xAxis <-  iris[,1]
  yAxis <-  iris[,2]  
  plot(xAxis,yAxis)
  identify(xAxis,yAxis)

Removing outlier can be done latter. Recall that

  iris.boxList

##   out group colName    species which
## 1 4.9     3       1  virginica   107
## 2 2.3     1       2     setosa    42
## 3 1.0     1       3     setosa    23
## 4 3.0     2       3 versicolor    99
## 5 0.5     1       4     setosa    24
## 6 0.6     1       4     setosa    44

  outlier.Idx <- unique(iris.boxList$which)
  iris.cln <- iris[-outlier.Idx,]

Alternatively, we can apply local outlier finding `lof()` in `rlof` package that uses the concept of clustering to identify outlier.

  ##-- This code is NOT execute
  require(Rlof)
  lof.dist <- lof(iris[,1:4],k=8) 
  isGood   <-  which(lof.dist < 1.2)
  iris.lof <- iris[isGood,]
  pairs(iris.lof[,1:4],col=iris.lof$Species)

Queston 4

Visual `iris` dataset using `ggplot2` package

00: concept

`ggplot2` package is a part of `tidyverse` that allows `data.table` and `data.frame` objects to plot and visualize. ggplot2 is based on the grammar of graphics, sepertating components: a data set, a coordinate system, and geoms—visual marks that represent data points.

DATA data.frame or data.table (required)
GEOM_FUNCTION plot style (requied), e.g., geom_point() geom_box()
MAPPING axis, plot component (requied) e.g., aes(x=,y=,fill=)
STAT_FUNCTION, plot with stat result
FACET_FUNCTION, show many plot in same figure

note the code is available in the next tab

more information:

[html]!(https://ggplot2.tidyverse.org/)
[cheat sheet]!(https://github.com/rstudio/cheatsheets/blob/master/data-visualization-2.1.pdf)

01: typical

we will cover a simple geom.

Histogram and Density plot

  ggplot(iris, aes(Sepal.Width,fill=Species)) + geom_histogram(bins=25)

Dot plot or overlapped histogram

  ggplot(iris) +  geom_dotplot(aes(x=Sepal.Width,fill=Species))

scatter plot

  ggplot(iris,aes(x=Sepal.Width,y=Sepal.Length,color=Species)) +  geom_point(position="jitter") + theme_classic()

02: adv

In general, `ggplot()` with additional packages can do any form of visualization. Here are some capability that we may use for the team project.

facet

  require(ggplot2)
  require(data.table)
  iris.DT <- as.data.table(iris)
  iris.lng <- melt(iris.DT,id.var="Species")
  ggplot(iris.lng, aes(x=variable,y=value,fill=Species)) + geom_violin() -> gg
  gg + facet_grid(cols=vars(Species) ) + xlab("dimension") + ylab("Unit (cm)")

1D density with facet and 2D density

  ggplot(iris.lng, aes(value,color=variable))+ geom_density() + facet_grid(cols=vars(Species) )

  ggplot(iris,aes(x=Sepal.Width,y=Sepal.Length,color=Species)) +  geom_density_2d()

combine filter with plot

 ##-- seperate data filter and query 
 iris.DT[Species %like% "color"] %>%
   ggplot(aes(x=Sepal.Width,y=Sepal.Length,color=Species)) + geom_point()

note

%>% = piping (passing data) command in dplyr package
melt = command to make long table
dcast = command to make short table

Question 5

After marked 30 questions, an instructor notice a possible cheating of the following ten students. The questions are TRUE-FALSE question, and instructor has marked ‘1’ for correct answer and ‘0’ for incorrect answer. Can you detect cheaters (source and copier)?

00: concept

Important Concept

How and what should be compared?
- answer of each question between two students at a time (why?)
- both get a wrong/right answer (any signal)
What are measurement of similarity
- \[ SMC = \frac{\mbox{\# matching attributes}}{\mbox{ \# all attributes}} \]
- \[ \cos(\mathbf{x},\mathbf{y}) =\frac{\mathbf{x} \cdot \mathbf{y} }{\| \mathbf{x}\| ~\|\mathbf{y}\|}\]

01: prepare

STEP 1: prepare data

  ##-- This code is NOT execute
  ##-- If error check where is your file
  exam.df <- as.data.frame(read.csv(file="examMarked.csv"))

  colnames(exam.df) <- c("id",paste( "Q",1:30,sep=""))
  
  stu1 <- exam.df[1,2:31]
  stu2 <- exam.df[2,2:31]
  
  ##-- example of data 
  rbind(stu1,stu2)[,1:10]

##   Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
## 1  1  1  1  1  0  1  1  1  1   1
## 2  1  1  1  0  0  1  1  0  1   1

create function that compare two students

checkFun <- function(stu1,stu2){
  case00 <- length(which(stu1 == 0 & stu2 == 0))
  case01 <- length(which(stu1 == 0 & stu2 == 1))
  case10 <- length(which(stu1 == 1 & stu2 == 0))
  case11 <- length(which(stu1 == 1 & stu2 == 1))
  return( list(case00=case00,
               case10=case10,
               case01=case01,
               case11=case11) )
}

## test your code
checkFun(stu1,stu2)

## $case00
## [1] 8
## 
## $case10
## [1] 5
## 
## $case01
## [1] 5
## 
## $case11
## [1] 12

02: compare

STEP 2: actual compare student

  ##-- This code is NOT execute
  pair <- combn(10,2)
  nPair<- ncol(pair)
  pairResult <- data.frame(stu1ID=pair[1,],stu2ID=pair[2,],
                           case00=rep(NA,nPair),case10=rep(NA,nPair),
                           case01=rep(NA,nPair),case11=rep(NA,nPair))



#nPair <- nrow(pairResult)
for( i in 1:nPair){
  ## i <- 1 ## debug
  stu1ID <- pairResult$stu1ID[i]
  stu2ID <- pairResult$stu2ID[i]
  exam1  <- exam.df[stu1ID,2:31]
  exam2  <- exam.df[stu2ID,2:31]
  compResult <- checkFun(exam1,exam2)
  pairResult$case00[i] <- compResult$case00
  pairResult$case10[i] <- compResult$case10
  pairResult$case01[i] <- compResult$case01
  pairResult$case11[i] <- compResult$case11  
}
ord <- order(pairResult$smc,decreasing = T)
pairResult[ord,]
smc <- (pairResult$case00+ pairResult$case11)/sum(pairResult[,3:6])

03: stat

cor.test

  cor.test(as.numeric(exam.df[1,2:31]),as.numeric(exam.df[2,2:31]))

## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(exam.df[1, 2:31]) and as.numeric(exam.df[2, 2:31])
## t = 1.7951, df = 28, p-value = 0.08343
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.04410735  0.61083640
## sample estimates:
##      cor 
## 0.321267

Question 6

Consider the following visualization example of number of murders in US from `USArrests` in package `datasets` by state with the thermal map

0A: map

`maps` package has a build-in worldmap function for visualization `map()`. The details of map may be varied depending on each country. Here is a state map of US.

  require(maps)
  map('state',col=c("red","blue","green"),fill=T)

0B: data

Before integrating with map, we need to know more about `USArrests` data.

  data <- as.data.frame(USArrests)
  head(data)

##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

  hist(data$Murder,col="grey")
  rug(jitter(data$Murder))
  box()

0C: color

We use `heat.colors()` scheme as color (red = hot; yellow=warm)

  intQuantile <- c(0.1,0.25,0.5,0.75,0.9)
  nRange      <- length(intQuantile)
  colorRange  <- sort(heat.colors(nRange+1),decreasing = T)
  pie(rep(1,nRange),col=colorRange)

put color and map together

  valMurder   <- quantile(data$Murder,intQuantile)
  myCol       <- as.character(cut(data$Murder,breaks = c(0,valMurder,20),labels=colorRange))
  
  map('state',col=myCol,fill=T)

### 0D: label ##### preparation

  ##-- prepare legend
  legendText <- c(paste(c(">",valMurder[nRange]),collapse=""))
  for(j in (nRange-1):1){
    legendText <- c(legendText,paste(c(valMurder[j],"-",valMurder[j+1]),collapse=""))
  }
  legendText <- c(legendText,paste(c("<",valMurder[1]),collapse=""))
  
  legendText

## [1] ">13.32"      "11.25-13.32" "7.25-11.25"  "4.075-7.25"  "2.56-4.075" 
## [6] "<2.56"

combine togetther

  map('state',col=myCol,fill=T)
  legend("bottomright",legend=legendText,pch=rep(15,nRange),col=colorRange,ncol=1,cex=1.0,pt.cex=3.5
         ,y.intersp=0.5,bty="n")

01: other maps

base on the previous code blocks, represent other three types of arrest, i.e. Assault, UrbanPop, and Rape with the similar manner with function `plotThermalMap(type=1,quantLv=c(0.1,0.25,0.5,0.75,0.9))`

  ##-- This code is NOT execute and incompleted
  plotThermalMap <- function(type=1,quantLv=c(0.1,0.25,0.5,0.75,0.9)){
    
    data <- as.data.frame(USArrests)

    ##-- This part is intentionaly left out --##
    
    return(0)
  }
  
  plotThermalMap(4)

02: rscript (DOS)

automatically generate the thermal map and export as files (note This can be combined and execute with batch file using `Rscript <fileName>.R`in DOS)

  ##-- This code is NOT execute   
  typeName <- colnames(USArrests)
  nType    <- length(typeName)
  for(i in 1:nType){
    ## i <- 1 ##-- for debug
    fileName <- paste( c(typeName[i],".png"),collapse="")
    png(fileName,width = 600,height = 600)
    plot.new()
    
    ##-- function from the previous part
    plotThermalMap(i) 
    dev.off()
  }

Question 7

The final step after understanding patterns and insights of dataset is to prepare data for a model. This involves preparing two data set for training model and testing model.

note This last question is overlapped with the question in the next workshop.

01: insights

Based on the data exploration so far, list useful insights that can be utilized the model selection.

02: seperate

This can be done using command `sample.int()` or `sample()`. It is very important to indicate `set.seed()`

  ##-- using `sample()`
  set.seed(17)
  smpl.idx <- sample(1:nrow(iris),size=30)
  
    ##-- using `sample.int()`
  set.seed(17)
  smpl.idx <- sample.int(nrow(iris),size=30)
  
  iris.test <- iris[smpl.idx,]
  iris.train<- iris[-smpl.idx,]
  
  head(iris.test)

##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 108          7.3         2.9          6.3         1.8 virginica
## 42           4.5         2.3          1.3         0.3    setosa
## 129          6.4         2.8          5.6         2.1 virginica
## 6            5.4         3.9          1.7         0.4    setosa
## 133          6.4         2.8          5.6         2.2 virginica
## 110          7.2         3.6          6.1         2.5 virginica

NOTE Do the sample balance in term of Spices? If not, is there any solution

03: build model

Different package has different way to build model and activate results. Here are three different examples of three classification packages.

`class::knn()`

k-Nearest Neighborhood is the easiest method in classification. It also have the most unique way to build model.

  require(class)
  species.knn <- knn(train=iris.train[,-5],test=iris.test[,-5],
                  cl=iris.train[,5],k=3)  
  
  ftable(iris.test[,5],species.knn)

##            species.knn setosa versicolor virginica
##                                                   
## setosa                     11          0         0
## versicolor                  0          9         1
## virginica                   0          1         8

`base::glm()`

general linear regression is an extension of linear regression model that covers other responses, such as binary and positive (poisson).

  ##-- casting factor into number (0-1)
  iris.train$specIdx <- as.numeric(iris.train$Species)
  iris.test$specIdx  <- as.numeric(iris.test$Species)

  species.glm <- glm(specIdx~. -Species,data=iris.train,family ="poisson")
  
  iris.glm    <- round(predict(species.glm,newdata = iris.test,type = "response"))
  ftable(iris.test[,5],iris.glm)

##            iris.glm  1  2  3  4
##                                
## setosa              11  0  0  0
## versicolor           0 10  0  0
## virginica            0  0  8  1

`rpart::rpart()`

Recursive Partitioning and Regression Tree is one way to build decision tree using greedy algorthm. The method requires pruning and checking of Mellow’s \(C_p\) to avoid overfitting.

  require(rpart)
  iris.rpart <- rpart(Species~.,data=iris.train)  
  
  require(rpart.plot)
  prp(iris.rpart)

  species.rpart <- apply(predict(iris.rpart,newdata = iris.test),1,which.max)
  ftable(iris.test[,5],species.rpart)

##            species.rpart  1  2  3
##                                  
## setosa                   11  0  0
## versicolor                0 10  0
## virginica                 0  0  9

NOTE The detail and how to select a suitable model will be discussed in the next workshop.

Workshop 4: Data Exploration and Visualization of IRIS Data Set

oran.k@chula.ac.th

August 2021

Instruction

Question 1

Consider iris dataset (one of the most famous dataset in Data Mining) and learn basic command of data.frame package

01: access

data.frame object has mixed propeties of matrix() and list(); hence, we can access the object using both methods in four ways.

02: sampling

one way to minimize computation time and reduce operation on data.frame is to sample and review only option of data. Here are two tips:

03: view

Alternatively, a user may want to screen/check datapoint using commands View() edit() and fix().

04: misc

Before exploring data, here are misc. tips in data.frame may useful.

05: summarytools

one trick that we uses to quickly view data is to use summarytools package

Data Frame Summary

iris

Question 2

Re-Consider iris, explore the data by its classification with R packages. This question is separated into three approaches with the identical result (the background of users and personal experience play important role to how one select approach) :

base

01: stat

summarize data using basic prescriptive statistic (mean, median, sd, skewness, kurtosis, sd, IQR, and CV) using base

02: query

Find data using the following conditions using base package

03: col select

select column in which contains word ‘Sepal’ in its column names and find their means

04: similar

count number of data that contains word ‘color’ in its Speies using base package

05: sort

order data in the following order using base package

06: mutate

count number of data in each Species and summarize using the following criteria

data.table

00: intro

data.table is a compact and quick package for transforming data in R based on the following structure.

more information:

01: stat

summarize data using basic prescriptive statistic (mean, median, sd, skewness, kurtosis, sd, IQR, and CV) using data.table

02: query

Find data using the following conditions with and without data.table package

03: col select

select column in which contains word ‘Sepal’ in its column names and find their means

04: similar

count number of data that contains word ‘color’ in its Speies using data.table

05: sort

sort the data in the following order using data.table packaage order data by

06: mutate

count number of data in each Species and summarize using the following criteria

dplyr

00: intro

dplyr is a part of tidyr for Data Transformation. It bases on a function of SQL language that consists of:

more information:

01: stat

summarize data using basic prescriptive statistic (mean, median, sd, skewness, kurtosis, sd, IQR, and CV) using dplyr

02: guery

Find data using the following conditions using dplyr package

03: group

select column in which contains word ‘Sepal’ in its column names and find their means using dplyr package

04: similar

count number of data that contains word ‘color’ in its Speies using dplyr package

05: sort

order data by using dplyr package

06: manipulate

count number of data in each Species and summarize using the following criteria

EXTRA I

Use your knowledage to query and mutate iris in the following step

EXTRA II

Convert the data into long format and convert it back

Question 3

Visualize iris dataset using standard base package and then lattic package for its classification. Please think about its outliers as you need the observation for the next question:

01: RECAP

Here are some plots that you should know in the first half. Please recreate them.

02: rare plot

here are some interesting plot, but we rarely uses or cover in the first half

03: pairs() and hists()

observe the dataset visually using pairs() and hist() They are basic for understand distribution and relationship

04: lactic

visualize data by bwplot() in lattice package to visualize its classification and compare with boxplot()

This is an an advance version of boxplot() to plot Species overlapped

Consider `iris` dataset (one of the most famous dataset in Data Mining) and learn basic command of `data.frame` package

`data.frame` object has mixed propeties of `matrix()` and `list()`; hence, we can access the object using both methods in four ways.

one way to minimize computation time and reduce operation on `data.frame` is to sample and review only option of data. Here are two tips:

Alternatively, a user may want to screen/check datapoint using commands `View()` `edit()` and `fix()`.

Before exploring data, here are misc. tips in `data.frame` may useful.

05: `summarytools`

one trick that we uses to quickly view data is to use `summarytools` package

Re-Consider `iris`, explore the data by its classification with R packages. This question is separated into three approaches with the identical result (the background of users and personal experience play important role to how one select approach) :

`base`

summarize data using basic prescriptive statistic (mean, median, sd, skewness, kurtosis, sd, IQR, and CV) using `base`

Find data using the following conditions using `base` package

count number of data that contains word ‘color’ in its Speies using `base` package

order data in the following order using `base` package

`data.table`

`data.table` is a compact and quick package for transforming data in R based on the following structure.

summarize data using basic prescriptive statistic (mean, median, sd, skewness, kurtosis, sd, IQR, and CV) using `data.table`

Find data using the following conditions with and without `data.table` package

count number of data that contains word ‘color’ in its Speies using `data.table`

sort the data in the following order using `data.table` packaage order data by

`dplyr`

`dplyr` is a part of `tidyr` for Data Transformation. It bases on a function of SQL language that consists of:

summarize data using basic prescriptive statistic (mean, median, sd, skewness, kurtosis, sd, IQR, and CV) using `dplyr`

Find data using the following conditions using `dplyr` package

select column in which contains word ‘Sepal’ in its column names and find their means using `dplyr` package

count number of data that contains word ‘color’ in its Speies using `dplyr` package

order data by using `dplyr` package

Use your knowledage to query and mutate `iris` in the following step

Visualize `iris` dataset using standard `base` package and then `lattic` package for its classification. Please think about its outliers as you need the observation for the next question:

03: `pairs()` and `hists()`

observe the dataset visually using `pairs()` and `hist()` They are basic for understand distribution and relationship

visualize data by `bwplot()` in `lattice` package to visualize its classification and compare with `boxplot()`

This is an an advance version of `boxplot()` to plot Species overlapped

Alternative, `lattice` package provides a scatter plot using `xyplot()`

we can identify outlier of `boxplot()` by saving its in another vairable. For example,

Then, we can identify row id using `which()`. For example,

Because `iris` is a cleaned data, we have to worries only outlier. We can manually identify outlier using `identify()` and scatter plot

Alternatively, we can apply local outlier finding `lof()` in `rlof` package that uses the concept of clustering to identify outlier.

Visual `iris` dataset using `ggplot2` package

`ggplot2` package is a part of `tidyverse` that allows `data.table` and `data.frame` objects to plot and visualize. ggplot2 is based on the grammar of graphics, sepertating components: a data set, a coordinate system, and geoms—visual marks that represent data points.

In general, `ggplot()` with additional packages can do any form of visualization. Here are some capability that we may use for the team project.

Consider the following visualization example of number of murders in US from `USArrests` in package `datasets` by state with the thermal map

`maps` package has a build-in worldmap function for visualization `map()`. The details of map may be varied depending on each country. Here is a state map of US.

Before integrating with map, we need to know more about `USArrests` data.

We use `heat.colors()` scheme as color (red = hot; yellow=warm)

base on the previous code blocks, represent other three types of arrest, i.e. Assault, UrbanPop, and Rape with the similar manner with function `plotThermalMap(type=1,quantLv=c(0.1,0.25,0.5,0.75,0.9))`

automatically generate the thermal map and export as files (note This can be combined and execute with batch file using `Rscript <fileName>.R`in DOS)

This can be done using command `sample.int()` or `sample()`. It is very important to indicate `set.seed()`

`class::knn()`

`base::glm()`

`rpart::rpart()`