Chapter 3 Data Transformation

We will be using dplyr for this chapter only.

dplyr allows us to manipulate data sets by passing SQL like query.

We will only be focusing on 5 function in this package.

filter()
arrange()
select()
mutate()
summarise()

Additionally, these functions also works in conjunction with

group_by()
Data Pipelines ( %>% )

The sales dataset have been gotten from the open data website, https://data.gov.sg/.

Before we can start, we have to download and load the required packages as well as load the csv files.

3.1 Initialise your environment

Install the following package

# you only need to install it once
install.packages("dplyr")

Load the packages into RStudio

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

dplyr comes with a few loaded packages.

To check them out simply type the following command

data(package = "dplyr")

To show all the available function within dplyr,

ls("package:dplyr")

3.2 Import the Dataset

Before we can import the Data Sets, we have to ensure the the .csv file is the same project folder. we can first check if we are in the correct directory by using getwd()

# check if the dataset is in the same project folder
getwd()

If it is not in the same project folder, we can set it manually by passing the correct path into the function setwd().

setwd("       ")

Once these are settled, we can import the datasets using the read.csv() function.

# import the dataset
sales <- read.csv(file = "Properties and Facilities Sales.csv")

We can do some prelimanry check on the attributes of the datasets

class(sales)
dim(sales)
names(sales)
str(sales)
glimpse(sales) # dplyr method

As the table is large and hard to read, dplyr has a function called as_data_frame() that allow us to convert it to a tidier format.

sales <- as_data_frame(sales) # note that we have overide the old variable
sales

## # A tibble: 416 x 6
##     year      types                           property_type category
##    <int>     <fctr>                                  <fctr>   <fctr>
##  1  2006 Commercial                 Shops and Eating Houses     Sold
##  2  2006 Commercial                 Shops and Eating Houses   Rented
##  3  2006 Commercial                                 Offices     Sold
##  4  2006 Commercial                                 Offices   Rented
##  5  2006 Commercial                Supermarkets & Emporiums     Sold
##  6  2006 Commercial                Supermarkets & Emporiums   Rented
##  7  2006 Commercial Civil Defence Shelters (commercial use)     Sold
##  8  2006 Commercial Civil Defence Shelters (commercial use)   Rented
##  9  2006 Commercial                 Shops and Eating Houses     Sold
## 10  2007 Commercial                 Shops and Eating Houses   Rented
## # ... with 406 more rows, and 2 more variables: subcategory <fctr>,
## #   no_of_units <int>

We can check data type of the sales datasets using class() function

class(sales)

## [1] "tbl_df"     "tbl"        "data.frame"

Note the difference between as.data.frame() and as_data_frame()

3.3 Filter

filter() allows us to filter out specific rows based on some specified conditions

We can list out all the different type of properties under the column type

unique(sales["types"])

## # A tibble: 3 x 1
##         types
##        <fctr>
## 1  Commercial
## 2  Industrial
## 3 Residential

Now lets filter out all the Residential Properties

filter(sales, types == "Residential")

## Warning: package 'bindrcpp' was built under R version 3.3.3

## # A tibble: 202 x 6
##     year       types property_type category
##    <int>      <fctr>        <fctr>   <fctr>
##  1  2006 Residential           HDB     Sold
##  2  2006 Residential           HDB     Sold
##  3  2006 Residential           HDB     Sold
##  4  2006 Residential           HDB     Sold
##  5  2006 Residential           HDB     Sold
##  6  2006 Residential           HDB     Sold
##  7  2006 Residential           HDB     Sold
##  8  2006 Residential           HDB   Rented
##  9  2006 Residential           HDB   Rented
## 10  2006 Residential           HDB   Rented
## # ... with 192 more rows, and 2 more variables: subcategory <fctr>,
## #   no_of_units <int>

Filtering can also be done through logical operator like,

And – &
Or – |

Filter all types == “Residential”

filter(sales, types == "Residential" & category == "Sold")

## # A tibble: 101 x 6
##     year       types property_type category
##    <int>      <fctr>        <fctr>   <fctr>
##  1  2006 Residential           HDB     Sold
##  2  2006 Residential           HDB     Sold
##  3  2006 Residential           HDB     Sold
##  4  2006 Residential           HDB     Sold
##  5  2006 Residential           HDB     Sold
##  6  2006 Residential           HDB     Sold
##  7  2006 Residential           HDB     Sold
##  8  2006 Residential          DBSS     Sold
##  9  2006 Residential          DBSS     Sold
## 10  2006 Residential          DBSS     Sold
## # ... with 91 more rows, and 2 more variables: subcategory <fctr>,
## #   no_of_units <int>

Filter all types == “HDB”

filter(sales, property_type == "HDB" | property_type == "DBSS")

## # A tibble: 202 x 6
##     year       types property_type category
##    <int>      <fctr>        <fctr>   <fctr>
##  1  2006 Residential           HDB     Sold
##  2  2006 Residential           HDB     Sold
##  3  2006 Residential           HDB     Sold
##  4  2006 Residential           HDB     Sold
##  5  2006 Residential           HDB     Sold
##  6  2006 Residential           HDB     Sold
##  7  2006 Residential           HDB     Sold
##  8  2006 Residential           HDB   Rented
##  9  2006 Residential           HDB   Rented
## 10  2006 Residential           HDB   Rented
## # ... with 192 more rows, and 2 more variables: subcategory <fctr>,
## #   no_of_units <int>

Exercise

You can also use filter with >, <, >=, <=,

Filter Out all the HDB sold in the year 2009

filter(sales, year == "2009", property_type == "HDB")

Filter out all rented HDB units that are 5 Room Flats

filter(sales, property_type == "HDB", category == "Rented", subcategory == "5-room flats")

3.4 Arrange

arrange() allows us to arrange or sort the rows based on the columns that we select.

Lets arrange by property type.

arrange(sales, property_type)

## # A tibble: 416 x 6
##     year      types         property_type category subcategory no_of_units
##    <int>     <fctr>                <fctr>   <fctr>      <fctr>       <int>
##  1  2006 Industrial Canteens/Eating House     Sold          NA          NA
##  2  2006 Industrial Canteens/Eating House   Rented          NA           5
##  3  2007 Industrial Canteens/Eating House     Sold          NA          NA
##  4  2007 Industrial Canteens/Eating House   Rented          NA           1
##  5  2008 Industrial Canteens/Eating House     Sold          NA          NA
##  6  2008 Industrial Canteens/Eating House   Rented          NA           0
##  7  2009 Industrial Canteens/Eating House     Sold          NA          NA
##  8  2009 Industrial Canteens/Eating House   Rented          NA           0
##  9  2010 Industrial Canteens/Eating House     Sold          NA          NA
## 10  2010 Industrial Canteens/Eating House   Rented          NA           0
## # ... with 406 more rows

We can also arrange by multiple columns.

arrange(sales, property_type, category)

## # A tibble: 416 x 6
##     year      types         property_type category subcategory no_of_units
##    <int>     <fctr>                <fctr>   <fctr>      <fctr>       <int>
##  1  2006 Industrial Canteens/Eating House   Rented          NA           5
##  2  2007 Industrial Canteens/Eating House   Rented          NA           1
##  3  2008 Industrial Canteens/Eating House   Rented          NA           0
##  4  2009 Industrial Canteens/Eating House   Rented          NA           0
##  5  2010 Industrial Canteens/Eating House   Rented          NA           0
##  6  2011 Industrial Canteens/Eating House   Rented          NA           0
##  7  2012 Industrial Canteens/Eating House   Rented          NA           0
##  8  2013 Industrial Canteens/Eating House   Rented          NA           1
##  9  2014 Industrial Canteens/Eating House   Rented          NA           0
## 10  2015 Industrial Canteens/Eating House   Rented          NA           0
## # ... with 406 more rows

To sort in descending order, we can use the desc() function.

arrange(sales, desc(property_type))

## # A tibble: 416 x 6
##     year      types     property_type category subcategory no_of_units
##    <int>     <fctr>            <fctr>   <fctr>      <fctr>       <int>
##  1  2014 Industrial Wholesale Centres     Sold          NA          NA
##  2  2014 Industrial Wholesale Centres   Rented          NA           8
##  3  2015 Industrial Wholesale Centres     Sold          NA          NA
##  4  2015 Industrial Wholesale Centres   Rented          NA          15
##  5  2006 Industrial         Warehouse     Sold          NA          NA
##  6  2006 Industrial         Warehouse   Rented          NA           5
##  7  2007 Industrial         Warehouse     Sold          NA          NA
##  8  2007 Industrial         Warehouse   Rented          NA           8
##  9  2008 Industrial         Warehouse     Sold          NA          NA
## 10  2008 Industrial         Warehouse   Rented          NA           0
## # ... with 406 more rows

We can also sort in both ascending and descending order.

arrange(sales, desc(property_type), category)

## # A tibble: 416 x 6
##     year      types     property_type category subcategory no_of_units
##    <int>     <fctr>            <fctr>   <fctr>      <fctr>       <int>
##  1  2014 Industrial Wholesale Centres   Rented          NA           8
##  2  2015 Industrial Wholesale Centres   Rented          NA          15
##  3  2014 Industrial Wholesale Centres     Sold          NA          NA
##  4  2015 Industrial Wholesale Centres     Sold          NA          NA
##  5  2006 Industrial         Warehouse   Rented          NA           5
##  6  2007 Industrial         Warehouse   Rented          NA           8
##  7  2008 Industrial         Warehouse   Rented          NA           0
##  8  2009 Industrial         Warehouse   Rented          NA           1
##  9  2010 Industrial         Warehouse   Rented          NA           0
## 10  2011 Industrial         Warehouse   Rented          NA           0
## # ... with 406 more rows

Exercise

Sort in descending order the total number of units sold

arrange(sales, desc(no_of_units))

Sort in descending order the total number of units sold by year

arrange(sales, year, desc(no_of_units))

3.5 Select

select() allows us to select specific columns that we want to scope in.

Lets says we want to only scope into or select the column, no_of_units,

select(sales, no_of_units)

## # A tibble: 416 x 1
##    no_of_units
##          <int>
##  1          NA
##  2         287
##  3          NA
##  4         112
##  5          NA
##  6           6
##  7          NA
##  8          23
##  9          17
## 10         330
## # ... with 406 more rows

We are not confine to selecting a single column. to select multiple column, all we just need to do is specify all the column name that we want.

select(sales, year, property_type, no_of_units)

## # A tibble: 416 x 3
##     year                           property_type no_of_units
##    <int>                                  <fctr>       <int>
##  1  2006                 Shops and Eating Houses          NA
##  2  2006                 Shops and Eating Houses         287
##  3  2006                                 Offices          NA
##  4  2006                                 Offices         112
##  5  2006                Supermarkets & Emporiums          NA
##  6  2006                Supermarkets & Emporiums           6
##  7  2006 Civil Defence Shelters (commercial use)          NA
##  8  2006 Civil Defence Shelters (commercial use)          23
##  9  2006                 Shops and Eating Houses          17
## 10  2007                 Shops and Eating Houses         330
## # ... with 406 more rows

Alternatively, instead of specifying the column that we want, we can also select the columns by dropping the column the we want. we can do that by appending a negative sign, infront on the column that we wish to drop.

select(sales, -types, -category, -subcategory)

## # A tibble: 416 x 3
##     year                           property_type no_of_units
##    <int>                                  <fctr>       <int>
##  1  2006                 Shops and Eating Houses          NA
##  2  2006                 Shops and Eating Houses         287
##  3  2006                                 Offices          NA
##  4  2006                                 Offices         112
##  5  2006                Supermarkets & Emporiums          NA
##  6  2006                Supermarkets & Emporiums           6
##  7  2006 Civil Defence Shelters (commercial use)          NA
##  8  2006 Civil Defence Shelters (commercial use)          23
##  9  2006                 Shops and Eating Houses          17
## 10  2007                 Shops and Eating Houses         330
## # ... with 406 more rows

Alternative to arrange(), we can also re-arrange coloumns through the select function as well. we used the function `everything()’ to specify the remaining column that we have no selected.

select(sales, no_of_units, everything())

## # A tibble: 416 x 6
##    no_of_units  year      types                           property_type
##          <int> <int>     <fctr>                                  <fctr>
##  1          NA  2006 Commercial                 Shops and Eating Houses
##  2         287  2006 Commercial                 Shops and Eating Houses
##  3          NA  2006 Commercial                                 Offices
##  4         112  2006 Commercial                                 Offices
##  5          NA  2006 Commercial                Supermarkets & Emporiums
##  6           6  2006 Commercial                Supermarkets & Emporiums
##  7          NA  2006 Commercial Civil Defence Shelters (commercial use)
##  8          23  2006 Commercial Civil Defence Shelters (commercial use)
##  9          17  2006 Commercial                 Shops and Eating Houses
## 10         330  2007 Commercial                 Shops and Eating Houses
## # ... with 406 more rows, and 2 more variables: category <fctr>,
## #   subcategory <fctr>

3.6 Mutate

mutate() allows us to add or modify a new coloumn to any datasets. we are also able to use any mathematical function or any user-defined function within the mutate() function.

Lets start by adding a new column which calculate’s the total number of units sold for Residential types only.

To do this we have to first filter out all types that is “Residential”

residential <- filter(sales, types == "Residential")

Next we can add a new column by giving a name “total_units”.

mutate(residential, total_units = sum(no_of_units))

## # A tibble: 202 x 7
##     year       types property_type category
##    <int>      <fctr>        <fctr>   <fctr>
##  1  2006 Residential           HDB     Sold
##  2  2006 Residential           HDB     Sold
##  3  2006 Residential           HDB     Sold
##  4  2006 Residential           HDB     Sold
##  5  2006 Residential           HDB     Sold
##  6  2006 Residential           HDB     Sold
##  7  2006 Residential           HDB     Sold
##  8  2006 Residential           HDB   Rented
##  9  2006 Residential           HDB   Rented
## 10  2006 Residential           HDB   Rented
## # ... with 192 more rows, and 3 more variables: subcategory <fctr>,
## #   no_of_units <int>, total_units <int>

mutate(residential, total_units = sum(no_of_units, na.rm = TRUE))

## # A tibble: 202 x 7
##     year       types property_type category
##    <int>      <fctr>        <fctr>   <fctr>
##  1  2006 Residential           HDB     Sold
##  2  2006 Residential           HDB     Sold
##  3  2006 Residential           HDB     Sold
##  4  2006 Residential           HDB     Sold
##  5  2006 Residential           HDB     Sold
##  6  2006 Residential           HDB     Sold
##  7  2006 Residential           HDB     Sold
##  8  2006 Residential           HDB   Rented
##  9  2006 Residential           HDB   Rented
## 10  2006 Residential           HDB   Rented
## # ... with 192 more rows, and 3 more variables: subcategory <fctr>,
## #   no_of_units <int>, total_units <int>

Exercise

Add a column, “total_sold”, which calculate the total number of sold properties and a column, “total_rented”, which calculate the total number of rented properties.

mutate(sales, total_sold = sum(category == "Sold"),
              total_rented = sum(category == "Rented"))

3.7 Summarise

summarise() allows us to summarise a data based on specific function like finding the sum, mean, median, percentile etc.

Sum

summarise(sales, Sum = sum(no_of_units, na.rm = TRUE))

## # A tibble: 1 x 1
##      Sum
##    <int>
## 1 165807

Mean

summarise(sales, Mean = mean(no_of_units, na.rm = TRUE))

## # A tibble: 1 x 1
##       Mean
##      <dbl>
## 1 594.2903

Median

summarise(sales, median = median(no_of_units, na.rm = TRUE))

## # A tibble: 1 x 1
##   median
##    <int>
## 1     20

Quantile

summarise(sales, Quantile = quantile(no_of_units, 0.9, na.rm = TRUE))

## # A tibble: 1 x 1
##   Quantile
##      <dbl>
## 1   1704.8

Alternatively, we can put them all into a single function

summarise(sales, Sum = sum(no_of_units, na.rm = TRUE),
                 Mean = mean(no_of_units, na.rm = TRUE),
                 Median = median(no_of_units, na.rm = TRUE),
                 Quantile = quantile(no_of_units, 0.9, na.rm = TRUE))

## # A tibble: 1 x 4
##      Sum     Mean Median Quantile
##    <int>    <dbl>  <int>    <dbl>
## 1 165807 594.2903     20   1704.8

Summarise and Mutate by itself is not interesting. we have to pair it with group_by() to derive meaningful insights from the datasets.

3.8 Group By

group_by() allows us to group the data set by certain column. It is, however, better to pair it with summarise(), mutate() and filter().

Supposed we want to find the proportion of each property types to the total property, we 1st have to group them by the property type.

temp <- group_by(sales, property_type)
temp

## # A tibble: 416 x 6
## # Groups:   property_type [15]
##     year      types                           property_type category
##    <int>     <fctr>                                  <fctr>   <fctr>
##  1  2006 Commercial                 Shops and Eating Houses     Sold
##  2  2006 Commercial                 Shops and Eating Houses   Rented
##  3  2006 Commercial                                 Offices     Sold
##  4  2006 Commercial                                 Offices   Rented
##  5  2006 Commercial                Supermarkets & Emporiums     Sold
##  6  2006 Commercial                Supermarkets & Emporiums   Rented
##  7  2006 Commercial Civil Defence Shelters (commercial use)     Sold
##  8  2006 Commercial Civil Defence Shelters (commercial use)   Rented
##  9  2006 Commercial                 Shops and Eating Houses     Sold
## 10  2007 Commercial                 Shops and Eating Houses   Rented
## # ... with 406 more rows, and 2 more variables: subcategory <fctr>,
## #   no_of_units <int>

*Notice that group_by() doesnt do anything visually to the datasets.

Then, we summarise them by counting the total number for each property type.

temp <- summarise(temp, n = n())

Finally, we have to ungroup them before we can add a new column to sum the total property available.

temp <- ungroup(temp)
temp <- mutate(temp, total = sum(n),
                prop = (n/total)*100)
temp

## # A tibble: 15 x 4
##                              property_type     n total       prop
##                                     <fctr> <int> <int>      <dbl>
##  1                   Canteens/Eating House    20   416  4.8076923
##  2 Civil Defence Shelters (commercial use)    20   416  4.8076923
##  3                                    DBSS    60   416 14.4230769
##  4               Flatted/Ramp-up Factories    12   416  2.8846154
##  5                                     HDB   142   416 34.1346154
##  6                    Industrial Complexes     8   416  1.9230769
##  7                        Industrial Shops    20   416  4.8076923
##  8                                 Offices    20   416  4.8076923
##  9                     Prototype Factories    20   416  4.8076923
## 10                 Shops and Eating Houses    20   416  4.8076923
## 11                Supermarkets & Emporiums    20   416  4.8076923
## 12                       Terrace Workshops    20   416  4.8076923
## 13                                TOL Land    10   416  2.4038462
## 14                               Warehouse    20   416  4.8076923
## 15                       Wholesale Centres     4   416  0.9615385

ungroup() is a function that allows us to return to the original data with no particular restriction (ie. without any prior grouping)

3.9 Data Pipelines

In the previous example, we had to constantly assign the output to a new variable (temp) before we can chain multiple output together.

Lets put it all together to have a better view of the flow/process.

temp <- group_by(sales, property_type)
temp <- summarise(temp, n = n())
temp <- ungroup(temp)
temp <- mutate(temp, total = sum(n),
               prop = (n/total)*100)
temp

## # A tibble: 15 x 4
##                              property_type     n total       prop
##                                     <fctr> <int> <int>      <dbl>
##  1                   Canteens/Eating House    20   416  4.8076923
##  2 Civil Defence Shelters (commercial use)    20   416  4.8076923
##  3                                    DBSS    60   416 14.4230769
##  4               Flatted/Ramp-up Factories    12   416  2.8846154
##  5                                     HDB   142   416 34.1346154
##  6                    Industrial Complexes     8   416  1.9230769
##  7                        Industrial Shops    20   416  4.8076923
##  8                                 Offices    20   416  4.8076923
##  9                     Prototype Factories    20   416  4.8076923
## 10                 Shops and Eating Houses    20   416  4.8076923
## 11                Supermarkets & Emporiums    20   416  4.8076923
## 12                       Terrace Workshops    20   416  4.8076923
## 13                                TOL Land    10   416  2.4038462
## 14                               Warehouse    20   416  4.8076923
## 15                       Wholesale Centres     4   416  0.9615385

An easier Method would be to use the Pipe Operator, %>%.

temp_pipe <- sales %>% 
  group_by(property_type) %>% 
  summarise(n = n()) %>% 
  ungroup() %>% 
  mutate(total = sum(n), 
         prop = (n/total)*100)
temp_pipe

## # A tibble: 15 x 4
##                              property_type     n total       prop
##                                     <fctr> <int> <int>      <dbl>
##  1                   Canteens/Eating House    20   416  4.8076923
##  2 Civil Defence Shelters (commercial use)    20   416  4.8076923
##  3                                    DBSS    60   416 14.4230769
##  4               Flatted/Ramp-up Factories    12   416  2.8846154
##  5                                     HDB   142   416 34.1346154
##  6                    Industrial Complexes     8   416  1.9230769
##  7                        Industrial Shops    20   416  4.8076923
##  8                                 Offices    20   416  4.8076923
##  9                     Prototype Factories    20   416  4.8076923
## 10                 Shops and Eating Houses    20   416  4.8076923
## 11                Supermarkets & Emporiums    20   416  4.8076923
## 12                       Terrace Workshops    20   416  4.8076923
## 13                                TOL Land    10   416  2.4038462
## 14                               Warehouse    20   416  4.8076923
## 15                       Wholesale Centres     4   416  0.9615385

This allows us to better see the chaining of operation much easier from a top down approach as well as remove the hassle of constantly assigning to a variable.

Notice that the output for both temp and temp_pipe are the same. we can easily check it with the function identical()

identical(temp, temp_pipe)

## [1] TRUE

dplyr also provides us with a function count() which does the exact same thing as summarise(n = n()) in a much compact way.

temp_pipe <- sales %>% 
  group_by(property_type) %>% 
  count() %>% 
  ungroup() %>% 
  mutate(total = sum(n), 
         prop = (n/total)*100)
temp_pipe

## # A tibble: 15 x 4
##                              property_type     n total       prop
##                                     <fctr> <int> <int>      <dbl>
##  1                   Canteens/Eating House    20   416  4.8076923
##  2 Civil Defence Shelters (commercial use)    20   416  4.8076923
##  3                                    DBSS    60   416 14.4230769
##  4               Flatted/Ramp-up Factories    12   416  2.8846154
##  5                                     HDB   142   416 34.1346154
##  6                    Industrial Complexes     8   416  1.9230769
##  7                        Industrial Shops    20   416  4.8076923
##  8                                 Offices    20   416  4.8076923
##  9                     Prototype Factories    20   416  4.8076923
## 10                 Shops and Eating Houses    20   416  4.8076923
## 11                Supermarkets & Emporiums    20   416  4.8076923
## 12                       Terrace Workshops    20   416  4.8076923
## 13                                TOL Land    10   416  2.4038462
## 14                               Warehouse    20   416  4.8076923
## 15                       Wholesale Centres     4   416  0.9615385

Exercise

dplyr offers a few inbuilt data sets to play around with. Try out the starwars dataset and see what kind of information we can derive from it.

starwars

## # A tibble: 87 x 13
##                  name height  mass    hair_color  skin_color eye_color
##                 <chr>  <int> <dbl>         <chr>       <chr>     <chr>
##  1     Luke Skywalker    172    77         blond        fair      blue
##  2              C-3PO    167    75          <NA>        gold    yellow
##  3              R2-D2     96    32          <NA> white, blue       red
##  4        Darth Vader    202   136          none       white    yellow
##  5        Leia Organa    150    49         brown       light     brown
##  6          Owen Lars    178   120   brown, grey       light      blue
##  7 Beru Whitesun lars    165    75         brown       light      blue
##  8              R5-D4     97    32          <NA>  white, red       red
##  9  Biggs Darklighter    183    84         black       light     brown
## 10     Obi-Wan Kenobi    182    77 auburn, white        fair blue-gray
## # ... with 77 more rows, and 7 more variables: birth_year <dbl>,
## #   gender <chr>, homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>