Chapter 3 Data Transformation
We will be using dplyr for this chapter only.
dplyr allows us to manipulate data sets by passing SQL like query.
We will only be focusing on 5 function in this package.
- filter()
- arrange()
- select()
- mutate()
- summarise()
Additionally, these functions also works in conjunction with
- group_by()
- Data Pipelines ( %>% )
The sales dataset have been gotten from the open data website, https://data.gov.sg/.
Before we can start, we have to download and load the required packages as well as load the csv files.
3.1 Initialise your environment
Install the following package
# you only need to install it once
install.packages("dplyr")
Load the packages into RStudio
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
dplyr comes with a few loaded packages.
To check them out simply type the following command
data(package = "dplyr")
To show all the available function within dplyr,
ls("package:dplyr")
3.2 Import the Dataset
Before we can import the Data Sets, we have to ensure the the .csv file is the same project folder. we can first check if we are in the correct directory by using getwd()
# check if the dataset is in the same project folder
getwd()
If it is not in the same project folder, we can set it manually by passing the correct path into the function setwd()
.
setwd(" ")
Once these are settled, we can import the datasets using the read.csv()
function.
# import the dataset
sales <- read.csv(file = "Properties and Facilities Sales.csv")
We can do some prelimanry check on the attributes of the datasets
class(sales)
dim(sales)
names(sales)
str(sales)
glimpse(sales) # dplyr method
As the table is large and hard to read, dplyr has a function called as_data_frame()
that allow us to convert it to a tidier format.
sales <- as_data_frame(sales) # note that we have overide the old variable
sales
## # A tibble: 416 x 6
## year types property_type category
## <int> <fctr> <fctr> <fctr>
## 1 2006 Commercial Shops and Eating Houses Sold
## 2 2006 Commercial Shops and Eating Houses Rented
## 3 2006 Commercial Offices Sold
## 4 2006 Commercial Offices Rented
## 5 2006 Commercial Supermarkets & Emporiums Sold
## 6 2006 Commercial Supermarkets & Emporiums Rented
## 7 2006 Commercial Civil Defence Shelters (commercial use) Sold
## 8 2006 Commercial Civil Defence Shelters (commercial use) Rented
## 9 2006 Commercial Shops and Eating Houses Sold
## 10 2007 Commercial Shops and Eating Houses Rented
## # ... with 406 more rows, and 2 more variables: subcategory <fctr>,
## # no_of_units <int>
We can check data type of the sales datasets using class()
function
class(sales)
## [1] "tbl_df" "tbl" "data.frame"
Note the difference between as.data.frame() and as_data_frame()
3.3 Filter
filter()
allows us to filter out specific rows based on some specified conditions
We can list out all the different type of properties under the column type
unique(sales["types"])
## # A tibble: 3 x 1
## types
## <fctr>
## 1 Commercial
## 2 Industrial
## 3 Residential
Now lets filter out all the Residential Properties
filter(sales, types == "Residential")
## Warning: package 'bindrcpp' was built under R version 3.3.3
## # A tibble: 202 x 6
## year types property_type category
## <int> <fctr> <fctr> <fctr>
## 1 2006 Residential HDB Sold
## 2 2006 Residential HDB Sold
## 3 2006 Residential HDB Sold
## 4 2006 Residential HDB Sold
## 5 2006 Residential HDB Sold
## 6 2006 Residential HDB Sold
## 7 2006 Residential HDB Sold
## 8 2006 Residential HDB Rented
## 9 2006 Residential HDB Rented
## 10 2006 Residential HDB Rented
## # ... with 192 more rows, and 2 more variables: subcategory <fctr>,
## # no_of_units <int>
Filtering can also be done through logical operator like,
- And –
&
- Or –
|
Filter all types == “Residential”
filter(sales, types == "Residential" & category == "Sold")
## # A tibble: 101 x 6
## year types property_type category
## <int> <fctr> <fctr> <fctr>
## 1 2006 Residential HDB Sold
## 2 2006 Residential HDB Sold
## 3 2006 Residential HDB Sold
## 4 2006 Residential HDB Sold
## 5 2006 Residential HDB Sold
## 6 2006 Residential HDB Sold
## 7 2006 Residential HDB Sold
## 8 2006 Residential DBSS Sold
## 9 2006 Residential DBSS Sold
## 10 2006 Residential DBSS Sold
## # ... with 91 more rows, and 2 more variables: subcategory <fctr>,
## # no_of_units <int>
Filter all types == “HDB”
filter(sales, property_type == "HDB" | property_type == "DBSS")
## # A tibble: 202 x 6
## year types property_type category
## <int> <fctr> <fctr> <fctr>
## 1 2006 Residential HDB Sold
## 2 2006 Residential HDB Sold
## 3 2006 Residential HDB Sold
## 4 2006 Residential HDB Sold
## 5 2006 Residential HDB Sold
## 6 2006 Residential HDB Sold
## 7 2006 Residential HDB Sold
## 8 2006 Residential HDB Rented
## 9 2006 Residential HDB Rented
## 10 2006 Residential HDB Rented
## # ... with 192 more rows, and 2 more variables: subcategory <fctr>,
## # no_of_units <int>
Exercise
You can also use filter with >, <, >=, <=,
- Filter Out all the HDB sold in the year 2009
filter(sales, year == "2009", property_type == "HDB")
- Filter out all rented HDB units that are 5 Room Flats
filter(sales, property_type == "HDB", category == "Rented", subcategory == "5-room flats")
3.4 Arrange
arrange()
allows us to arrange or sort the rows based on the columns that we select.
Lets arrange by property type.
arrange(sales, property_type)
## # A tibble: 416 x 6
## year types property_type category subcategory no_of_units
## <int> <fctr> <fctr> <fctr> <fctr> <int>
## 1 2006 Industrial Canteens/Eating House Sold NA NA
## 2 2006 Industrial Canteens/Eating House Rented NA 5
## 3 2007 Industrial Canteens/Eating House Sold NA NA
## 4 2007 Industrial Canteens/Eating House Rented NA 1
## 5 2008 Industrial Canteens/Eating House Sold NA NA
## 6 2008 Industrial Canteens/Eating House Rented NA 0
## 7 2009 Industrial Canteens/Eating House Sold NA NA
## 8 2009 Industrial Canteens/Eating House Rented NA 0
## 9 2010 Industrial Canteens/Eating House Sold NA NA
## 10 2010 Industrial Canteens/Eating House Rented NA 0
## # ... with 406 more rows
We can also arrange by multiple columns.
arrange(sales, property_type, category)
## # A tibble: 416 x 6
## year types property_type category subcategory no_of_units
## <int> <fctr> <fctr> <fctr> <fctr> <int>
## 1 2006 Industrial Canteens/Eating House Rented NA 5
## 2 2007 Industrial Canteens/Eating House Rented NA 1
## 3 2008 Industrial Canteens/Eating House Rented NA 0
## 4 2009 Industrial Canteens/Eating House Rented NA 0
## 5 2010 Industrial Canteens/Eating House Rented NA 0
## 6 2011 Industrial Canteens/Eating House Rented NA 0
## 7 2012 Industrial Canteens/Eating House Rented NA 0
## 8 2013 Industrial Canteens/Eating House Rented NA 1
## 9 2014 Industrial Canteens/Eating House Rented NA 0
## 10 2015 Industrial Canteens/Eating House Rented NA 0
## # ... with 406 more rows
To sort in descending order, we can use the desc()
function.
arrange(sales, desc(property_type))
## # A tibble: 416 x 6
## year types property_type category subcategory no_of_units
## <int> <fctr> <fctr> <fctr> <fctr> <int>
## 1 2014 Industrial Wholesale Centres Sold NA NA
## 2 2014 Industrial Wholesale Centres Rented NA 8
## 3 2015 Industrial Wholesale Centres Sold NA NA
## 4 2015 Industrial Wholesale Centres Rented NA 15
## 5 2006 Industrial Warehouse Sold NA NA
## 6 2006 Industrial Warehouse Rented NA 5
## 7 2007 Industrial Warehouse Sold NA NA
## 8 2007 Industrial Warehouse Rented NA 8
## 9 2008 Industrial Warehouse Sold NA NA
## 10 2008 Industrial Warehouse Rented NA 0
## # ... with 406 more rows
We can also sort in both ascending and descending order.
arrange(sales, desc(property_type), category)
## # A tibble: 416 x 6
## year types property_type category subcategory no_of_units
## <int> <fctr> <fctr> <fctr> <fctr> <int>
## 1 2014 Industrial Wholesale Centres Rented NA 8
## 2 2015 Industrial Wholesale Centres Rented NA 15
## 3 2014 Industrial Wholesale Centres Sold NA NA
## 4 2015 Industrial Wholesale Centres Sold NA NA
## 5 2006 Industrial Warehouse Rented NA 5
## 6 2007 Industrial Warehouse Rented NA 8
## 7 2008 Industrial Warehouse Rented NA 0
## 8 2009 Industrial Warehouse Rented NA 1
## 9 2010 Industrial Warehouse Rented NA 0
## 10 2011 Industrial Warehouse Rented NA 0
## # ... with 406 more rows
Exercise
- Sort in descending order the total number of units sold
arrange(sales, desc(no_of_units))
- Sort in descending order the total number of units sold by year
arrange(sales, year, desc(no_of_units))
3.5 Select
select()
allows us to select specific columns that we want to scope in.
Lets says we want to only scope into or select the column, no_of_units
,
select(sales, no_of_units)
## # A tibble: 416 x 1
## no_of_units
## <int>
## 1 NA
## 2 287
## 3 NA
## 4 112
## 5 NA
## 6 6
## 7 NA
## 8 23
## 9 17
## 10 330
## # ... with 406 more rows
We are not confine to selecting a single column. to select multiple column, all we just need to do is specify all the column name that we want.
select(sales, year, property_type, no_of_units)
## # A tibble: 416 x 3
## year property_type no_of_units
## <int> <fctr> <int>
## 1 2006 Shops and Eating Houses NA
## 2 2006 Shops and Eating Houses 287
## 3 2006 Offices NA
## 4 2006 Offices 112
## 5 2006 Supermarkets & Emporiums NA
## 6 2006 Supermarkets & Emporiums 6
## 7 2006 Civil Defence Shelters (commercial use) NA
## 8 2006 Civil Defence Shelters (commercial use) 23
## 9 2006 Shops and Eating Houses 17
## 10 2007 Shops and Eating Houses 330
## # ... with 406 more rows
Alternatively, instead of specifying the column that we want, we can also select the columns by dropping the column the we want. we can do that by appending a negative sign, infront on the column that we wish to drop.
select(sales, -types, -category, -subcategory)
## # A tibble: 416 x 3
## year property_type no_of_units
## <int> <fctr> <int>
## 1 2006 Shops and Eating Houses NA
## 2 2006 Shops and Eating Houses 287
## 3 2006 Offices NA
## 4 2006 Offices 112
## 5 2006 Supermarkets & Emporiums NA
## 6 2006 Supermarkets & Emporiums 6
## 7 2006 Civil Defence Shelters (commercial use) NA
## 8 2006 Civil Defence Shelters (commercial use) 23
## 9 2006 Shops and Eating Houses 17
## 10 2007 Shops and Eating Houses 330
## # ... with 406 more rows
Alternative to arrange()
, we can also re-arrange coloumns through the select function as well. we used the function `everything()’ to specify the remaining column that we have no selected.
select(sales, no_of_units, everything())
## # A tibble: 416 x 6
## no_of_units year types property_type
## <int> <int> <fctr> <fctr>
## 1 NA 2006 Commercial Shops and Eating Houses
## 2 287 2006 Commercial Shops and Eating Houses
## 3 NA 2006 Commercial Offices
## 4 112 2006 Commercial Offices
## 5 NA 2006 Commercial Supermarkets & Emporiums
## 6 6 2006 Commercial Supermarkets & Emporiums
## 7 NA 2006 Commercial Civil Defence Shelters (commercial use)
## 8 23 2006 Commercial Civil Defence Shelters (commercial use)
## 9 17 2006 Commercial Shops and Eating Houses
## 10 330 2007 Commercial Shops and Eating Houses
## # ... with 406 more rows, and 2 more variables: category <fctr>,
## # subcategory <fctr>
3.6 Mutate
mutate()
allows us to add or modify a new coloumn to any datasets. we are also able to use any mathematical function or any user-defined function within the mutate()
function.
Lets start by adding a new column which calculate’s the total number of units sold for Residential types only.
To do this we have to first filter out all types that is “Residential”
residential <- filter(sales, types == "Residential")
Next we can add a new column by giving a name “total_units”.
mutate(residential, total_units = sum(no_of_units))
## # A tibble: 202 x 7
## year types property_type category
## <int> <fctr> <fctr> <fctr>
## 1 2006 Residential HDB Sold
## 2 2006 Residential HDB Sold
## 3 2006 Residential HDB Sold
## 4 2006 Residential HDB Sold
## 5 2006 Residential HDB Sold
## 6 2006 Residential HDB Sold
## 7 2006 Residential HDB Sold
## 8 2006 Residential HDB Rented
## 9 2006 Residential HDB Rented
## 10 2006 Residential HDB Rented
## # ... with 192 more rows, and 3 more variables: subcategory <fctr>,
## # no_of_units <int>, total_units <int>
mutate(residential, total_units = sum(no_of_units, na.rm = TRUE))
## # A tibble: 202 x 7
## year types property_type category
## <int> <fctr> <fctr> <fctr>
## 1 2006 Residential HDB Sold
## 2 2006 Residential HDB Sold
## 3 2006 Residential HDB Sold
## 4 2006 Residential HDB Sold
## 5 2006 Residential HDB Sold
## 6 2006 Residential HDB Sold
## 7 2006 Residential HDB Sold
## 8 2006 Residential HDB Rented
## 9 2006 Residential HDB Rented
## 10 2006 Residential HDB Rented
## # ... with 192 more rows, and 3 more variables: subcategory <fctr>,
## # no_of_units <int>, total_units <int>
Exercise
- Add a column, “total_sold”, which calculate the total number of sold properties and a column, “total_rented”, which calculate the total number of rented properties.
mutate(sales, total_sold = sum(category == "Sold"),
total_rented = sum(category == "Rented"))
3.7 Summarise
summarise()
allows us to summarise a data based on specific function like finding the sum, mean, median, percentile etc.
Sum
summarise(sales, Sum = sum(no_of_units, na.rm = TRUE))
## # A tibble: 1 x 1
## Sum
## <int>
## 1 165807
Mean
summarise(sales, Mean = mean(no_of_units, na.rm = TRUE))
## # A tibble: 1 x 1
## Mean
## <dbl>
## 1 594.2903
Median
summarise(sales, median = median(no_of_units, na.rm = TRUE))
## # A tibble: 1 x 1
## median
## <int>
## 1 20
Quantile
summarise(sales, Quantile = quantile(no_of_units, 0.9, na.rm = TRUE))
## # A tibble: 1 x 1
## Quantile
## <dbl>
## 1 1704.8
Alternatively, we can put them all into a single function
summarise(sales, Sum = sum(no_of_units, na.rm = TRUE),
Mean = mean(no_of_units, na.rm = TRUE),
Median = median(no_of_units, na.rm = TRUE),
Quantile = quantile(no_of_units, 0.9, na.rm = TRUE))
## # A tibble: 1 x 4
## Sum Mean Median Quantile
## <int> <dbl> <int> <dbl>
## 1 165807 594.2903 20 1704.8
Summarise and Mutate by itself is not interesting. we have to pair it with group_by()
to derive meaningful insights from the datasets.
3.8 Group By
group_by()
allows us to group the data set by certain column. It is, however, better to pair it with summarise()
, mutate()
and filter()
.
Supposed we want to find the proportion of each property types to the total property, we 1st have to group them by the property type.
temp <- group_by(sales, property_type)
temp
## # A tibble: 416 x 6
## # Groups: property_type [15]
## year types property_type category
## <int> <fctr> <fctr> <fctr>
## 1 2006 Commercial Shops and Eating Houses Sold
## 2 2006 Commercial Shops and Eating Houses Rented
## 3 2006 Commercial Offices Sold
## 4 2006 Commercial Offices Rented
## 5 2006 Commercial Supermarkets & Emporiums Sold
## 6 2006 Commercial Supermarkets & Emporiums Rented
## 7 2006 Commercial Civil Defence Shelters (commercial use) Sold
## 8 2006 Commercial Civil Defence Shelters (commercial use) Rented
## 9 2006 Commercial Shops and Eating Houses Sold
## 10 2007 Commercial Shops and Eating Houses Rented
## # ... with 406 more rows, and 2 more variables: subcategory <fctr>,
## # no_of_units <int>
*Notice that group_by()
doesnt do anything visually to the datasets.
Then, we summarise them by counting the total number for each property type.
temp <- summarise(temp, n = n())
Finally, we have to ungroup them before we can add a new column to sum the total property available.
temp <- ungroup(temp)
temp <- mutate(temp, total = sum(n),
prop = (n/total)*100)
temp
## # A tibble: 15 x 4
## property_type n total prop
## <fctr> <int> <int> <dbl>
## 1 Canteens/Eating House 20 416 4.8076923
## 2 Civil Defence Shelters (commercial use) 20 416 4.8076923
## 3 DBSS 60 416 14.4230769
## 4 Flatted/Ramp-up Factories 12 416 2.8846154
## 5 HDB 142 416 34.1346154
## 6 Industrial Complexes 8 416 1.9230769
## 7 Industrial Shops 20 416 4.8076923
## 8 Offices 20 416 4.8076923
## 9 Prototype Factories 20 416 4.8076923
## 10 Shops and Eating Houses 20 416 4.8076923
## 11 Supermarkets & Emporiums 20 416 4.8076923
## 12 Terrace Workshops 20 416 4.8076923
## 13 TOL Land 10 416 2.4038462
## 14 Warehouse 20 416 4.8076923
## 15 Wholesale Centres 4 416 0.9615385
ungroup()
is a function that allows us to return to the original data with no particular restriction (ie. without any prior grouping)
3.9 Data Pipelines
In the previous example, we had to constantly assign the output to a new variable (temp) before we can chain multiple output together.
Lets put it all together to have a better view of the flow/process.
temp <- group_by(sales, property_type)
temp <- summarise(temp, n = n())
temp <- ungroup(temp)
temp <- mutate(temp, total = sum(n),
prop = (n/total)*100)
temp
## # A tibble: 15 x 4
## property_type n total prop
## <fctr> <int> <int> <dbl>
## 1 Canteens/Eating House 20 416 4.8076923
## 2 Civil Defence Shelters (commercial use) 20 416 4.8076923
## 3 DBSS 60 416 14.4230769
## 4 Flatted/Ramp-up Factories 12 416 2.8846154
## 5 HDB 142 416 34.1346154
## 6 Industrial Complexes 8 416 1.9230769
## 7 Industrial Shops 20 416 4.8076923
## 8 Offices 20 416 4.8076923
## 9 Prototype Factories 20 416 4.8076923
## 10 Shops and Eating Houses 20 416 4.8076923
## 11 Supermarkets & Emporiums 20 416 4.8076923
## 12 Terrace Workshops 20 416 4.8076923
## 13 TOL Land 10 416 2.4038462
## 14 Warehouse 20 416 4.8076923
## 15 Wholesale Centres 4 416 0.9615385
An easier Method would be to use the Pipe Operator, %>%.
temp_pipe <- sales %>%
group_by(property_type) %>%
summarise(n = n()) %>%
ungroup() %>%
mutate(total = sum(n),
prop = (n/total)*100)
temp_pipe
## # A tibble: 15 x 4
## property_type n total prop
## <fctr> <int> <int> <dbl>
## 1 Canteens/Eating House 20 416 4.8076923
## 2 Civil Defence Shelters (commercial use) 20 416 4.8076923
## 3 DBSS 60 416 14.4230769
## 4 Flatted/Ramp-up Factories 12 416 2.8846154
## 5 HDB 142 416 34.1346154
## 6 Industrial Complexes 8 416 1.9230769
## 7 Industrial Shops 20 416 4.8076923
## 8 Offices 20 416 4.8076923
## 9 Prototype Factories 20 416 4.8076923
## 10 Shops and Eating Houses 20 416 4.8076923
## 11 Supermarkets & Emporiums 20 416 4.8076923
## 12 Terrace Workshops 20 416 4.8076923
## 13 TOL Land 10 416 2.4038462
## 14 Warehouse 20 416 4.8076923
## 15 Wholesale Centres 4 416 0.9615385
This allows us to better see the chaining of operation much easier from a top down approach as well as remove the hassle of constantly assigning to a variable.
Notice that the output for both temp and temp_pipe are the same. we can easily check it with the function identical()
identical(temp, temp_pipe)
## [1] TRUE
dplyr also provides us with a function count() which does the exact same thing as summarise(n = n())
in a much compact way.
temp_pipe <- sales %>%
group_by(property_type) %>%
count() %>%
ungroup() %>%
mutate(total = sum(n),
prop = (n/total)*100)
temp_pipe
## # A tibble: 15 x 4
## property_type n total prop
## <fctr> <int> <int> <dbl>
## 1 Canteens/Eating House 20 416 4.8076923
## 2 Civil Defence Shelters (commercial use) 20 416 4.8076923
## 3 DBSS 60 416 14.4230769
## 4 Flatted/Ramp-up Factories 12 416 2.8846154
## 5 HDB 142 416 34.1346154
## 6 Industrial Complexes 8 416 1.9230769
## 7 Industrial Shops 20 416 4.8076923
## 8 Offices 20 416 4.8076923
## 9 Prototype Factories 20 416 4.8076923
## 10 Shops and Eating Houses 20 416 4.8076923
## 11 Supermarkets & Emporiums 20 416 4.8076923
## 12 Terrace Workshops 20 416 4.8076923
## 13 TOL Land 10 416 2.4038462
## 14 Warehouse 20 416 4.8076923
## 15 Wholesale Centres 4 416 0.9615385
Exercise
- dplyr offers a few inbuilt data sets to play around with. Try out the starwars dataset and see what kind of information we can derive from it.
starwars
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color
## <chr> <int> <dbl> <chr> <chr> <chr>
## 1 Luke Skywalker 172 77 blond fair blue
## 2 C-3PO 167 75 <NA> gold yellow
## 3 R2-D2 96 32 <NA> white, blue red
## 4 Darth Vader 202 136 none white yellow
## 5 Leia Organa 150 49 brown light brown
## 6 Owen Lars 178 120 brown, grey light blue
## 7 Beru Whitesun lars 165 75 brown light blue
## 8 R5-D4 97 32 <NA> white, red red
## 9 Biggs Darklighter 183 84 black light brown
## 10 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray
## # ... with 77 more rows, and 7 more variables: birth_year <dbl>,
## # gender <chr>, homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>