Loading, Querying And Manipulating Data In R
R is one of the best software facilities available for calculation, manipulations and graphical display of data available today. Quite similar to the S language, R provides the user with the following:
- Effective facility for data storage and handling
- A large collection of coherent and integrated tools for data analysis
- An operator suite that handles calculations in arrays and matrices
- Graphical capabilities for data display and analysis as a hard copy or on screen
- A simple but well developed programming language that is effective in handling loops, conditionals, input and output facilities with functions that are user defined.
When completing their data science certification, most of the future data analysts want to know the very basics of R which is Loading, Querying and Manipulation of data within R. Since R has become a popular platform for most data analysis by larger corporations, the demand for data analysts who are well versed and experienced in R has increased.
For those seeking ways to increase their knowledge, this article can help you understand all three main aspects of R fully.
Loading Data in R
Most people who will be working on R will be working and handling data in Excel workbook formats. This is where a person needs to know how to transfer that data onto R before you can start your data analysis and manipulation.
To do this, the most basic way is to use the “read.xls” function from the gdata package. The function analyzes the data from the excel spreadsheet and gives back a data frame (a list of vectors of equalt length).
The following will show the way to load data from a spreadsheet names “xyzdata.xls”. The method requires the system to have Perl runtime present on it.
> library(gdata) # load gdata package
> help(read.xls) # documentation
> xyzdata = read.xls("xyzdata.xls") # read from first sheet
There is an alternate method available for this too, where the analyst can use the loadworkbook function from the XLConnect package, but similarly as the previous method requires Perl runtime, this requires Jave to already be present on the system.
> library(XLConnect) # load XLConnect package
> wk = loadWorkbook("xyzdata.xls")
> df = readWorksheet(wk, sheet="Sheet1")
Querying data in R
There are a lot of ways data can be queried in R, which you will learn about when completing the data analytics certification, in this article we will be discussing the most basic and easiest ways to do it.
The three ways are:
- Using DBI
- Using dplyr suntax
Data can be queried in R through DBI using the function, dbGetQuery ( ). All you have to do is paste your SQL code in the function as a quoted string. This method is also known as the pass through SQL code, and is one of the simplest ways to Query data in R. The use should however be careful to escape the code as required.
select "month_idx", "year", "month",
sum(case when "term_deposit" = \'yes\' then 1.0 else 0.0 end) as subscribe,
count(*) as total
group by "month_idx", "year", "month"
Using dplyr syntax
Code can be written in the dplyr syntax, and it will translate the code into SQL. So why prefer this over DBI? Well here are some advantages you get for using dplyr syntax:
- The language for both Ro objects and database tables can be kept consistent
- Dplyr uses lazy evaluation
- It is much easier to read and you can check translation with show_qurey ( ) function.
You can use dplyr syntax in the following way:
q1 <- tbl(con, "bank") %>%
group_by(month_idx, year, month) %>%
subscribe = sum(ifelse(term_deposit == "yes", 1, 0)),
total = n())
SELECT "month_idx", "year", "month", SUM(CASE WHEN ("term_deposit" = 'yes') THEN (1.0) ELSE (0.0) END) AS "subscribe", COUNT(*) AS "total"
GROUP BY "month_idx", "year", "month"
Manipulating Data in R
Manipulation of data in R is not something you can do with a sure fire way, there are different ways you can achieve your goal, given that the data is understood and necessary actions have been taken to end the alteration.
Here are some of the common and simple ways people manipulate data in R:
- The simplest way is to use the inbuilt R functions for the task, but this is a time consuming process so most analysts look towards alternatives. This method is mostly used by beginners learning to use R.
- Packages can be used for data manipulation. The Comprehensive R Archive Network, or CRAn for short, has developed more than 7000 packages to aid analysts for their data manipulations tasks. These packages act as templates for you to use in your own data manipulation and when applied, considerably reduces the effort of the analyst in terms of writing the whole code.
- ML algorithms are the fastest way for data manipulations. Now it may as well be the fastest, but it does require a much better understanding of R and the data itself.
A data science certification can help you understand these tree-based boosting algorithms to help you sort your data out.