Manipulating data with R

Published 08-02-2011 00:00:00

Manipulating data with R

Getting help about a command

help(merge)

merge                   package:base                   R Documentation

Merge Two Data Frames

Description:

     Merge two data frames by common columns or row names, or do other
     versions of database _join_ operations.

Usage:

     merge(x, y, ...)
     
     ## Default S3 method:
     merge(x, y, ...)
[...]

Variable names collisions


To prevent variable collisions, create a namespaced-like variable. For example:

my.a <- 1
my.b <- "test"

Installing third-party libraries


To install packages locally (in your HOME directory):

install.packages("psych")

Using a specific library


The psych package provides function about sample data. At the moment I only use the describe function.

> library(psych)
> describe(my.data)

Read a CSV file into a variable


A CSV file example (person.csv):

"name","age"
John,25
Jack,50
Paul,12

Assign it to a variable called persons:

> my.person <- read.csv(file="person.csv", head=TRUE, sep=",")

We can access each row with $row_name:

> my.person$name
[1] John Jack Paul
Levels: Jack John Paul

Usefull information about the data (summary):

> summary(my.person)
   name        age      
 Jack:1   Min.   :12.0  
 John:1   1st Qu.:18.5  
 Paul:1   Median :25.0  
          Mean   :29.0  
          3rd Qu.:37.5  
          Max.   :50.0  

Filter some data help(”[“)


Remove all people older than 20:

> my.older <- my.person[my.person$name, my.person$age > 20]
  name age
1 John  25
2 Jack  50

This can be written more easily by (do not forget the last comma):

> my.older <- my.person[my.person$age > 20,]
  name age    
1 John  25
2 Jack  50

Merging two data frames (merge)


Merging is like doing a join in SQL:

"First name","country"
Paul,France
Jack,"United States"
John,"Spain"
> my.country <- read.csv(file="country.csv", head=TRUE, sep=",")

> my.person_country <- merge(my.person, country, by.x="name", by.y="First.name")

> my.person_country
  name age       country
1 Jack  50 United States
2 John  25         Spain
3 Paul  12        France

Plotting a simple coordinate file


$ cat simple_plot.txt
x,y
1,1
10,-3
25,6
8,14

Read the file:

> my.simple <- read.table('simple_plot.txt', header=TRUE, sep=',')

Plot the file:

> plot(my.simple$x, simple$y)

Plotting the number of occurence


An example of durations (for example a session duration):

$ cat durations.csv
duration
2
15
2
2
2
14
18
2
19
15
19
2
19
50

Read the file:

> my.duration = read.csv(file="durations.csv", head=TRUE)
> summary(my.duration$duration)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   2.00    2.00   14.50   12.93   18.75   50.00

Dispay the histogram, in 5 part (breaks=5):

> hist(my.duration$duration, breaks=5)
  • 6 durations are between 0 and 10 seconds,
  • 7 between 10 and 20,
  • 1 between 40 and 50.

Percentile (help(quantile))


With the previous data:

> quantile(my.duration$duration)
   0%   25%   50%   75%  100%
 2.00  2.00 14.50 18.75 50.00
  • 0% of sessions are below 2 seconds
  • 25% are below 2 seconds
  • 50% are below 14.5 seconds
  • 75% are below 18.75 seconds
  • 100% are below 50 seconds

This also can be read as: “If you take a random session you have 75% chance to find a duration below 18.75 seconds”.

You can specify your own percentiles with the probs parameter:

> quantile(my.duration$duration, probs=c(0.5, 0.9, 0.95, 0.99))
  50%   90%   95%   99% 
14.50 19.00 29.85 45.97