Manipulating data with R
Getting help about a command
help(merge)
merge package:base R Documentation
Merge Two Data Frames
Description:
Merge two data frames by common columns or row names, or do other
versions of database _join_ operations.
Usage:
merge(x, y, ...)
## Default S3 method:
merge(x, y, ...)
[...]
Variable names collisions
To prevent variable collisions, create a namespaced-like variable. For example:
my.a <- 1
my.b <- "test"
Installing third-party libraries
To install packages locally (in your HOME directory):
install.packages("psych")
Using a specific library
The psych
package provides function about sample data.
At the moment I only use the describe
function.
> library(psych)
> describe(my.data)
Read a CSV file into a variable
A CSV file example (person.csv
):
"name","age"
John,25
Jack,50
Paul,12
Assign it to a variable called persons
:
> my.person <- read.csv(file="person.csv", head=TRUE, sep=",")
We can access each row with $row_name
:
> my.person$name
[1] John Jack Paul
Levels: Jack John Paul
Usefull information about the data (summary
):
> summary(my.person)
name age
Jack:1 Min. :12.0
John:1 1st Qu.:18.5
Paul:1 Median :25.0
Mean :29.0
3rd Qu.:37.5
Max. :50.0
Filter some data help(”[“)
Remove all people older than 20:
> my.older <- my.person[my.person$name, my.person$age > 20]
name age
1 John 25
2 Jack 50
This can be written more easily by (do not forget the last comma):
> my.older <- my.person[my.person$age > 20,]
name age
1 John 25
2 Jack 50
Merging two data frames (merge)
Merging is like doing a join
in SQL:
"First name","country"
Paul,France
Jack,"United States"
John,"Spain"
> my.country <- read.csv(file="country.csv", head=TRUE, sep=",")
> my.person_country <- merge(my.person, country, by.x="name", by.y="First.name")
> my.person_country
name age country
1 Jack 50 United States
2 John 25 Spain
3 Paul 12 France
Plotting a simple coordinate file
$ cat simple_plot.txt
x,y
1,1
10,-3
25,6
8,14
Read the file:
> my.simple <- read.table('simple_plot.txt', header=TRUE, sep=',')
Plot the file:
> plot(my.simple$x, simple$y)
Plotting the number of occurence
An example of durations (for example a session duration):
$ cat durations.csv
duration
2
15
2
2
2
14
18
2
19
15
19
2
19
50
Read the file:
> my.duration = read.csv(file="durations.csv", head=TRUE)
> summary(my.duration$duration)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 2.00 14.50 12.93 18.75 50.00
Dispay the histogram, in 5 part (breaks=5):
> hist(my.duration$duration, breaks=5)
- 6 durations are between 0 and 10 seconds,
- 7 between 10 and 20,
- 1 between 40 and 50.
Percentile (help(quantile))
With the previous data:
> quantile(my.duration$duration)
0% 25% 50% 75% 100%
2.00 2.00 14.50 18.75 50.00
- 0% of sessions are below 2 seconds
- 25% are below 2 seconds
- 50% are below 14.5 seconds
- 75% are below 18.75 seconds
- 100% are below 50 seconds
This also can be read as: “If you take a random session you have 75% chance to find a duration below 18.75 seconds”.
You can specify your own percentiles with the probs parameter:
> quantile(my.duration$duration, probs=c(0.5, 0.9, 0.95, 0.99))
50% 90% 95% 99%
14.50 19.00 29.85 45.97