前言

虽然但是,还是系统的学了一遍R。但实际体验下来,感觉和Pythonpandas包非常的类似,感觉天下的统计的语言都是一家。

视频教程链接

基本数据结构

向量

基本概念

数值型

1
> x <- c(1, 2, 3, 4, 5)

字符串型

1
> y <- c("one", "two", "three")

逻辑型(全部大写)

1
2
> z <- c (T, F, T)
> z <- c (TRUE, FALSE, T)

等差数列

1
2
> c(1:10)
[1] 1 2 3 4 5 6 7 8 9 10

seq()函数

生成等差数列,fromto控制起始与终点,by控制数列间距,length.out控制数列个数

1
2
3
4
> seq(from = 1, to = 10, by = 5)
[1] 1 6
> seq(from = 1,to =10 , length.out=10)
[1] 1 2 3 4 5 6 7 8 9 10

rep()函数

重复生成向量,重复项可以是数值也可以是数组

1
2
3
4
5
> rep(2, 5)
[1] 2 2 2 2 2
> x <- c(1, 2, 3, 4, 5)
> rep(x, 3)
[1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
1
2
3
4
5
6
> rep(x, each = 5)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5
> x <- c(1, 2, 3, 4, 5)
> rep(x, each = 3, times = 2)
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 1 1 1 2 2 2 3 3 3 4 4
[27] 4 5 5 5

变量转换

数值型变量统一转换为字符串型变量

1
2
3
4
5
> a <- c(1, 2, "three")
> a
[1] "1" "2" "three"
> mode(a)
[1] "character"

向量索引

整数索引

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
> x <- c(1:10)
> x
[1] 1 2 3 4 5 6 7 8 9 10
> length(x)
[1] 10
> x[1]
[1] 1
> x[0]
integer(0)
> x[-5]
[1] 1 2 3 4 6 7 8 9 10
> x[c(2:8)]
[1] 2 3 4 5 6 7 8
> x[c(1, 5, 9)]
[1] 1 5 9

逻辑向量索引

数值向量

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
> x <- c(1:10)
> x
[1] 1 2 3 4 5 6 7 8 9 10
> x[c(T, F, T, T, F, F, T, T, T, F)]
[1] 1 3 4 7 8 9
> x[c(T)]
[1] 1 2 3 4 5 6 7 8 9 10
> x[c(F)]
integer(0)
> x[c(T, F)]
[1] 1 3 5 7 9
> x[c(T, F, F)]
[1] 1 4 7 10
> x[c(T, F, T, T, F, F, T, T, T, F, T)]
[1] 1 3 4 7 8 9 NA
> x[x > 3 & x < 7]
[1] 4 5 6

字符串向量

1
2
3
4
5
6
7
8
9
10
11
12
> x <- c("one", "two", "three", "four", "five")
> "one" %in% x
[1] TRUE
> x["one" %in% x]
[1] "one" "two" "three" "four" "five"
> x %in% c("one", "three")
[1] TRUE FALSE TRUE FALSE FALSE
> x[x %in% c("one", "three")]
[1] "one" "three"
> k <- x %in% c("one", "three")
> x[k]
[1] "one" "three"

名称索引

1
2
3
4
5
6
7
8
9
10
11
12
> x <- c(1:5)
> names(x)
NULL
> names(x) <- c("one" , "two", "three", "four" , "five")
> names(x)
[1] "one" "two" "three" "four" "five"
> x
one two three four five
1 2 3 4 5
> x["two"]
two
2

修改向量值

添加值

1
2
3
4
5
6
7
8
9
10
11
12
13
> x <- c(1:5)
> x
[1] 1 2 3 4 5
> x[6] = 6
> x
[1] 1 2 3 4 5 6
> x[c(7, 8, 9)] <- c(7, 8, 9)
> x
[1] 1 2 3 4 5 6 7 8 9
> x[20] = 20
> x
[1] 1 2 3 4 5 6 7 8 9 NA NA NA NA NA NA NA NA
[18] NA NA 20

append函数

1
2
3
4
5
6
> append(x = x, values = 10, after = 9)
[1] 1 2 3 4 5 6 7 8 9 10 NA NA NA NA NA NA NA
[18] NA NA NA 20
> append(x = x,values = -99,after = 0)
[1] -99 1 2 3 4 5 6 7 8 9 NA NA NA
[14] NA NA NA NA NA NA NA 20

删除向量

1
2
3
> rm(x)
> x
Error: object 'x' not found

删除值

1
2
3
4
5
6
7
8
9
> x <- c(1:10)
> x
[1] 1 2 3 4 5 6 7 8 9 10
> x <- x[-4]
> x
[1] 1 2 3 5 6 7 8 9 10
> x <- x[-c(2:4)]
> x
[1] 1 6 7 8 9 10

通过索引修改值

1
2
3
4
5
6
7
8
9
> x <- c(1:5)
> names(x) <- c("one" , "two", "three", "four" , "five")
> x
one two three four five
1 2 3 4 5
> x["four"] <- -99
> x
one two three four five
1 2 3 -99 5

数值型的向量,不能赋值给字符串,会把整个向量变成字符型向量

1
2
3
4
5
6
7
> x <- c(1:5)
> x[3] <- -99
> x
[1] 1 2 -99 4 5
> x[4] <- "four"
> x
[1] "1" "2" "-99" "four" "5"

向量运算

数值运算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
> x <- 1:10
> x
[1] 1 2 3 4 5 6 7 8 9 10
> x + 1
[1] 2 3 4 5 6 7 8 9 10 11
> x - 3
[1] -2 -1 0 1 2 3 4 5 6 7
> x <- x + 1
> x
[1] 2 3 4 5 6 7 8 9 10 11
> y <- seq(2, 20, length.out = 10)
> y
[1] 2 4 6 8 10 12 14 16 18 20
> x - y
[1] 0 -1 -2 -3 -4 -5 -6 -7 -8 -9
> x * y
[1] 4 12 24 40 60 84 112 144 180 220

幂运算:** 求余运算:%% 整除运算:%/%

1
2
3
4
5
6
7
8
9
10
11
> x
[1] 2 3 4 5 6 7 8 9 10 11
> y
[1] 2 4 6 8 10 12 14 16 18 20
> x ** y
[1] 4.000000e+00 8.100000e+01 4.096000e+03 3.906250e+05 6.046618e+07 1.384129e+10 4.398047e+12
[8] 1.853020e+15 1.000000e+18 6.727500e+20
> x %% y
[1] 0 3 4 5 6 7 8 9 10 11
> y %/% x
[1] 1 1 1 1 1 1 1 1 1 1

对应位置进行运算,若向量长度不同,则长向量的长度必须是短向量的整数倍

1
2
3
4
5
6
7
8
9
10
11
12
13
> x <- c(1, 3)
> y <- c(2:11)
> x
[1] 1 3
> y
[1] 2 3 4 5 6 7 8 9 10 11
> x + y
[1] 3 6 5 8 7 10 9 12 11 14
> z <- c(1, 2, 3)
> x + z
[1] 2 5 4
Warning message:
In x + z : longer object length is not a multiple of shorter object length

逻辑运算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
> y <- c(5:-5)
> x <- c(0:10)
> x
[1] 0 1 2 3 4 5 6 7 8 9 10
> x > 5
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[8] TRUE TRUE TRUE TRUE
> y <- c(5:-5)
> x > y
[1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE
[8] TRUE TRUE TRUE TRUE
> x == y
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[8] FALSE FALSE FALSE FALSE
> c(1,2,3) %in% c(2,3,6,3)
[1] FALSE TRUE TRUE

数学运算符

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
> x <- c(-5:5)
> x
[1] -5 -4 -3 -2 -1 0 1 2 3 4 5
> abs(x)
[1] 5 4 3 2 1 0 1 2 3 4 5
> sqrt(x)
[1] NaN NaN NaN NaN NaN 0.000000 1.000000 1.414214 1.732051 2.000000
[11] 2.236068
Warning message:
In sqrt(x) : NaNs produced
> sqrt(abs(x))
[1] 2.236068 2.000000 1.732051 1.414214 1.000000 0.000000 1.000000 1.414214 1.732051 2.000000
[11] 2.236068
> log(x, base = 3)
[1] NaN NaN NaN NaN NaN -Inf 0.0000000 0.6309298 1.0000000
[10] 1.2618595 1.4649735
Warning message:
NaNs produced
> log(abs(x) + 1, base = 3)
[1] 1.6309298 1.4649735 1.2618595 1.0000000 0.6309298 0.0000000 0.6309298 1.0000000 1.2618595
[10] 1.4649735 1.6309298
> exp(x)
[1] 6.737947e-03 1.831564e-02 4.978707e-02 1.353353e-01 3.678794e-01 1.000000e+00 2.718282e+00
[8] 7.389056e+00 2.008554e+01 5.459815e+01 1.484132e+02
> sin(x)
[1] 0.9589243 0.7568025 -0.1411200 -0.9092974 -0.8414710 0.0000000 0.8414710 0.9092974
[9] 0.1411200 -0.7568025 -0.9589243
> cos(x)
[1] 0.2836622 -0.6536436 -0.9899925 -0.4161468 0.5403023 1.0000000 0.5403023 -0.4161468
[9] -0.9899925 -0.6536436 0.2836622

ceiling()不小于x的最小整数,floor()函数不大于x的最大整数

1
2
3
4
5
6
7
8
9
10
> ceiling(c(-2.3, 3.1415))
[1] -2 4
> floor(c(-2.3, 3.1415))
[1] -3 3
> round(c(-2.3, 3.1415))
[1] -2 3
> round(c(-2.3, 3.1415), digits = 2)
[1] -2.30 3.14
> signif(c(-2.3, 3.1415), digits = 2)
[1] -2.3 3.1

统计函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
> vec <- c(1:100)
> vec
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
[23] 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
[45] 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
[67] 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
[89] 89 90 91 92 93 94 95 96 97 98 99 100
> sum(vec)
[1] 5050
> max(vec)
[1] 100
> min(vec)
[1] 1
> mean(vec)
[1] 50.5
> var(vec)
[1] 841.6667
> round(var(vec), digits = 2)
[1] 841.67
> sd(vec)
[1] 29.01149
> prod(vec)
[1] 9.332622e+157
> median(vec)
[1] 50.5
> quantile(vec)
0% 25% 50% 75% 100%
1.00 25.75 50.50 75.25 100.00
> quantile(vec, c(0.4, 0.5, 0.8))
40% 50% 80%
40.6 50.5 80.2
1
2
3
4
5
6
7
8
9
10
> t <- c(1, 2, 4, 6, 7,-2)
> which.max(t)
[1] 5
> which.min(t)
[1] 6
> t[which(t == 7)]
[1] 7
> t[which(t > 5)]
[1] 6 7

多维向量

矩阵

1
2
3
4
5
6
7
> m <- matrix(1:20, nrow = 4, ncol = 5)
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20

默认按列进行分布

1
2
3
4
5
6
7
> m <- matrix(1:20, 4)
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20

按行分布

1
2
3
4
5
6
7
> m <- matrix(1:20, 4, byrow = T)
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 6 7 8 9 10
[3,] 11 12 13 14 15
[4,] 16 17 18 19 20

按列分布

1
2
3
4
5
6
7
> m <- matrix(1:20, 4, byrow = F)
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20

定义矩阵行与列的名字

1
2
3
4
5
6
7
8
9
> rnames <- c("R1", "R2", "R3", "R4")
> cnames <- c("C1", "C2", "C3", "C4", "C5")
> dimnames(m) <- list(rnames, cnames)
> m
C1 C2 C3 C4 C5
R1 1 5 9 13 17
R2 2 6 10 14 18
R3 3 7 11 15 19
R4 4 8 12 16 20

矩阵维数

1
2
> dim(m)
[1] 4 5

使向量转化为矩阵

1
2
3
4
5
6
7
8
> x <- c(1:20)
> dim(x) <- c(4:5)
> x
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20

多维数组

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
> x <- c(1:20)
> dim(x) <- c(2, 2, 5)
> x
, , 1

[,1] [,2]
[1,] 1 3
[2,] 2 4

, , 2

[,1] [,2]
[1,] 5 7
[2,] 6 8

, , 3

[,1] [,2]
[1,] 9 11
[2,] 10 12

, , 4

[,1] [,2]
[1,] 13 15
[2,] 14 16

, , 5

[,1] [,2]
[1,] 17 19
[2,] 18 20

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
> dim1 <- c("A1", "A2")
> dim2 <- c("B1", "B2", "B3")
> dim3 <- c("C1", "C2", "C3", "C4")
> z <- array(1:24, c(2, 3, 4), dimnames = list(dim1, dim2, dim3))
> z
, , C1

B1 B2 B3
A1 1 3 5
A2 2 4 6

, , C2

B1 B2 B3
A1 7 9 11
A2 8 10 12

, , C3

B1 B2 B3
A1 13 15 17
A2 14 16 18

, , C4

B1 B2 B3
A1 19 21 23
A2 20 22 24

索引

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
> m <- matrix(1:20, 4, 5, byrow = T)
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 6 7 8 9 10
[3,] 11 12 13 14 15
[4,] 16 17 18 19 20
> m[2, 5]
[1] 10
> m[2, c(2, 3, 5)]
[1] 7 8 10
> m[c(2:4), c(4, 5)]
[,1] [,2]
[1,] 9 10
[2,] 14 15
[3,] 19 20
> m[3,]
[1] 11 12 13 14 15
> m[, 3]
[1] 3 8 13 18
> m[2]
[1] 6
> m[-1, 2]
[1] 7 12 17
1
2
3
4
5
6
7
8
9
10
11
> rnames <- c("R1", "R2", "R3", "R4")
> cnames <- c("C1", "C2", "C3", "C4", "C5")
> dimnames(m) <- list(rnames, cnames)
> m
C1 C2 C3 C4 C5
R1 1 2 3 4 5
R2 6 7 8 9 10
R3 11 12 13 14 15
R4 16 17 18 19 20
> m["R2", "C5"]
[1] 10

矩阵的运算

四则运算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
> rnames <- c("R1", "R2", "R3", "R4")
> cnames <- c("C1", "C2", "C3", "C4", "C5")
> dimnames(m) <- list(rnames, cnames)
> m
C1 C2 C3 C4 C5
R1 1 2 3 4 5
R2 6 7 8 9 10
R3 11 12 13 14 15
R4 16 17 18 19 20
> m + 1
C1 C2 C3 C4 C5
R1 2 3 4 5 6
R2 7 8 9 10 11
R3 12 13 14 15 16
R4 17 18 19 20 21
> m * 2
C1 C2 C3 C4 C5
R1 2 4 6 8 10
R2 12 14 16 18 20
R3 22 24 26 28 30
R4 32 34 36 38 40
> m + m
C1 C2 C3 C4 C5
R1 2 4 6 8 10
R2 12 14 16 18 20
R3 22 24 26 28 30
R4 32 34 36 38 40
> n <- matrix(1:20, 5, 4)
> n
[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20
> m + n
Error in m + n : non-conformable arrays

矩阵函数

1
2
3
4
5
6
rnames <- c("R1", "R2", "R3", "R4")
cnames <- c("C1", "C2", "C3", "C4", "C5")
dimnames(m) <- list(rnames, cnames)
m
rowSums(m)
colSums(m)

矩阵内积与外积

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
> m <- matrix(1:9, nrow = 3, ncol = 3)
> n <- matrix(2:10, nrow = 3, ncol = 3)
> m
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> n
[,1] [,2] [,3]
[1,] 2 5 8
[2,] 3 6 9
[3,] 4 7 10
> m * n
[,1] [,2] [,3]
[1,] 2 20 56
[2,] 6 30 72
[3,] 12 42 90
> m %*% n
[,1] [,2] [,3]
[1,] 42 78 114
[2,] 51 96 141
[3,] 60 114 168

对角线与转置

1
2
3
4
5
6
7
8
9
10
11
12
13
> m <- matrix(1:9, nrow = 3, ncol = 3)
> m
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> diag(m)
[1] 1 5 9
> t(m)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9

列表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
> a <- c(1:20)
> b <- matrix(1:20, 4, 5)
> c <- mtcars
> d <- "this is a test"
> mlist <- list(a, b, c, d)
> mlist
[[1]]
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

[[2]]
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20

[[3]]
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
...
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

[[4]]
[1] "this is a test"

使用名称值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
> a <- c(1:20)
> b <- matrix(1:20, 4, 5)
> c <- mtcars
> d <- "this is a test"
> mlist <- list(first = a,second = b,third = c,forth = d)
> mlist
$first
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

$second
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20

$third
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
...
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

$forth
[1] "this is a test"

索引

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
> mlist[1]
$first
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

> mlist[c(1, 4)]
$first
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

$forth
[1] "this is a test"

> mlist[c("second", "third")]
$second
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20

$third
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
...
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

$字符访问

1
2
> mlist$first
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

两个中括号访问数据原本类型,一个中括号只能访问列表

1
2
3
4
5
6
7
8
9
10
> mlist[1]
$first
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

> class(mlist[1])
[1] "list"
> mlist[[1]]
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> class(mlist[[1]])
[1] "integer"

给列表赋值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
> mlist[5] <- iris
Warning message:
In mlist[5] <- iris :
number of items to replace is not a multiple of replacement length
> mlist[[5]] <- iris
> mlist[5]
[[1]]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
...
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica

删除元素

赋值NULL

1
2
3
4
> mlist[5] <- NULL
> mlist[5]
$<NA>
NULL

负索引

1
2
3
4
> mlist <- mlist[-5]
> mlist[5]
$<NA>
NULL

数据框

数据框(Data frame)可以理解成我们常说的"表格"。数据框是 R 语言的数据结构,是特殊的二维列表。数据框每一列都有一个唯一的列名,长度都是相等的,同一列的数据类型需要一致,不同列的数据类型可以不一样。

1
2
3
4
5
6
7
8
> mystate <- data.frame(state.name, state.abb, state.region)
> mystate
state.name state.abb state.region
1 Alabama AL South
2 Alaska AK West
...
49 Wisconsin WI North Central
50 Wyoming WY West

索引

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
> mystate <- data.frame(state.name, state.abb, state.region)
> mystate[1]
state.name
1 Alabama
2 Alaska
...
49 Wisconsin
50 Wyoming
> mystate[c(2, 3)]
state.abb state.region
1 AL South
2 AK West
...
49 WI North Central
50 WY West
> mystate[-c(1, 3)]
state.abb
1 AL
2 AK
...
49 WI
50 WY
> mystate[, "state.abb"]
[1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL" "IN" "IA"
[16] "KS" "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV" "NH" "NJ"
[31] "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VT"
[46] "VA" "WA" "WV" "WI" "WY"
> mystate["1", ]
state.name state.abb state.region
1 Alabama AL South
> mystate[5, ]
state.name state.abb state.region
5 California CA West
> mystate$state.region
[1] South West West South West
[6] West Northeast South South South
[11] West West North Central North Central North Central
[16] North Central South South Northeast South
[21] Northeast North Central North Central South North Central
[26] West North Central West Northeast Northeast
[31] West Northeast South North Central North Central
[36] South West Northeast Northeast South
[41] North Central South South West Northeast
[46] South West South North Central West
Levels: Northeast South North Central West

attach()函数加载数据框

1
2
3
4
5
6
7
> attach(mtcars)
> mpg
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7
[18] 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4
> hp
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 66 52 65 97
[22] 150 150 245 175 66 91 113 264 175 335 109

行名

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
> rownames(mtcars)
[1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
[4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
[7] "Duster 360" "Merc 240D" "Merc 230"
[10] "Merc 280" "Merc 280C" "Merc 450SE"
[13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
[16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
[19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
[22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
[25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
[28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
[31] "Maserati Bora" "Volvo 142E"
> mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
...
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

列名

1
2
3
4
5
6
7
8
9
> colnames(mtcars)
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"
> mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
...
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

detach()函数取消加载

1
2
3
4
5
> detach(mtcars)
> mgp
Error: object 'mgp' not found
> hp
Error: object 'hp' not found

with()函数访问数据

1
2
3
4
5
6
7
8
9
10
> with(mtcars,{sum(mpg)})
[1] 642.9
> with(mtcars, {hp})
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 66 52 65 97
[22] 150 150 245 175 66 91 113 264 175 335 109
> with(mtcars, {mpg})
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7
[18] 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4
> with(mtcars, {sum(mpg)})
[1] 642.9

因子

用于表示一组数据中的类别,可以记录这组数据中的类别名称及类别数目

名义型变量

1
2
3
4
> f <- factor(c("red", "red", "green", "blue", "green", "blue", "blue"))
> f
[1] red red green blue green blue blue
Levels: blue green red

有序型变量

1
2
3
4
5
6
7
8
9
10
11
12
13
> week <- factor(c("Mon", "Fri", "Thu", "Wed", "Mon", "Fri", "Sun"))
> week
[1] Mon Fri Thu Wed Mon Fri Sun
Levels: Fri Mon Sun Thu Wed
> week <-
+ factor(
+ c("Mon", "Fri", "Thu", "Wed", "Mon", "Fri", "Sun"),
+ ordered = T,
+ levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
+ )
> week
[1] Mon Fri Thu Wed Mon Fri Sun
Levels: Mon < Tue < Wed < Thu < Fri < Sat < Sun

因子图

1
2
> plot(mtcars$cyl)
> plot(factor(mtcars$cyl))

向量输出散点图

因子输出条形图

cut()函数

分割数据集

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
> num <- c(1:100)
> num
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
[22] 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
[43] 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
[64] 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84
[85] 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
> cut(num, c(seq(0, 100, 10)))
[1] (0,10] (0,10] (0,10] (0,10] (0,10] (0,10] (0,10] (0,10] (0,10]
[10] (0,10] (10,20] (10,20] (10,20] (10,20] (10,20] (10,20] (10,20] (10,20]
[19] (10,20] (10,20] (20,30] (20,30] (20,30] (20,30] (20,30] (20,30] (20,30]
[28] (20,30] (20,30] (20,30] (30,40] (30,40] (30,40] (30,40] (30,40] (30,40]
[37] (30,40] (30,40] (30,40] (30,40] (40,50] (40,50] (40,50] (40,50] (40,50]
[46] (40,50] (40,50] (40,50] (40,50] (40,50] (50,60] (50,60] (50,60] (50,60]
[55] (50,60] (50,60] (50,60] (50,60] (50,60] (50,60] (60,70] (60,70] (60,70]
[64] (60,70] (60,70] (60,70] (60,70] (60,70] (60,70] (60,70] (70,80] (70,80]
[73] (70,80] (70,80] (70,80] (70,80] (70,80] (70,80] (70,80] (70,80] (80,90]
[82] (80,90] (80,90] (80,90] (80,90] (80,90] (80,90] (80,90] (80,90] (80,90]
[91] (90,100] (90,100] (90,100] (90,100] (90,100] (90,100] (90,100] (90,100] (90,100]
[100] (90,100]
10 Levels: (0,10] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80] ... (90,100]

缺失数据处理

NA表示缺失值

1
2
3
4
5
6
7
8
9
> a <- c(NA, 1:49)
> a
[1] NA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
[29] 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
> is.na(a)
[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[15] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[29] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[43] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

缺失值的运算

1
2
3
4
5
6
7
8
9
10
11
12
> NA + 1
[1] NA
> NA == 0
[1] NA
> a <- c(NA, 1:49)
> a
[1] NA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
[29] 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
> sum(a)
[1] NA
> mean(a)
[1] NA

跳过缺失值

1
2
3
4
5
6
7
8
> a <- c(NA, 1:49)
> a
[1] NA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
[29] 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
> sum(a, na.rm = TRUE)
[1] 1225
> mean(a, na.rm = TRUE)
[1] 25

删除缺失值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
> a <- c(NA, 1:10, NA, 10:20, NA, NA)
> a
[1] NA 1 2 3 4 5 6 7 8 9 10 NA 10 11 12 13 14 15 16 17 18 19 20 NA NA
> d <- na.omit(a)
> d
[1] 1 2 3 4 5 6 7 8 9 10 10 11 12 13 14 15 16 17 18 19 20
attr(,"na.action")
[1] 1 12 24 25
attr(,"class")
[1] "omit"
> is.na(d)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[15] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> sum(d)
[1] 220
> mean(d)
[1] 10.47619

na.omit()删除数据集中缺失数据所在的行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
> install.packages("VIM")
> library(VIM)
> sleep
BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger
1 6654.000 5712.00 NA NA 3.3 38.6 645.0 3 5 3
2 1.000 6.60 6.3 2.0 8.3 4.5 42.0 3 1 3
3 3.385 44.50 NA NA 12.5 14.0 60.0 1 1 1
4 0.920 5.70 NA NA 16.5 NA 25.0 5 2 3
...
61 3.500 3.90 12.8 6.6 19.4 3.0 14.0 2 1 1
62 4.050 17.00 NA NA NA 13.0 38.0 3 1 1
> sum(is.na(sleep))
[1] 38
> na.omit(sleep)
BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger
2 1.000 6.60 6.3 2.0 8.3 4.5 42.0 3 1 3
5 2547.000 4603.00 2.1 1.8 3.9 69.0 624.0 3 5 4
6 10.550 179.50 9.1 0.7 9.8 27.0 180.0 4 4 4
7 0.023 0.30 15.8 3.9 19.7 19.0 35.0 1 1 1
8 160.000 169.00 5.2 1.0 6.2 30.4 392.0 4 5 4
9 3.300 25.60 10.9 3.6 14.5 28.0 63.0 1 2 1
...
60 4.190 58.00 9.7 0.6 10.3 24.0 210.0 4 3 4
61 3.500 3.90 12.8 6.6 19.4 3.0 14.0 2 1 1
> length(rownames(sleep))
[1] 62
> length(rownames(na.omit(sleep)))
[1] 42

其他缺失值

NaN表示不存在的值

Inf表示无穷值

1
2
3
4
5
6
7
8
9
10
> 1 / 0
[1] Inf
> - 1 / 0
[1] -Inf
> 0 / 0
[1] NaN
> is.nan(0 / 0)
[1] TRUE
> is.infinite(1 / 0)
[1] TRUE

字符串

统计字符串长度

1
2
3
4
5
6
7
8
9
10
11
> nchar("Hello World")
[1] 11
> month.name
[1] "January" "February" "March" "April" "May" "June" "July"
[8] "August" "September" "October" "November" "December"
> nchar(month.name)
[1] 7 8 5 5 3 4 4 6 9 7 8 8
> length(month.name)
[1] 12
> nchar(c(12, 3, 100))
[1] 2 1 3

连接字符串

1
2
3
4
5
6
7
> paste("I", "Love", "bnu")
[1] "I Love bnu"
> paste("I", "Love", "bnu", sep = "-")
[1] "I-Love-bnu"
> names <- c("You", "I", "Ta")
> paste(names, "Love bnu")
[1] "You Love bnu" "I Love bnu" "Ta Love bnu"

截取字符串

1
2
> substr(x = month.name, start = 1, stop = 3)
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"

修改字符串大小写

1
2
3
4
5
6
7
8
9
> temp <- substr(x = month.name, start = 1, stop = 3)
> toupper(temp)
[1] "JAN" "FEB" "MAR" "APR" "MAY" "JUN" "JUL" "AUG" "SEP" "OCT" "NOV" "DEC"
> tolower(temp)
[1] "jan" "feb" "mar" "apr" "may" "jun" "jul" "aug" "sep" "oct" "nov" "dec"
> gsub("^(\\w)", "\\U\\1", tolower(temp), perl = T)
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
> gsub("^(\\w)", "\\L\\1", toupper(temp), perl = T)
[1] "jAN" "fEB" "mAR" "aPR" "mAY" "jUN" "jUL" "aUG" "sEP" "oCT" "nOV" "dEC"

查询字符串元素

1
2
3
4
5
6
7
> x <- c("a", "A", "+", "A+", "+A", "+a", "a+A")
> grep("A+", x, fixed = T)
[1] 4
> grep("A+", x, fixed = F)
[1] 2 4 5 7
> match("A+", x)
[1] 4

分割字符串

输出列表格式

1
2
3
4
5
6
7
8
9
10
11
> path <- "/usr/local/bin/R"
> strsplit(path, "/")
[[1]]
[1] "" "usr" "local" "bin" "R"

> strsplit(c(path, path), "/")
[[1]]
[1] "" "usr" "local" "bin" "R"

[[2]]
[1] "" "usr" "local" "bin" "R"

日期与时间

1
2
3
4
> Sys.Date()
[1] "2022-09-28"
> class(Sys.Date())
[1] "Date"

字符串转换为日期

1
2
3
4
5
6
7
> a <- "2022-09-28"
> class(a)
[1] "character"
> as.Date(a, format = "%Y-%m-%d")
[1] "2022-09-28"
> class(as.Date(a, format = "%Y-%m-%d"))
[1] "Date"

日期序列

1
2
3
4
5
6
7
8
> seq(as.Date("2022-08-07"), as.Date("2022-09-28"), by = 3)
[1] "2022-08-07" "2022-08-10" "2022-08-13" "2022-08-16"
[5] "2022-08-19" "2022-08-22" "2022-08-25" "2022-08-28"
[9] "2022-08-31" "2022-09-03" "2022-09-06" "2022-09-09"
[13] "2022-09-12" "2022-09-15" "2022-09-18" "2022-09-21"
[17] "2022-09-24" "2022-09-27"
> class(seq(as.Date("2022-08-07"), as.Date("2022-09-28"), by = 3))
[1] "Date"
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
> scores <- round(runif(48, min = 50, max = 100))
> scores
[1] 69 95 67 51 78 95 78 55 92 54 55 99 85 75 87
[16] 73 66 64 69 52 96 65 81 88 77 75 81 51 58 79
[31] 91 76 65 92 99 70 70 70 70 74 89 81 95 75 81
[46] 86 59 100
> ts(
+ scores,
+ start = c(2019, 9),
+ end = c(2023, 9),
+ frequency = 1
+ )
Time Series:
Start = 2027
End = 2031
Frequency = 1
[1] 69 95 67 51 78
> ts(
+ scores,
+ start = c(2019, 9),
+ end = c(2023, 9),
+ frequency = 4
+ )
Qtr1 Qtr2 Qtr3 Qtr4
2021 69 95 67 51
2022 78 95 78 55
2023 92 54 55 99
2024 85 75 87 73
2025 66
> ts(
+ scores,
+ start = c(2019, 9),
+ end = c(2023, 9),
+ frequency = 12
+ )
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2019 69 95 67 51
2020 78 95 78 55 92 54 55 99 85 75 87 73
2021 66 64 69 52 96 65 81 88 77 75 81 51
2022 58 79 91 76 65 92 99 70 70 70 70 74
2023 89 81 95 75 81 86 59 100 69

对文件的操作

读入数据

键盘输入

1
2
3
4
5
6
7
8
9
10
11
12
> patientID <- c(1:4)
> admdate <- c("10/15/2009", "11/01/2009", "10/21/2009", "10/28/2009")
> age <- c(25, 34, 28, 52)
> diabetes <- c("Type1", "Type2", "Type1", "Tpye1")
> ststus <- c("Poor", "Improved", "Excellent", "Poor")
> data <- data.frame(patientID, admdate, age, diabetes, ststus)
> data
patientID admdate age diabetes ststus
1 1 10/15/2009 25 Type1 Poor
2 2 11/01/2009 34 Type2 Improved
3 3 10/21/2009 28 Type1 Excellent
4 4 10/28/2009 52 Tpye1 Poor

edit()函数

1
2
3
4
5
6
7
8
9
> datacopy <-
+ data.frame(
+ patientID = character(0),
+ admdate = character(0),
+ age = numeric(),
+ diabetes = character(),
+ status = character()
+ )
> datecopy <- edit(datacopy)

fix()函数

1
fix(datacopy)

读取外部文件

更改文件路径

1
2
3
> getwd()
[1] "D:/学习/R初体验"
> setwd("D:/学习/R初体验/RData")

read.table()读取文件(相对路径)

1
2
3
4
5
6
7
8
> x <- read.table("input.txt")
> x
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
...
152 18 131 8.0 76 9 29
153 20 223 11.5 68 9 30

head()tail()查看数据首尾

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
> head(x)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
> tail(x)
Ozone Solar.R Wind Temp Month Day
148 14 20 16.6 63 9 25
149 30 193 6.9 70 9 26
150 NA 145 13.2 77 9 27
151 14 191 14.3 75 9 28
152 18 131 8.0 76 9 29
153 20 223 11.5 68 9 30
> head(x, n = 10)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 NA 194 8.6 69 5 10

sep分隔符

1
2
3
4
5
6
7
8
> x <- read.table("input.csv", sep = ",")
> x
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1 mpg cyl disp hp drat wt qsec vs am gear carb
2 Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4
...
32 Maserati Bora 15 8 301 335 3.54 3.57 14.6 0 1 5 8
33 Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2

header = TRUE将第一行数据作为变量名

1
2
3
4
5
6
7
8
9
> x <- read.table("input.csv", sep = ",", header = TRUE)
> head(x)
X mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
5 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

skip跳过文件部分内容(跳过内容多为文件注释)

1
2
3
4
5
6
7
8
9
> x <- read.table("input 1.txt", header = TRUE, skip = 5)
> head(x)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6

nrows确定读取文件行数

1
2
3
4
5
6
7
8
> x <- read.table("input 1.txt", skip = 20, nrows = 50)
> x
V1 V2 V3 V4 V5 V6 V7
1 15 18 65 13.2 58 5 15
2 16 14 334 11.5 64 5 16
...
49 63 49 248 9.2 85 7 2
50 64 32 236 9.2 81 7 3

na.strings确认缺失值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
> x <- read.table("input.txt", na.strings = "NA")
> x
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
...
150 NA 145 13.2 77 9 27
151 14 191 14.3 75 9 28
152 18 131 8.0 76 9 29
153 20 223 11.5 68 9 30
> is.na(x)
Ozone Solar.R Wind Temp Month Day
1 FALSE FALSE FALSE FALSE FALSE FALSE
2 FALSE FALSE FALSE FALSE FALSE FALSE
3 FALSE FALSE FALSE FALSE FALSE FALSE
4 FALSE FALSE FALSE FALSE FALSE FALSE
5 TRUE TRUE FALSE FALSE FALSE FALSE
...
150 TRUE FALSE FALSE FALSE FALSE FALSE
151 FALSE FALSE FALSE FALSE FALSE FALSE
152 FALSE FALSE FALSE FALSE FALSE FALSE
153 FALSE FALSE FALSE FALSE FALSE FALSE

stringsAsFactors将字符串数据转化成因子

read.csv()读取csv文件

1
2
3
4
5
6
7
8
> x <- read.csv("input.csv")
> x
X mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
...
31 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
32 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

读取网络上的文件,将下载到本地

1
> x <- read.table("https://codeload.github.com/mperdeck/LINQtocSV/zip/master", header = TRUE)

读取网页表格

1
2
3
> install.packages("XML")
> library(XML)
> x <- readHTMLTable("https://en.wikipedia.org/wiki/World_population" , which = 3)

读取剪贴板Ctrl C上的文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
> x <- read.table("clipboard", header =  TRUE, sep = "\t")
> x
X mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
...
24 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
25 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
> readClipboard()
[1] "\tmpg\tcyl\tdisp\thp\tdrat\twt\tqsec\tvs\tam\tgear\tcarb"
[2] "Mazda RX4\t21\t6\t160\t110\t3.9\t2.62\t16.46\t0\t1\t4\t4"
...
[25] "Camaro Z28\t13.3\t8\t350\t245\t3.73\t3.84\t15.41\t0\t0\t3\t4"
[26] "Pontiac Firebird\t19.2\t8\t400\t175\t3.08\t3.845\t17.05\t0\t0\t3\t2"

读取压缩格式的文件

1
2
3
4
5
6
7
8
> x <- read.table(gzfile("input.txt.gz"))
> x
X mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
...
15 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
16 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4

readLines()读取文件固定行

1
2
3
4
5
6
7
> x <- readLines("input.csv", n = 5)
> x
[1] "\"\",\"mpg\",\"cyl\",\"disp\",\"hp\",\"drat\",\"wt\",\"qsec\",\"vs\",\"am\",\"gear\",\"carb\""
[2] "\"Mazda RX4\",21,6,160,110,3.9,2.62,16.46,0,1,4,4"
[3] "\"Mazda RX4 Wag\",21,6,160,110,3.9,2.875,17.02,0,1,4,4"
[4] "\"Datsun 710\",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1"
[5] "\"Hornet 4 Drive\",21.4,6,258,110,3.08,3.215,19.44,1,0,3,1"

scan()读取文件固定行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
> x <- scan("scan.txt", what = list(X1 = character(3), X2 = character(0), X3 = character(0)))
Read 20 records
> x
$X1
[1] "one" "four" "one" "four" "one" "four" "one" "four" "one" "four" "one" "four" "one" "four" "one" "four" "one" "four"
[19] "one" "four"

$X2
[1] "2" "5" "2" "5" "2" "5" "2" "5" "2" "5" "2" "5" "2" "5" "2" "5" "2" "5" "2" "5"

$X3
[1] "3" "6" "3" "6" "3" "6" "3" "6" "3" "6" "3" "6" "3" "6" "3" "6" "3" "6" "3" "6"
> class(x)
[1] "list"
> data <- data.frame(x$X1, x$X2, x$X3)
> data
x.X1 x.X2 x.X3
1 one 2 3
2 four 5 6
...
19 one 2 3
20 four 5 6

访问数据库系统

1
2
> install.packages("RODBC")
> library("RODBC")

写入文件

写入向量

1
2
3
4
5
6
7
8
9
10
11
> x = rivers
> x
[1] 735 320 325 392 524 450 1459 135 465 600 330 336 280 315 870 906 202 329 290 1000 600 505 1450 840 1243
[26] 890 350 407 286 280 525 720 390 250 327 230 265 850 210 630 260 230 360 730 600 306 390 420 291 710
[51] 340 217 281 352 259 250 470 680 570 350 300 560 900 625 332 2348 1171 3710 2315 2533 780 280 410 460 260
[76] 255 431 350 760 618 338 981 1306 500 696 605 250 411 1054 735 233 435 490 310 460 383 375 1270 545 445
[101] 1885 380 300 380 377 425 276 210 800 420 350 360 538 1100 1205 314 237 610 360 540 1038 424 310 300 444
[126] 301 268 620 215 652 900 525 246 360 529 500 720 270 430 671 1770
> cat(x)
735 320 325 392 524 450 1459 135 465 600 330 336 280 315 870 906 202 329 290 1000 600 505 1450 840 1243 890 350 407 286 280 525 720 390 250 327 230 265 850 210 630 260 230 360 730 600 306 390 420 291 710 340 217 281 352 259 250 470 680 570 350 300 560 900 625 332 2348 1171 3710 2315 2533 780 280 410 460 260 255 431 350 760 618 338 981 1306 500 696 605 250 411 1054 735 233 435 490 310 460 383 375 1270 545 445 1885 380 300 380 377 425 276 210 800 420 350 360 538 1100 1205 314 237 610 360 540 1038 424 310 300 444 301 268 620 215 652 900 525 246 360 529 500 720 270 430 671 1770
> write(x, file = "x.txt")

写入表格

1
2
3
> x <- read.table("input.txt", header = TRUE)
> write.table(x, file = "newfile.txt")
> write.table(x, file = "newfile.csv", sep = ",")

去除行号

1
> write.table(x, file = "newfile.csv", sep = ",", row.names = FALSE)

append将追加到已有文件结尾

1
> write.table(iris, file = "newfile.txt",skip = 1, append = TRUE)

写入压缩文件

1
> write.table(mtcars, gzfile("nwefile.txt.gz"))

读写Excel文件

安装XLConnect包,必须先下载好Java

1
2
> install.packages("XLConnect")
> library(XLConnect)

读取Excel文件

1
2
3
4
5
6
7
8
9
10
> ex <- loadWorkbook("data.xlsx")
> exdata <- readWorksheet(ex, 1)
> head(exdata)
Col1 mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
5 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
1
2
3
4
5
6
7
8
9
10
11
> readWorksheet(ex, 1, startRow = 0, startCol = 0, endRow = 10, endCol = 3, header = TRUE)
Col1 mpg cyl
1 Mazda RX4 21.0 6
2 Mazda RX4 Wag 21.0 6
3 Datsun 710 22.8 4
4 Hornet 4 Drive 21.4 6
5 Hornet Sportabout 18.7 8
6 Valiant 18.1 6
7 Duster 360 14.3 8
8 Merc 240D 24.4 4
9 Merc 230 22.8 4

一步完成

1
2
3
4
5
6
7
> readWorksheetFromFile("data.xlsx", 1)
Col1 mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
...
31 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
32 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

写入Excel文件

1
2
3
4
> wb <- loadWorkbook("exfile.xlsx", create = TRUE)
> createSheet(wb, "Sheet 1")
> writeWorksheet(wb, data = mtcars, sheet = "Sheet 1")
> saveWorkbook(wb)

一步完成

1
> writeWorksheetToFile("exfile.xlsx", data = iris, sheet = "Sheet 1")

xlsx包读写Excel文件

1
2
> x <- read.xlsx("data.xlsx", 1, startRow = 1, endRow = 10)
> write.xlsx(x, file = "rdata.xlsx", sheetName = "Sheet 1", append = FALSE)

读写R格式文件

RDS文件

1
> saveRDS(iris, file = "iris.RDS")
1
x <- readRDS("iris.RDS")

Rdata文件

1
2
> save(iris, iris3, file = "iris.Rdata")
> load("iris.Rdata")

保存所有对象

1
> save.image()

数据转换

数据类型转化

is.data.frame()判断数据框

1
2
3
4
5
6
7
8
9
10
11
> library(xlsx)
> cars32 <- read.xlsx("mtcars.xlsx", sheetIndex = 1, header = TRUE)
> cars32
NA. mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
...
31 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
32 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
> is.data.frame(cars32)
[1] TRUE

as.data.frame转化为数据框

1
2
3
4
5
> is.data.frame(state.x77)
[1] FALSE
> dstate.x77 <- as.data.frame(state.x77)
> is.data.frame(dstate.x77)
[1] TRUE

as.matrix()转化为矩阵,但出现字符型时,会将数值全部转化成字符串

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
> data <- data.frame(state.region, state.x77)
> data
state.region Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
Alabama South 3615 3624 2.1 69.05 15.1 41.3 20 50708
Alaska West 365 6315 1.5 69.31 11.3 66.7 152 566432
...
Wisconsin North Central 4589 4468 0.7 72.48 3.0 54.5 149 54464
Wyoming West 376 4566 0.6 70.29 6.9 62.9 173 97203
> as.matrix(data)
state.region Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
Alabama "South" " 3615" "3624" "2.1" "69.05" "15.1" "41.3" " 20" " 50708"
Alaska "West" " 365" "6315" "1.5" "69.31" "11.3" "66.7" "152" "566432"
...
Wisconsin "North Central" " 4589" "4468" "0.7" "72.48" " 3.0" "54.5" "149" " 54464"
Wyoming "West" " 376" "4566" "0.6" "70.29" " 6.9" "62.9" "173" " 97203"

as.vector转化为向量

1
2
3
4
5
6
> as.vector(state.x77)
[1] 3615.00 365.00 2212.00 2110.00 21198.00 2541.00 3100.00 579.00 8277.00 4931.00 868.00 813.00
[13] 11197.00 5313.00 2861.00 2280.00 3387.00 3806.00 1058.00 4122.00 5814.00 9111.00 3921.00 2341.00
...
[385] 40975.00 68782.00 96184.00 44966.00 1049.00 30225.00 75955.00 41328.00 262134.00 82096.00 9267.00 39780.00
[397] 66570.00 24070.00 54464.00 97203.00

as.factor()转化为因子

1
2
3
4
5
6
7
8
> x <- state.abb
> x
[1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL" "IN" "IA" "KS" "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO"
[26] "MT" "NE" "NV" "NH" "NJ" "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VT" "VA" "WA" "WV" "WI" "WY"
> as.factor(x)
[1] AL AK AZ AR CA CO CT DE FL GA HI ID IL IN IA KS KY LA ME MD MA MI MN MS MO MT NE NV NH NJ NM NY NC ND OH OK OR PA RI SC SD TN
[43] TX UT VT VA WA WV WI WY
50 Levels: AK AL AR AZ CA CO CT DE FL GA HI IA ID IL IN KS KY LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR ... WY

as.list()转化为列表

1
2
3
4
5
6
7
8
9
10
> x <- state.abb
> as.list(x)
[[1]]
[1] "AL"

[[2]]
[1] "AK"
...
[[50]]
[1] "WY"

unname()去除列明,unlist()转化为向量

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
> x <- state.abb
> state <- data.frame(x, state.region, state.x77)
> state$Income
[1] 3624 6315 4530 3378 5114 4884 5348 4809 4815 4091 4963 4119 5107 4458 4628 4669 3712 3545 3694 5299 4755 4751 4675 3098 4254
[26] 4347 4508 5149 4281 5237 3601 4903 3875 5087 4561 3983 4660 4449 4558 3635 4167 3821 4188 4022 3907 4701 4864 3617 4468 4566
> state["Nevada", ]
x state.region Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
Nevada NV West 590 5149 0.5 69.03 11.5 65.2 188 109889
> is.data.frame(state["Nevada", ])
[1] TRUE
> y <- state["Nevada", ]
> y
x state.region Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
Nevada NV West 590 5149 0.5 69.03 11.5 65.2 188 109889
> unname(y)

Nevada NV West 590 5149 0.5 69.03 11.5 65.2 188 109889
> unlist(y)
x state.region Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
"NV" "4" "590" "5149" "0.5" "69.03" "11.5" "65.2" "188" "109889"

选取子集

索引选取子集

View()打开数据窗口

1
2
3
4
5
6
> who <- read.csv("WHO.csv", header = TRUE)
> View(who)
> who1 <- who[c(1:50), c(1:10)]
> View(who1)
> who2 <- who[c(1, 3, 5, 8), c(2, 4, 6, 8)]
> View(who2)

逻辑判断选取子集

1
2
3
4
> who3 <- who[which(who$Continent == 7), ]
> View(who3)
> who4 <- who[which(who$CountryID > 50 & who$CountryID <= 100), ]
> View(who4)

subset()函数选取子集

1
2
who4 <- subset(who, who$CountryID > 50 & who$CountryID <= 100)
View(who4)

sample()随机抽样

1
2
3
4
5
6
7
8
9
> x <- 1:100
> sample(x, 30)
[1] 55 92 50 37 48 16 9 20 28 76 10 87 86 17 59 77 21 67 43 8 34 71 3 39 6 83 69 30 14 54
> sample(x, 60, replace = TRUE)
[1] 74 80 71 81 88 84 52 22 69 9 99 77 62 60 85 48 84 23 1 10 57 92 13 100 2 27 42 64 26 16 81
[32] 55 18 90 84 18 95 92 1 8 48 70 37 84 97 51 2 49 55 9 79 60 64 62 15 50 7 31 15 22
> sort(sample(x, 60, replace = TRUE))
[1] 1 2 3 5 8 9 10 11 12 15 16 25 26 35 35 35 37 40 40 40 41 46 47 48 49 49 50 51 52 53 53
[32] 55 56 57 58 61 65 66 67 68 69 69 69 71 74 76 76 83 84 86 88 89 89 91 92 93 94 95 97 100
1
2
3
4
5
> who <- read.csv("WHO.csv", header = TRUE)
> sample(who$CountryID, 30, replace = TRUE)
[1] 76 93 166 197 108 24 174 100 61 65 176 169 127 162 196 193 92 122 77 52 198 118 63 61 38 185 90 25 152 197
> who5 <- who[sample(who$CountryID, 30, replace = TRUE), ]
> View(who5)

删除固定行与列

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
> head(mtcars)
cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 6 225 105 2.76 3.460 20.22 1 0 3 1
> head(mtcars[-1:-5, ])
cyl disp hp drat wt qsec vs am gear carb
Valiant 6 225.0 105 2.76 3.46 20.22 1 0 3 1
Duster 360 8 360.0 245 3.21 3.57 15.84 0 0 3 4
Merc 240D 4 146.7 62 3.69 3.19 20.00 1 0 4 2
Merc 230 4 140.8 95 3.92 3.15 22.90 1 0 4 2
Merc 280 6 167.6 123 3.92 3.44 18.30 1 0 4 4
Merc 280C 6 167.6 123 3.92 3.44 18.90 1 0 4 4
> head(mtcars[, -1:-5, ])
qsec vs am gear carb
Mazda RX4 16.46 0 1 4 4
Mazda RX4 Wag 17.02 0 1 4 4
Datsun 710 18.61 1 1 4 1
Hornet 4 Drive 19.44 1 0 3 1
Hornet Sportabout 17.02 0 0 3 2
Valiant 20.22 1 0 3 1
> mtcars$mpg <- NULL
> head(mtcars)
cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 6 225 105 2.76 3.460 20.22 1 0 3 1

数据框的添加与合并

合成一个数据框

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
> USArrests
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
...
Wisconsin 2.6 53 66 10.8
Wyoming 6.8 161 60 15.6
> state.division
[1] East South Central Pacific Mountain West South Central Pacific Mountain
[7] New England South Atlantic South Atlantic South Atlantic Pacific Mountain ...
[43] West South Central Mountain New England South Atlantic Pacific South Atlantic
[49] East North Central Mountain
9 Levels: New England Middle Atlantic South Atlantic East South Central West South Central ... Pacific
> data.frame(USArrests, state.division)
Murder Assault UrbanPop Rape state.division
Alabama 13.2 236 58 21.2 East South Central
Alaska 10.0 263 48 44.5 Pacific
...
Wisconsin 2.6 53 66 10.8 East North Central
Wyoming 6.8 161 60 15.6 Mountain

cbind()与rbind()函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
> cbind(USArrests, state.division)
Murder Assault UrbanPop Rape state.division
Alabama 13.2 236 58 21.2 East South Central
Alaska 10.0 263 48 44.5 Pacific
...

Wisconsin 2.6 53 66 10.8 East North Central
Wyoming 6.8 161 60 15.6 Mountain
> data1 <- head(USArrests)
> data2 <- tail(USArrests)
> rbind(data1, data2)
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
Vermont 2.2 48 32 11.2
Virginia 8.5 156 63 20.7
Washington 4.0 145 73 26.2
West Virginia 5.7 81 39 9.3
Wisconsin 2.6 53 66 10.8
Wyoming 6.8 161 60 15.6

去除重复行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
> data1 <- head(USArrests, 30)
> data2 <- tail(USArrests, 30)
> data3 <- rbind(data1, data2)
> length(rownames(data3))
[1] 60
>
> duplicated(data3)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[22] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
[43] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> data3[duplicated(data3), ]
Murder Assault UrbanPop Rape
Massachusetts1 4.4 149 85 16.3
Michigan1 12.1 255 74 35.1
Minnesota1 2.7 72 66 14.9
Mississippi1 16.1 259 44 17.1
Missouri1 9.0 178 70 28.2
Montana1 6.0 109 53 16.4
Nebraska1 4.3 102 62 16.5
Nevada1 12.2 252 81 46.0
New Hampshire1 2.1 57 56 9.5
New Jersey1 7.4 159 89 18.8
> data3[!duplicated(data3), ]
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
...
Wisconsin 2.6 53 66 10.8
Wyoming 6.8 161 60 15.6
> length(rownames(data3[!duplicated(data3), ]))
[1] 50
>
> unique(data3)
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
...
Wisconsin 2.6 53 66 10.8
Wyoming 6.8 161 60 15.6
> length(rownames(unique(data3)))
[1] 50

merge()函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
> x <- data.frame(k1 = c(NA, NA, 3, 4, 5), k2 = c(1, NA, NA, 4, 5), data = 1:5)
> y <- data.frame(k1 = c(NA, 2, NA, 4, 5), k2 = c(NA, NA, 3, 4, 5), data = 5:1)
> merge(x, y, by = "k1")
k1 k2.x data.x k2.y data.y
1 4 4 4 4 2
2 5 5 5 5 1
3 NA 1 1 NA 5
4 NA 1 1 3 3
5 NA NA 2 NA 5
6 NA NA 2 3 3
> merge(x, y, by = "k2", incomparables = TRUE)
k2 k1.x data.x k1.y data.y
1 4 4 4 4 2
2 5 5 5 5 1
3 NA NA 2 NA 5
4 NA NA 2 2 4
5 NA 3 3 NA 5
6 NA 3 3 2 4
> merge(x, y, by = "k2", incomparables = NA)
k2 k1.x data.x k1.y data.y
1 4 4 4 4 2
2 5 5 5 5 1
> merge(x, y, by = c("k1", "k2"))
k1 k2 data.x data.y
1 4 4 4 2
2 5 5 5 1
3 NA NA 2 5

修改数据的值

t()转置

1
2
> sractm <- t(mtcars)
> View(sractm)

rev()翻转向量

1
2
3
4
> letters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
> rev(letters)
[1] "z" "y" "x" "w" "v" "u" "t" "s" "r" "q" "p" "o" "n" "m" "l" "k" "j" "i" "h" "g" "f" "e" "d" "c" "b" "a"
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
> women
height weight
1 58 115
2 59 117
...
14 71 159
15 72 164
> rownames(women)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15"
> rev(rownames(women))
[1] "15" "14" "13" "12" "11" "10" "9" "8" "7" "6" "5" "4" "3" "2" "1"
> women[rev(rownames(women)), ]
height weight
15 72 164
14 71 159
...
2 59 117
1 58 115

transform()函数修改列值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
> transform(women, weight = weight * 2)
height weight
1 58 230
2 59 234
...
14 71 318
15 72 328
> transform(women, cm = height * 2.54)
height weight cm
1 58 115 147.32
2 59 117 149.86
...
14 71 159 180.34
15 72 164 182.88

排序

sort()

向量排序

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
> rivers
[1] 735 320 325 392 524 450 1459 135 465 600 330 336 280 315 870 906 202 329 290 1000 600 505 1450 840 1243
[26] 890 350 407 286 280 525 720 390 250 327 230 265 850 210 630 260 230 360 730 600 306 390 420 291 710
[51] 340 217 281 352 259 250 470 680 570 350 300 560 900 625 332 2348 1171 3710 2315 2533 780 280 410 460 260
[76] 255 431 350 760 618 338 981 1306 500 696 605 250 411 1054 735 233 435 490 310 460 383 375 1270 545 445
[101] 1885 380 300 380 377 425 276 210 800 420 350 360 538 1100 1205 314 237 610 360 540 1038 424 310 300 444
[126] 301 268 620 215 652 900 525 246 360 529 500 720 270 430 671 1770
> sort(rivers)
[1] 135 202 210 210 215 217 230 230 233 237 246 250 250 250 255 259 260 260 265 268 270 276 280 280 280
[26] 281 286 290 291 300 300 300 301 306 310 310 314 315 320 325 327 329 330 332 336 338 340 350 350 350
[51] 350 352 360 360 360 360 375 377 380 380 383 390 390 392 407 410 411 420 420 424 425 430 431 435 444
[76] 445 450 460 460 465 470 490 500 500 505 524 525 525 529 538 540 545 560 570 600 600 600 605 610 618
[101] 620 625 630 652 671 680 696 710 720 720 730 735 735 760 780 800 840 850 870 890 900 900 906 981 1000
[126] 1038 1054 1100 1171 1205 1243 1270 1306 1450 1459 1770 1885 2315 2348 2533 3710
> month.name
[1] "January" "February" "March" "April" "May" "June" "July" "August" "September" "October"
[11] "November" "December"
> sort(month.name)
[1] "April" "August" "December" "February" "January" "July" "June" "March" "May" "November"
[11] "October" "September"
> rev(sort(rivers))
[1] 3710 2533 2348 2315 1885 1770 1459 1450 1306 1270 1243 1205 1171 1100 1054 1038 1000 981 906 900 900 890 870 850 840
[26] 800 780 760 735 735 730 720 720 710 696 680 671 652 630 625 620 618 610 605 600 600 600 570 560 545
[51] 540 538 529 525 525 524 505 500 500 490 470 465 460 460 450 445 444 435 431 430 425 424 420 420 411
[76] 410 407 392 390 390 383 380 380 377 375 360 360 360 360 352 350 350 350 350 340 338 336 332 330 329
[101] 327 325 320 315 314 310 310 306 301 300 300 300 291 290 286 281 280 280 280 276 270 268 265 260 260
[126] 259 255 250 250 250 246 237 233 230 230 217 215 210 210 202 135
> rev(sort(month.name))
[1] "September" "October" "November" "May" "March" "June" "July" "January" "February" "December"
[11] "August" "April"

数据框排序

1
2
3
4
5
6
7
8
9
> mtcars[sort(rownames(mtcars)),]
mpg cyl disp hp drat wt qsec vs am gear carb
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
...
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

order()

order()返回排序位置

向量排序

1
2
3
4
5
6
> order(rivers)
[1] 8 17 39 108 129 52 36 42 91 117 133 34 56 87 76 55 41 75 37 127 138 107 13 30 72 53 29 19 49 61 103
[32] 124 126 46 94 123 116 14 2 3 35 18 11 65 12 81 51 27 60 78 111 54 43 112 119 134 97 105 102 104 96 33
[63] 47 4 28 73 88 48 110 122 106 139 77 92 125 100 6 74 95 9 57 93 84 136 22 5 31 132 135 113 120 99 62
[94] 59 10 21 45 86 118 80 128 64 40 130 140 58 85 50 32 137 44 1 90 79 71 109 24 38 15 26 63 131 16 82
[125] 20 121 89 114 67 115 25 98 83 23 7 141 101 69 66 70 68

数据框排序

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
> mtcars[order(mtcars$mpg), ]
mpg cyl disp hp drat wt qsec vs am gear carb
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
...
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
> mtcars[order(-mtcars$mpg), ]
mpg cyl disp hp drat wt qsec vs am gear carb
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
...
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4

多个条件排序

1
2
3
4
5
6
7
8
9
> mtcars[order(mtcars$mpg, mtcars$disp), ]
mpg cyl disp hp drat wt qsec vs am gear carb
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
...
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1

数学计算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
> WorldPhones
N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
1951 45939 21574 2876 1815 1646 89 555
1956 60423 29990 4708 2568 2366 1411 733
1957 64721 32510 5230 2695 2526 1546 773
1958 68484 35218 6662 2845 2691 1663 836
1959 71799 37598 6856 3000 2868 1769 911
1960 76036 40341 8220 3145 3054 1905 1008
1961 79831 43173 9053 3338 3224 2005 1076
> worldphones <- as.data.frame(WorldPhones)
> rs <- rowSums(worldphones)
> rs
1951 1956 1957 1958 1959 1960 1961
74494 102199 110001 118399 124801 133709 141700
> cm <- colMeans(worldphones)
> cm
N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
66747.5714 34343.4286 6229.2857 2772.2857 2625.0000 1484.0000 841.7143
> total <- cbind(worldphones, Total = rs)
> total
N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer Total
1951 45939 21574 2876 1815 1646 89 555 74494
1956 60423 29990 4708 2568 2366 1411 733 102199
1957 64721 32510 5230 2695 2526 1546 773 110001
1958 68484 35218 6662 2845 2691 1663 836 118399
1959 71799 37598 6856 3000 2868 1769 911 124801
1960 76036 40341 8220 3145 3054 1905 1008 133709
1961 79831 43173 9053 3338 3224 2005 1076 141700
> rbind(total, Mean = cm)
N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer Total
1951 45939.00 21574.00 2876.000 1815.000 1646 89 555.0000 74494.00
1956 60423.00 29990.00 4708.000 2568.000 2366 1411 733.0000 102199.00
1957 64721.00 32510.00 5230.000 2695.000 2526 1546 773.0000 110001.00
1958 68484.00 35218.00 6662.000 2845.000 2691 1663 836.0000 118399.00
1959 71799.00 37598.00 6856.000 3000.000 2868 1769 911.0000 124801.00
1960 76036.00 40341.00 8220.000 3145.000 3054 1905 1008.0000 133709.00
1961 79831.00 43173.00 9053.000 3338.000 3224 2005 1076.0000 141700.00
Mean 66747.57 34343.43 6229.286 2772.286 2625 1484 841.7143 66747.57
Warning message:
In rbind(deparse.level, ...) :
number of columns of result, 8, is not a multiple of vector length 7 of arg 2

apply()函数处理数组与数据框

1
2
3
4
5
6
7
8
9
> apply(worldphones, MARGIN = 1, FUN = sum)
1951 1956 1957 1958 1959 1960 1961
74494 102199 110001 118399 124801 133709 141700
> apply(worldphones, MARGIN = 2, FUN = mean)
N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
66747.5714 34343.4286 6229.2857 2772.2857 2625.0000 1484.0000 841.7143
> apply(worldphones, MARGIN = 2, FUN = var)
N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
127181160.6 51776902.0 4512287.6 246698.6 273595.0 419524.3 31019.9

lapply()函数处理列表,转化为列表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
> state.center
$x
[1] -86.7509 -127.2500 -111.6250 -92.2992 -119.7730 -105.5130 -72.3573 -74.9841 -81.6850 -83.3736 -126.2500 -113.9300
[13] -89.3776 -86.0808 -93.3714 -98.1156 -84.7674 -92.2724 -68.9801 -76.6459 -71.5800 -84.6870 -94.6043 -89.8065
[25] -92.5137 -109.3200 -99.5898 -116.8510 -71.3924 -74.2336 -105.9420 -75.1449 -78.4686 -100.0990 -82.5963 -97.1239
[37] -120.0680 -77.4500 -71.1244 -80.5056 -99.7238 -86.4560 -98.7857 -111.3300 -72.5450 -78.2005 -119.7460 -80.6665
[49] -89.9941 -107.2560

$y
[1] 32.5901 49.2500 34.2192 34.7336 36.5341 38.6777 41.5928 38.6777 27.8744 32.3329 31.7500 43.5648 40.0495 40.0495 41.9358
[16] 38.4204 37.3915 30.6181 45.6226 39.2778 42.3645 43.1361 46.3943 32.6758 38.3347 46.8230 41.3356 39.1063 43.3934 39.9637
[31] 34.4764 43.1361 35.4195 47.2517 40.2210 35.5053 43.9078 40.9069 41.5928 33.6190 44.3365 35.6767 31.3897 39.1063 44.2508
[46] 37.5630 47.4231 38.4204 44.5937 43.0504

> lapply(state.center, FUN = length)
$x
[1] 50

$y
[1] 50

> class(lapply(state.center, FUN = length))
[1] "list"

sapply()函数处理列表,转化为列表

1
2
3
4
5
> sapply(state.center, FUN = length)
x y
50 50
> class(sapply(state.center, FUN = length))
[1] "integer"
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
> state.name
[1] "Alabama" "Alaska" "Arizona" "Arkansas" "California" "Colorado" "Connecticut"
...
[43] "Texas" "Utah" "Vermont" "Virginia" "Washington" "West Virginia" "Wisconsin"
[50] "Wyoming"
> state.division
[1] East South Central Pacific Mountain West South Central Pacific Mountain
...
[43] West South Central Mountain New England South Atlantic Pacific South Atlantic
[49] East North Central Mountain
9 Levels: New England Middle Atlantic South Atlantic East South Central West South Central ... Pacific
> tapply(state.name, state.division, FUN = length)
New England Middle Atlantic South Atlantic East South Central West South Central East North Central
6 3 8 4 4 5
West North Central Mountain Pacific
7 8 5

数据标准化处理\

x=xxˉσx^* = \frac{x - \bar{x}}{\sigma}

1
heatmap(as.matrix(mtcars))

未标准化处理的热力图

1
2
> x <- scale(mtcars, center = TRUE, scale = TRUE)
> heatmap(x)

标准化处理的热力图

reshape2包

类似数据库的操作

1
2
> install.packages("reshape2")
> library(reshape2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
> airquality
ozone solar.r wind temp month day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
...
152 18 131 8.0 76 9 29
153 20 223 11.5 68 9 30
> names(airquality) <- tolower(names(airquality))
> head(airquality)
ozone solar.r wind temp month day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6

melt()函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
> melt(airquality)
No id variables; using all as measure variables
variable value
1 ozone 41.0
2 ozone 36.0
...
152 ozone 18.0
153 ozone 20.0
154 solar.r 190.0
155 solar.r 118.0
...
305 solar.r 131.0
306 solar.r 223.0
...
...
460 temp 67.0
461 temp 72.0
...
499 temp 90.0
500 temp 87.0
[ reached 'max' / getOption("max.print") -- omitted 418 rows ]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
> aql <- melt(airquality, id.vars = c("month", "day"))
> aql
month day variable value
1 5 1 ozone 41
2 5 2 ozone 36
...
29 5 29 ozone 45
30 5 30 ozone 115
31 5 31 ozone 37
32 6 1 ozone NA
33 6 2 ozone NA
...
60 6 29 ozone NA
61 6 30 ozone NA
...
...
124 9 1 ozone 96
125 9 2 ozone 78
...
152 9 29 ozone 18
153 9 30 ozone 20
154 5 1 solar.r 190
155 5 2 solar.r 118
...
183 5 30 solar.r 223
184 5 31 solar.r 279
185 6 1 solar.r 286
186 6 2 solar.r 287
...
213 6 29 solar.r 31
214 6 30 solar.r 138
...
...
246 8 1 solar.r 83
247 8 2 solar.r 24
248 8 3 solar.r 77
249 8 4 solar.r NA
250 8 5 solar.r NA
[ reached 'max' / getOption("max.print") -- omitted 362 rows ]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
> aqw <- dcast(aql,month + day ~ variable)
> aqw
month day ozone solar.r wind temp
1 5 1 41 190 7.4 67
2 5 2 36 118 8.0 72
...
30 5 30 115 223 5.7 79
31 5 31 37 279 7.4 76
32 6 1 NA 286 8.6 78
33 6 2 NA 287 9.7 74
...
60 6 29 NA 31 14.9 77
61 6 30 NA 138 8.0 83
...
...
124 9 1 96 167 6.9 91
125 9 2 78 197 5.1 92
...
152 9 29 18 131 8.0 76
153 9 30 20 223 11.5 68

dcast()函数

类似透视表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
> aqw2 <- dcast(aql,month ~ variable,fun.aggregate =  mean,na.rm = TRUE)
> aqw2
month ozone solar.r wind temp
1 5 23.61538 181.2963 11.622581 65.54839
2 6 29.44444 190.1667 10.266667 79.10000
3 7 59.11538 216.4839 8.941935 83.90323
4 8 59.96154 171.8571 8.793548 83.96774
5 9 31.44828 167.4333 10.180000 76.90000
> aqw2 <- dcast(aql,month ~ variable,fun.aggregate = sum,na.rm = TRUE)
> aqw2
month ozone solar.r wind temp
1 5 614 4895 360.3 2032
2 6 265 5705 308.0 2373
3 7 1537 6711 277.2 2601
4 8 1559 4812 272.6 2603
5 9 912 5023 305.4 2307

tidyr包

1
2
3
> install.packages(c("tidyr", "dplyr"))
> library(tidyr)
> library(dplyr)

gather()函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
> mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
...
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
> tdata <- mtcars[1:10, 1:3]
> tdata <- data.frame(names = rownames(tdata), tdata)
> tdata
names mpg cyl disp
Mazda RX4 Mazda RX4 21.0 6 160.0
Mazda RX4 Wag Mazda RX4 Wag 21.0 6 160.0
...
Merc 230 Merc 230 22.8 4 140.8
Merc 280 Merc 280 19.2 6 167.6
> gather(tdata, key = "Key", value = "Value", cyl, disp, mpg)
names Key Value
1 Mazda RX4 cyl 6.0
2 Mazda RX4 Wag cyl 6.0
...
9 Merc 230 cyl 4.0
10 Merc 280 cyl 6.0
11 Mazda RX4 disp 160.0
12 Mazda RX4 Wag disp 160.0
...
19 Merc 230 disp 140.8
20 Merc 280 disp 167.6
21 Mazda RX4 mpg 21.0
22 Mazda RX4 Wag mpg 21.0
...
29 Merc 230 mpg 22.8
30 Merc 280 mpg 19.2
> gather(tdata, key = "Key", value = "Value", cyl: disp, mpg)
names Key Value
1 Mazda RX4 cyl 6.0
2 Mazda RX4 Wag cyl 6.0
...
9 Merc 230 cyl 4.0
10 Merc 280 cyl 6.0
11 Mazda RX4 disp 160.0
12 Mazda RX4 Wag disp 160.0
...
19 Merc 230 disp 140.8
20 Merc 280 disp 167.6
21 Mazda RX4 mpg 21.0
22 Mazda RX4 Wag mpg 21.0
...
29 Merc 230 mpg 22.8
30 Merc 280 mpg 19.2
> gather(tdata, key = "Key", value = "Value", cyl, -disp)
names mpg disp Key Value
1 Mazda RX4 21.0 160.0 cyl 6
2 Mazda RX4 Wag 21.0 160.0 cyl 6
...
9 Merc 230 22.8 140.8 cyl 4
10 Merc 280 19.2 167.6 cyl 6
> gather(tdata, key = "Key", value = "Value", 2:4)
names Key Value
1 Mazda RX4 mpg 21.0
2 Mazda RX4 Wag mpg 21.0
...
9 Merc 230 mpg 22.8
10 Merc 280 mpg 19.2
11 Mazda RX4 cyl 6.0
...
19 Merc 230 cyl 4.0
20 Merc 280 cyl 6.0
21 Mazda RX4 disp 160.0
22 Mazda RX4 Wag disp 160.0
...
29 Merc 230 disp 140.8
30 Merc 280 disp 167.6

spread()函数

1
2
3
4
5
6
7
8
9
10
11
12
13
> gdata <- gather(tdata, key = "Key", value = "Value", 2:4)
> spread(gdata, key = "Key", value = "Value")
names cyl disp mpg
1 Datsun 710 4 108.0 22.8
2 Duster 360 8 360.0 14.3
3 Hornet 4 Drive 6 258.0 21.4
4 Hornet Sportabout 8 360.0 18.7
5 Mazda RX4 6 160.0 21.0
6 Mazda RX4 Wag 6 160.0 21.0
7 Merc 230 4 140.8 22.8
8 Merc 240D 4 146.7 24.4
9 Merc 280 6 167.6 19.2
10 Valiant 6 225.0 18.1

separate()函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
> df <- data.frame(x = c(NA, "a.b", "a.d", "b.c"))
> df
x
1 <NA>
2 a.b
3 a.d
4 b.c
> separate(df, col = x, into = c("A", "B"))
A B
1 <NA> <NA>
2 a b
3 a d
4 b c
> df <- data.frame(x = c(NA, "a.b-c", "a-d", "a-b.c"))
> df
x
1 <NA>
2 a.b-c
3 a-d
4 a-b.c
> separate(df, col = x, into = c("A", "B"), sep = "-")
A B
1 <NA> <NA>
2 a.b c
3 a d
4 a b.c

unite()函数

1
2
3
4
5
6
7
> x <- separate(df, col = x, into = c("A", "B"), sep = "-")
> unite(x, col = "AB", A, B, sep = "-")
AB
1 NA-NA
2 a.b-c
3 a-d
4 a-b.c

dplyr包

1
2
install.packages("dplyr")
library(dplyr)

dplyr::filter()

按条件筛选

1
2
3
4
5
6
7
8
9
10
11
12
13
14
> dplyr::filter(iris, Sepal.Length > 7)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 7.1 3.0 5.9 2.1 virginica
2 7.6 3.0 6.6 2.1 virginica
3 7.3 2.9 6.3 1.8 virginica
4 7.2 3.6 6.1 2.5 virginica
5 7.7 3.8 6.7 2.2 virginica
6 7.7 2.6 6.9 2.3 virginica
7 7.7 2.8 6.7 2.0 virginica
8 7.2 3.2 6.0 1.8 virginica
9 7.2 3.0 5.8 1.6 virginica
10 7.4 2.8 6.1 1.9 virginica
11 7.9 3.8 6.4 2.0 virginica
12 7.7 3.0 6.1 2.3 virginica

dplyr::distinct()

去除重复行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
> rbind(iris[1:10, ], iris[1:15, ])
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
...
24 4.3 3.0 1.1 0.1 setosa
25 5.8 4.0 1.2 0.2 setosa
> dplyr::distinct(rbind(iris[1:10, ], iris[1:15, ]))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
...
14 4.3 3.0 1.1 0.1 setosa
15 5.8 4.0 1.2 0.2 setosa

dplyr::slice()

取出数据的任意行

1
2
3
4
5
6
7
8
> dplyr::slice(iris, 10:15)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.9 3.1 1.5 0.1 setosa
2 5.4 3.7 1.5 0.2 setosa
3 4.8 3.4 1.6 0.2 setosa
4 4.8 3.0 1.4 0.1 setosa
5 4.3 3.0 1.1 0.1 setosa
6 5.8 4.0 1.2 0.2 setosa

dplyr::sample()

随机抽取行数

1
2
3
4
5
6
7
8
9
10
11
12
> dplyr::sample_n(iris, 10)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 7.4 2.8 6.1 1.9 virginica
2 5.1 3.7 1.5 0.4 setosa
3 6.4 2.7 5.3 1.9 virginica
4 4.9 3.1 1.5 0.2 setosa
5 7.1 3.0 5.9 2.1 virginica
6 7.2 3.6 6.1 2.5 virginica
7 4.4 2.9 1.4 0.2 setosa
8 6.5 2.8 4.6 1.5 versicolor
9 6.3 2.5 5.0 1.9 virginica
10 6.4 3.1 5.5 1.8 virginica

按比例抽取行数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
> dplyr::sample_frac(iris, 0.1)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 7.2 3.0 5.8 1.6 virginica
2 6.6 2.9 4.6 1.3 versicolor
3 5.2 4.1 1.5 0.1 setosa
4 5.4 3.9 1.7 0.4 setosa
5 6.9 3.1 5.1 2.3 virginica
6 6.8 3.2 5.9 2.3 virginica
7 6.1 2.9 4.7 1.4 versicolor
8 5.1 3.8 1.5 0.3 setosa
9 6.6 3.0 4.4 1.4 versicolor
10 5.1 3.8 1.9 0.4 setosa
11 7.7 2.6 6.9 2.3 virginica
12 4.9 2.5 4.5 1.7 virginica
13 5.7 2.9 4.2 1.3 versicolor
14 4.6 3.6 1.0 0.2 setosa
15 5.4 3.0 4.5 1.5 versicolor

dplyr::arrange()

以某一列为基准排序

1
2
3
4
5
6
7
> dplyr::arrange(iris, Sepal.Length)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.3 3.0 1.1 0.1 setosa
2 4.4 2.9 1.4 0.2 setosa
...
149 7.7 3.0 6.1 2.3 virginica
150 7.9 3.8 6.4 2.0 virginica
1
2
3
4
5
6
7
> dplyr::arrange(iris, desc(Sepal.Length))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 7.9 3.8 6.4 2.0 virginica
2 7.7 3.8 6.7 2.2 virginica
...
149 4.4 3.2 1.3 0.2 setosa
150 4.3 3.0 1.1 0.1 setosa

summarise()

1
2
3
4
5
6
> summarise(iris, avg = mean(Sepal.Length))
avg
1 5.843333
> summarise(iris, sum = sum(Sepal.Length))
sum
1 876.5

链式操作符 %>%

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
> head(mtcars, 20)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
> head(mtcars, 20) %>% tail(10)
mpg cyl disp hp drat wt qsec vs am gear carb
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1

dplyr::group_by()分组

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
> dplyr::group_by(iris, Species)
# A tibble: 150 × 5
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# … with 140 more rows
# ℹ Use `print(n = ...)` to see more rows
> iris %>% group_by(Species)
# A tibble: 150 × 5
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# … with 140 more rows
# ℹ Use `print(n = ...)` to see more rows
> iris %>% group_by(Species) %>% summarise()
# A tibble: 3 × 1
Species
<fct>
1 setosa
2 versicolor
3 virginica
> iris %>% group_by(Species) %>% summarise(avg = mean(Sepal.Width))
# A tibble: 3 × 2
Species avg
<fct> <dbl>
1 setosa 3.43
2 versicolor 2.77
3 virginica 2.97
> iris %>% group_by(Species) %>% summarise(avg = mean(Sepal.Width)) %>% arrange(avg)
# A tibble: 3 × 2
Species avg
<fct> <dbl>
1 versicolor 2.77
2 virginica 2.97
3 setosa 3.43

dplyr::mutate()

添加新的变量

1
2
3
4
5
6
7
> dplyr::mutate(iris, new = Sepal.Length + Petal.Length)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species new
1 5.1 3.5 1.4 0.2 setosa 6.5
2 4.9 3.0 1.4 0.2 setosa 6.3
...
149 6.2 3.4 5.4 2.3 virginica 11.6
150 5.9 3.0 5.1 1.8 virginica 11.0

链接函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
> a = data.frame(x1 = c("A", "B", "C"), x2 = c(1, 2, 3))
> b = data.frame(x1 = c("A", "B", "D"), x3 = c(T, F, T))
> a
x1 x2
1 A 1
2 B 2
3 C 3
> b
x1 x3
1 A TRUE
2 B FALSE
3 D TRUE
> dplyr::left_join(a, b, by = "x1")
x1 x2 x3
1 A 1 TRUE
2 B 2 FALSE
3 C 3 NA
> dplyr::right_join(a, b, by = "x1")
x1 x2 x3
1 A 1 TRUE
2 B 2 FALSE
3 D NA TRUE
> dplyr::full_join(a, b, by = "x1")
x1 x2 x3
1 A 1 TRUE
2 B 2 FALSE
3 C 3 NA
4 D NA TRUE
> dplyr::semi_join(a, b, by = "x1")
x1 x2
1 A 1
2 B 2
> dplyr::anti_join(a, b, by = "x1")
x1 x2
1 C 3

交集并集补集

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
> mtcars <- mutate(mtcars, Model = rownames(mtcars))
> mtcars
mpg cyl disp hp drat wt qsec vs am gear carb Model
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Datsun 710
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet 4 Drive
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Hornet Sportabout
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 Valiant
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 Duster 360
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 Merc 240D
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Merc 230
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 Merc 280
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 Merc 280C
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 Merc 450SE
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 Merc 450SL
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 Merc 450SLC
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 Cadillac Fleetwood
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 Lincoln Continental
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 Chrysler Imperial
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 Fiat 128
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 Honda Civic
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota Corolla
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 Toyota Corona
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 Dodge Challenger
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 AMC Javelin
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 Camaro Z28
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 Pontiac Firebird
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 Fiat X1-9
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 Porsche 914-2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 Lotus Europa
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 Ford Pantera L
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 Ferrari Dino
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 Maserati Bora
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 Volvo 142E
> first <- slice(mtcars, 1:20)
> second <- slice(mtcars, 10:30)

交集

1
2
3
4
5
6
7
8
9
10
11
12
13
> intersect(first, second)
mpg cyl disp hp drat wt qsec vs am gear carb Model
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 Merc 280
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 Merc 280C
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 Merc 450SE
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 Merc 450SL
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 Merc 450SLC
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 Cadillac Fleetwood
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 Lincoln Continental
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 Chrysler Imperial
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 Fiat 128
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 Honda Civic
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota Corolla

并集

1
2
3
4
5
6
7
8
9
10
11
12
> dplyr::union_all(first, second)
mpg cyl disp hp drat wt qsec vs am gear carb Model
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag
...
Honda Civic...19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 Honda Civic
Toyota Corolla...20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota Corolla
Merc 280...21 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 Merc 280
Merc 280C...22 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 Merc 280C
...
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 Ford Pantera L
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 Ferrari Dino

不交并

1
2
3
4
5
6
7
> dplyr::union(first, second)
mpg cyl disp hp drat wt qsec vs am gear carb Model
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag
...
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 Ford Pantera L
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 Ferrari Dino

补集

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
> setdiff(first, second)
mpg cyl disp hp drat wt qsec vs am gear carb Model
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Datsun 710
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet 4 Drive
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Hornet Sportabout
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 Valiant
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 Duster 360
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 Merc 240D
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Merc 230
> setdiff(second, first)
mpg cyl disp hp drat wt qsec vs am gear carb Model
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 Toyota Corona
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 Dodge Challenger
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 AMC Javelin
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 Camaro Z28
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 Pontiac Firebird
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 Fiat X1-9
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 Porsche 914-2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 Lotus Europa
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 Ford Pantera L
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 Ferrari Dino

数据分析