Subsets

DataArrays

The DataArray type is meant to behave like a standard Julia Array and tries to implement identical indexing rules:

One dimensional DataArray:

julia> using DataArrays

julia> dv = data([1, 2, 3])
3-element DataArray{Int64,1}:
 1
 2
 3

julia> dv[1]
1

julia> dv[2] = NA
NA

julia> dv[2]
NA

Two dimensional DataArray:

julia> using DataArrays

julia> dm = data([1 2; 3 4])
2x2 DataArray{Int64,2}:
 1  2
 3  4

julia> dm[1, 1]
1

julia> dm[2, 1] = NA
NA

julia> dm[2, 1]
NA

DataFrames

In contrast, a DataFrame offers substantially more forms of indexing because columns can be referred to by name:

julia> using DataFrames

julia> df = DataFrame(A = 1:10, B = 2:2:20)
10x2 DataFrame
| Row | A  | B  |
|-----|----|----|
| 1   | 1  | 2  |
| 2   | 2  | 4  |
| 3   | 3  | 6  |
| 4   | 4  | 8  |
| 5   | 5  | 10 |
| 6   | 6  | 12 |
| 7   | 7  | 14 |
| 8   | 8  | 16 |
| 9   | 9  | 18 |
| 10  | 10 | 20 |

Refering to the first column by index or name:

julia> df[1]
10-element DataArray{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

julia> df[:A]
10-element DataArray{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

Refering to the first element of the first column:

julia> df[1, 1]
1

julia> df[1, :A]
1

Selecting a subset of rows by index and an (ordered) subset of columns by name:

julia> df[1:3, [:A, :B]]
3x2 DataFrame
| Row | A | B |
|-----|---|---|
| 1   | 1 | 2 |
| 2   | 2 | 4 |
| 3   | 3 | 6 |

julia> df[1:3, [:B, :A]]
3x2 DataFrame
| Row | B | A |
|-----|---|---|
| 1   | 2 | 1 |
| 2   | 4 | 2 |
| 3   | 6 | 3 |

Selecting a subset of rows by using a condition:

julia> df[df[:A] % 2 .== 0, :]
5x2 DataFrame
| Row | A  | B  |
|-----|----|----|
| 1   | 2  | 4  |
| 2   | 4  | 8  |
| 3   | 6  | 12 |
| 4   | 8  | 16 |
| 5   | 10 | 20 |

julia> df[df[:B] % 2 .== 0, :]
10x2 DataFrame
| Row | A  | B  |
|-----|----|----|
| 1   | 1  | 2  |
| 2   | 2  | 4  |
| 3   | 3  | 6  |
| 4   | 4  | 8  |
| 5   | 5  | 10 |
| 6   | 6  | 12 |
| 7   | 7  | 14 |
| 8   | 8  | 16 |
| 9   | 9  | 18 |
| 10  | 10 | 20 |