Skip to content
This repository has been archived by the owner on Apr 17, 2022. It is now read-only.

Latest commit

 

History

History
514 lines (447 loc) · 24.8 KB

README.md

File metadata and controls

514 lines (447 loc) · 24.8 KB

All functionalities of this repository plus more features has been implemented in a new package called InMemoryDatasets.jl

DFTranspose.jl

DFTranspose.jl is a Julia package to explore the implementation of a new function for reshaping DataFrames.

See JuliaData/DataFrames.jl#2732 (comment)

Installation

run the following code inside a Julia session.

julia> using Pkg
julia> Pkg.add(url = "https://github.com/sl-solution/DFTranspose.jl")

Introduction

The DFTranspose.jl package only exports one function, df_transpose, which can be used to reshape a DataFrame.

   df_transpose(df, cols [, groupbycols]; kwargs...)

In its simplest form df_transpose transposes the specified columns, cols, of a DataFrame, and attaches a new column to the output data frame to track the original names of the transposed variables. The label of this column can be controlled by variable_name keyword.

Simple Transposing

When an id variable is specified, df_transpose transpose the data as above, however, it uses the values of the id variable to label the columns of the output data frame. renamecolid can be used to modify these labels on fly. When there are multiple id variables, the default renamecolid function assigns each label by a Tuple of the values of those variables (see examples).

When a set of groupby variables, groupbycols, are specified, the df_transpose function repeats the simple transposing of data within each group constructed by groupby variables. Like the simplest case, an id variable can be used to label the columns of the output data frame.

The following behaviours may change in future

  • The order of the output data frame is based on the order returned by groupby function.
  • The order of columns labels is based on their order of appearance in the original data.
  • Currently if id value(s) is repeated within a group, df_transpose throws an error.
  • Missing values can be a group level or a value for the id variable. They will be treated as a category.

Groupby Transposing

Examples

Basic usage

julia> df = DataFrame(x1 = [1,2,3,4], x2 = [1,4,9,16])
 4×2 DataFrame
 Row │ x1     x2    
     │ Int64  Int64
─────┼──────────────
   11      1
   22      4
   33      9
   44     16

julia> df_transpose(df, [:x1,:x2])
2×5 DataFrame
 Row │ _variables_  _c1    _c2    _c3    _c4   
     │ String       Int64  Int64  Int64  Int64
─────┼─────────────────────────────────────────
   1 │ x1               1      2      3      4
   2 │ x2               1      4      9     16

Specifying ID variable

julia> df = DataFrame(id = ["r1", "r2", "r3" , "r4"], x1 = [1,2,3,4], x2 = [1,4,9,16])
4×3 DataFrame
 Row │ id      x1     x2    
     │ String  Int64  Int64
─────┼──────────────────────
   1 │ r1          1      1
   2 │ r2          2      4
   3 │ r3          3      9
   4 │ r4          4     16

julia> df_transpose(df, [:x1,:x2], id = :id)
2×5 DataFrame
 Row │ _variables_  r1     r2     r3     r4    
     │ String       Int64  Int64  Int64  Int64
─────┼─────────────────────────────────────────
   1 │ x1               1      2      3      4
   2 │ x2               1      4      9     16

Specifying groupby variables

julia> df = DataFrame(group = repeat(1:3, inner = 2),
                             b = repeat(1:2, inner = 3),
                             c = repeat(1:1, inner = 6),
                             d = repeat(1:6, inner = 1),
                             e = string.('a':'f'))
6×5 DataFrame
Row │ group  b      c      d      e      
    │ Int64  Int64  Int64  Int64  String
────┼────────────────────────────────────
  11      1      1      1  a
  21      1      1      2  b
  32      1      1      3  c
  42      2      1      4  d
  53      2      1      5  e
  63      2      1      6  f

julia> df_transpose(df, 2:4, [:group])
9×4 DataFrame
 Row │ group  _variables_  _c1     _c2    
     │ Int64  String       Int64?  Int64?
─────┼────────────────────────────────────
   11  b                 1       1
   21  c                 1       1
   31  d                 1       2
   42  b                 1       2
   52  c                 1       1
   62  d                 3       4
   73  b                 2       2
   83  c                 1       1
   93  d                 5       6

julia> df_transpose(df, 2:4, :group, id = :e)
9×8 DataFrame
 Row │ group  _variables_  a        b        c        d        e        f       
     │ Int64  String       Int64?   Int64?   Int64?   Int64?   Int64?   Int64?  
─────┼──────────────────────────────────────────────────────────────────────────
   11  b                  1        1  missing  missing  missing  missing
   21  c                  1        1  missing  missing  missing  missing
   31  d                  1        2  missing  missing  missing  missing
   42  b            missing  missing        1        2  missing  missing
   52  c            missing  missing        1        1  missing  missing
   62  d            missing  missing        3        4  missing  missing
   73  b            missing  missing  missing  missing        2        2
   83  c            missing  missing  missing  missing        1        1
   93  d            missing  missing  missing  missing        5        6

julia> df_transpose(df, 2:4, :group, id = :e, default_fill = 0)
9×8 DataFrame
 Row │ group  _variables_  a      b      c      d      e      f
     │ Int64  String       Int64  Int64  Int64  Int64  Int64  Int64
─────┼──────────────────────────────────────────────────────────────
   11  b                1      1      0      0      0      0
   21  c                1      1      0      0      0      0
   31  d                1      2      0      0      0      0
   42  b                0      0      1      2      0      0
   52  c                0      0      1      1      0      0
   62  d                0      0      3      4      0      0
   73  b                0      0      0      0      2      2
   83  c                0      0      0      0      1      1
   93  d                0      0      0      0      5      6

Advanced usage

julia> pop = DataFrame(country = ["c1","c1","c2","c2","c3","c3"],
                        sex = ["male", "female", "male", "female", "male", "female"],
                        pop_2000 = [100, 120, 150, 155, 170, 190],
                        pop_2010 = [110, 120, 155, 160, 178, 200],
                        pop_2020 = [115, 130, 161, 165, 180, 203])
6×5 DataFrame
 Row │ country  sex     pop_2000  pop_2010  pop_2020
     │ String   String  Int64     Int64     Int64    
─────┼───────────────────────────────────────────────
   1 │ c1       male         100       110       115
   2 │ c1       female       120       120       130
   3 │ c2       male         150       155       161
   4 │ c2       female       155       160       165
   5 │ c3       male         170       178       180
   6 │ c3       female       190       200       203

julia> df_transpose(pop, r"pop_", :country,
                      id = :sex, 
                      variable_name = "year",
                      renamerowid = x -> replace(x, "pop_"=>""),
                      renamecolid = x -> x * "_pop")
9×4 DataFrame
  Row │ country  year       male_pop  female_pop
      │ String   String         Int64?    Int64?     
 ─────┼──────────────────────────────────────────
    1 │ c1       2000            100         120
    2 │ c1       2010            110         120
    3 │ c1       2020            115         130
    4 │ c2       2000            150         155
    5 │ c2       2010            155         160
    6 │ c2       2020            161         165
    7 │ c3       2000            170         190
    8 │ c3       2010            178         200
    9 │ c3       2020            180         203

julia> using Dates
julia> df = DataFrame([[1, 2, 3], [1.1, 2.0, 3.3],[1.1, 2.1, 3.0],[1.1, 2.0, 3.2]]
                    ,[:person, Symbol("11/2020"), Symbol("12/2020"), Symbol("1/2021")])
3×4 DataFrame
Row │ person   11/2020  12/2020  1/2021  
    │ Float64  Float64  Float64  Float64
────┼────────────────────────────────────
  11.0      1.1      1.1      1.1
  22.0      2.0      2.1      2.0
  33.0      3.3      3.0      3.2

julia> df_transpose(df, Not(:person), :person,
                           variable_name = "Date",
                           renamerowid = x -> Date(x, dateformat"m/y"),
                           renamecolid = x -> "measurement")
   9×3 DataFrame
 Row │ person   Date        measurement
     │ Float64  Date        Float64    
─────┼──────────────────────────────────
   11.0  2020-11-01          1.1
   21.0  2020-12-01          1.1
   31.0  2021-01-01          1.1
   42.0  2020-11-01          2.0
   52.0  2020-12-01          2.1
   62.0  2021-01-01          2.0
   73.0  2020-11-01          3.3
   83.0  2020-12-01          3.0
   93.0  2021-01-01          3.2          

julia> df = DataFrame(region = repeat(["North","North","South","South"],2),
             fuel_type = repeat(["gas","coal"],4),
             load = rand(8),
             time = [1,1,1,1,2,2,2,2],
             )
8×4 DataFrame
  Row │ region  fuel_type  load      time  
      │ String  String     Float64   Int64
 ─────┼────────────────────────────────────
    1 │ North   gas        0.877347      1
    2 │ North   coal       0.412013      1
    3 │ South   gas        0.969407      1
    4 │ South   coal       0.641831      1
    5 │ North   gas        0.856583      2
    6 │ North   coal       0.409253      2
    7 │ South   gas        0.235768      2
    8 │ South   coal       0.655087      2

julia> df_transpose(df, :load, :time, id = 1:2)
2×6 DataFrame
 Row │ time   _variables_  ("North", "gas")  ("North", "coal")  ("South", "gas")  ("South", "coal")
     │ Int64  String       Float64?          Float64?           Float64?          Float64?          
─────┼──────────────────────────────────────────────────────────────────────────────────────────────
   11  load                 0.877347           0.412013          0.969407           0.641831
   22  load                 0.856583           0.409253          0.235768           0.655087

julia> df = DataFrame(A_2018=1:4, A_2019=5:8, B_2017=9:12,
                        B_2018=9:12, B_2019 = [missing,13,14,15],
                         ID = [1,2,3,4])
4×6 DataFrame
  Row │ A_2018  A_2019  B_2017  B_2018  B_2019   ID    
      │ Int64   Int64   Int64   Int64   Int64?   Int64
 ─────┼────────────────────────────────────────────────
    11       5       9       9  missing      1
    22       6      10      10       13      2
    33       7      11      11       14      3
    44       8      12      12       15      4

julia> f(x) =  match(r"[0-9]+",x).match
julia> dfA = df_transpose(df, r"A", :ID, renamerowid = f, variable_name = "Year", renamecolid = x->"A");
julia> dfB = df_transpose(df, r"B", :ID, renamerowid = f, variable_name = "Year", renamecolid = x->"B");
julia> outerjoin(dfA, dfB, on = [:ID, :Year])
12×4 DataFrame
 Row │ ID     Year       A        B       
     │ Int64  SubStrin  Int64?   Int64?  
─────┼────────────────────────────────────
   11  2018             1        9
   21  2019             5  missing
   32  2018             2       10
   42  2019             6       13
   53  2018             3       11
   63  2019             7       14
   74  2018             4       12
   84  2019             8       15
   91  2017       missing        9
  102  2017       missing       10
  113  2017       missing       11
  124  2017       missing       12

# emulating the pivoting in python pandas
julia> df = DataFrame(paddockId= [0, 0, 1, 1, 2, 2],
                        color= ["red", "blue", "red", "blue", "red", "blue"],
                        count= [3, 4, 3, 4, 3, 4],
                        weight= [0.2, 0.3, 0.2, 0.3, 0.2, 0.2])
6×4 DataFrame
 Row │ paddockId  color   count  weight  
     │ Int64      String  Int64  Float64
─────┼───────────────────────────────────
   10  red         3      0.2
   20  blue        4      0.3
   31  red         3      0.2
   41  blue        4      0.3
   52  red         3      0.2
   62  blue        4      0.2

julia> df_transpose( df_transpose(df, [:count,:weight], [:paddockId,:color]),
                     :_c1, :paddockId, id = 2:3)
3×6 DataFrame
 Row │ paddockId  _variables_  ("red", "count")  ("red", "weight")  ("blue", "count")  ("blue", "weight")
     │ Int64      String       Float64?          Float64?           Float64?           Float64?           
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────
   10  _c1                       3.0                0.2                4.0                 0.3
   21  _c1                       3.0                0.2                4.0                 0.3
   32  _c1                       3.0                0.2                4.0                 0.2

Relation to other functions

permutedims

df_transpose(df::AbstractDataFrame, cols) is similar to permutedims() with some differences.

  • df_transpose only permutes the columns cols.
  • If no variable is provided as id, it generates the column names of the new data set by mapping a function (renamecolid) on the sequence of rows in df.
  • A function (renamerowid) applies to the rows ids in the output data frame before generating the variable_name columns.

Examples

julia> df = DataFrame(x1 = [9,2,8,6,8], x2 = [8,1,6,2,3], x3 = [6,5,3,10,8])
5×3 DataFrame
 Row │ x1     x2     x3
     │ Int64  Int64  Int64
─────┼─────────────────────
   19      8      6
   22      1      5
   38      6      3
   46      2     10
   58      3      8

julia> df_transpose(df, r"x")
3×6 DataFrame
 Row │ _variables_  _c1    _c2    _c3    _c4    _c5
     │ String       Int64  Int64  Int64  Int64  Int64
─────┼────────────────────────────────────────────────
   1 │ x1               9      2      8      6      8
   2 │ x2               8      1      6      2      3
   3 │ x3               6      5      3     10      8

julia> df_transpose(df, r"x", renamerowid = x -> match(r"[0-9]+",x).match,
                     renamecolid = x -> "_column_" * string(x))
3×6 DataFrame
 Row │ _variables_  _column_1  _column_2  _column_3  _column_4  _column_5
     │ SubString   Int64      Int64      Int64      Int64      Int64
─────┼────────────────────────────────────────────────────────────────────
   11                    9          2          8          6          8
   22                    8          1          6          2          3
   33                    6          5          3         10          8


julia> df1 = DataFrame(a=["x", "y"], b=[1.0, 2.0], c=[3, 4], d=[true, false])
2×4 DataFrame
 Row │ a       b        c      d
     │ String  Float64  Int64  Bool
─────┼───────────────────────────────
   1 │ x           1.0      3   true
   2 │ y           2.0      4  false

julia> df_transpose(df, [:b, :c, :d], id = :a) # note the column types
3×3 DataFrame
 Row │ _variables_  x        y
     │ String       Float64  Float64
─────┼───────────────────────────────
   1 │ b                1.0      2.0
   2 │ c                3.0      4.0
   3 │ d                1.0      0.0


julia> df2 = DataFrame(a=["x", "y"], b=[1, "two"], c=[3, 4], d=[true, false])
2×4 DataFrame
 Row │ a       b    c      d
     │ String  Any  Int64  Bool
─────┼───────────────────────────
   1 │ x       1        3   true
   2 │ y       two      4  false

julia> df_transpose(df2, [:b, :c, :d], id = :a, variable_name = "new_col")
3×3 DataFrame
 Row │ new_col  x     y
     │ String   Any   Any
─────┼──────────────────────
   1 │ b        1     two
   2 │ c        3     4
   3 │ d        true  false

stack

df_transpose(df::AbstractDataFrame, cols, gcols) can be used to emulate stack functionalities. Generally speaking, stack transposes each row of a data frame. Thus, to achieve the stack functionality each row of the input data frame should be an individual group. This can be done by inserting a new column to the input data frame or using df_transpose twice.

df_transpose cannot create the output data as a view (see help for stack function).

Examples

julia> df = DataFrame(a = repeat(1:3, inner = 2),
                             b = repeat(1:2, inner = 3),
                             c = repeat(1:1, inner = 6),
                             d = repeat(1:6, inner = 1),
                             e = string.('a':'f'))
6×5 DataFrame
 Row │ a      b      c      d      e
     │ Int64  Int64  Int64  Int64  String
─────┼────────────────────────────────────
   11      1      1      1  a
   21      1      1      2  b
   32      1      1      3  c
   42      2      1      4  d
   53      2      1      5  e
   63      2      1      6  f

julia> df_transpose(df, [:c, :d], Not([:c, :d]),
                     variable_name = "variable",
                     renamecolid = x -> "value")
12×5 DataFrame
 Row │ a      b      e       variable  value
     │ Int64  Int64  String  String    Int64
─────┼────────────────────────────────────────
   11      1  a       c              1
   21      1  a       d              1
   31      1  b       c              1
   41      1  b       d              2
   52      1  c       c              1
   62      1  c       d              3
   72      2  d       c              1
   82      2  d       d              4
   93      2  e       c              1
  103      2  e       d              5
  113      2  f       c              1
  123      2  f       d              6

julia> insertcols!(df, 1, :RowID => 1:nrow(df))
julia> df_transpose(df, [:c, :d], [:RowID, :a])
12×4 DataFrame
 Row │ RowID  a      _variables_  _c1
     │ Int64  Int64  String       Int64
─────┼───────────────────────────────────
   11      1  c                 1
   21      1  d                 1
   32      1  c                 1
   42      1  d                 2
   53      2  c                 1
   63      2  d                 3
   74      2  c                 1
   84      2  d                 4
   95      3  c                 1
  105      3  d                 5
  116      3  c                 1
  126      3  d                 6

unstack

When only one column is set for cols argument df_transpose(df::AbstractDataFrame, cols, gcols) can be used to emulate unstack functionality.

Examples

wide = DataFrame(id = 1:6,
                  a  = repeat(1:3, inner = 2),
                  b  = repeat(1.0:2.0, inner = 3),
                  c  = repeat(1.0:1.0, inner = 6),
                  d  = repeat(1.0:3.0, inner = 2))
6×5 DataFrame
 Row │ id     a      b        c        d       
     │ Int64  Int64  Float64  Float64  Float64
─────┼─────────────────────────────────────────
   11      1      1.0      1.0      1.0
   22      1      1.0      1.0      1.0
   33      2      1.0      1.0      2.0
   44      2      2.0      1.0      2.0
   55      3      2.0      1.0      3.0
   66      3      2.0      1.0      3.0            

# stacking to make data long
julia> long = df_transpose(wide, 3:5, [:id, :a],
                           variable_name = "variable",
                           renamecolid = x->"value")
18×4 DataFrame
 Row │ id     a      variable  value    
     │ Int64  Int64  String    Float64
─────┼──────────────────────────────────
   11      1  b              1.0
   21      1  c              1.0
   31      1  d              1.0
   42      1  b              1.0
   52      1  c              1.0
   62      1  d              1.0
   73      2  b              1.0
                         
  124      2  d              2.0
  135      3  b              2.0
  145      3  c              1.0
  155      3  d              3.0
  166      3  b              2.0
  176      3  c              1.0
  186      3  d              3.0
                          4 rows omitted
# unstack to make the long data wide again
julia> df_transpose(long, [:value], [:id, :a], id = :variable)
6×6 DataFrame
 Row │ id     a      _variables_  b         c         d        
     │ Int64  Int64  String       Float64?  Float64?  Float64?
─────┼─────────────────────────────────────────────────────────
   11      1  value             1.0       1.0       1.0
   22      1  value             1.0       1.0       1.0
   33      2  value             1.0       1.0       2.0
   44      2  value             2.0       1.0       2.0
   55      3  value             2.0       1.0       3.0
   66      3  value             2.0       1.0       3.0