Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed nested ':=' reference assignment fails #6789

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,8 @@ rowwiseDT(

19. An integer overflow in `fread()` with lines longer than `2^(31/2)` bytes is prevented, [#6729](https://github.com/Rdatatable/data.table/issues/6729). The typical impact was no worse than a wrong initial allocation size, corrected later. Thanks to @TaikiSan21 for the report and @aitap for the fix.

20. By reference assignments (':=') with functions that modified the data.table by reference e.g. (`foo=function(DT){modify(DT);return(1L)}`, `DT[,a:=foo(DT)]`) returned a mallformed data.table due to the modification of the targeted named column index ("a") during the j expression evaluation [#6768](https://github.com/Rdatatable/data.table/issues/6768). Thanks @AntonNM for the report and fix.

## NOTES

1. There is a new vignette on joins! See `vignette("datatable-joins")`. Thanks to Angel Feliz for authoring it! Feedback welcome. This vignette has been highly requested since 2017: [#2181](https://github.com/Rdatatable/data.table/issues/2181).
Expand Down
112 changes: 66 additions & 46 deletions R/data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -1154,7 +1154,6 @@ replace_dot_alias = function(e) {
} else if (is.numeric(lhs)) {
m = as.integer(lhs)
if (any(m<1L | ncol(x)<m)) stopf("LHS of := appears to be column positions but are outside [1,ncol] range. New columns can only be added by name.")
lhs = names_x[m]
} else
stopf("LHS of := isn't column names ('character') or positions ('integer' or 'numeric')")
if (!anyNA(m)) {
Expand All @@ -1179,57 +1178,16 @@ replace_dot_alias = function(e) {
return(invisible(x))
}
} else {
# Adding new column(s). TO DO: move after the first eval in case the jsub has an error.
# Adding new column(s). Allocation for columns and recalculation of target cols moved after the jval = eval(jsub)
# in case of error or by-reference modifications to the DT
newnames=setdiff(lhs, names_x)
m[is.na(m)] = ncol(x)+seq_len(length(newnames))
cols = as.integer(m)
# don't pass verbose to selfrefok here -- only activated when
# ok=-1 which will trigger setalloccol with verbose in the next
# branch, which again calls _selfrefok and returns the message then
# ok=-1 which will trigger setalloccol with verbose after
# the jval = eval(jsub, ...)
if ((ok<-selfrefok(x, verbose=FALSE))==0L) # ok==0 so no warning when loaded from disk (-1) [-1 considered TRUE by R]
if (is.data.table(x)) warningf("A shallow copy of this data.table was taken so that := can add or remove %d columns by reference. At an earlier point, this data.table was copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. It's also not unusual for data.table-agnostic packages to produce tables affected by this issue. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.", length(newnames))
# !is.data.table for DF |> DT(,:=) tests 2212.16-19 (#5113) where a shallow copy is routine for data.frame
if ((ok<1L) || (truelength(x) < ncol(x)+length(newnames))) {
DT = x # in case getOption contains "ncol(DT)" as it used to. TODO: warn and then remove
n = length(newnames) + eval(getOption("datatable.alloccol")) # TODO: warn about expressions and then drop the eval()
# i.e. reallocate at the size as if the new columns were added followed by setalloccol().
name = substitute(x)
if (is.name(name) && ok && verbose) { # && NAMED(x)>0 (TO DO) # ok here includes -1 (loaded from disk)
catf("Growing vector of column pointers from truelength %d to %d. A shallow copy has been taken, see ?setalloccol. Only a potential issue if two variables point to the same data (we can't yet detect that well) and if not you can safely ignore this. To avoid this message you could setalloccol() first, deep copy first using copy(), wrap with suppressWarnings() or increase the 'datatable.alloccol' option.\n", truelength(x), n)
# #1729 -- copying to the wrong environment here can cause some confusion
if (ok == -1L) catf("Note that the shallow copy will assign to the environment from which := was called. That means for example that if := was called within a function, the original table may be unaffected.\n")

# Verbosity should not issue warnings, so cat rather than warning.
# TO DO: Add option 'datatable.pedantic' to turn on warnings like this.

# TO DO ... comments moved up from C ...
# Note that the NAMED(dt)>1 doesn't work because .Call
# always sets to 2 (see R-ints), it seems. Work around
# may be possible but not yet working. When the NAMED test works, we can drop allocwarn argument too
# because that's just passed in as FALSE from [<- where we know `*tmp*` isn't really NAMED=2.
# Note also that this growing will happen for missing columns assigned NULL, too. But so rare, we
# don't mind.
}
setalloccol(x, n, verbose=verbose) # always assigns to calling scope; i.e. this scope
if (is.name(name)) {
assign(as.character(name),x,parent.frame(),inherits=TRUE)
} else if (.is_simple_extraction(name)) { # TODO(#6702): use a helper here as the code is very similar to setDT().
k = eval(name[[2L]], parent.frame(), parent.frame())
if (is.list(k)) {
origj = j = if (name[[1L]] == "$") as.character(name[[3L]]) else eval(name[[3L]], parent.frame(), parent.frame())
if (is.character(j)) {
if (length(j)!=1L) stopf("Cannot assign to an under-allocated recursively indexed list -- L[[i]][,:=] syntax is only valid when i is length 1, but its length is %d", length(j))
j = match(j, names(k))
if (is.na(j)) internal_error("item '%s' not found in names of list", origj) # nocov
}
.Call(Csetlistelt,k,as.integer(j), x)
} else if (is.environment(k) && exists(as.character(name[[3L]]), k)) {
assign(as.character(name[[3L]]), x, k, inherits=FALSE)
} else if (isS4(k)) {
.Call(CsetS4elt, k, as.character(name[[3L]]), x)
}
} # TO DO: else if env$<- or list$<-
}
}
}
}
Expand Down Expand Up @@ -1402,6 +1360,68 @@ replace_dot_alias = function(e) {
}

if (!is.null(lhs)) {
# Re-matches characters names in the lhs after jval to account for jsub's that modify the columns of the data.table (#6768)
# Replaces numerical lhs with respective names_x
if(is.character(lhs)){
m = chmatch(lhs, names_x)
if(!anyNA(m)){
# updates by reference to existing columns
cols = as.integer(m)
newnames=NULL
}else{
# Adding new column(s).
newnames=setdiff(lhs, names_x)
m[is.na(m)] = ncol(x)+seq_len(length(newnames))
cols = as.integer(m)
# ok <- selfrefok above called without verbose -- only activated when
# ok=-1 which will trigger setalloccol with verbose in the next
# branch, which again calls _selfrefok and returns the message then
# !is.data.table for DF |> DT(,:=) tests 2212.16-19 (#5113) where a shallow copy is routine for data.frame
if ((ok<1L) || (truelength(x) < ncol(x)+length(newnames))) {
DT = x # in case getOption contains "ncol(DT)" as it used to. TODO: warn and then remove
n = length(newnames) + eval(getOption("datatable.alloccol")) # TODO: warn about expressions and then drop the eval()
# i.e. reallocate at the size as if the new columns were added followed by setalloccol().
name = substitute(x)
if (is.name(name) && ok && verbose) { # && NAMED(x)>0 (TO DO) # ok here includes -1 (loaded from disk)
catf("Growing vector of column pointers from truelength %d to %d. A shallow copy has been taken, see ?setalloccol. Only a potential issue if two variables point to the same data (we can't yet detect that well) and if not you can safely ignore this. To avoid this message you could setalloccol() first, deep copy first using copy(), wrap with suppressWarnings() or increase the 'datatable.alloccol' option.\n", truelength(x), n)
# #1729 -- copying to the wrong environment here can cause some confusion
if (ok == -1L) catf("Note that the shallow copy will assign to the environment from which := was called. That means for example that if := was called within a function, the original table may be unaffected.\n")

# Verbosity should not issue warnings, so cat rather than warning.
# TO DO: Add option 'datatable.pedantic' to turn on warnings like this.

# TO DO ... comments moved up from C ...
# Note that the NAMED(dt)>1 doesn't work because .Call
# always sets to 2 (see R-ints), it seems. Work around
# may be possible but not yet working. When the NAMED test works, we can drop allocwarn argument too
# because that's just passed in as FALSE from [<- where we know `*tmp*` isn't really NAMED=2.
# Note also that this growing will happen for missing columns assigned NULL, too. But so rare, we
# don't mind.
}
setalloccol(x, n, verbose=verbose) # always assigns to calling scope; i.e. this scope
if (is.name(name)) {
assign(as.character(name),x,parent.frame(),inherits=TRUE)
} else if (.is_simple_extraction(name)) { # TODO(#6702): use a helper here as the code is very similar to setDT().
k = eval(name[[2L]], parent.frame(), parent.frame())
if (is.list(k)) {
origj = j = if (name[[1L]] == "$") as.character(name[[3L]]) else eval(name[[3L]], parent.frame(), parent.frame())
if (is.character(j)) {
if (length(j)!=1L) stopf("Cannot assign to an under-allocated recursively indexed list -- L[[i]][,:=] syntax is only valid when i is length 1, but its length is %d", length(j))
j = match(j, names(k))
if (is.na(j)) internal_error("item '%s' not found in names of list", origj) # nocov
}
.Call(Csetlistelt,k,as.integer(j), x)
} else if (is.environment(k) && exists(as.character(name[[3L]]), k)) {
assign(as.character(name[[3L]]), x, k, inherits=FALSE)
} else if (isS4(k)) {
.Call(CsetS4elt, k, as.character(name[[3L]]), x)
}
} # TO DO: else if env$<- or list$<-
}
}
} else if (is.numeric(lhs)) {
lhs = names_x[m]
}
# TODO?: use set() here now that it can add new columns. Then remove newnames and alloc logic above.
.Call(Cassign,x,irows,cols,newnames,jval)
return(suppPrint(x))
Expand Down
13 changes: 11 additions & 2 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -1979,8 +1979,9 @@ test(632, merge(DT1,DT2,all=TRUE), data.table(a=c(1,2,3,4,5),total.x=c(2,NA,1,3,
test(632.1, merge(DT1,DT2,all=TRUE), setkey(adt(merge(adf(DT1),adf(DT2),by="a",all=TRUE)),a))

# Test that with=FALSE by number isn't messed up by dup column names, #2025
DT = data.table(a=1:3,a=4:6)
test(634, DT[,2:=200L], data.table(a=1:3,a=200L))
DT = data.table(a=1:3, a=4:6, a=7:9)
test(634, DT[,2:=200L], data.table(a=1:3, a=200L, a=7:9))
test(634.1, DT[,c(2, 3):=200L], data.table(a=1:3, a=200L, a=200L))

# Test names when not all items are named, #2029
DT = data.table(x=1:3,y=1:3)
Expand Down Expand Up @@ -21063,3 +21064,11 @@ test(2304.100, set(copy(DT), i=2L, j=c("L1", "L2"), value=list(list(NULL), list(

# the integer overflow in #6729 is only noticeable with UBSan
test(2305, { fread(testDir("issue_6729.txt.bz2")); TRUE })

# j expressions that modify a data.table by reference, (#6768)
inner=function(dt){dt[,b:=4:6]}
outer=function(dt){inner(dt); return(7:9)}
foo=function(dt){dt[,b:=4:6];return(7:9)}
DT = data.table(a=1:3)
test(2306, DT[,c:=outer(DT)], data.table(a=1:3, b=4:6, c=7:9))
test(2306.1, DT[,c:=foo(DT)], data.table(a=1:3, b=4:6, c=7:9))
Loading