Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move some setDT validation checks to C #5427

Merged
merged 13 commits into from
Apr 24, 2024
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@

11. `split.data.table` recognizes `sep=` when splitting with `by=`, just like the default and data.frame methods [#5417](https://github.com/Rdatatable/data.table/issues/5417).

12. `setDT` is faster for data with many columns, thanks @MichaelChirico for reporting and fixing the issue, [#5426](https://github.com/Rdatatable/data.table/issues/5426).

## BUG FIXES

1. `unique()` returns a copy the case when `nrows(x) <= 1` instead of a mutable alias, [#5932](https://github.com/Rdatatable/data.table/pull/5932). This is consistent with existing `unique()` behavior when the input has no duplicates but more than one row. Thanks to @brookslogan for the report and @dshemetov for the fix.
Expand Down
35 changes: 13 additions & 22 deletions R/data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -2846,19 +2846,21 @@ setDT = function(x, keep.rownames=FALSE, key=NULL, check.names=FALSE) {
stopf("Cannot convert '%1$s' to data.table by reference because binding is locked. It is very likely that '%1$s' resides within a package (or an environment) that is locked to prevent modifying its variable bindings. Try copying the object to your current environment, ex: var <- copy(var) and then using setDT again.", cname)
}
}
# check no matrix-like columns, #3760. Other than a single list(matrix) is unambiguous and depended on by some revdeps, #3581
if (length(x)>1L) {
idx = vapply_1i(x, function(xi) length(dim(xi)))>1L
if (any(idx))
warningf("Some columns are a multi-column type (such as a matrix column): %s. setDT will retain these columns as-is but subsequent operations like grouping and joining may fail. Please consider as.data.table() instead which will create a new column for each embedded column.", brackify(which(idx)))
}
if (is.data.table(x)) {
# fix for #1078 and #1128, see .resetclass() for explanation.
setattr(x, 'class', .resetclass(x, 'data.table'))
if (!missing(key)) setkeyv(x, key) # fix for #1169
if (check.names) setattr(x, "names", make.names(names(x), unique=TRUE))
if (selfrefok(x) > 0L) return(invisible(x)) else setalloccol(x)
} else if (is.data.frame(x)) {
# check no matrix-like columns, #3760. Allow a single list(matrix) is unambiguous and depended on by some revdeps, #3581
# for performance, only warn on the first such column, #5426
for (jj in seq_along(x)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be preferable to combine the logic for the for loops instead of just emitting the warning, but I didn't see an easy way to do so -- the setdt_nrows parallel also does the other checks in the same loop. Unless we think looping over columns twice is fine to exchange for the clarity of sharing this logic. But this loop is pretty straightforward.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say the loop overhead is likely to be negligible compared to even this simple logic, so I don't think it's worth much instinctively.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good performance improvement.

if (length(dim(x[[jj]])) > 1L) {
.Call(Cwarn_matrix_column_r, jj)
break
}
}
rn = if (!identical(keep.rownames, FALSE)) rownames(x) else NULL
setattr(x, "row.names", .set_row_names(nrow(x)))
if (check.names) setattr(x, "names", make.names(names(x), unique=TRUE))
Expand All @@ -2877,22 +2879,11 @@ setDT = function(x, keep.rownames=FALSE, key=NULL, check.names=FALSE) {
x = null.data.table()
} else if (is.list(x)) {
# copied from as.data.table.list - except removed the copy
for (i in seq_along(x)) {
if (is.null(x[[i]])) next # allow NULL columns to be created by setDT(list) even though they are not really allowed
# many operations still work in the presence of NULL columns and it might be convenient
# e.g. in package eplusr which calls setDT on a list when parsing JSON. Operations which
# fail for NULL columns will give helpful error at that point, #3480 and #3471
if (inherits(x[[i]], "POSIXlt")) stopf("Column %d is of POSIXlt type. Please convert it to POSIXct using as.POSIXct and run setDT again. We do not recommend use of POSIXlt at all because it uses 40 bytes to store one date.", i)
}
n = vapply_1i(x, length)
MichaelChirico marked this conversation as resolved.
Show resolved Hide resolved
n_range = range(n)
if (n_range[1L] != n_range[2L]) {
tbl = sort(table(n))
stopf("All elements in argument 'x' to 'setDT' must be of same length, but the profile of input lengths (length:frequency) is: %s\nThe first entry with fewer than %d entries is %d.", brackify(sprintf('%s:%d', names(tbl), tbl)), n_range[2L], which.max(n<n_range[2L]))
}
nrow = .Call(Csetdt_nrows, x)

xn = names(x)
if (is.null(xn)) {
setattr(x, "names", paste0("V",seq_len(length(x))))
setattr(x, "names", paste0("V", seq_along(x)))
} else {
idx = xn %chin% "" # names can be NA - test 1006 caught that!
if (any(idx)) {
Expand All @@ -2901,8 +2892,8 @@ setDT = function(x, keep.rownames=FALSE, key=NULL, check.names=FALSE) {
}
if (check.names) setattr(x, "names", make.names(xn, unique=TRUE))
}
setattr(x,"row.names",.set_row_names(n_range[2L]))
setattr(x,"class",c("data.table","data.frame"))
setattr(x, "row.names", .set_row_names(nrow))
setattr(x, "class", c("data.table", "data.frame"))
setalloccol(x)
} else {
stopf("Argument 'x' to 'setDT' should be a 'list', 'data.frame' or 'data.table'")
Expand Down
12 changes: 9 additions & 3 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -8591,7 +8591,7 @@ dt = data.table(d1="1984-03-17")
ans = data.table(d1="1984-03-17", d2=as.POSIXct("1984-03-17", tz='UTC'))
test(1612.2, dt[, d2 := strptime(d1, "%Y-%m-%d", tz='UTC')], ans, warning="strptime() usage detected and wrapped with as.POSIXct()")
ll = list(a=as.POSIXlt("2015-01-01"), b=2L)
test(1612.3, setDT(ll), error="Column 1 is of POSIXlt type")
test(1612.3, setDT(ll), error="Column 1 has class 'POSIXlt'")

# tests for all.equal.data.table #1106
# diff nrow
Expand Down Expand Up @@ -13723,7 +13723,13 @@ DT = as.data.table(M)
test(1964.3, rownames(DT), character(0))
test(1964.4, colnames(DT), c("v1","v2"))

test(1965, setDT(list(1, 1:2)), error = 'profile of input lengths') #3121
test(1965.1, setDT(list(1, 1:2)), error='input 2 has length 2.*length 1') #3121
# setDT() skips NULL elements, so get the length from the first non-NULL
test(1965.2, setDT(list(NULL, 1, 1:2)), error='input 3 has length 2.*length 1')
M = matrix(nrow=2L, ncol=2L)
ans = structure(list(V1=1:2, V2=M), class=c('data.table', 'data.frame'))
alloc.col(ans)
test(1965.3, setDT(list(1:2, M)), ans, warning='Some columns are a multi-column type.*for example column 2')

# fread/fwrite file name in native and utf-8 encoding, #3078
if (.Platform$OS.type=="windows") {
Expand Down Expand Up @@ -16050,7 +16056,7 @@ test(2088, x[, round(.SD, 1)][, c:=8.88], data.table(a=c(.8, -.4, 1.2), b=c(.6,.
DF = data.frame(a=1:5)
DF$m = matrix(6:15, ncol=2L)
test(2089.1, names(setDT(DF)), c("a","m"),
warning="Some columns are a multi-column type.*[[]2[]].*setDT will retain these columns as-is but.*Please consider as.data.table")
warning="Some columns are a multi-column type.*for example column 2.*setDT will retain these columns as-is but.*Please consider as.data.table")
test(2089.2, as.data.table(DF), data.table(a=1:5, m.V1=6:10, m.V2=11:15)) # the DF here is now a data.table, so as.data.table.data.table() is matrix-column-aware
test(2089.3, print(DF), output="<multi-column>")
DF = data.frame(a=1:5)
Expand Down
42 changes: 42 additions & 0 deletions src/assign.c
Original file line number Diff line number Diff line change
Expand Up @@ -200,6 +200,48 @@ static SEXP shallow(SEXP dt, SEXP cols, R_len_t n)
return(newdt);
}

// Wrapped in a function so the same message is issued for the data.frame case at the R level
void warn_matrix_column(/* 1-indexed */ int i) {
warning(_("Some columns are a multi-column type (such as a matrix column), for example column %d. setDT will retain these columns as-is but subsequent operations like grouping and joining may fail. Please consider as.data.table() instead which will create a new column for each embedded column."), i);
}

// input validation for setDT() list input; assume is.list(x) was tested in R
SEXP setdt_nrows(SEXP x)
{
int base_length = 0;
bool test_matrix_cols = true;

for (R_len_t i = 0; i < LENGTH(x); ++i) {
SEXP xi = VECTOR_ELT(x, i);
/* allow NULL columns to be created by setDT(list) even though they are not really allowed
* many operations still work in the presence of NULL columns and it might be convenient
* e.g. in package eplusr which calls setDT on a list when parsing JSON. Operations which
* fail for NULL columns will give helpful error at that point, #3480 and #3471 */
if (Rf_isNull(xi)) continue;
if (Rf_inherits(xi, "POSIXlt")) {
error(_("Column %d has class 'POSIXlt'. Please convert it to POSIXct (using as.POSIXct) and run setDT() again. We do not recommend the use of POSIXlt at all because it uses 40 bytes to store one date."), i+1);
}
SEXP dim_xi = getAttrib(xi, R_DimSymbol);
R_len_t len_xi;
R_len_t n_dim = LENGTH(dim_xi);
if (n_dim) {
if (test_matrix_cols && n_dim > 1) {
warn_matrix_column(i+1);
test_matrix_cols = false;
}
len_xi = INTEGER(dim_xi)[0];
} else {
len_xi = LENGTH(xi);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with #5981 in mind I do have some wariness about potentially introducing issues by skipping dispatch.

}
if (!base_length) {
base_length = len_xi;
} else if (len_xi != base_length) {
error(_("All elements in argument 'x' to 'setDT' must be of equal length, but input %d has length %d whereas the first non-empty input had length %d"), i+1, len_xi, base_length);
}
}
return ScalarInteger(base_length);
}

SEXP alloccol(SEXP dt, R_len_t n, Rboolean verbose)
{
SEXP names, klass; // klass not class at request of pydatatable because class is reserved word in C++, PR #3129
Expand Down
3 changes: 3 additions & 0 deletions src/data.table.h
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,7 @@ SEXP dt_na(SEXP x, SEXP cols);
SEXP alloccol(SEXP dt, R_len_t n, Rboolean verbose);
const char *memrecycle(const SEXP target, const SEXP where, const int start, const int len, SEXP source, const int sourceStart, const int sourceLen, const int colnum, const char *colname);
SEXP shallowwrapper(SEXP dt, SEXP cols);
void warn_matrix_column(int i);

SEXP dogroups(SEXP dt, SEXP dtcols, SEXP groups, SEXP grpcols, SEXP jiscols,
SEXP xjiscols, SEXP grporder, SEXP order, SEXP starts,
Expand Down Expand Up @@ -275,6 +276,7 @@ SEXP notchin(SEXP x, SEXP table);
SEXP setattrib(SEXP, SEXP, SEXP);
SEXP assign(SEXP, SEXP, SEXP, SEXP, SEXP);
SEXP copy(SEXP);
SEXP setdt_nrows(SEXP);
SEXP alloccolwrapper(SEXP, SEXP, SEXP);
SEXP selfrefokwrapper(SEXP, SEXP);
SEXP truelength(SEXP);
Expand Down Expand Up @@ -317,6 +319,7 @@ SEXP glast(SEXP);
SEXP gfirst(SEXP);
SEXP gnthvalue(SEXP, SEXP);
SEXP dim(SEXP);
SEXP warn_matrix_column_r(SEXP);
SEXP gvar(SEXP, SEXP);
SEXP gsd(SEXP, SEXP);
SEXP gprod(SEXP, SEXP);
Expand Down
2 changes: 2 additions & 0 deletions src/init.c
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ R_CallMethodDef callMethods[] = {
{"Cdogroups", (DL_FUNC) &dogroups, -1},
{"Ccopy", (DL_FUNC) &copy, -1},
{"Cshallowwrapper", (DL_FUNC) &shallowwrapper, -1},
{"Csetdt_nrows", (DL_FUNC) &setdt_nrows, -1},
{"Calloccolwrapper", (DL_FUNC) &alloccolwrapper, -1},
{"Cselfrefokwrapper", (DL_FUNC) &selfrefokwrapper, -1},
{"Ctruelength", (DL_FUNC) &truelength, -1},
Expand Down Expand Up @@ -141,6 +142,7 @@ R_CallMethodDef callMethods[] = {
{"CstartsWithAny", (DL_FUNC)&startsWithAny, -1},
{"CconvertDate", (DL_FUNC)&convertDate, -1},
{"Cnotchin", (DL_FUNC)&notchin, -1},
{"Cwarn_matrix_column_r", (DL_FUNC)&warn_matrix_column_r, -1},
{NULL, NULL, 0}
};

Expand Down
4 changes: 4 additions & 0 deletions src/wrappers.c
Original file line number Diff line number Diff line change
Expand Up @@ -120,3 +120,7 @@ SEXP dim(SEXP x)
return ans;
}

SEXP warn_matrix_column_r(SEXP i) {
warn_matrix_column(INTEGER(i)[0]);
return R_NilValue;
}
Loading