Enable dataset support for quidem tests #17699

kgyrtkirk · 2025-02-05T15:04:50Z

Enables to use datasets in druidtest uri-s.

The datasets should point to a directory - all json files become candidates to be interpeted as
ingestiontions.

The system will not run a full ingestion - but to simplify usage recognizing an ingestion seemed like the easiest way.

Example:

!use druidtest://?componentSupplier=StandardComponentSupplier&datasets=sql/src/test/quidem/sampledataset

Locations are relative to the projectroot - for both the datasets option and for the LocalInputSource-es looking for their inputs inside the ingestions.

Setting datasets to a value supresses the initialization of the componentsupplier's builtin datasets.

This reverts commit 821c3d90a34e31347a86433eb5543d5fe8cdf0a5.

This reverts commit 336b7c0.

…aset-0

server/src/test/java/org/apache/druid/sql/calcite/util/datasets/NumFoo.java

sql/src/test/java/org/apache/druid/sql/calcite/util/SqlTestFramework.java

      return walker;
    }

+    @Provides
+    @LazySingleton
+    public List<TestDataSet> buildCustomTables(ObjectMapper objectMapper, TempDirProducer tdp,


imply-cheddar

A few comments, not major, but the Guice named annotation thing should probably be addressed. That's a pattern that I'd prefer not exist in the codebase.

imply-cheddar · 2025-02-06T18:12:43Z

server/src/test/java/org/apache/druid/server/TestClusterQuerySegmentWalker.java

  TestClusterQuerySegmentWalker(
-      Map<String, VersionedIntervalTimeline<String, ReferenceCountingSegment>> timelines,
+      @Named(TIMELINES_KEY) Map<String, VersionedIntervalTimeline<String, ReferenceCountingSegment>> timelines,


The only "good" usage of a named annotation is when it is completely encapsulated inside of a singular module (i.e. there's a binding to the named item and then a provider method in the same module that depends on the naming). This is because it's too easy for named things to proliferate and it becomes hard to figure out how things are actually being setup.

In this case, it's unclear to me why you need the annotation at all. Though, from looking at it, it looks like you are trying to push a map around to multiple different things and have them all use the same map. In that case, I'd suggest that you create a relatively simple TestSegmentsBroker class which all of the places that would use this map can depend on and then you bind that. This makes it more clear that you have a singular object that multiple things are potentially updating the state of and reading the state from.

in those static methods this map was used to back-communicate...didn't wanted to make them much different ; so just added this Named tag so it will be less likely that it surprises someone :)
by using the constant as the name I was thinking that people interested in its usage could look it up if needed.

I guess sometimes later this will be dropped and probably TestTimelineServerView could take its place?

butfor now I've packed this Map into a TestSegmentsBroker

imply-cheddar · 2025-02-06T18:21:42Z

sql/src/test/java/org/apache/druid/sql/calcite/util/FakeIndexTaskUtil.java

+      ObjectMapper om = objectMapper.copy();
+      om.registerSubtypes(new NamedType(MyIOConfigType.class, "index_parallel"));
+      FakeIndexTask indexTask = om.readValue(src, FakeIndexTask.class);
+      FakeIngestionSpec spec = indexTask.spec;


I probably would've used the object mapper to read it in as a Map<String, Object>, then get on the keys I want to check and use objectMapper.convertValue() to convert into the types that you want. What you've done works as well, but it's a little more difficult to read this code and figure out what structure it wants in the JSON.

I went thru a few of those approaches

java code

public static TestDataSet makeDS(ObjectMapper objectMapper, File src) { try { Map<String, Object> map = objectMapper.readValue( src, new TypeReference<>(){} ); Map<String, Object> spec = objectMapper.convertValue( Preconditions.checkNotNull(map.get("spec"),"spec not specified"), new TypeReference<>(){} ); DataSchema dataSchema = objectMapper.convertValue( Preconditions.checkNotNull(spec.get("dataSchema"), "dataschema not specified"), DataSchema.class ); Map<String, Object> ioConfig = objectMapper.convertValue( Preconditions.checkNotNull(spec.get("ioConfig"), "ioConfig not specified"), new TypeReference<>(){} ); InputSource inputSource0 = objectMapper.convertValue( Preconditions.checkNotNull(ioConfig.get("inputSource"), "inputSource not specified"), InputSource.class ); InputFormat inputFormat = objectMapper.convertValue( Preconditions.checkNotNull(ioConfig.get("inputFormat"), "inputFormat not specified"), InputFormat.class ); InputSource inputSource = relativizeLocalInputSource( inputSource0, ProjectPathUtils.PROJECT_ROOT ); TestDataSet dataset = new InputSourceBasedTestDataset( dataSchema, inputFormat, inputSource ); return dataset; } catch (IOException e) { throw new RuntimeException(e); } } public static TestDataSet makeDS3(ObjectMapper objectMapper, File src) { try { Map<String, Object> map = objectMapper.readValue( src, new TypeReference<>(){} ); Map<String, Object> spec = objectMapper.convertValue( Preconditions.checkNotNull(map.get("spec"),"spec not specified"), new TypeReference<>(){} ); DataSchema dataSchema = objectMapper.convertValue( Preconditions.checkNotNull(spec.get("dataSchema"), "dataschema not specified"), DataSchema.class ); ObjectMapper om = objectMapper.copy(); om.registerSubtypes(new NamedType(MyIOConfigType.class, "index_parallel")); MyIOConfigType ioConfig = om.convertValue( Preconditions.checkNotNull(spec.get("ioConfig"), "ioConfig not specified"), MyIOConfigType.class ); InputSource inputSource = relativizeLocalInputSource( ioConfig.inputSource, ProjectPathUtils.PROJECT_ROOT ); TestDataSet dataset = new InputSourceBasedTestDataset( dataSchema, ioConfig.inputFormat, inputSource ); return dataset; } catch (IOException e) { throw new RuntimeException(e); } } public static TestDataSet makeDS2(ObjectMapper objectMapper, File src) { try { Map<String, Object> map = objectMapper.readValue( src, new TypeReference<>() { } ); ObjectMapper om = objectMapper.copy(); om.registerSubtypes(new NamedType(MyIOConfigType.class, "index_parallel")); FakeIngestionSpec spec = om .convertValue(Preconditions.checkNotNull(map.get("spec"), "spec not specified"), FakeIngestionSpec.class); InputSource inputSource = relativizeLocalInputSource( spec.getIOConfig().inputSource, ProjectPathUtils.PROJECT_ROOT ); TestDataSet dataset = new InputSourceBasedTestDataset( spec.getDataSchema(), spec.getIOConfig().inputFormat, inputSource ); return dataset; } catch (IOException e) { throw new RuntimeException(e); } } public static TestDataSet makeDS1(ObjectMapper objectMapper, File src) { try { ObjectMapper om = objectMapper.copy(); om.registerSubtypes(new NamedType(MyIOConfigType.class, "index_parallel")); FakeIndexTask indexTask = om.readValue(src, FakeIndexTask.class); FakeIngestionSpec spec = indexTask.spec; InputSource inputSource = relativizeLocalInputSource( spec.getIOConfig().inputSource, ProjectPathUtils.PROJECT_ROOT ); TestDataSet dataset = new InputSourceBasedTestDataset( spec.getDataSchema(), spec.getIOConfig().inputFormat, inputSource ); return dataset; } catch (IOException e) { throw new RuntimeException(e); } }

There are a few "middle" solutions which still use the MyIoConfigType ; but I think the cleanest is the original one - it might be harder to understand it at first - but if something will be missing later; it will be easier to adapt.
There is also no need to prepare for missing values and such things as jackson will do the usual checks.

This reverts commit 83ae6ac.

cryptoe

Changes LGTM. This will simplify testing a lot. Thanks.

kgyrtkirk added 30 commits January 30, 2025 11:30

first relevant commit w.r.t use cachingclientquerss

8fbbf73

inline

49377df

try1

7fbf287

Revert "try1"

1c58dc8

This reverts commit 821c3d90a34e31347a86433eb5543d5fe8cdf0a5.

separate methopd

95506a3

move emptywalker to guice module

012fc9b

factor

a3a03b4

extract

567d0c8

inline impl

41baee4

inline

cbf16d0

one step closer

d8af313

rely on @Inject

4f074fa

rely on Inject instead of wiore in

dcc72d4

rely on Inject instead of wiore in

2dd95c5

this far

0c3a7ee

xt

336b7c0

Revert "xt"

880d27b

This reverts commit 336b7c0.

Merge commit 'f4912d1c6620caa40cbd02cbd8ef91fd68187a1b' into test-dat…

30bd4bb

…aset-0

stuff from oteher branch

746c2d2

up

0dd0973

a

cdca192

prep ingest

7924d64

update

c76e376

make it run

57c96a1

add dep

292e3a5

format

cf7298e

x

f9e9f38

u

3d7a6df

closer

8704ee0

prepare to dissolve

7006c3a

kgyrtkirk added 7 commits February 5, 2025 13:42

rename option to datasets

dfd1b6d

move

e390cb2

tries

871fa91

cleanup

07e139a

fix tests

b7a758d

up

24c19d6

cleanup

8c3a019

github-actions bot added Area - Querying Area - Dependencies Area - Ingestion labels Feb 5, 2025

kgyrtkirk added 5 commits February 5, 2025 15:20

fix hashcode/equals

e3d9b6f

remove comment

0966a12

remove unnecessary changes

5efb3c7

fix style

0e8cc13

cleanup

78e3784

github-advanced-security bot found potential problems Feb 5, 2025

View reviewed changes

kgyrtkirk added 6 commits February 5, 2025 16:14

x

d48cd79

Merge remote-tracking branch 'apache/master' into test-dataset-0

8430ebd

add test

8c4f082

remove ex

7d1f9be

mandate directory for datasets

46080b4

move customdataset to sql

d028d04

imply-cheddar reviewed Feb 6, 2025

View reviewed changes

kgyrtkirk added 5 commits February 7, 2025 12:55

use TestSegmentsBroker

7677415

clenaup

97ca2d3

alternates

83ae6ac

Revert "alternates"

8ae065b

This reverts commit 83ae6ac.

checkstyle fix

2d42004

cryptoe approved these changes Feb 10, 2025

View reviewed changes

cryptoe merged commit bba25a6 into apache:master Feb 10, 2025
74 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable dataset support for quidem tests #17699

Enable dataset support for quidem tests #17699

kgyrtkirk commented Feb 5, 2025 •

edited

Loading

imply-cheddar left a comment

imply-cheddar Feb 6, 2025

kgyrtkirk Feb 7, 2025

imply-cheddar Feb 6, 2025

kgyrtkirk Feb 7, 2025

cryptoe left a comment

Enable dataset support for quidem tests #17699

Enable dataset support for quidem tests #17699

Conversation

kgyrtkirk commented Feb 5, 2025 • edited Loading

imply-cheddar left a comment

Choose a reason for hiding this comment

imply-cheddar Feb 6, 2025

Choose a reason for hiding this comment

kgyrtkirk Feb 7, 2025

Choose a reason for hiding this comment

imply-cheddar Feb 6, 2025

Choose a reason for hiding this comment

kgyrtkirk Feb 7, 2025

Choose a reason for hiding this comment

cryptoe left a comment

Choose a reason for hiding this comment

kgyrtkirk commented Feb 5, 2025 •

edited

Loading