Skip to content

Commit 3154ea7

Browse files
committed
store: Fix mistake in handling of histogram bounds in VidBatcher
When setting upa VidBatcher we have both accurate values for the range of vids as well as Postgres' estimate of bounds for a histogram with roughly the same number of entries in each bucket. As an example, say we have min and max of 1 and 100, and histogram bounds [5, 50, 96]. We used to then add min and max to these bounds resulting in an ogive over [1, 5, 50, 96, 100]. With that, it seems that there is a bucket [1, 5] with just as many entries as the bucket [5, 50], which is not what the Posgres staistics indicate. Using this ogive will cause e.g. pruning to increase batch size quickly as it tries to get out of the [1, 5] bucket resulting in a batch size that is way too big for the next bucket and a batch that can take a very long time. The first and last entry of the bounds are Postgres' estimate of the min and max. We now simply replace the first and last bound with our known min and max, resulting in an ogive over [1, 50, 100], which reflects the statistics much more accurately and avoids impossibly short buckets.
1 parent 900f10a commit 3154ea7

File tree

1 file changed

+46
-6
lines changed

1 file changed

+46
-6
lines changed

store/postgres/src/vid_batcher.rs

Lines changed: 46 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -136,12 +136,22 @@ impl VidBatcher {
136136
) -> Result<Self, StoreError> {
137137
let start = range.min;
138138

139-
let bounds = bounds
140-
.into_iter()
141-
.filter(|bound| range.min < *bound && range.max > *bound)
142-
.chain(vec![range.min, range.max].into_iter())
143-
.collect::<Vec<_>>();
144-
139+
let bounds = {
140+
// Keep only histogram bounds that are relevent for the range
141+
let mut bounds = bounds
142+
.into_iter()
143+
.filter(|bound| range.min <= *bound && range.max >= *bound)
144+
.collect::<Vec<_>>();
145+
// The first and last entry in `bounds` are Postgres' estimates
146+
// of the min and max `vid` values in the table. We use the
147+
// actual min and max `vid` values from the `vid_range` instead
148+
let len = bounds.len();
149+
if len > 0 {
150+
bounds[0] = range.min;
151+
bounds[len - 1] = range.max;
152+
}
153+
bounds
154+
};
145155
let mut ogive = if range.is_empty() {
146156
None
147157
} else {
@@ -363,6 +373,17 @@ mod tests {
363373
}
364374
}
365375

376+
impl std::fmt::Debug for Batcher {
377+
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
378+
f.debug_struct("Batcher")
379+
.field("start", &self.vid.start)
380+
.field("end", &self.vid.end)
381+
.field("size", &self.vid.batch_size.size)
382+
.field("duration", &self.vid.batch_size.target.as_secs())
383+
.finish()
384+
}
385+
}
386+
366387
#[test]
367388
fn simple() {
368389
let bounds = vec![10, 20, 30, 40, 49];
@@ -414,4 +435,23 @@ mod tests {
414435
batcher.at(360, 359, 80);
415436
batcher.step(360, 359, S010);
416437
}
438+
439+
#[test]
440+
fn vid_batcher_adjusts_bounds() {
441+
// The first and last entry in `bounds` are estimats of the min and
442+
// max that are slightly off compared to the actual min and max we
443+
// put in `vid_range`. Check that `VidBatcher` uses the actual min
444+
// and max from `vid_range`.
445+
let bounds = vec![639, 20_000, 40_000, 60_000, 80_000, 90_000];
446+
let vid_range = VidRange::new(1, 100_000);
447+
let batch_size = AdaptiveBatchSize {
448+
size: 1000,
449+
target: S100,
450+
};
451+
452+
let vid_batcher = VidBatcher::new(bounds, vid_range, batch_size).unwrap();
453+
let ogive = vid_batcher.ogive.as_ref().unwrap();
454+
assert_eq!(1, ogive.start());
455+
assert_eq!(100_000, ogive.end());
456+
}
417457
}

0 commit comments

Comments
 (0)