Implicit Problem Aggregators #7738

GregoryTravis · 2023-09-05T15:10:52Z

GregoryTravis
Sep 5, 2023
Collaborator

Implicit Problem Aggregators

The Problem

Problem aggregation in the Java Table implementation suffers from a number of issues:

We have many different problem builders which should be combined into a unified approach
Each way of handling / aggregating problems is slightly different
Using multiple kinds of problem builders in the same code path requires extra code to combine them
Problems are (sometimes) attached to values, so some objects need their own problem builders while others don't
The developer has to take care to propagate problems anywhere they arise, making it easy to lose problems

Many of these isues would be solved by creating a more general problem aggregator (PA) to use everywhere. But the last problem remains: developers have to take care to propagate problems at all times, both in writing new code and modifying existing code.

Automatic Aggregation

The solution to this problem is to provide a facility for reporting problems that will always ensure that the problems are dealt with – propagated to the user by default, or optionally handled in some other way.

We propose that there exists at all times an ambient PA that problems can be reported to, and which will collect those problems and propagate them upwards to the user.

We maintain a stack of PAs. When crossing the boundary from Enso to Java, we push a PA onto the stack. We can push additional PAs onto the stack to wrap particular computations, and then pop them off. Popping a PA from the stack transfers its problems to its parent.

There is no need to pass PAs everywhere – the current PA is always available through the ProblemAggregators class.

At any point, we can report a problem like this:

report(new ArithmeticError(location, method, row));

Calling report adds the problem to the current PA. Since the PA stack always has at least one entry, there is always a place for a problem to go.

Developers can inspect and modify problems that have been aggregated, but by default they are simply propagated.

Problem Aggregator Scope

PAs are dynamically scoped. You can add a PA at any time:

ProblemAggregators.push(new ProblemAggregator());

This can default to a standard PA:

ProblemAggregators.push();

You only need to specify the PA if you are using a special one:

ProblemAggregators.push(new BulkArithmeticErrorPA());

A push is generally paired with a pop, which should be inside a finally block:

try {
  ProblemAggregators.push(new ProblemAggregator());
} finally {
  ProblemAggregators.pop();
}

Alternately, you can take care of the pop automatically with a wrapper

ProblemAggregators.with(new ProblemAggregator(),
  () -> doSomeStuff());

or

ProblemAggregators.withPA(
  () -> {
    doSomeStuff();
    doMore Stuff();
  }
);

By default, when a PA is popped, it gives its problems to its parent. Alternatively, you can choose to intervene:

try {
  ProblemAggregators.push(new ProblemAggregator());
} finally {
  List<Problem> problems = ProblemAggregators.extractCurrent();
  List<Problem> filteredProblems = filterProblems(problems);
  push(filteredProblems); // Pushes to parents.
}

Or

try {
  ProblemAggregators.push(new ProblemAggregator());
} finally {
  ProblemAggregators.popWithTransformer(Reducers::arithmeticErrorReducer);
}

Or

ProblemAggregators.withFilteredPA(Reducers::arithmeticErrorReducer,
  () -> {
    doSomeStuff();
    doMore Stuff();
  }

);

In the simplest case, you only need call report(problem) for each problem, knowing that the problem will be propagated correctly. All other calls to the PA system are for special cases: reducing / combining bulk problems; removing misleading or unimportant problems; or adding additional information to problems.

Specialized PAs

We currently have a number of aggregators with convenience methods, such as

public class MapOperationProblemBuilder {
  public void reportArithmeticError(String message, Integer row) { … }
}

We could implement these methods on the default PA for many common cases. For more specialized use, you could refer to the PA directly:

try {
  var myPA = new SpecialPA();
  ProblemAggregators.push(myPa);
  myPA.reportVeryWeirdProblem(byteArray, ioStream, serverSocket);
} finally {
  ProblemAggregators.popWithTransformer(Reducers::arithmeticErrorReducer);
}

Other specialized PAs might be:

Report only one of each class of error
Report up to five of each class of error
Report only one of each kind of error, where the ‘kind’ of error is determined by tags passed along with the error:
- report(new ArithmeticError(), columnName)
Merge errors using custom merging logic (which we have right now in some cases)

Extension to Enso code

While this proposal is for reporting problems within Java code, we can also consider using it from Enso. Enso code has some of the same problem aggregation problems: multiple implementations, impedance mismatch between different aggregators, and of course the possibility of dropping problems on the floor.

With Enso, the difference is that problems are attached to values. Reporting problems in Enso then means including the value in the aggregator, which allows the problem to be displayed in the right place in the IDE.

By default, any problem encountered while evaluating a node will be attached to that node. Additional logic could be used to rearrange or reattach problems, if needed.

Using PAs from Enso will look a lot like using it from Java, except perhaps even cleaner:

  handle_bulk_cast_problems <|
    do_things 1 2 3

As before, there is no need to pass PAs around; they’re always available in any scope.

radeusgd · 2023-09-05T15:27:58Z

radeusgd
Sep 5, 2023
Collaborator

What is the reasoning for switching away from the solution that we discussed last week into the 'global implicit' approach?

I don't see much gain in this approach versus the one from last week.

For the sake of comments, let's call this approach 'implicit' and the one discussed last week 'argument-based'.

Calling report adds the problem to the current PA. Since the PA stack always has at least one entry, there is always a place for a problem to go.

How do we guarantee that there is always at least one entry? Who's responsibility is to guarantee that? I like that with the 'argument-based' approach, the API user has to construct a problem aggregator to pass down the API call, so there the API gives us a guarantee that indeed the aggregator is constructed.

Here I don't see any guarantee apart from 'we try to remember about it'.

We maintain a stack of PAs. When crossing the boundary from Enso to Java, we push a PA onto the stack. We can push additional PAs onto the stack to wrap particular computations, and then pop them off. Popping a PA from the stack transfers its problems to its parent.

What if I forget to pop off an element from the stack?

A push is generally paired with a pop, which should be inside a finally block:

This is really more complicated than the other approach. With the 'argument-based' approach, I'd just create a nested aggregator (+- a 'push') and I do not need to care about a 'pop' - the whole structure is summarized once the problem summary is requested. The structure is maintained automatically by passing a nested instance as an argument. There is no risk of forgetting to pop it. No added complexity of exceptions.

While this proposal is for reporting problems within Java code, we can also consider using it from Enso.

Most of the time the Warnings API is sufficient in Enso. We have a Problem_Builder that we use for rather specialized cases. We may need to improve it, but I don't think we need a common solution here.

I don't understand the reasons behind a switch towards the solution described above. I thought the solution we discussed last week was good enough.

I really liked it and found it very elegant. This new solution is IMO not at all as elegant - it requires more work with maintaining the 'stack'. It will not port well to possible multi-threaded scenarios due to the global state.

The benefit of this seems to be that we don't need to pass any problem aggregator explicitly. But IMO that is a disadvantage - this creates a risk of unhandled problems. The 'argument-based' approach was giving us assurance that if my function holds a reference to a problem aggregator, someone up the call-chain will summarize these problems and so will handle them. This does no longer have this, and makes the problem handling much more implicit.

I don't think using global state is a good tool for this job. Problem handling should be done in some explicit manner IMO.

0 replies

radeusgd · 2023-09-05T17:06:02Z

radeusgd
Sep 5, 2023
Collaborator

Here's how I've been imagining the idea that we discussed together last week could work:

package org.enso.table.parsing.problems.proto;

import org.enso.table.data.column.storage.Storage;
import org.enso.table.data.column.storage.type.FloatType;
import org.enso.table.data.column.storage.type.IntegerType;
import org.enso.table.data.column.storage.type.StorageType;
import org.enso.table.data.table.Column;
import org.enso.table.data.table.problems.ArithmeticOverflow;
import org.enso.table.data.table.problems.FloatingPointGrouping;
import org.enso.table.problems.Problem;

import java.util.ArrayList;
import java.util.List;

public class ProblemAggregator {
  protected List<Problem> directlyReportedProblems = new ArrayList<>();
  protected List<ProblemAggregator> children = new ArrayList<>();

  // Called by any processing code to report a simple problem. Specialized implementations may be available too.
  public void report(Problem problem) {
    directlyReportedProblems.add(problem);
  }

  // Called by the top-level user after all processing is completed, to summarize problems that happened.
  public List<Problem> summarize() {
    List<Problem> result = new ArrayList<>();
    result.addAll(directlyReportedProblems);
    for (ProblemAggregator child : children) {
      result.addAll(child.summarize());
    }
    return result;
  }

  // The simple constructor is private, so children need to use one that specifies the parent, thus guaranteeing that a parent exists.
  private ProblemAggregator() {}

  // The constructor to use for inheritors, that guarantees that it is attached to a parent.
  protected ProblemAggregator(ProblemAggregator parent) {
    parent.children.add(this);
  }

  /* This should only be used by top-level code, and any call to this method should be paired up with a summarize
   * call that translates problems from Java to Enso.
   *
   * We can guarantee this by relying on an Enso wrapper:
   * with_problem_aggregator on_problems code =
   *     agg = ProblemAggregator.makeTopLevelAggregator
   *     result = code agg
   *     on_problems.attach_problems_before (Java_Problems.translate agg.summarize) result
   *
   * No function other than `with_problem_aggregator` should ever call `makeTopLevelAggregator`, to guarantee correctness.
   * Usage:
   * with_problem_aggregator Problem_Behavior.Report_Warning problem_aggregator->
   *     Java_Helper.call args... problem_aggregator
   */
  public static ProblemAggregator makeTopLevelAggregator() {
    return new ProblemAggregator();
  }
}

// We can use a more specific aggregator for specific tasks, that will contain a more efficient implementation.
public class ArithmeticOperationProblemAggregator extends ProblemAggregator {

  public ArithmeticOperationProblemAggregator(ProblemAggregator parent) {
    super(parent);
  }

  private long overflowCount = 0;
  private Object[] overflowExample = null;
  private StorageType overflowTargetType = null;

  public void reportOverflow(StorageType targetType, long x, String op, long y) {
    overflowCount++;
    if (overflowTargetType == null) {
      overflowTargetType = targetType;
      overflowExample = new Object[] {x, op, y};
    }
  }

  @Override
  public List<Problem> summarize() {
    List<Problem> problems = super.summarize();
    if (overflowCount > 0) {
      problems.add(new ArithmeticOverflow(overflowTargetType, overflowCount, overflowExample));
    }
    return problems;
  }
}

// Or we can have an aggregator that provides context info to report more 'rich' problems.
public class ColumnOperationProblemAggregator extends ProblemAggregator {
  private final String columnName;

  public ColumnOperationProblemAggregator(ProblemAggregator parent, String columnName) {
    super(parent);
    this.columnName = columnName;
  }

  public void reportFloatingPointEquality(int row) {
    report(new FloatingPointGrouping(columnName, row));
  }
}

// Usage
public class Java_Stuff {
  public static int enso_facing_method(Column java_column, ProblemAggregator aggregator) {
    int x = computeArithmeticOperation("foo", new ArithmeticOperationProblemAggregator(aggregator));
    int y = storageMapping(java_column.getStorage(), new ColumnOperationProblemAggregator(aggregator, java_column.getName()));
    return x + y;
  }

  private static int computeArithmeticOperation(String args, ArithmeticOperationProblemAggregator aggregator) {
    // ...
    aggregator.reportOverflow(IntegerType.INT_64, 123, "+", 42);
    // ...
    return 23;
  }

// alternatively we can create the aggregator at usage site
  private static int computeArithmeticOperation2(String args, ProblemAggregator aggregator) {
    // Create a specialzied aggregator because we want it to be more efficient, all is passed to parent anyway.
    ArithmeticOperationProblemAggregator specialized = new ArithmeticOperationProblemAggregator(aggregator);
    // ...
    specialized.reportOverflow(IntegerType.INT_64, 123, "+", 42);
    // ...
    return 23;
  }

  private static int storageMapping(Storage<?> storage, ColumnOperationProblemAggregator aggregator) {
    if (storage.getType() instanceof FloatType) {
      // ...
      aggregator.reportFloatingPointEquality(0);
      // ...
    }

    return 0;
  }
}

/*
 * Enso code:
 * my_function column =
 *     # The wrapper guarantees that `summarize` will be called (unless an exception is thrown - but then we don't care about the problems - everything is simply discarded and cleaned up by GC).
 *     with_problem_aggregator Problem_Behavior.Report_Warning problem_aggregator->
 *       Java_Stuff.enso_facing_method column.java_column problem_aggregator
 */

Here we can see that we don't need any bookkeeping - thanks to the protected constructor of ProblemAggregator, any child aggregator is guaranteed to be registered to its parent. There is no way to go around this apart from abusing the makeTopLevelAggregator - but it's clear this function should only be called from with_problem_aggregator.

The specialized ArithmeticOperationProblemAggregator shows us how we can be ready to efficiently handle thousands of overflow errors with performing just O(1) allocations.

The specialized ColumnOperationProblemAggregator shows us how we can include additional context info in an aggregator. Ensuring that a method, like storageMapping takes such a specialized aggregator guarantees us that any users will have to construct this specialized instance, thus guaranteeing us that we have all context information we need at the call-site (in this case, the column name).

I don't think we can easily achieve this kind of context-information passing nor specialized implementations for efficient handling of big amounts of problems using the global state approach - there is no place to 'specialize' the aggregators there, at least in an elegant way.

1 reply

radeusgd Sep 7, 2023
Collaborator

I can't wait to improve our problem handling.

A practical example of how in the current approach I forgot to pass-through warnings from a builder, causing a Loss_Of_Integer_Precision warning to be discarded. 329b290 (this is a commit fixing the missing warning handling)

I hope with the new approach we will not have such problems :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enso Analytics

Implicit Problem Aggregators #7738

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Enso Analytics

Implicit Problem Aggregators #7738

GregoryTravis Sep 5, 2023 Collaborator