Skip to content
Andrea Falconi edited this page Apr 13, 2020 · 2 revisions

The model should make it easier to spot potential issues and reason about solutions. Before we talk solutions though, let's list the issues. Then we can pick what items are actually even worth considering. Bare questions to answer to make up our mind what is worth our while addressing could include:

  • How likely is the problem to happen?
  • If it does happen what are the consequences? E.g. is it something catastrophic or rather something we can live with?
  • How much effort would it take to fix?

Semantics

Ambiguous empty result

Looking at the failure scenarios from earlier and the possible flows through the model, we see the client may get back an empty list if

  • The list of target tables is empty.
  • The PC constraint (or NAC if the client requested a numeric aggregate) doesn't hold true across the board—i.e. it's false for all the tables touched by the query.
  • For some tables the query brings back no rows whereas PC (NAC) constraints fail for all the other tables.
  • For each table, the query actually brings back no rows.

If there was an error, how is the client supposed to realise it? i.e. how to distinguish the above cases so that it can fix the request? Perhaps not a big deal since a human will eventually get to see an empty dashboard and realise there's something fishy going on? Quite reasonable actually, but not a good assumption to make from an API design standpoint?

Hidden data loss

Suppose the query touches multiple tables but for some of them PC (or NAC depending on the case) doesn't hold true whereas there are tables that satisfy all required constraints and the query produces results for at least one of them. On getting a non-empty result list back, how can the client possibly tell the query was supposed to return more results, assuming the failed tables would've have contributed some records?

Here's a pesky scenario. We have two tables T1 and T2 storing waste management device data. T1 stores data for devices sold by manufacturer M and has a level numeric column whose values tell how full a waste container is. T2's data comes from devices sold by another manufacturer and it too has a level column with the same meaning as T1's but the column type happens to be a string even though it was meant to be a number. (We know this can happen.) Now suppose the client application is a dashboard to show cumulative waste levels. The client requests a sum aggregate spanning both tables and then further sums the returned numbers to show the total waste level from all devices. Unfortunately, the total waste level will always equal the sum of levels in T1 and it could be quite sometime before someone realises there's something wrong with that number.

In returning whatever results are available, we're making the assumption that's going to be more useful to the client than no result at all. Is this a fair assumption? Perhaps we shouldn't make any assumptions about how the client will use the data...

Hidden partial results

The model shows how expensive queries get capped so that the number of records fetched from a table by any SQL query is limited to 10,000 or a client-specified maximum through the limit parameter. Let's call λ either number. On getting a list of length λ back from a call involving just one table, how can the client tell if the query was capped and more data can be fetched?

In detail, suppose the client hits Quantum Leap with a single-table query whose predicate p (refer to the model) matches λ + n rows where n > 0. Looking at the algorithm in the model, we see the list returned to the client will only have λ entries. But how is the client supposed to know that another n records match the predicate p?

Perhaps the client explicitly set limit=λ and doesn't actually want to get back more than λ results ever. Still, in general the problem is there: how to tell partial results apart?

Limit not honoured

Another problem with capping queries is that when results get collected from multiple tables, the size of the result list returned to the client may exceed 10,000 (API default limit) or the value of the API limit parameter explicitly set by the client. Again, looking at the limit algorithm in the model, we can see this is a straight consequence of dropping results for each table but not considering the accumulated number of results. So what's the scope of the API limit parameter? The spec seems to imply it should be global, i.e. apply to the entirety of records collected from all tables, but in actual fact it is local, i.e. it only applies to single tables.

Too broad multi-table queries?

We saw that if there's no input entity type, then the query method targets all tables in a tenant DB as determined by the input FiWare service. (Have a look at the target function in the model.) Was it really meant to be this way? Doesn't it make more sense to let the client specify a subset of tables to query, e.g. through a list of entity types? The way it stands, the client is forced to filter any irrelevant entity data out of the returned list.

Maintenance

  • Query template/string manipulation instead of language/interpreter approach; error-prone, difficult to extend; can only cover a small subset of queries that'd be useful.
  • Lack of modularity, spaghetti code.
  • Lack of abstraction, copy & paste tech.
  • Untyped soup, requires more tests we can afford to write.
  • Unit, integration and end-to-end tests all crammed together.
  • Test suite takes too long to run, wasting lots of dev time.

Security

  • Query template/string manipulation pave the way to SQL injection.

Performance

The model highlights a bunch of potential performance hot spots, listed below. Don't jump to conclusions though, there's no substitute for measuring (profiling, bench-marking, etc.) production workloads. Explicitly: you're looking at guesstimates.

  • Entire record set sucked into memory, no streaming (constant space); reliance on client to make wise use of API limit parameter.
  • What if hundreds of queries run concurrently though? Limit may not be enough! Say there are 100 single-table queries running at just about the same time, each with a limit of 1,000 (very reasonable) and each matching 100 1Kb-sized records. Now you have about 100MB of raw data in memory that needs to be massaged into Python objects and then those objects get converted to JSON, still in memory. So your heap could easily grow to over 300MB in less than a couple of seconds.
  • For the same reason, even moderate concurrent load may result in significant slowdowns due to Python garbage collection or even OS thrashing.
  • Multi-table queries (e.g. attrs queries without an entity type) span all the tables in a tenant DB which, for large DBs, has the potential of producing result sets outstripping available processing power. (Have a look at the target function in the model.)
  • Multi-table queries require one separate DB call for each table, so they're only marginally more efficient than fetching data separately, one table at a time, e.g. through the types endpoint.

Up

Prev