-
Notifications
You must be signed in to change notification settings - Fork 49
Issues
The model should make it easier to spot potential issues and reason about solutions. Before we talk solutions though, let's list the issues. Then we can pick what items are actually even worth considering. Bare questions to answer to make up our mind what is worth our while addressing could include:
- How likely is the problem to happen?
- If it does happen what are the consequences? E.g. is it something catastrophic or rather something we can live with?
- How much effort would it take to fix?
Looking at the failure scenarios from earlier and the possible flows through the model, we see the client may get back an empty list if
- The list of target tables is empty.
- The
PC
constraint (orNAC
if the client requested a numeric aggregate) doesn't hold true across the board—i.e. it's false for all the tables touched by the query. - For some tables the query brings back no rows whereas
PC
(NAC
) constraints fail for all the other tables. - For each table, the query actually brings back no rows.
If there was an error, how is the client supposed to realise it? i.e. how to distinguish the above cases so that it can fix the request? Perhaps not a big deal since a human will eventually get to see an empty dashboard and realise there's something fishy going on? Quite reasonable actually, but not a good assumption to make from an API design standpoint?
Suppose the query touches multiple tables but for some of them PC
(or NAC
depending on the case) doesn't hold true whereas there are
tables that satisfy all required constraints and the query produces
results for at least one of them. On getting a non-empty result list
back, how can the client possibly tell the query was supposed to return
more results, assuming the failed tables would've have contributed
some records?
Here's a pesky scenario. We have two tables T1
and T2
storing
waste management device data. T1
stores data for devices sold by
manufacturer M
and has a level
numeric column whose values tell
how full a waste container is. T2
's data comes from devices sold
by another manufacturer and it too has a level
column with the
same meaning as T1
's but the column type happens to be a string
even though it was meant to be a number. (We know this can happen.)
Now suppose the client application is a dashboard to show cumulative
waste levels. The client requests a sum
aggregate spanning both
tables and then further sums the returned numbers to show the total
waste level from all devices. Unfortunately, the total waste level
will always equal the sum of levels in T1
and it could be quite
sometime before someone realises there's something wrong with that
number.
In returning whatever results are available, we're making the assumption that's going to be more useful to the client than no result at all. Is this a fair assumption? Perhaps we shouldn't make any assumptions about how the client will use the data...
The model shows how expensive queries get capped so that the number
of records fetched from a table by any SQL query is limited to 10,000
or a client-specified maximum through the limit
parameter. Let's
call λ
either number. On getting a list of length λ
back from a
call involving just one table, how can the client tell if the query
was capped and more data can be fetched?
In detail, suppose the client hits Quantum Leap with a single-table
query whose predicate p
(refer to the model) matches λ + n
rows
where n > 0
. Looking at the algorithm in the model, we see the list
returned to the client will only have λ
entries. But how is the
client supposed to know that another n
records match the predicate
p
?
Perhaps the client explicitly set limit=λ
and doesn't actually want
to get back more than λ
results ever. Still, in general the problem
is there: how to tell partial results apart?
Another problem with capping queries is that when results get collected
from multiple tables, the size of the result list returned to the
client may exceed 10,000 (API default limit) or the value of the API
limit
parameter explicitly set by the client. Again, looking at the
limit algorithm in the model, we can see this is a straight consequence
of dropping results for each table but not considering the accumulated
number of results. So what's the scope of the API limit parameter?
The spec seems to imply it should be global, i.e. apply to the entirety
of records collected from all tables, but in actual fact it is local,
i.e. it only applies to single tables.
We saw that if there's no input entity type, then the query
method
targets all tables in a tenant DB as determined by the input FiWare
service. (Have a look at the target
function in the model.) Was it
really meant to be this way? Doesn't it make more sense to let the
client specify a subset of tables to query, e.g. through a list of
entity types? The way it stands, the client is forced to filter any
irrelevant entity data out of the returned list.
- Query template/string manipulation instead of language/interpreter approach; error-prone, difficult to extend; can only cover a small subset of queries that'd be useful.
- Lack of modularity, spaghetti code.
- Lack of abstraction, copy & paste tech.
- Untyped soup, requires more tests we can afford to write.
- Unit, integration and end-to-end tests all crammed together.
- Test suite takes too long to run, wasting lots of dev time.
- Query template/string manipulation pave the way to SQL injection.
The model highlights a bunch of potential performance hot spots, listed below. Don't jump to conclusions though, there's no substitute for measuring (profiling, bench-marking, etc.) production workloads. Explicitly: you're looking at guesstimates.
- Entire record set sucked into memory, no streaming (constant space); reliance on client to make wise use of API limit parameter.
- What if hundreds of queries run concurrently though? Limit may not be enough! Say there are 100 single-table queries running at just about the same time, each with a limit of 1,000 (very reasonable) and each matching 100 1Kb-sized records. Now you have about 100MB of raw data in memory that needs to be massaged into Python objects and then those objects get converted to JSON, still in memory. So your heap could easily grow to over 300MB in less than a couple of seconds.
- For the same reason, even moderate concurrent load may result in significant slowdowns due to Python garbage collection or even OS thrashing.
- Multi-table queries (e.g.
attrs
queries without an entity type) span all the tables in a tenant DB which, for large DBs, has the potential of producing result sets outstripping available processing power. (Have a look at thetarget
function in the model.) - Multi-table queries require one separate DB call for each table, so
they're only marginally more efficient than fetching data separately,
one table at a time, e.g. through the
types
endpoint.
Developer Track
- Cookbook
- Gauging performance
- Mother of all queries
- Enteater
- Work a Q
- No async free lunch
- Release procedure
User Track