get() handles non-existent q_name cost-effectively #15

mhjohnson · 2016-09-05T03:42:16Z

Hi @malthe -
This is a simple fix intended to address Issue #14 that @3manuek posted.

We discovered some very long running queries on our database today that were due to a worker trying to get() a task for a q_name that did not exist in our table. This situation can occur when the rows of a q_name are purged from the tasks table and thenput() occurs very infrequently. If the worker is trying to get() repetitively in this scenario, it results in the costly queries that @3manuek described (especially on a big table with many other q_names).

I'm no wizard, but these changes seem safe and introduce little need to refactor the tests. I used time.sleep(timeout) in lieu of the unix select.select(timeout), since it didn't seem appropriate to monitor the file descriptor in this specific context.

I hope it works out! If not, I'd love to hear your thoughts on a better approach.

Thanks,
-Matt

malthe · 2016-09-05T07:11:17Z

I think it's the wrong way to fix this issue.

It might actually just work if you use something like this:

SET statement_timeout TO <timeout>
QUERY
RESET statement_timeout

I think you can fold those into a WITH block to avoid extra queries.

You would put that timeout configuration in the _pull_item method. The actual SQL code is right there in the docstring.

mhjohnson · 2016-09-05T17:23:39Z

@malthe I think I described the issue poorly by using the term 'long running queries'. Please take a look at the extra details I posted in the discussion of the issue and let me know what you think. I might be misunderstanding your suggestion, but I don't think aborting long running queries will solve the high call volume.

3manuek · 2016-09-05T17:36:51Z

@malthe @mhjohnson

I don't think that it would be a good idea to set a timeout in this case. The problem does not reside on the query execution length, it is how much noise is being underlying in the internal central queue system for the async calls in Postgres.

The fix over the q_name indexing already took part on reducing the locks over index records that the "selected" view in the CTE were returning (now, q_name filter helps on reducing the index records to be locked). However, there is still an additional delay when the selected rows are considerable. In other words, the current implementation of the async queues is not scalable.

I'm thinking, that the only way to scale this calls, is splitting the queue table into different databases in the same instance. Which can be translated in other way by sharding by q_name.

Currently, every backend connected to the same database, will listen and wake up on notifications in the same database, triggering the query execution. Having no-rows queries is adding noise to the channels and additional query executions.

malthe · 2016-09-05T18:50:23Z

Okay, I got it now.

I think it makes sense to consider if LISTEN/NOTIFY should be qualified with both the table name and q_name.

Yes, it doesn't scale to many of such pairs, but it does give good performance in the case of up to a dozen queues with some idle and some active.

Plus it leaves an option for the PostgreSQL developers to improve the performance if there are users that run into a limitation here.

Wdyt?

3manuek · 2016-09-06T02:21:32Z

@malthe I think that is a good idea (see bellow), however still don't know if this should avoid all the queries with non-existent q_name (but looks like it will avoid for cases wether q_names do not match across different tables). The case that @mhjohnson is commenting looks somehow coming from other part of the code: "This situation can occur when the rows of a q_name are purged from the tasks table and then put() occurs very infrequently."

https://github.com/malthe/pq/blob/master/pq/create.sql

create function pq_notify() returns trigger as $$ begin
  perform pg_notify(TG_TABLE_NAME || '_' || new.q_name, '');
  return null;
end $$ language plpgsql;

https://github.com/malthe/pq/blob/master/pq/__init__.py#L250:

    def _listen(self, cursor):
        cursor.execute('LISTEN "%s_%s"', (Literal(queue.table),),(Literal(self.name), ))

This can avoid execute q_name on different table listenings. However, as the queue is database based, each listening backend will be still "waken up":

 *   3. Every backend that is listening on at least one channel registers by
 *    entering its PID into the array in AsyncQueueControl. It then scans all
 *    incoming notifications in the central queue and first compares the
 *    database OID of the notification with its own database OID and then
 *    compares the notified channel with the list of channels that it listens
 *    to. In case there is a match it delivers the notification event to its
 *    frontend.  Non-matching events are simply skipped.

After reviewing asyncQueueProcessPageEntries, the NotifyMyFrontEnd happens after IsListeningOn(channel) (async). This means that all the notifications that are in the same database, will have an extra computation for discarding the id transaction in progress and those that aren't listening on the channel. However, by splitting as above, will avoid the NotifyMyFrontEnd call, reducing considerably the notification interruptions.

malthe · 2016-09-06T20:01:36Z

By "backend" here you mean the PostgreSQL server process that handles a single client connection (session) – ?

3manuek · 2016-09-07T14:09:36Z

@malthe Yes, that extract is from the Postgres code, so it refers to server child processes handling connections.

malthe · 2016-09-07T16:35:13Z

I don't see how it's a huge problem that the backend (which is a child
process on the PG server) does this filtering. You shouldn't have very many
backends (use a dedicated proxy if you can't work around it in your
application) when using PostgreSQL. And if you have very many queues then I
think it could be an indication that your overall design is wrong.

My suggestion would be to encourage the use of an application-level message
bus to multiplex messages from (relatively) few queues to multiple handlers.

Malthe
On Wed, 7 Sep 2016 at 16:09, Emanuel Calvo notifications@github.com wrote:

@malthe https://github.com/malthe Yes, that extract is from the
Postgres code, so it refers to server child processes handling connections.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#15 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AABnJXwYlB-Umh8GMA6bQFfERmFo9mJ7ks5qnsWigaJpZM4J0t-H
.

3manuek · 2016-09-08T14:04:22Z

@malthe The problem is that the filtering is not done over only 1 backend, every backend that is listening on at least one channel is an entry on the async array. We know that the design would need to step away from transactional to an app-level queue system in a future, but we need to keep it as it is until then.

With the change purposed above (splitting the listeners), I'm pretty confident that the CPU usage will reduce considerably in certain scenarios (when you split the queues in the same database).

malthe · 2016-09-08T18:09:27Z

It seems like there's a less than perfect behavior there in the PG codebase. But then again, they have always maintained that you can't really do queues.

khunafin · 2017-04-11T11:33:50Z

pq/__init__.py

+                (self.table, self.name)
+            )
+            row = cursor.fetchone()
+        self.has_records = True if row else False


mhjohnson and others added 3 commits September 4, 2016 23:10

get() handles non-existent q_name cost-effectively

004b9d6

rtype typo

7f800bd

Running worker detects changes in purged records

f9b1515

khunafin reviewed Apr 11, 2017

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get() handles non-existent q_name cost-effectively #15

get() handles non-existent q_name cost-effectively #15

mhjohnson commented Sep 5, 2016 •

edited

Loading

malthe commented Sep 5, 2016 •

edited

Loading

mhjohnson commented Sep 5, 2016

3manuek commented Sep 5, 2016

malthe commented Sep 5, 2016

3manuek commented Sep 6, 2016 •

edited

Loading

malthe commented Sep 6, 2016

3manuek commented Sep 7, 2016

malthe commented Sep 7, 2016

3manuek commented Sep 8, 2016

malthe commented Sep 8, 2016

khunafin Apr 11, 2017

get() handles non-existent q_name cost-effectively #15

Are you sure you want to change the base?

get() handles non-existent q_name cost-effectively #15

Conversation

mhjohnson commented Sep 5, 2016 • edited Loading

malthe commented Sep 5, 2016 • edited Loading

mhjohnson commented Sep 5, 2016

3manuek commented Sep 5, 2016

malthe commented Sep 5, 2016

3manuek commented Sep 6, 2016 • edited Loading

malthe commented Sep 6, 2016

3manuek commented Sep 7, 2016

malthe commented Sep 7, 2016

3manuek commented Sep 8, 2016

malthe commented Sep 8, 2016

khunafin Apr 11, 2017

Choose a reason for hiding this comment

mhjohnson commented Sep 5, 2016 •

edited

Loading

malthe commented Sep 5, 2016 •

edited

Loading

3manuek commented Sep 6, 2016 •

edited

Loading