Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get() handles non-existent q_name cost-effectively #15

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

mhjohnson
Copy link

@mhjohnson mhjohnson commented Sep 5, 2016

Hi @malthe -
This is a simple fix intended to address Issue #14 that @3manuek posted.

We discovered some very long running queries on our database today that were due to a worker trying to get() a task for a q_name that did not exist in our table. This situation can occur when the rows of a q_name are purged from the tasks table and thenput() occurs very infrequently. If the worker is trying to get() repetitively in this scenario, it results in the costly queries that @3manuek described (especially on a big table with many other q_names).

I'm no wizard, but these changes seem safe and introduce little need to refactor the tests. I used time.sleep(timeout) in lieu of the unix select.select(timeout), since it didn't seem appropriate to monitor the file descriptor in this specific context.

I hope it works out! If not, I'd love to hear your thoughts on a better approach.

Thanks,
-Matt

@malthe
Copy link
Owner

malthe commented Sep 5, 2016

I think it's the wrong way to fix this issue.

It might actually just work if you use something like this:

SET statement_timeout TO <timeout>
QUERY
RESET statement_timeout

I think you can fold those into a WITH block to avoid extra queries.

You would put that timeout configuration in the _pull_item method. The actual SQL code is right there in the docstring.

@mhjohnson
Copy link
Author

@malthe I think I described the issue poorly by using the term 'long running queries'. Please take a look at the extra details I posted in the discussion of the issue and let me know what you think. I might be misunderstanding your suggestion, but I don't think aborting long running queries will solve the high call volume.

@3manuek
Copy link
Contributor

3manuek commented Sep 5, 2016

@malthe @mhjohnson

I don't think that it would be a good idea to set a timeout in this case. The problem does not reside on the query execution length, it is how much noise is being underlying in the internal central queue system for the async calls in Postgres.

The fix over the q_name indexing already took part on reducing the locks over index records that the "selected" view in the CTE were returning (now, q_name filter helps on reducing the index records to be locked). However, there is still an additional delay when the selected rows are considerable. In other words, the current implementation of the async queues is not scalable.

I'm thinking, that the only way to scale this calls, is splitting the queue table into different databases in the same instance. Which can be translated in other way by sharding by q_name.

Currently, every backend connected to the same database, will listen and wake up on notifications in the same database, triggering the query execution. Having no-rows queries is adding noise to the channels and additional query executions.

@malthe
Copy link
Owner

malthe commented Sep 5, 2016

Okay, I got it now.

I think it makes sense to consider if LISTEN/NOTIFY should be qualified with both the table name and q_name.

Yes, it doesn't scale to many of such pairs, but it does give good performance in the case of up to a dozen queues with some idle and some active.

Plus it leaves an option for the PostgreSQL developers to improve the performance if there are users that run into a limitation here.

Wdyt?

@3manuek
Copy link
Contributor

3manuek commented Sep 6, 2016

@malthe I think that is a good idea (see bellow), however still don't know if this should avoid all the queries with non-existent q_name (but looks like it will avoid for cases wether q_names do not match across different tables). The case that @mhjohnson is commenting looks somehow coming from other part of the code: "This situation can occur when the rows of a q_name are purged from the tasks table and then put() occurs very infrequently."

https://github.com/malthe/pq/blob/master/pq/create.sql

create function pq_notify() returns trigger as $$ begin
  perform pg_notify(TG_TABLE_NAME || '_' || new.q_name, '');
  return null;
end $$ language plpgsql;

https://github.com/malthe/pq/blob/master/pq/__init__.py#L250:

    def _listen(self, cursor):
        cursor.execute('LISTEN "%s_%s"', (Literal(queue.table),),(Literal(self.name), ))

This can avoid execute q_name on different table listenings. However, as the queue is database based, each listening backend will be still "waken up":

 *   3. Every backend that is listening on at least one channel registers by
 *    entering its PID into the array in AsyncQueueControl. It then scans all
 *    incoming notifications in the central queue and first compares the
 *    database OID of the notification with its own database OID and then
 *    compares the notified channel with the list of channels that it listens
 *    to. In case there is a match it delivers the notification event to its
 *    frontend.  Non-matching events are simply skipped.

After reviewing asyncQueueProcessPageEntries, the NotifyMyFrontEnd happens after IsListeningOn(channel) (async). This means that all the notifications that are in the same database, will have an extra computation for discarding the id transaction in progress and those that aren't listening on the channel. However, by splitting as above, will avoid the NotifyMyFrontEnd call, reducing considerably the notification interruptions.

@malthe
Copy link
Owner

malthe commented Sep 6, 2016

By "backend" here you mean the PostgreSQL server process that handles a single client connection (session) – ?

@3manuek
Copy link
Contributor

3manuek commented Sep 7, 2016

@malthe Yes, that extract is from the Postgres code, so it refers to server child processes handling connections.

@malthe
Copy link
Owner

malthe commented Sep 7, 2016

I don't see how it's a huge problem that the backend (which is a child
process on the PG server) does this filtering. You shouldn't have very many
backends (use a dedicated proxy if you can't work around it in your
application) when using PostgreSQL. And if you have very many queues then I
think it could be an indication that your overall design is wrong.

My suggestion would be to encourage the use of an application-level message
bus to multiplex messages from (relatively) few queues to multiple handlers.

Malthe
On Wed, 7 Sep 2016 at 16:09, Emanuel Calvo notifications@github.com wrote:

@malthe https://github.com/malthe Yes, that extract is from the
Postgres code, so it refers to server child processes handling connections.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#15 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AABnJXwYlB-Umh8GMA6bQFfERmFo9mJ7ks5qnsWigaJpZM4J0t-H
.

@3manuek
Copy link
Contributor

3manuek commented Sep 8, 2016

@malthe The problem is that the filtering is not done over only 1 backend, every backend that is listening on at least one channel is an entry on the async array. We know that the design would need to step away from transactional to an app-level queue system in a future, but we need to keep it as it is until then.

With the change purposed above (splitting the listeners), I'm pretty confident that the CPU usage will reduce considerably in certain scenarios (when you split the queues in the same database).

@malthe
Copy link
Owner

malthe commented Sep 8, 2016

It seems like there's a less than perfect behavior there in the PG codebase. But then again, they have always maintained that you can't really do queues.

(self.table, self.name)
)
row = cursor.fetchone()
self.has_records = True if row else False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool(row)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants