Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Session reset callbacks cannot be used with the background reader #1673

Closed
adriansmares opened this issue Jul 5, 2023 · 7 comments
Closed
Labels

Comments

@adriansmares
Copy link
Contributor

adriansmares commented Jul 5, 2023

Describe the bug

The background reader introduced in #1629 maintains the FD lock over the underlying net.Conn / file descriptor. As such, because the background reader keeps the file descriptor lock, the session reset callback cannot do a low level file descriptor Read in order to check if the connection has been gracefully ended or not.

To Reproduce

We use this implementation of a session reset fallback.

Because the background reader is still operating in the background after a connection has been used for a query at least once, the session reset fallback basically deadlocks while waiting for the background read to end (which it never does, since the background reader has no deadline).

goroutine 84 [semacquire, 4 minutes]:
internal/poll.runtime_Semacquire(0x4058cf?)
	/opt/hostedtoolcache/go/1.20.5/x64/src/runtime/sema.go:67 +0x27
internal/poll.(*fdMutex).rwlock(0xc0001ba780, 0xd0?)
	/opt/hostedtoolcache/go/1.20.5/x64/src/internal/poll/fd_mutex.go:154 +0xd2
internal/poll.(*FD).readLock(...)
	/opt/hostedtoolcache/go/1.20.5/x64/src/internal/poll/fd_mutex.go:221
internal/poll.(*FD).RawRead(0xc0001ba780, 0xc0018342d0)
	/opt/hostedtoolcache/go/1.20.5/x64/src/internal/poll/fd_unix.go:755 +0x45
net.(*rawConn).Read(0xc001814cc0, 0x2e22328?)
	/opt/hostedtoolcache/go/1.20.5/x64/src/net/rawconn.go:43 +0x45
go.thethings.network/lorawan-stack/v3/pkg/util/store.checkConn(***0x2e22328, 0xc000d50a20***)
	/home/runner/work/lorawan-stack/lorawan-stack/pkg/util/store/conncheck_unix.go:45 +0xba

...

 goroutine 1010 [IO wait, 4 minutes]:
internal/poll.runtime_pollWait(0x7ff1457afe78, 0x72)
	/opt/hostedtoolcache/go/1.20.5/x64/src/runtime/netpoll.go:306 +0x89
internal/poll.(*pollDesc).wait(0xc0001ba780?, 0xc001828000?, 0x0)
	/opt/hostedtoolcache/go/1.20.5/x64/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
	/opt/hostedtoolcache/go/1.20.5/x64/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0001ba780, ***0xc001828000, 0x2000, 0x2000***)
	/opt/hostedtoolcache/go/1.20.5/x64/src/internal/poll/fd_unix.go:167 +0x299
net.(*netFD).Read(0xc0001ba780, ***0xc001828000?, 0xc000e5b560?, 0xc001786f28?***)
	/opt/hostedtoolcache/go/1.20.5/x64/src/net/fd_posix.go:55 +0x29
net.(*conn).Read(0xc000d50a20, ***0xc001828000?, 0xc0011de750?, 0xc001810ab0?***)
	/opt/hostedtoolcache/go/1.20.5/x64/src/net/net.go:183 +0x45
github.com/jackc/pgx/v5/pgconn/internal/bgreader.(*BGReader).bgRead(0xc0011de700)
	/home/runner/go/pkg/mod/github.com/jackc/pgx/v5@v5.4.1/pgconn/internal/bgreader/bgreader.go:68 +0x98
created by github.com/jackc/pgx/v5/pgconn/internal/bgreader.(*BGReader).Start
	/home/runner/go/pkg/mod/github.com/jackc/pgx/v5@v5.4.1/pgconn/internal/bgreader/bgreader.go:40 +0xea

Expected behavior

The session reset fallback should have access to the underlying net.Conn bidirectionally.

Actual behavior

The background reader keeps the file descriptor of the net.Conn locked in the background.

Version

  • Go: $ go version -> go version go1.20.5 darwin/arm64
  • PostgreSQL: $ psql --no-psqlrc --tuples-only -c 'select version()' -> PostgreSQL 14.8 (Debian 14.8-1.pgdg120+1) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
  • pgx: $ grep 'github.com/jackc/pgx/v[0-9]' go.mod -> v5.4.1

Additional context

Perhaps the *PgConn should have some form of 'peek' method that allows the session reset callback to check the error of the last unread Read call ? It is racy, but we could check for io.EOF there and return driver.ErrBadConn.

Another possible solution is to make the exitPotentialWriteReadDeadlock set a deadline on the connection and read the error perhaps. This is racy again.

A non racy solution would be to allow *PgConn users to stop the background reader and wait for it close, but I am not sure if interrupting the Read intentionally via deadlines is the best way of achieving this.

@adriansmares
Copy link
Contributor Author

I think also the background reader broke Hijack, since it is not stopped before the hijacked connection is returned.

@jackc
Copy link
Owner

jackc commented Jul 8, 2023

I've merged #1670. Will that also resolve this?

@adriansmares
Copy link
Contributor Author

Unfortunately not. The background reader is still there, so other entities that will try to use the underlying net.Conn will have to fight with it. I have proposed some solutions in the issue description, but none of them are clean from my perspective.

I understand the need for the background reader, and the issue that it solves (deadlocks due to full write buffers on both sides of a connection) - I just don't have a good solution regarding how to stop the background reader when it is not used (i.e. after exitPotentialWriteReadDeadlock).

In our project we have removed the session reset callback and rely on the health checks that PgxConn does periodically. It seems that they will suffice for production environments - I just hope we don't end up with some extra flaky tests (we don't see this so far).

@jackc
Copy link
Owner

jackc commented Jul 8, 2023

If I understand correctly, the background reader will only be activated if there was a slow write. But then the background goroutine will continue running even after exitPotentialWriteReadDeadlock until it returns from a Read.

In theory SetReadDeadline would be the cleanest approach to stopping the backgrounder reader. But that introduces some trickier concurrency. It also might be an issue for golang.org/x/crypto/ssh.Conn as it implements SetReadDeadline as always returning an error. If we're not careful, we break the ability to run pgx through an SSH tunnel -- and pgx v5.4.0+ just fixed that.

The only other option for stopping it seems to be to induce a read via Ping.

I suppose we could introduce some method(s) to expose the state of the background reader. If it is running we can consider the connection alive. Not too excited about exposing internal state though. Don't want to get locked into this particular implementation of write deadlock prevention.

@adriansmares
Copy link
Contributor Author

In theory SetReadDeadline would be the cleanest approach to stopping the backgrounder reader. But that introduces some trickier concurrency. It also might be an issue for golang.org/x/crypto/ssh.Conn as it implements SetReadDeadline as always returning an error. If we're not careful, we break the ability to run pgx through an SSH tunnel -- and pgx v5.4.0+ just fixed that.

I agree with this assessment - SetReadLine is the closest solution, and it still is not clean enough for other net.Conn implementations.

The only other option for stopping it seems to be to induce a read via Ping.

Indeed, we could Stop(), then trigger a Ping, but that's an extra round-trip that probably would affect throughput.

I suppose we could introduce some method(s) to expose the state of the background reader. If it is running we can consider the connection alive. Not too excited about exposing internal state though. Don't want to get locked into this particular implementation of write deadlock prevention.

That's fair. Unless someone comes up with a smarter solution regarding how we could manage the lifetime of the background reader, I think the only action point for now is to ensure that the background reader is stopped in Hijack.

jackc added a commit that referenced this issue Jul 8, 2023
This provides a way to ensure it is safe to directly read or write to
the underlying net.Conn.

#1673
@jackc
Copy link
Owner

jackc commented Jul 8, 2023

@adriansmares Please take a look at a potential solution in PR #1679.

jackc added a commit that referenced this issue Jul 12, 2023
This provides a way to ensure it is safe to directly read or write to
the underlying net.Conn.

#1673
@jackc
Copy link
Owner

jackc commented Jul 12, 2023

I believe this should be resolved in v5.4.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants