Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow parsing arrays without connection #841

Closed
alexolog opened this issue May 27, 2024 · 25 comments
Closed

Allow parsing arrays without connection #841

alexolog opened this issue May 27, 2024 · 25 comments

Comments

@alexolog
Copy link

alexolog commented May 27, 2024

Currently, parsing an array requires an active connection, which is used only to retrieve the client encoding. This prevents using containers inside tuples for automatic conversion and requires jumping through hoops to parse arrays.

This also applies to pqxx::array constructor.

However, in many cases the encoding is known in advance, especially when developing a product where both the client and the server are controlled by the same actor.

I believe it would be better if we could have a standalone function to globally set the encoding to a constant or from a connection, and it would persist until changed.

@tt4g
Copy link
Contributor

tt4g commented May 27, 2024

There is a constructor pqxx::array that can be created without a connection by passing pqxx::internal::encoding_group as the 2nd argument.

explicit array(std::string_view data, pqxx::internal::encoding_group enc)
{
using group = pqxx::internal::encoding_group;
switch (enc)
{
case group::MONOBYTE: parse<group::MONOBYTE>(data); break;
case group::BIG5: parse<group::BIG5>(data); break;
case group::EUC_CN: parse<group::EUC_CN>(data); break;
case group::EUC_JP: parse<group::EUC_JP>(data); break;
case group::EUC_KR: parse<group::EUC_KR>(data); break;
case group::EUC_TW: parse<group::EUC_TW>(data); break;
case group::GB18030: parse<group::GB18030>(data); break;
case group::GBK: parse<group::GBK>(data); break;
case group::JOHAB: parse<group::JOHAB>(data); break;
case group::MULE_INTERNAL: parse<group::MULE_INTERNAL>(data); break;
case group::SJIS: parse<group::SJIS>(data); break;
case group::UHC: parse<group::UHC>(data); break;
case group::UTF8: parse<group::UTF8>(data); break;
default: PQXX_UNREACHABLE; break;
}
}

However, it is a private API, cannot be used by applications.
The reason why this is a private API cannot be read from the source code, but there must be a reason.

@jtv
Copy link
Owner

jtv commented May 27, 2024

The reason is that so far, encodings and encoding groups are entirely hidden from the application. If we're going to introduce it into the API, then we'll have to answer some questions:

  • As the API evolves, will we ever need to know the actual encoding, or is just the encoding group enough?
  • Do we need stability guarantees for the enum as encodings and encoding groups are added or removed?
  • Will we be able to work encoding awareness into the string conversion API in the future?

@alexolog
Copy link
Author

Adding the following constructor can alleviate the problem:

  explicit array(std::string_view data, std::string_view encoding_name = "UTF8") :
          array{data, pqxx::internal::enc_group(encoding_name)}
  {}

@jtv
Copy link
Owner

jtv commented May 27, 2024

Yes, I suppose using the encoding's name would solve it. I'd probably want a way for the caller to move that string lookup out of loops though.

We'd also need to check lifetime guarantees. If that encoding name happens to live inside a buffer held by the libpq-level connection struct somewhere, then closing the connection would invalidate that memory and we'd be back to the original problem.

I wonder if it would help to encapsulate encoding in a class. You'd construct an object using an encoding name, but internally it might store just the encoding group. No transparency of enums, no persistent storage of the numerical values.

@alexolog
Copy link
Author

alexolog commented May 27, 2024

I wonder if it would help to encapsulate encoding in a class. You'd construct an object using an encoding name, but internally it might store just the encoding group. No transparency of enums, no persistent storage of the numerical values.

It probably will.

That said, I would go one step further and allow for programmatically setting a "global" persistent encoding that would be used whenever array and similar objects are constructed.

The existing interface will use the encoding from the supplied connection, and the new one (that does not accept a connection parameter) will use the globally set encoding.

@fallenworld1
Copy link
Contributor

fallenworld1 commented May 27, 2024

the problem we're trying to solve here is we're unable to write something like this:

auto array = row[field_id].as<array<int>>();

// or

using Tuple = std::tuple<pqxx::array<int>>;
auto tuple = row.as<Tuple>();

because as internally only passes view

if you could create template specification for as that would work for pqxx::array (or maybe even for vector like containers) - it would keep groups internal

e.g.:

  template<T> pqxx::array<T> as() const
  {
    if (is_null())
    {
      if constexpr (not nullness<T>::has_null)
        internal::throw_null_conversion(type_name<T>);
      else
        return nullness<T>::null();
    }
    else
    {
      return  pqxx::array<T>(this->view(), this->m_home.m_encoding);
    }
  }

changing from_string to take field instead of string_view may also work

@alexolog
Copy link
Author

alexolog commented May 27, 2024

It seems that we only pass connections for encoding in array, so we could do something quick like this:

template<...>
class array final
{
public:
  explicit array(std::string_view data) :
    array{data, m_default_enc_group}
  {}

  static void set_default_encoding(std::string_view encoding_name) {
    m_default_enc_group = pqxx::internal::enc_group(encoding_name); // may throw
    m_default_enc_name = std::string(encoding_name);
  }

  static [[nodiscard]] const std::string& get_default_encoding() {
    return m_default_enc_name;
  }

private:
  static std::string m_default_enc_name{"UTF8"};
  static pqxx::internal::encoding_group m_default_enc_group{UTF8};
};

Alternatively, the "default encoding" could be a static property of connection, and array will query it.

@alexolog
Copy link
Author

(or maybe even for vector like containers)

C++20 ranges would be nice :)

@alexolog
Copy link
Author

libpqxx needs to be aware in various situations of where the next character begins. Without that, for example, it might stumble when it sees a byte that happens to be the same numeric value as an ASCII quote or backslash, but is actually just a byte inside a multi-byte character value. It's a painful problem with real security implications.

This cannot happen with UTF-8. I am not familiar with other multi-byte encodings.

That said, the problem we are trying to solve is array parsing without an active connection, when both the client and server encodings are known.

@jtv
Copy link
Owner

jtv commented May 28, 2024

@fallenworld1 the problem with specialising field::as() is that it's not the only place where we use string conversion. We'd probably have to do that for more functions. But much worse, the trick won't work for streaming queries.

@jtv
Copy link
Owner

jtv commented May 28, 2024

@alexolog two points that I want to get out of the way:

  1. Server encoding is (thankfully) irrelevant here. It's client encoding that matters. (Unfortunately it can change on the fly, even though it rarely will.)
  2. I'm not going to introduce a global or static variable for the default encoding.

Now, one thing we could try doing, is make result aware of its encoding. I see two advantages:

  • can support field::as() and streaming queries more or less equally
  • behaves sensibly when client encoding changes on the fly

That still leaves us with the problem of passing the encoding into the string conversions. This is in a way the hard one: it affects the string conversion API, and therefore it may affect people who have written their own string conversions.

@jtv
Copy link
Owner

jtv commented Jun 6, 2024

Phew. I finally got the documentation working again. That's been a big de-motivating factor for the last three months. @alexolog @fallenworld1 I'd like to focus on this issue next, barring any urgent problems elsewhere, since we seem to have a clear direction.

@jtv
Copy link
Owner

jtv commented Jun 23, 2024

Again there were lots of things to distract me but I did discover that pqxx::result is already aware of its encoding.

So I'm now thinking along the lines of @fallenworld1's idea.

First step: Add a completely separate function for reading an SQL array field as a pqxx::array<...>.

This new function would have an entirely new name. That means you can't use it in row-at-a-time conversions. So yes, its utility would be limited. Unless I find it's entirely safe to provide conversions such as pqxx::field::as<pqxx::array<...>>(). I'm just thinking out loud here, just getting back into the subject.

Second step: Start integrating the new function into the existing field conversions system, so you can use the regular field::as() template on array types.

Supporting conversion straight to some "C++-native" type like std::vector<int> is not trivial, because we get to deal with those conflicting concepts etc. This should probably wait until some time in the libpqxx 8.0 cycle, which will require a compiler that supports concepts.

Third step: Make this work with streaming queries.

If we're lucky we can actually do this before the second step. We'll see.

Fourth step: Can we make this work with pqxx::field::to<pqxx::array<...>>()? Or default values? What else?

@jtv
Copy link
Owner

jtv commented Jun 23, 2024

I'm trying the first step.

(At the second step I run into the problem that I need a partial template specialisation for pqxx::field::as(). Current C++ does not allow partial specialisations of template functions — only for classes and variables. IIRC future C++ versions may fix this.)

@jtv
Copy link
Owner

jtv commented Jun 30, 2024

@alexolog, @fallenworld1, @tt4g — I think #854 would take care of the first step. Any thoughts?

@fallenworld1
Copy link
Contributor

@jtv couple of thoughts on second step - you can hide your encoding type inside

struct ConnectionEncoding {};

array_paser(string_view data, const ConnectionEncoding *enc): encoding(std::bitcast<pqxx::internal::encoding_group>(enc))

// or something like this
class ConnectionEncoding 
{
private:
    ConnectionEncoding(pqxx::internal::encoding_group);

    friend fiel, array, array_parser;
    int val
};

@tt4g
Copy link
Contributor

tt4g commented Jun 30, 2024

LGTM.

@jtv
Copy link
Owner

jtv commented Jun 30, 2024

@jtv couple of thoughts on second step - you can hide your encoding type inside

struct ConnectionEncoding {};

array_paser(string_view data, const ConnectionEncoding *enc): encoding(std::bitcastpqxx::internal::encoding_group(enc))

@fallenworld1 I think this is more or less what we've already discussed. Be aware however that pqxx::array_parser is an awful API, and pqxx::array supersedes it.

@jtv
Copy link
Owner

jtv commented Jun 30, 2024

LGTM.

Thanks @tt4g!

@jtv
Copy link
Owner

jtv commented Jul 5, 2024

@alexolog I suppose field::as_sql_array() actually solves this ticket! Not perfectly, but we can expand on it gradually. Mind if I close this one?

jtv added a commit that referenced this issue Jul 5, 2024
It's annoying that there is already a `field::as_array()`, which returns an obsolete `array_parser` with its clunky API.  I'm being aggressive in deprecating that one, so that eventually the _new_ function can assume its name without clashing.
@jtv
Copy link
Owner

jtv commented Jul 20, 2024

There's more work to be done, but not quite as urgent I suppose.

@jtv jtv closed this as completed Jul 20, 2024
@alexolog
Copy link
Author

alexolog commented Feb 3, 2025

@alexolog I suppose field::as_sql_array() actually solves this ticket! Not perfectly, but we can expand on it gradually. Mind if I close this one?

Hi @jtv

Unfortunately we lost track of this issue, so I apologize for the late reply.

This does not really work for us since we use the to() and as() functions throughout (in an intermediate layer), and they use from_string() internally, which is not overloaded/specialized for arrays.

Can you please take another look?

@jtv
Copy link
Owner

jtv commented Feb 4, 2025

@alexolog that's that part that's not so easy. I think it requires...

  1. Compiler support for concepts, and therefore, C++20 as a minimum. That means 8.0 at the earliest, but I'm working on that right now.
  2. A change in the string conversions API. This could be a biggie.
  3. A way of encapsulating encoding knowledge, as discussed.
  4. Solid concepts to distinguish arrays from strings from binary data. I'm working on this for 8.0.

I've just filed #948 for this problem. I happen to be facing a similar problem for the introduction of std::source_location.

@alexolog
Copy link
Author

alexolog commented Feb 5, 2025

  1. Compiler support for concepts, and therefore, C++20 as a minimum. That means 8.0 at the earliest, but I'm working on that right now.

Not necessarily. Concepts give you better error messages, but they are not required for correctness.
SFINAE and static_assert can be viable pre-C++20 alternatives.

@jtv
Copy link
Owner

jtv commented Feb 5, 2025

  1. Compiler support for concepts, and therefore, C++20 as a minimum. That means 8.0 at the earliest, but I'm working on that right now.

Not necessarily. Concepts give you better error messages, but they are not required for correctness.
SFINAE and static_assert can be viable pre-C++20 alternatives.

I'm not just talking about what is required to get something that compiles. It also has to be maintainable and fit in with my plans.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants