Support for Wide Strings in SOCI for Enhanced Unicode Handling #1133

bold84 · 2024-03-22T11:21:02Z

This pull request adds comprehensive support for wide strings (wchar_t, std::wstring) to the SOCI database library, significantly improving its support for Unicode string types such as SQL Server's NVARCHAR and NTEXT. This enhancement is crucial for applications that require robust handling of international and multi-language data.

Key Changes:

Introduced exchange_type_traits and exchange_traits Specializations:
- These specializations facilitate the handling of wide strings during type exchange, ensuring proper conversion and management within the SOCI library.
Updated ODBC Backend:
- Added support for wide strings, specifically for wchar_t and std::wstring.
- Adjusted the parameter binding and data retrieval mechanisms to correctly process wide characters.
Enhanced Buffer Management:
- Modified buffer allocation and management to accommodate the larger size of wide characters, which are essential for proper Unicode support.
- Implemented logic to handle buffer size overflow, ensuring safety and stability when processing large text data.
Improved Unicode Support:
- Incorporated routines to convert between different Unicode encodings (UTF-16 and UTF-32 on Unix-like systems, native UTF-16 on Windows) to handle wide strings properly across various platforms.
Extended Test Coverage:
- Added comprehensive tests focusing on wide string handling, especially ensuring compatibility with SQL Server.
- Included edge cases for large strings to test buffer management and overflow handling.

Notes:

The focus of this pull request is on the ODBC backend.
While the original pull request mentioned a focus on C++17 standards, this has been removed to maintain compatibility with earlier C++ versions, ensuring broader usability of these enhancements.
This work sets a foundational framework to extend wide string support to other database backends in SOCI in future updates.

This update significantly bolsters SOCI's capabilities in handling Unicode data, making it a more versatile and powerful tool for database interactions in multi-language applications.

Example usage

Here are a few examples showing how the new wide string features can be used with the ODBC backend.

Example 1: Handling `std::wstring` in SQL Queries

Inserting and Selecting `std::wstring` Data

#include <soci.h>
#include <soci-odbc.h>
#include <iostream>

int main()
{
    try
    {
        soci::session sql(soci::odbc, "DSN=my_datasource;UID=user;PWD=password");

        // Create table with NVARCHAR column
        sql << "CREATE TABLE soci_test (id INT IDENTITY PRIMARY KEY, wide_text NVARCHAR(40) NULL)";

        // Define a wstring to insert
        std::wstring wide_str_in = L"Hello, 世界!";
        
        // Insert the wstring
        sql << "INSERT INTO soci_test(wide_text) VALUES (:wide_text)", soci::use(wide_str_in);

        // Retrieve the wstring
        std::wstring wide_str_out;
        sql << "SELECT wide_text FROM soci_test WHERE id = 1", soci::into(wide_str_out);

        // Output the retrieved wstring
        std::wcout << L"Retrieved wide string: " << wide_str_out << std::endl;
    }
    catch (const soci::soci_error& e)
    {
        std::cerr << "Error: " << e.what() << std::endl;
    }

    return 0;
}

Example 2: Working with `wchar_t` Vectors

Inserting and Selecting Wide Characters

#include <soci.h>
#include <soci-odbc.h>
#include <iostream>
#include <vector>

int main()
{
    try
    {
        soci::session sql(soci::odbc, "DSN=my_datasource;UID=user;PWD=password");

        // Create table with NCHAR column
        sql << "CREATE TABLE soci_test (id INT IDENTITY PRIMARY KEY, wide_char NCHAR(2) NULL)";

        // Define a vector of wide characters to insert
        std::vector<wchar_t> wide_chars_in = {L'A', L'B', L'C', L'D'};
        
        // Insert the wide characters
        sql << "INSERT INTO soci_test(wide_char) VALUES (:wide_char)", soci::use(wide_chars_in);

        // Retrieve the wide characters
        std::vector<wchar_t> wide_chars_out(4);
        sql << "SELECT wide_char FROM soci_test WHERE id IN (1, 2, 3, 4)", soci::into(wide_chars_out);

        // Output the retrieved wide characters
        for (wchar_t ch : wide_chars_out)
        {
            std::wcout << L"Retrieved wide char: " << ch << std::endl;
        }
    }
    catch (const soci::soci_error& e)
    {
        std::cerr << "Error: " << e.what() << std::endl;
    }

    return 0;
}

Example 3: Using `std::wstring` with the `sql` Stream Operator

Inserting and Selecting `std::wstring` Data with Stream Operator

#include <soci.h>
#include <soci-odbc.h>
#include <iostream>

int main()
{
    try
    {
        soci::session sql(soci::odbc, "DSN=my_datasource;UID=user;PWD=password");

        // Create table with NVARCHAR column
        sql << "CREATE TABLE soci_test (id INT IDENTITY PRIMARY KEY, wide_text NVARCHAR(40) NULL)";

        // Define a wstring to insert
        std::wstring wide_str_in = L"Hello, 世界!";
        
        // Use stream operator to insert the wstring
        sql << "INSERT INTO soci_test(wide_text) VALUES (N'" << wide_str_in << "')";
        
        // Retrieve the wstring using stream operator
        std::wstring wide_str_out;
        sql << "SELECT wide_text FROM soci_test WHERE id = 1", soci::into(wide_str_out);

        // Output the retrieved wstring
        std::wcout << L"Retrieved wide string: " << wide_str_out << std::endl;
    }
    catch (const soci::soci_error& e)
    {
        std::cerr << "Error: " << e.what() << std::endl;
    }

    return 0;
}

In this example:

A soci::session object is created to connect to the database.
A table is created with an NVARCHAR column.
A std::wstring is defined for insertion.
The sql stream operator is used to insert the std::wstring into the database. Note the use of N' to indicate a Unicode string in SQL Server.
The std::wstring is retrieved from the database using the sql stream operator and the soci::into function.
The retrieved wide string is printed to the console using std::wcout.

These examples demonstrate how to insert and retrieve wide strings and wide characters using SOCI's newly added features for handling wide strings (wchar_t, std::wstring).

Limitation: The current implementation does not handle combining characters correctly. Combining characters, such as accents or diacritical marks, are treated separately instead of being combined with their base characters. This limitation may result in incorrect conversions for strings containing combining characters. A potential solution would be to incorporate Unicode normalization before the conversion process to ensure that combining characters are properly combined with their base characters.
Unicode defines several normalization forms (e.g., NFC, NFD, NFKC, NFKD), each with its own set of rules and behaviors. Choosing the appropriate normalization form is crucial, as different forms may produce different results.

To have full Unicode support, linking against a library like ICU or iconv is necessary. It can be made optional.

Disclaimer: This text is partially AI generated.

bold84 · 2024-03-22T12:58:26Z

Converting from UTF-16 to UTF-8 is no problem when retrieving data, because the column data type is known.
When inserting/updating though, it is not so straightforward, as we don't have programmatic knowledge of the column data type in advance.

I'm thinking of adding another argument to "soci::use()" that lets the developer override the data type that's used for the underlying ODBC call.

Another issue is the currently non-existing N'' enclosure for unicode strings for MSSQL in case of soci::use().

Another issue is the stream interface. Currently std::wstring isn't supported and as far as I understand, supporting it would require widening the query to UTF-16 before sending it to the DB.

vadz

Thanks! This globally looks good but there are globally 2 issues:

The new functionality needs to be documented, notably it should be clearly stated that wstring and wchar_t are only supported in the ODBC backend (and only when using SQL Server?).
The use of/checks for C++17 are confusing as it's not clear if it is required for wide char support or if it's just some kind of optimization (in the latter case I'd drop it, it's not worth the extra code complexity).

include/private/soci-vector-helpers.h

include/soci/soci-backend.h

src/core/use-type.cpp

include/soci/odbc/soci-odbc.h

…data_type"

…a, std::size_t ind)

…hub.com/ORDIS-Co-Ltd/soci into wstring_support_with_unicode_conversion

This commit updates the Unicode conversion functions to handle UTF-16 on Windows and UTF-32 on other platforms. The changes include: 1. Updating the `utf8_to_wide` and `wide_to_utf8` functions to handle UTF-32 on Unix/Linux platforms. 2. Updating the `copy_from_string` function to handle UTF-16 on Windows and convert UTF-32 to UTF-16 on other platforms. 3. Updating the `bind_by_pos` function to handle UTF-16 on Windows and convert UTF-32 to UTF-16 on other platforms. 4. Adding a test case for wide strings in the ODBC MSSQL tests.

bold84 · 2024-06-18T19:18:55Z

Please note that I updated the FreeBSD Image for Cirrus from 13.2 to 13.3.

cirruslabs/cirrus-ci-docs#1277

…or.yml

…version' into wstring_support

bold84 · 2024-06-20T23:12:26Z

I'm adding better UTF conversion first.

bold84 · 2024-07-23T16:35:38Z

@vadz
I extended the description; see "Limitation".

Maybe this can be an optional feature, similar to boost.

vadz

Thanks, this looks mostly good to me and the limitation (lack of support for combined forms) can be addressed later.

I have some minor comments below and I admit I didn't read all the code in details, but it looks superficially fine (if a bit verbose) and the tests look good, thank you.

include/soci/ref-counted-statement.h

include/soci/soci-backend.h

tests/odbc/test-odbc-mssql.cpp

vadz · 2024-07-23T18:40:28Z

tests/odbc/test-odbc-mssql.cpp

+
+// }
+
+TEST_CASE("UTF-8 validation tests", "[unicode]")


All these tests are neither MSSQL nor ODBC specific, they should ideally be in their own file.

I have moved them to the "empty" test module, as it contains other non-backend-specific tests.
I don't think this is the best solution, but I would need more information on how you want a separate unicode test file to be treated in the context of the CMake files. The backend tests use the CMake macro soci_backend_test.

I can just "nail it in", but I assume a more elegant solution is preferred.

src/backends/odbc/standard-into-type.cpp

include/soci/soci-unicode.h

vadz · 2024-07-23T18:53:58Z

Oh, I forgot to ask: why do you think this should be an option? AFAICS this doesn't affect the existing API, so I see no reason to not enable this unconditionally for people who need it, am I missing something?

Co-authored-by: VZ <vz-github@zeitlins.org>

bold84 · 2024-07-24T03:42:30Z

Oh, I forgot to ask: why do you think this should be an option? AFAICS this doesn't affect the existing API, so I see no reason to not enable this unconditionally for people who need it, am I missing something?

I was referring to the need to link against icu or iconv to have combining character support right away. But it's not necessary, if we can take care of the normalization later.

…upport

Add UTF-8 BOM handling to unicode conversion functions.

Added std::wstring and wchar_t support for ODBC backend

075fe68

bold84 marked this pull request as draft March 22, 2024 11:25

bold84 added 6 commits March 22, 2024 19:11

Fixes for Ubuntu GCC 12

a2fa9c9

fixes for sqlite3 on ubuntu gcc 12

371804f

fixes for oracle backend

dd96bcb

one more

0b28428

removed semicolon

865c5dc

...

7c68f7c

bold84 mentioned this pull request Mar 22, 2024

UTF-16 support for ODBC (for MSSQL) #1041

Open

bold84 marked this pull request as ready for review March 22, 2024 13:09

vadz reviewed Mar 27, 2024

View reviewed changes

include/private/soci-vector-helpers.h Outdated Show resolved Hide resolved

include/soci/soci-backend.h Show resolved Hide resolved

src/core/use-type.cpp Outdated Show resolved Hide resolved

include/soci/odbc/soci-odbc.h Outdated Show resolved Hide resolved

bold84 added 3 commits March 29, 2024 21:29

Removed C++17 specific copy_from_string

77fe29c

Added default labels to be able to remove dt_wstring rom deprecated "…

130603b

…data_type"

removed std::wstring& vector_wstring_value(exchange_type e, void* dat…

708613e

…a, std::size_t ind)

bold84 force-pushed the wstring_support branch from a983010 to 708613e Compare March 31, 2024 23:22

bold84 added 3 commits April 1, 2024 06:51

added TODO comment

5c4eb8f

Merge branch 'SOCI:master' into wstring_support

1aa9410

added wstring stuff needed for building with merged master branch

a9f7996

bold84 force-pushed the wstring_support branch from d1e7608 to a9f7996 Compare April 1, 2024 00:11

bold84 marked this pull request as draft April 1, 2024 00:14

bold84 added 9 commits April 1, 2024 13:33

implicit conversion

a409107

only on windows

f7561d8

wstring stream

c38c4d0

Merge branch 'wstring_support_with_unicode_conversion' of https://git…

a1999e2

…hub.com/ORDIS-Co-Ltd/soci into wstring_support_with_unicode_conversion

cleaning up

e4cfb8b

more cleanup

d60d92e

unicode conversion

1422524

Refactor string handling for Windows and non-Windows platforms

b71c1e9

bold84 added 8 commits June 19, 2024 02:59

Update AppVeyor configuration to use PostgreSQL 9.6

aa149c1

Suppress MSVC warning C4702 in soci-backend.h

7cf22b7

Update AppVeyor configuration to use PostgreSQL 10

197f4cb

Reverted PostgreSQL service from postgresql10 to postgresql in appvey…

cfe22d0

…or.yml

Add detailed documentation for Unicode conversion functions.

4d56f01

Added documentation

184783e

Merge remote-tracking branch 'origin/wstring_support_with_unicode_con…

e9c3113

…version' into wstring_support

Optimize UTF conversion functions and improve error handling

a06185d

bold84 marked this pull request as draft June 20, 2024 23:12

bold84 added 2 commits July 23, 2024 19:34

Improved soci-unicode.h

60d826d

Fix wchar_t detection logic

7bb1a0c

bold84 marked this pull request as ready for review July 23, 2024 16:19

vadz reviewed Jul 23, 2024

View reviewed changes

Update ref-counted-statement.h

a4bce27

Co-authored-by: VZ <vz-github@zeitlins.org>

bold84 added 8 commits July 24, 2024 10:48

Rename SOCI_WCHAR_T_IS_WIDE to SOCI_WCHAR_T_IS_UTF32.

5550169

Add UTF-16 <-> wstring conversion functions

4a3851d

Fix formatting in MS SQL tests

16b8915

Removed pragma for msvc

401b34b

moved unicode tests to empty

82a9fd1

Merge branch 'wstring_support_with_unicode_conversion' into wstring_s…

89be602

…upport

Merge branch 'master' into wstring_support

0470392

Add check for non-characters U+FFFE and U+FFFF in UTF-32 validation

068a9e3

Add UTF-8 BOM handling to unicode conversion functions.

bold84 force-pushed the wstring_support branch from e42c8d8 to 068a9e3 Compare July 24, 2024 06:24

bold84 added 2 commits July 24, 2024 13:26

Remove unused print_hex helper function in unicode test

19e3927

Remove constexpr from is_valid_utf8_sequence function (MSVC 2015)

1ce5842

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Wide Strings in SOCI for Enhanced Unicode Handling #1133

Support for Wide Strings in SOCI for Enhanced Unicode Handling #1133

bold84 commented Mar 22, 2024 •

edited

Loading

bold84 commented Mar 22, 2024 •

edited

Loading

vadz left a comment

bold84 commented Jun 18, 2024

bold84 commented Jun 20, 2024

bold84 commented Jul 23, 2024

vadz left a comment

vadz Jul 23, 2024

bold84 Jul 24, 2024

vadz commented Jul 23, 2024

bold84 commented Jul 24, 2024

Support for Wide Strings in SOCI for Enhanced Unicode Handling #1133

Are you sure you want to change the base?

Support for Wide Strings in SOCI for Enhanced Unicode Handling #1133

Conversation

bold84 commented Mar 22, 2024 • edited Loading

Example usage

Example 1: Handling std::wstring in SQL Queries

Inserting and Selecting std::wstring Data

Example 2: Working with wchar_t Vectors

Inserting and Selecting Wide Characters

Example 3: Using std::wstring with the sql Stream Operator

Inserting and Selecting std::wstring Data with Stream Operator

bold84 commented Mar 22, 2024 • edited Loading

vadz left a comment

Choose a reason for hiding this comment

bold84 commented Jun 18, 2024

bold84 commented Jun 20, 2024

bold84 commented Jul 23, 2024

vadz left a comment

Choose a reason for hiding this comment

vadz Jul 23, 2024

Choose a reason for hiding this comment

bold84 Jul 24, 2024

Choose a reason for hiding this comment

vadz commented Jul 23, 2024

bold84 commented Jul 24, 2024

bold84 commented Mar 22, 2024 •

edited

Loading

Example 1: Handling `std::wstring` in SQL Queries

Inserting and Selecting `std::wstring` Data

Example 2: Working with `wchar_t` Vectors

Example 3: Using `std::wstring` with the `sql` Stream Operator

Inserting and Selecting `std::wstring` Data with Stream Operator

bold84 commented Mar 22, 2024 •

edited

Loading