Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Wide Strings in SOCI for Enhanced Unicode Handling #1133

Open
wants to merge 70 commits into
base: master
Choose a base branch
from

Conversation

bold84
Copy link

@bold84 bold84 commented Mar 22, 2024

This pull request adds comprehensive support for wide strings (wchar_t, std::wstring) to the SOCI database library, significantly improving its support for Unicode string types such as SQL Server's NVARCHAR and NTEXT. This enhancement is crucial for applications that require robust handling of international and multi-language data.

Key Changes:

  1. Introduced exchange_type_traits and exchange_traits Specializations:

    • These specializations facilitate the handling of wide strings during type exchange, ensuring proper conversion and management within the SOCI library.
  2. Updated ODBC Backend:

    • Added support for wide strings, specifically for wchar_t and std::wstring.
    • Adjusted the parameter binding and data retrieval mechanisms to correctly process wide characters.
  3. Enhanced Buffer Management:

    • Modified buffer allocation and management to accommodate the larger size of wide characters, which are essential for proper Unicode support.
    • Implemented logic to handle buffer size overflow, ensuring safety and stability when processing large text data.
  4. Improved Unicode Support:

    • Incorporated routines to convert between different Unicode encodings (UTF-16 and UTF-32 on Unix-like systems, native UTF-16 on Windows) to handle wide strings properly across various platforms.
  5. Extended Test Coverage:

    • Added comprehensive tests focusing on wide string handling, especially ensuring compatibility with SQL Server.
    • Included edge cases for large strings to test buffer management and overflow handling.

Notes:

  • The focus of this pull request is on the ODBC backend.
  • While the original pull request mentioned a focus on C++17 standards, this has been removed to maintain compatibility with earlier C++ versions, ensuring broader usability of these enhancements.
  • This work sets a foundational framework to extend wide string support to other database backends in SOCI in future updates.

This update significantly bolsters SOCI's capabilities in handling Unicode data, making it a more versatile and powerful tool for database interactions in multi-language applications.

Example usage

Here are a few examples showing how the new wide string features can be used with the ODBC backend.

Example 1: Handling std::wstring in SQL Queries

Inserting and Selecting std::wstring Data

#include <soci.h>
#include <soci-odbc.h>
#include <iostream>

int main()
{
    try
    {
        soci::session sql(soci::odbc, "DSN=my_datasource;UID=user;PWD=password");

        // Create table with NVARCHAR column
        sql << "CREATE TABLE soci_test (id INT IDENTITY PRIMARY KEY, wide_text NVARCHAR(40) NULL)";

        // Define a wstring to insert
        std::wstring wide_str_in = L"Hello, 世界!";
        
        // Insert the wstring
        sql << "INSERT INTO soci_test(wide_text) VALUES (:wide_text)", soci::use(wide_str_in);

        // Retrieve the wstring
        std::wstring wide_str_out;
        sql << "SELECT wide_text FROM soci_test WHERE id = 1", soci::into(wide_str_out);

        // Output the retrieved wstring
        std::wcout << L"Retrieved wide string: " << wide_str_out << std::endl;
    }
    catch (const soci::soci_error& e)
    {
        std::cerr << "Error: " << e.what() << std::endl;
    }

    return 0;
}

Example 2: Working with wchar_t Vectors

Inserting and Selecting Wide Characters

#include <soci.h>
#include <soci-odbc.h>
#include <iostream>
#include <vector>

int main()
{
    try
    {
        soci::session sql(soci::odbc, "DSN=my_datasource;UID=user;PWD=password");

        // Create table with NCHAR column
        sql << "CREATE TABLE soci_test (id INT IDENTITY PRIMARY KEY, wide_char NCHAR(2) NULL)";

        // Define a vector of wide characters to insert
        std::vector<wchar_t> wide_chars_in = {L'A', L'B', L'C', L'D'};
        
        // Insert the wide characters
        sql << "INSERT INTO soci_test(wide_char) VALUES (:wide_char)", soci::use(wide_chars_in);

        // Retrieve the wide characters
        std::vector<wchar_t> wide_chars_out(4);
        sql << "SELECT wide_char FROM soci_test WHERE id IN (1, 2, 3, 4)", soci::into(wide_chars_out);

        // Output the retrieved wide characters
        for (wchar_t ch : wide_chars_out)
        {
            std::wcout << L"Retrieved wide char: " << ch << std::endl;
        }
    }
    catch (const soci::soci_error& e)
    {
        std::cerr << "Error: " << e.what() << std::endl;
    }

    return 0;
}

Example 3: Using std::wstring with the sql Stream Operator

Inserting and Selecting std::wstring Data with Stream Operator

#include <soci.h>
#include <soci-odbc.h>
#include <iostream>

int main()
{
    try
    {
        soci::session sql(soci::odbc, "DSN=my_datasource;UID=user;PWD=password");

        // Create table with NVARCHAR column
        sql << "CREATE TABLE soci_test (id INT IDENTITY PRIMARY KEY, wide_text NVARCHAR(40) NULL)";

        // Define a wstring to insert
        std::wstring wide_str_in = L"Hello, 世界!";
        
        // Use stream operator to insert the wstring
        sql << "INSERT INTO soci_test(wide_text) VALUES (N'" << wide_str_in << "')";
        
        // Retrieve the wstring using stream operator
        std::wstring wide_str_out;
        sql << "SELECT wide_text FROM soci_test WHERE id = 1", soci::into(wide_str_out);

        // Output the retrieved wstring
        std::wcout << L"Retrieved wide string: " << wide_str_out << std::endl;
    }
    catch (const soci::soci_error& e)
    {
        std::cerr << "Error: " << e.what() << std::endl;
    }

    return 0;
}

In this example:

  1. A soci::session object is created to connect to the database.
  2. A table is created with an NVARCHAR column.
  3. A std::wstring is defined for insertion.
  4. The sql stream operator is used to insert the std::wstring into the database. Note the use of N' to indicate a Unicode string in SQL Server.
  5. The std::wstring is retrieved from the database using the sql stream operator and the soci::into function.
  6. The retrieved wide string is printed to the console using std::wcout.

These examples demonstrate how to insert and retrieve wide strings and wide characters using SOCI's newly added features for handling wide strings (wchar_t, std::wstring).

Limitation: The current implementation does not handle combining characters correctly. Combining characters, such as accents or diacritical marks, are treated separately instead of being combined with their base characters. This limitation may result in incorrect conversions for strings containing combining characters. A potential solution would be to incorporate Unicode normalization before the conversion process to ensure that combining characters are properly combined with their base characters.
Unicode defines several normalization forms (e.g., NFC, NFD, NFKC, NFKD), each with its own set of rules and behaviors. Choosing the appropriate normalization form is crucial, as different forms may produce different results.

To have full Unicode support, linking against a library like ICU or iconv is necessary. It can be made optional.

Disclaimer: This text is partially AI generated.

@bold84 bold84 marked this pull request as draft March 22, 2024 11:25
@bold84
Copy link
Author

bold84 commented Mar 22, 2024

Converting from UTF-16 to UTF-8 is no problem when retrieving data, because the column data type is known.
When inserting/updating though, it is not so straightforward, as we don't have programmatic knowledge of the column data type in advance.

I'm thinking of adding another argument to "soci::use()" that lets the developer override the data type that's used for the underlying ODBC call.

Another issue is the currently non-existing N'' enclosure for unicode strings for MSSQL in case of soci::use().

Another issue is the stream interface. Currently std::wstring isn't supported and as far as I understand, supporting it would require widening the query to UTF-16 before sending it to the DB.

@bold84 bold84 marked this pull request as ready for review March 22, 2024 13:09
Copy link
Member

@vadz vadz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This globally looks good but there are globally 2 issues:

  1. The new functionality needs to be documented, notably it should be clearly stated that wstring and wchar_t are only supported in the ODBC backend (and only when using SQL Server?).
  2. The use of/checks for C++17 are confusing as it's not clear if it is required for wide char support or if it's just some kind of optimization (in the latter case I'd drop it, it's not worth the extra code complexity).

include/private/soci-vector-helpers.h Outdated Show resolved Hide resolved
include/soci/soci-backend.h Show resolved Hide resolved
src/core/use-type.cpp Outdated Show resolved Hide resolved
include/soci/odbc/soci-odbc.h Outdated Show resolved Hide resolved
@bold84 bold84 marked this pull request as draft April 1, 2024 00:14
This commit updates the Unicode conversion functions to handle UTF-16 on Windows and UTF-32 on other platforms.

The changes include:

1. Updating the `utf8_to_wide` and `wide_to_utf8` functions to handle UTF-32 on Unix/Linux platforms.
2. Updating the `copy_from_string` function to handle UTF-16 on Windows and convert UTF-32 to UTF-16 on other platforms.
3. Updating the `bind_by_pos` function to handle UTF-16 on Windows and convert UTF-32 to UTF-16 on other platforms.
4. Adding a test case for wide strings in the ODBC MSSQL tests.
@bold84
Copy link
Author

bold84 commented Jun 18, 2024

Please note that I updated the FreeBSD Image for Cirrus from 13.2 to 13.3.

cirruslabs/cirrus-ci-docs#1277

@bold84 bold84 marked this pull request as draft June 20, 2024 23:12
@bold84
Copy link
Author

bold84 commented Jun 20, 2024

I'm adding better UTF conversion first.

@bold84 bold84 marked this pull request as ready for review July 23, 2024 16:19
@bold84
Copy link
Author

bold84 commented Jul 23, 2024

@vadz
I extended the description; see "Limitation".

Maybe this can be an optional feature, similar to boost.

Copy link
Member

@vadz vadz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks mostly good to me and the limitation (lack of support for combined forms) can be addressed later.

I have some minor comments below and I admit I didn't read all the code in details, but it looks superficially fine (if a bit verbose) and the tests look good, thank you.

include/soci/ref-counted-statement.h Outdated Show resolved Hide resolved
include/soci/soci-backend.h Outdated Show resolved Hide resolved
include/soci/soci-backend.h Outdated Show resolved Hide resolved
tests/odbc/test-odbc-mssql.cpp Show resolved Hide resolved

// }

TEST_CASE("UTF-8 validation tests", "[unicode]")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these tests are neither MSSQL nor ODBC specific, they should ideally be in their own file.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved them to the "empty" test module, as it contains other non-backend-specific tests.
I don't think this is the best solution, but I would need more information on how you want a separate unicode test file to be treated in the context of the CMake files. The backend tests use the CMake macro soci_backend_test.

I can just "nail it in", but I assume a more elegant solution is preferred.

src/backends/odbc/standard-into-type.cpp Outdated Show resolved Hide resolved
include/soci/soci-unicode.h Outdated Show resolved Hide resolved
@vadz
Copy link
Member

vadz commented Jul 23, 2024

Oh, I forgot to ask: why do you think this should be an option? AFAICS this doesn't affect the existing API, so I see no reason to not enable this unconditionally for people who need it, am I missing something?

Co-authored-by: VZ <vz-github@zeitlins.org>
@bold84
Copy link
Author

bold84 commented Jul 24, 2024

Oh, I forgot to ask: why do you think this should be an option? AFAICS this doesn't affect the existing API, so I see no reason to not enable this unconditionally for people who need it, am I missing something?

I was referring to the need to link against icu or iconv to have combining character support right away. But it's not necessary, if we can take care of the normalization later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants