From 191a9ab3d142c7c36a1a685c80e23aa65658047b Mon Sep 17 00:00:00 2001 From: Eugene Gershnik Date: Thu, 2 Feb 2023 04:16:25 -0800 Subject: [PATCH 01/14] Updating docs --- README.md | 11 ++++++----- doc/Emscripten.md | 44 ++++++++++++++++++++++++++++++++++++++++++++ doc/Usage.md | 1 + 3 files changed, 51 insertions(+), 5 deletions(-) create mode 100644 doc/Emscripten.md diff --git a/README.md b/README.md index d7f05ae..c85bf39 100644 --- a/README.md +++ b/README.md @@ -11,7 +11,7 @@ of characters. ## What does it mean? -* Interoperability with platform-native string type means that `sys_string` makes conversions to and from native string types as efficient as possible and ideally 0 cost operations. Native string types are things like `NSString *` or `CFStringRef` on macOS/iOS, Java `String` on Android, `const wchar_t *`, `HSTRING` or `BSTR` on Windows and `const char *` on Linux. For example on Apple's platforms it stores `NSString *` internally allowing zero cost conversion. On Android where no-op conversions to Java strings are impossible for technical reasons, the internal storage is such that it makes conversions as cheap as possible. +* Interoperability with platform-native string type means that `sys_string` makes conversions to and from native string types as efficient as possible and ideally 0 cost operations. Native string types are things like `NSString *` or `CFStringRef` on macOS/iOS, Java `String` on Android, JavaScript `String` on Emscripten/WebAssembly, `const wchar_t *`, `HSTRING` or `BSTR` on Windows and `const char *` on Linux. For example on Apple's platforms it stores `NSString *` internally allowing zero cost conversion. On Android and Emscripten/WebAssembly where no-op conversions to Java/JavaScript strings are impossible for technical reasons, the internal storage is such that it makes conversions as cheap as possible. Some platforms, like Windows, support multiple kinds of native string types. Internally, `sys_string` is a specialization of template `sys_string_t` where the `Storage` parameter defines what kind of native string type to use. The default storage for `sys_string` is picked for you based on your platform (you can change it via compilation options) but you can also directly use other specializations in your code if necessary. @@ -52,10 +52,11 @@ Another way to look at it is that `sys_string` sometimes trades micro-benchmarki ## Compatibility This library has been tested with -* Xcode 13 on x86_64 and arm64 -* MSVC 16.9 and 17.1 on x86_64 -* Clang 12.0.5 under Android NDK on x86, x86_64, armeabi-v7a and arm64-v8a architectures -* GCC 9.3 on x86_64 Ubuntu 20.04 +* Xcode 13 - 14 on x86_64 and arm64 +* MSVC 16.9 - 17.4 on x86_64 +* Clang 12.0.5 under Android NDK, ANDROID_PLATFORM=19 on x86, x86_64, armeabi-v7a and arm64-v8a architectures +* GCC 9.3 - 11.3 on x86_64 Ubuntu 20.04 - 22.04 +* Emscripten 3.1.21 ## Usage diff --git a/doc/Emscripten.md b/doc/Emscripten.md new file mode 100644 index 0000000..8065996 --- /dev/null +++ b/doc/Emscripten.md @@ -0,0 +1,44 @@ +## Emscripten platform conversions + +Similar to Android when compiling under Emscripten there are two storage types available. The default one is optimized for interoperability with JavaScript. It stores a sequence of `char16_t` which can be converted to `String` with the least amount of overhead. + +Additionally you can chose a "generic Unix" storage which stores `char *` and is meant to interoperate with plain Unix API. It can be selected via `#define SYS_STRING_USE_GENERIC 1` and is described under [Linux](Linux.md). + +With JavaScript-optimized storage a conversion to and from `String` is not-trivial. It incurs allocation on JavaScript or native heap and copying between them. + +Conversions rely on on `embind` [library](https://emscripten.org/docs/porting/connecting_cpp_and_javascript/embind.html) so you will need to link with that (e.g. `-lembind`). + + +```cpp +//Conversions from/to JavaScript +EM_VAL handle_in = ... //passed from JavaScript, see below +sys_string str(handle_in); +assert(str == S("abc")); +EM_VAL handle_out = str.make_js_string(); +assert(handle_in != handle_out); //in and out are NOT the same! +//Return handle_out to JavaScript see below +``` + +Passing strings from JavaScript can be accomplished as follows + +```javascript +let handle_in = Emval.toHandle("abc"); +try { + callNativeFunction(handle_in); +} finally { + __emval_decref(handle_in); +} +``` + +And receiving strings from native code as follows + +```javascript +let handle_out = callNativeFunction(); +try { + let str = Emval.toValue(handle_out); +} finally { + __emval_decref(handle_out); +} +``` + + diff --git a/doc/Usage.md b/doc/Usage.md index 369e8db..24680d0 100644 --- a/doc/Usage.md +++ b/doc/Usage.md @@ -110,6 +110,7 @@ Those are described on the following pages. * [Windows](Windows.md) * [Android](Android.md) * [Linux](Linux.md) +* [Emscripten](Emscripten.md) ## Adding Strings From 062a14e6edd4637a913ac8b93a9a8a77e1374af7 Mon Sep 17 00:00:00 2001 From: Eugene Gershnik Date: Sat, 11 Mar 2023 04:52:31 -0800 Subject: [PATCH 02/14] Fixing incorrect type name in currently unused function --- lib/inc/sys_string/impl/util/cursor.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/inc/sys_string/impl/util/cursor.h b/lib/inc/sys_string/impl/util/cursor.h index d5b9d2d..51221d3 100644 --- a/lib/inc/sys_string/impl/util/cursor.h +++ b/lib/inc/sys_string/impl/util/cursor.h @@ -262,7 +262,7 @@ namespace sysstr::util iter_cursor operator++(int) { - index_cursor ret = *this; + iter_cursor ret = *this; ++(*this); return ret; } From cdf87343949ff862edca2d18792c21253bb04db1 Mon Sep 17 00:00:00 2001 From: Eugene Gershnik Date: Sat, 11 Mar 2023 04:52:52 -0800 Subject: [PATCH 03/14] Removing dev garbage --- test/test_javascript.cpp | 1 - 1 file changed, 1 deletion(-) diff --git a/test/test_javascript.cpp b/test/test_javascript.cpp index 09d70ee..e5d0fe5 100644 --- a/test/test_javascript.cpp +++ b/test/test_javascript.cpp @@ -20,7 +20,6 @@ using namespace emscripten; TEST_CASE( "Javascript Conversions", "[javascript]") { - emscripten::val gug; EM_VAL handle = (EM_VAL)EM_ASM_PTR({ return Emval.toHandle(""); }, 0); From 18b77097ac40dba4ce28973d991df5f008e601fd Mon Sep 17 00:00:00 2001 From: Eugene Gershnik Date: Sat, 11 Mar 2023 04:53:16 -0800 Subject: [PATCH 04/14] First pass on Python strings --- lib/CMakeLists.txt | 6 +- lib/inc/sys_string/config.h | 6 + lib/inc/sys_string/impl/compare.h | 23 +- lib/inc/sys_string/impl/platform.h | 136 +++--- .../sys_string/impl/platforms/python_any.h | 439 ++++++++++++++++++ test/CMakeLists.txt | 31 ++ test/test_apple.mm | 4 +- test/test_main.cpp | 11 + 8 files changed, 576 insertions(+), 80 deletions(-) create mode 100644 lib/inc/sys_string/impl/platforms/python_any.h diff --git a/lib/CMakeLists.txt b/lib/CMakeLists.txt index cbc8102..634fed8 100644 --- a/lib/CMakeLists.txt +++ b/lib/CMakeLists.txt @@ -10,7 +10,7 @@ cmake_minimum_required(VERSION 3.16) project(sys_string) -find_package (Python3 COMPONENTS Interpreter) +find_package (Python3 COMPONENTS Interpreter Development) set(SRCDIR ${CMAKE_CURRENT_LIST_DIR}) set(LIBNAME sys_string${SYS_STRING_LIBRARY_SUFFIX}) @@ -44,8 +44,7 @@ target_compile_definitions(${LIBNAME} PUBLIC ) target_include_directories(${LIBNAME} - - PUBLIC +PUBLIC ${SRCDIR}/inc ) @@ -70,6 +69,7 @@ set(PLATFORM_FILES ${SRCDIR}/inc/sys_string/impl/platforms/windows_bstr.h ${SRCDIR}/inc/sys_string/impl/platforms/windows_generic.h ${SRCDIR}/inc/sys_string/impl/platforms/emscripten_javascript.h + ${SRCDIR}/inc/sys_string/impl/platforms/python_any.h ${SRCDIR}/inc/sys_string/impl/platforms/unix_generic.h ) source_group("Platforms" FILES ${PLATFORM_FILES}) diff --git a/lib/inc/sys_string/config.h b/lib/inc/sys_string/config.h index b9d57d6..1ccde55 100644 --- a/lib/inc/sys_string/config.h +++ b/lib/inc/sys_string/config.h @@ -38,6 +38,12 @@ #endif +#if defined(SYS_STRING_USE_PYTHON) + + #include + +#endif + #if __SIZEOF_POINTER__ == 8 || (defined(_MSC_VER) && _WIN64) #define SYS_STRING_SIZEOF_POINTER 8 #elif __SIZEOF_POINTER__ == 4 || (defined(_MSC_VER) && _WIN32) diff --git a/lib/inc/sys_string/impl/compare.h b/lib/inc/sys_string/impl/compare.h index 85a0602..1412b9a 100644 --- a/lib/inc/sys_string/impl/compare.h +++ b/lib/inc/sys_string/impl/compare.h @@ -12,7 +12,7 @@ namespace sysstr { - #if (defined(__APPLE__) && defined(__MACH__)) + #if (defined(__APPLE__) && defined(__MACH__)) template<> inline auto sys_string_cfstr::compare(const sys_string_t & lhs, const sys_string_t & rhs) noexcept -> compare_result @@ -33,6 +33,27 @@ namespace sysstr #endif + #if defined (SYS_STRING_USE_PYTHON) + + template<> + inline auto sys_string_pystr::compare(const sys_string_t & lhs, const sys_string_t & rhs) noexcept -> compare_result + { + auto lhs_ptr = lhs.py_str(); + auto rhs_ptr = rhs.py_str(); + + if (lhs_ptr == rhs_ptr) + return ordering_equal; + if (!lhs_ptr) + return PyUnicode_GetLength(rhs_ptr) == 0 ? ordering_equal : ordering_less; + if (!rhs_ptr) + return PyUnicode_GetLength(lhs_ptr) == 0 ? ordering_equal : ordering_greater; + + int res = PyUnicode_Compare(lhs_ptr, rhs_ptr); + return util::make_compare_result(res); + } + + #endif + template inline auto sys_string_t::compare(const sys_string_t & lhs, const sys_string_t & rhs) noexcept -> compare_result { diff --git a/lib/inc/sys_string/impl/platform.h b/lib/inc/sys_string/impl/platform.h index 7169a25..d47769d 100644 --- a/lib/inc/sys_string/impl/platform.h +++ b/lib/inc/sys_string/impl/platform.h @@ -9,22 +9,35 @@ #error This header must not be included directly. Please include sys_string.h #endif -#if (defined(__APPLE__) && defined(__MACH__)) - #include - #include +#if defined(SYS_STRING_USE_PYTHON) + #include - #if defined(SYS_STRING_USE_GENERIC) + namespace sysstr + { + using sys_string = sys_string_pystr; + using sys_string_builder = sys_string_pystr_builder; + } - namespace sysstr - { - using sys_string = sys_string_generic; - using sys_string_builder = sys_string_generic_builder; - } + #define SYS_STRING_STATIC SYS_STRING_STATIC_PYSTR - #define SYS_STRING_STATIC SYS_STRING_STATIC_GENERIC +#elif defined(SYS_STRING_USE_GENERIC) + + #include + + namespace sysstr + { + using sys_string = sys_string_generic; + using sys_string_builder = sys_string_generic_builder; + } + + #define SYS_STRING_STATIC SYS_STRING_STATIC_GENERIC - #else +#endif + +#if (defined(__APPLE__) && defined(__MACH__)) + #include + #if !defined(SYS_STRING_USE_PYTHON) && !defined(SYS_STRING_USE_GENERIC) namespace sysstr { using sys_string = sys_string_cfstr; @@ -37,20 +50,8 @@ #elif defined(__ANDROID__) #include - #include - #if defined(SYS_STRING_USE_GENERIC) - - namespace sysstr - { - using sys_string = sys_string_generic; - using sys_string_builder = sys_string_generic_builder; - } - - #define SYS_STRING_STATIC SYS_STRING_STATIC_GENERIC - - #else - + #if !defined(SYS_STRING_USE_PYTHON) && !defined(SYS_STRING_USE_GENERIC) namespace sysstr { @@ -66,67 +67,49 @@ #include #include #include - #include - #if SYS_STRING_WIN_BSTR + #if !defined(SYS_STRING_USE_PYTHON) && !defined(SYS_STRING_USE_GENERIC) - namespace sysstr - { - using sys_string = sys_string_bstr; - using sys_string_builder = sys_string_bstr_builder; - } + #if SYS_STRING_WIN_BSTR - #define SYS_STRING_STATIC SYS_STRING_STATIC_BSTR - - #elif SYS_STRING_WIN_HSTRING + namespace sysstr + { + using sys_string = sys_string_bstr; + using sys_string_builder = sys_string_bstr_builder; + } - namespace sysstr - { - using sys_string = sys_string_hstring; - using sys_string_builder = sys_string_hstring_builder; - } + #define SYS_STRING_STATIC SYS_STRING_STATIC_BSTR + + #elif SYS_STRING_WIN_HSTRING - #define SYS_STRING_STATIC SYS_STRING_STATIC_HSTRING + namespace sysstr + { + using sys_string = sys_string_hstring; + using sys_string_builder = sys_string_hstring_builder; + } - #elif SYS_STRING_USE_GENERIC + #define SYS_STRING_STATIC SYS_STRING_STATIC_HSTRING - namespace sysstr - { - using sys_string = sys_string_generic; - using sys_string_builder = sys_string_generic_builder; - } + #else - #define SYS_STRING_STATIC SYS_STRING_STATIC_GENERIC - - #else + namespace sysstr + { + using sys_string = sys_string_win_generic; + using sys_string_builder = sys_string_win_generic_builder; + } - namespace sysstr - { - using sys_string = sys_string_win_generic; - using sys_string_builder = sys_string_win_generic_builder; - } + #define SYS_STRING_STATIC SYS_STRING_STATIC_WIN_GENERIC - #define SYS_STRING_STATIC SYS_STRING_STATIC_WIN_GENERIC + #endif #endif + #elif defined(__EMSCRIPTEN__) #include - #include - #if defined(SYS_STRING_USE_GENERIC) - - namespace sysstr - { - using sys_string = sys_string_generic; - using sys_string_builder = sys_string_generic_builder; - } - - #define SYS_STRING_STATIC SYS_STRING_STATIC_GENERIC - - #else - + #if !defined(SYS_STRING_USE_PYTHON) && !defined(SYS_STRING_USE_GENERIC) namespace sysstr { @@ -139,16 +122,21 @@ #endif #elif defined(__linux__) || defined(__FreeBSD__) || defined(__unix__) + + #include + #if !defined(SYS_STRING_USE_PYTHON) && !defined(SYS_STRING_USE_GENERIC) - namespace sysstr - { - using sys_string = sys_string_generic; - using sys_string_builder = sys_string_generic_builder; - } + namespace sysstr + { + using sys_string = sys_string_generic; + using sys_string_builder = sys_string_generic_builder; + } - #define SYS_STRING_STATIC SYS_STRING_STATIC_GENERIC + #define SYS_STRING_STATIC SYS_STRING_STATIC_GENERIC + + #endif #else #error Unsupported platform diff --git a/lib/inc/sys_string/impl/platforms/python_any.h b/lib/inc/sys_string/impl/platforms/python_any.h new file mode 100644 index 0000000..6bc3c6e --- /dev/null +++ b/lib/inc/sys_string/impl/platforms/python_any.h @@ -0,0 +1,439 @@ +// +// Copyright 2023 Eugene Gershnik +// +// Use of this source code is governed by a BSD-style +// license that can be found in the LICENSE file or at +// https://github.com/gershnik/sys_string/blob/master/LICENSE +// +#ifndef HEADER_SYS_STRING_H_INSIDE +#error This header must not be included directly. Please include sys_string.h +#endif + +#include + +namespace sysstr +{ + class py_storage; +} + +namespace sysstr::util +{ + struct py_traits + { + using storage_type = char32_t; //we pretend that Python strings store char32_t and convert transparently + static_assert(sizeof(storage_type) == sizeof(Py_UCS4)); + using size_type = Py_ssize_t; + using hash_type = size_t; + using native_handle_type = PyObject *; + + static constexpr size_type max_size = std::numeric_limits::max() / sizeof(char32_t); + }; + + inline PyObject * check_create(PyObject * src) + { + if (!src) { + PyErr_Clear(); + throw std::runtime_error("Python string creation failed"); + } + return src; + } + + class py_builder_storage + { + public: + using value_type = py_traits::storage_type; + using size_type = py_traits::size_type; + + class dynamic_t + { + public: + dynamic_t() noexcept = default; + dynamic_t(size_t size): + m_ptr((value_type *)malloc(size * sizeof(value_type))) + { + if (!m_ptr) + throw std::bad_alloc(); + } + dynamic_t(const dynamic_t &) = delete; + dynamic_t(dynamic_t && src) noexcept : + m_ptr(src.m_ptr) + { src.m_ptr = nullptr; } + ~dynamic_t() noexcept + { + if (m_ptr) + free(m_ptr); + } + dynamic_t & operator=(const dynamic_t &) = delete; + dynamic_t & operator=(dynamic_t && rhs) noexcept + { + if (this != &rhs) + { + if (m_ptr) + free(m_ptr); + m_ptr = rhs.m_ptr; + rhs.m_ptr = nullptr; + } + return *this; + } + + constexpr value_type * data() const noexcept + { return m_ptr; } + + void reallocate(size_type size) + { + auto result = (value_type *)realloc(m_ptr, size * sizeof(value_type)); + if (!result) + throw std::bad_alloc(); + m_ptr = result; + } + + value_type * release() noexcept + { + auto ret = m_ptr; + m_ptr = nullptr; + return ret; + } + private: + value_type * m_ptr = nullptr; + }; + + static constexpr size_type minimum_capacity = 32; + + using static_t = std::array; + + using buffer_t = std::variant; + + public: + constexpr size_type capacity() const noexcept + { return m_capacity; } + constexpr value_type * buffer() const noexcept + { + return std::visit([](const auto & val) { + return const_cast(val.data()); + }, m_buffer); + } + static constexpr size_type max_size() noexcept + { return py_traits::max_size; } + + void reallocate(size_type size, size_type used_size) + { + struct reallocator + { + py_builder_storage * me; + size_type size; + size_type used_size; + + void operator()(dynamic_t & buf) const + { + buf.reallocate(size); + me->m_capacity = size; + } + + void operator()(static_t & buf) const + { + if (size > minimum_capacity) + { + dynamic_t new_buf(size); + memcpy(new_buf.data(), buf.data(), used_size * sizeof(value_type)); + me->m_buffer = std::move(new_buf); + me->m_capacity = size; + } else { + me->m_capacity = minimum_capacity; + } + } + }; + + std::visit(reallocator{this, size, used_size}, m_buffer); + } + + buffer_t release() noexcept + { + this->m_capacity = minimum_capacity; + return std::move(m_buffer); + } + private: + buffer_t m_buffer; + size_type m_capacity = minimum_capacity; + }; + + using py_builder_impl = char_buffer; + + inline PyObject * convert_to_string(py_builder_impl & builder) noexcept + { + auto size = builder.size(); + return std::visit([&](auto && buf) { + return size ? check_create(PyUnicode_DecodeUTF32((const char *)buf.data(), size * sizeof(char32_t), "replace", 0)) : + nullptr; + }, builder.release()); + } + + class py_char_access + { + public: + using value_type = py_traits::storage_type; + using size_type = py_traits::size_type; + using reference = value_type; + using pointer = void; + + using cursor = index_cursor; + using reverse_cursor = index_cursor; + + using iterator = cursor; + using const_iterator = iterator; + using reverse_iterator = reverse_cursor; + using const_reverse_iterator = reverse_iterator; + + public: + py_char_access(const sys_string_t & src) noexcept; + + + py_char_access(const py_char_access & src) noexcept = delete; + py_char_access(py_char_access && src) noexcept = delete; + py_char_access & operator=(const py_char_access & src) = delete; + py_char_access & operator=(py_char_access && src) = delete; + + PyObject * get_string() const noexcept + { return m_str; } + + value_type operator[](size_type idx) const noexcept + { + assert (idx >= 0 && idx < m_size); + return char32_t(PyUnicode_READ(m_kind, m_data, idx)); + } + + size_type size() const noexcept + { return m_size; } + + template + auto cursor_begin() const noexcept -> index_cursor + { return index_cursor(*this, bool(Direction) ? 0 : m_size); } + + template + auto cursor_end() const noexcept -> index_cursor + { return index_cursor(*this, bool(Direction) ? m_size : 0); } + + iterator begin() const noexcept + { return cursor_begin(); } + iterator end() const noexcept + { return cursor_end(); } + const_iterator cbegin() const noexcept + { return begin(); } + const_iterator cend() const noexcept + { return end(); } + reverse_iterator rbegin() const noexcept + { return cursor_begin(); } + reverse_iterator rend() const noexcept + { return cursor_end(); } + const_reverse_iterator crbegin() const noexcept + { return rbegin(); } + const_reverse_iterator crend() const noexcept + { return rend(); } + + const char * c_str() const noexcept + { return PyUnicode_AsUTF8(m_str); } + + friend bool operator==(const py_char_access & lhs, const py_char_access & rhs) noexcept + { return lhs.m_str == rhs.m_str; } + friend bool operator!=(const py_char_access & lhs, const py_char_access & rhs) noexcept + { return !(lhs == rhs); } + private: + PyObject * m_str = nullptr; + PyUnicode_Kind m_kind = PyUnicode_4BYTE_KIND; + void * m_data = nullptr; + size_type m_size = 0; + }; + +} + + +namespace sysstr +{ + + class py_storage + { + public: + using size_type = util::py_traits::size_type; + using storage_type = util::py_traits::storage_type; + using native_handle_type = util::py_traits::native_handle_type; + using hash_type = util::py_traits::hash_type; + using char_access = util::py_char_access; + + using builder_impl = util::py_builder_impl; + + static constexpr size_type max_size = util::py_traits::max_size; + + public: + py_storage() noexcept: + m_str(null_string()) + {} + + py_storage(native_handle_type str, handle_retain retain_handle = handle_retain::yes) noexcept : + m_str(str ? (retain_handle == handle_retain::yes ? retain(str) : str) : null_string()) + {} + + protected: + + py_storage(native_handle_type src, size_type first, size_type last) : + m_str(src ? util::check_create(PyUnicode_Substring(src, first, last)) : null_string()) + {} + + py_storage(const py_storage & src, size_type first, size_type last) : + py_storage(src.m_str, first, last) + {} + + template + py_storage(const Char * str, size_t len); + + template<> + py_storage(const char * str, size_t len): + py_storage(util::check_create(PyUnicode_DecodeUTF8((const char *)str, len, "replace")), handle_retain::no) + {} + + #if SYS_STRING_USE_CHAR8 + template<> + py_storage(const char8_t * str, size_t len): + py_storage(util::check_create(PyUnicode_DecodeUTF8((const char *)str, len, "replace")), handle_retain::no) + {} + #endif + + template<> + py_storage(const char16_t * str, size_t len) : + py_storage(util::check_create(PyUnicode_DecodeUTF16((const char *)str, len * sizeof(char16_t), "replace", 0)), handle_retain::no) + {} + + template<> + py_storage(const char32_t * str, size_t len): + py_storage(util::check_create(PyUnicode_DecodeUTF32((const char *)str, len * sizeof(char32_t), "replace", 0)), handle_retain::no) + {} + + ~py_storage() noexcept + { release(m_str); } + + py_storage(const py_storage & src) noexcept : m_str(retain(src.m_str)) + {} + + py_storage(py_storage && src) noexcept : m_str(src.m_str) + { + src.m_str = null_string(); + } + + auto operator=(const py_storage & rhs) noexcept -> py_storage & + { + PyObject * temp = m_str; + m_str = rhs.m_str; + retain(m_str); + release(temp); + return *this; + } + + inline auto operator=(py_storage && rhs) noexcept -> py_storage & + { + if (this != &rhs) + { + release(m_str); + m_str = rhs.m_str; + rhs.m_str = null_string(); + } + return *this; + } + + auto swap(py_storage & other) noexcept -> void + { + using std::swap; + swap(m_str, other.m_str); + } + + public: + + auto py_str() const noexcept -> native_handle_type + { return m_str; } + + auto data() const noexcept -> const storage_type * { + auto kind = PyUnicode_KIND(m_str); + if (kind == PyUnicode_4BYTE_KIND) + return (const storage_type *)PyUnicode_DATA(m_str); + + return nullptr; + } + + auto copy_data(size_type idx, storage_type * buf, size_type buf_size) const noexcept -> size_type + { + auto kind = PyUnicode_KIND(m_str); + auto size = PyUnicode_GET_LENGTH(m_str); + auto data = PyUnicode_DATA(m_str); + + if (idx >= size) + return 0; + + size_type ret; + for(ret = 0; ret < buf_size && idx + ret < size; ++ret) + buf[ret] = PyUnicode_READ(kind, data, idx + ret); + return ret; + } + + protected: + + auto size() const noexcept -> size_type + { return PyUnicode_GET_LENGTH(m_str); } + + private: + static PyObject * retain(PyObject * src) noexcept + { + Py_INCREF(src); + return src; + } + static void release(PyObject * src) noexcept + { + Py_DECREF(src); + } + + static PyObject * null_string() noexcept + { + static py_storage null(util::check_create(PyUnicode_FromString("")), ::sysstr::handle_retain::no); + return retain(null.m_str); + } + + private: + PyObject * m_str = nullptr; + }; +} + +namespace sysstr::util +{ + inline py_char_access::py_char_access(const sys_string_t & src) noexcept: + m_str(src.py_str()), + m_kind(PyUnicode_Kind(PyUnicode_KIND(m_str))), + m_data(PyUnicode_DATA(m_str)), + m_size(PyUnicode_GET_LENGTH(m_str)) + {} + + template<> + inline sys_string_t build(py_builder_impl & builder) noexcept + { + auto str = convert_to_string(builder); + return sys_string_t(str, handle_retain::no); + } +} + +namespace sysstr +{ + template<> + inline sys_string_t::sys_string_t(const char_access::cursor & src, size_type length): + sys_string_t(src.container() ? src.container()->get_string() : nullptr, src.position(), src.position() + length) + {} + + template<> + inline sys_string_t::sys_string_t(const char_access::reverse_cursor & src, size_type length): + sys_string_t(src.container() ? src.container()->get_string() : nullptr, src.position() - length, src.position()) + {} + + template<> + inline sys_string_t::sys_string_t(const char_access::iterator & first, const char_access::iterator & last): + sys_string_t(first, last.position() - first.position()) + {} + + using sys_string_pystr = sys_string_t; + using sys_string_pystr_builder = sys_string_builder_t; +} + +#define SYS_STRING_STATIC_PYSTR(x) ::sysstr::sys_string_pystr(::sysstr::util::check_create(PyUnicode_FromString(x)), ::sysstr::handle_retain::no) diff --git a/test/CMakeLists.txt b/test/CMakeLists.txt index 8efaa3b..9ae8460 100644 --- a/test/CMakeLists.txt +++ b/test/CMakeLists.txt @@ -16,6 +16,18 @@ endif() project(test) +find_package (Python3 COMPONENTS Interpreter Development) + +if(${Python3_Development_FOUND}) + include_directories( + SYSTEM + ${Python3_INCLUDE_DIRS} + ) + + link_libraries( + ${Python3_LIBRARIES} + ) +endif() set (CXX_STANDARDS 17 @@ -34,6 +46,12 @@ if (WIN32) unix_gen ) + if (Python3_Development_FOUND) + + list(APPEND STORAGE_TYPES python) + + endif() + elseif(${CMAKE_SYSTEM_NAME} STREQUAL "Darwin") set (STORAGE_TYPES @@ -41,6 +59,12 @@ elseif(${CMAKE_SYSTEM_NAME} STREQUAL "Darwin") unix_gen ) + if (Python3_Development_FOUND) + + list(APPEND STORAGE_TYPES python) + + endif() + elseif (${CMAKE_SYSTEM_NAME} STREQUAL Android) set (STORAGE_TYPES @@ -61,6 +85,12 @@ else() def ) + if (Python3_FOUND) + + list(APPEND STORAGE_TYPES python) + + endif() + endif() set(STORAGE_FLAG_def "") @@ -70,6 +100,7 @@ set(STORAGE_FLAG_win_gen "") set(STORAGE_FLAG_cfstr "") set(STORAGE_FLAG_andr "") set(STORAGE_FLAG_unix_gen SYS_STRING_USE_GENERIC=1) +set(STORAGE_FLAG_python SYS_STRING_USE_PYTHON=1) set(TEST_COMMAND "") set(TEST_DEPS "") diff --git a/test/test_apple.mm b/test/test_apple.mm index 38ea267..b5fe0eb 100644 --- a/test/test_apple.mm +++ b/test/test_apple.mm @@ -12,7 +12,7 @@ using namespace sysstr; -#if !SYS_STRING_USE_GENERIC +#if !SYS_STRING_USE_GENERIC && !SYS_STRING_USE_PYTHON TEST_CASE( "Apple Conversions", "[apple]") { @@ -31,7 +31,7 @@ } -#else +#elif SYS_STRING_USE_GENERIC TEST_CASE( "Apple Conversions", "[apple]") { diff --git a/test/test_main.cpp b/test/test_main.cpp index a2b6a1f..b5687bd 100644 --- a/test/test_main.cpp +++ b/test/test_main.cpp @@ -8,6 +8,13 @@ #define CATCH_CONFIG_RUNNER #include "catch.hpp" +#if defined(SYS_STRING_USE_PYTHON) + + #define PY_SSIZE_T_CLEAN + #include + +#endif + #if defined(__ANDROID__) @@ -75,6 +82,10 @@ int main(int argc, char** argv) SetConsoleOutputCP(CP_UTF8); #endif + #if defined(SYS_STRING_USE_PYTHON) + Py_Initialize(); + #endif + return Catch::Session().run( argc, argv ); } From c0e87ceea9588f00ef31f05d897aa77077da78ee Mon Sep 17 00:00:00 2001 From: Eugene Gershnik Date: Sat, 11 Mar 2023 05:20:20 -0800 Subject: [PATCH 05/14] Fixing Linux and Windows failures --- lib/CMakeLists.txt | 2 +- .../sys_string/impl/platforms/python_any.h | 45 ++++++++++--------- test/test_windows.cpp | 2 +- 3 files changed, 26 insertions(+), 23 deletions(-) diff --git a/lib/CMakeLists.txt b/lib/CMakeLists.txt index 634fed8..b839a7f 100644 --- a/lib/CMakeLists.txt +++ b/lib/CMakeLists.txt @@ -10,7 +10,7 @@ cmake_minimum_required(VERSION 3.16) project(sys_string) -find_package (Python3 COMPONENTS Interpreter Development) +find_package (Python3 COMPONENTS Interpreter) set(SRCDIR ${CMAKE_CURRENT_LIST_DIR}) set(LIBNAME sys_string${SYS_STRING_LIBRARY_SUFFIX}) diff --git a/lib/inc/sys_string/impl/platforms/python_any.h b/lib/inc/sys_string/impl/platforms/python_any.h index 6bc3c6e..d31f749 100644 --- a/lib/inc/sys_string/impl/platforms/python_any.h +++ b/lib/inc/sys_string/impl/platforms/python_any.h @@ -11,6 +11,8 @@ #include +#include + namespace sysstr { class py_storage; @@ -284,27 +286,6 @@ namespace sysstr template py_storage(const Char * str, size_t len); - template<> - py_storage(const char * str, size_t len): - py_storage(util::check_create(PyUnicode_DecodeUTF8((const char *)str, len, "replace")), handle_retain::no) - {} - - #if SYS_STRING_USE_CHAR8 - template<> - py_storage(const char8_t * str, size_t len): - py_storage(util::check_create(PyUnicode_DecodeUTF8((const char *)str, len, "replace")), handle_retain::no) - {} - #endif - - template<> - py_storage(const char16_t * str, size_t len) : - py_storage(util::check_create(PyUnicode_DecodeUTF16((const char *)str, len * sizeof(char16_t), "replace", 0)), handle_retain::no) - {} - - template<> - py_storage(const char32_t * str, size_t len): - py_storage(util::check_create(PyUnicode_DecodeUTF32((const char *)str, len * sizeof(char32_t), "replace", 0)), handle_retain::no) - {} ~py_storage() noexcept { release(m_str); } @@ -396,6 +377,28 @@ namespace sysstr private: PyObject * m_str = nullptr; }; + + template<> + inline py_storage::py_storage(const char * str, size_t len): + py_storage(util::check_create(PyUnicode_DecodeUTF8((const char *)str, len, "replace")), handle_retain::no) + {} + + #if SYS_STRING_USE_CHAR8 + template<> + inline py_storage::py_storage(const char8_t * str, size_t len): + py_storage(util::check_create(PyUnicode_DecodeUTF8((const char *)str, len, "replace")), handle_retain::no) + {} + #endif + + template<> + inline py_storage::py_storage(const char16_t * str, size_t len) : + py_storage(util::check_create(PyUnicode_DecodeUTF16((const char *)str, len * sizeof(char16_t), "replace", 0)), handle_retain::no) + {} + + template<> + inline py_storage::py_storage(const char32_t * str, size_t len): + py_storage(util::check_create(PyUnicode_DecodeUTF32((const char *)str, len * sizeof(char32_t), "replace", 0)), handle_retain::no) + {} } namespace sysstr::util diff --git a/test/test_windows.cpp b/test/test_windows.cpp index 6d5e122..2c43670 100644 --- a/test/test_windows.cpp +++ b/test/test_windows.cpp @@ -129,7 +129,7 @@ using namespace sysstr; CHECK(strcmp(sys_string("a水𐀀𝄞bcå🤢").c_str(), "a水𐀀𝄞bcå🤢") == 0); } -#else +#elif !defined(SYS_STRING_USE_PYTHON) TEST_CASE( "Windows Conversions", "[windows]") { From 12983c922b8f6081f3952036805b2fcef75488f1 Mon Sep 17 00:00:00 2001 From: Eugene Gershnik Date: Sat, 11 Mar 2023 22:47:54 -0800 Subject: [PATCH 06/14] Separating between using and enabling Python and tests logic cleanup --- lib/inc/sys_string/config.h | 10 +++++++-- lib/inc/sys_string/impl/compare.h | 2 +- lib/inc/sys_string/impl/platform.h | 22 +++++++++---------- test/CMakeLists.txt | 1 + test/test_apple.mm | 19 ----------------- test/test_generic.cpp | 34 ++++++++++++++++++++++++++++++ test/test_javascript.cpp | 2 +- test/test_linux.cpp | 6 +++++- test/test_windows.cpp | 24 ++++----------------- 9 files changed, 65 insertions(+), 55 deletions(-) create mode 100644 test/test_generic.cpp diff --git a/lib/inc/sys_string/config.h b/lib/inc/sys_string/config.h index 1ccde55..dab1d2b 100644 --- a/lib/inc/sys_string/config.h +++ b/lib/inc/sys_string/config.h @@ -38,10 +38,16 @@ #endif -#if defined(SYS_STRING_USE_PYTHON) +#if SYS_STRING_USE_PYTHON + #if defined(SYS_STRING_ENABLE_PYTHON) + #undef SYS_STRING_ENABLE_PYTHON + #endif - #include + #define SYS_STRING_ENABLE_PYTHON 1 +#endif +#if SYS_STRING_ENABLE_PYTHON + #include #endif #if __SIZEOF_POINTER__ == 8 || (defined(_MSC_VER) && _WIN64) diff --git a/lib/inc/sys_string/impl/compare.h b/lib/inc/sys_string/impl/compare.h index 1412b9a..29513d0 100644 --- a/lib/inc/sys_string/impl/compare.h +++ b/lib/inc/sys_string/impl/compare.h @@ -33,7 +33,7 @@ namespace sysstr #endif - #if defined (SYS_STRING_USE_PYTHON) + #if SYS_STRING_ENABLE_PYTHON template<> inline auto sys_string_pystr::compare(const sys_string_t & lhs, const sys_string_t & rhs) noexcept -> compare_result diff --git a/lib/inc/sys_string/impl/platform.h b/lib/inc/sys_string/impl/platform.h index d47769d..158e250 100644 --- a/lib/inc/sys_string/impl/platform.h +++ b/lib/inc/sys_string/impl/platform.h @@ -9,9 +9,13 @@ #error This header must not be included directly. Please include sys_string.h #endif -#if defined(SYS_STRING_USE_PYTHON) +#include + +#if SYS_STRING_ENABLE_PYTHON #include +#endif +#if SYS_STRING_USE_PYTHON namespace sysstr { using sys_string = sys_string_pystr; @@ -20,9 +24,7 @@ #define SYS_STRING_STATIC SYS_STRING_STATIC_PYSTR -#elif defined(SYS_STRING_USE_GENERIC) - - #include +#elif SYS_STRING_USE_GENERIC namespace sysstr { @@ -37,7 +39,7 @@ #if (defined(__APPLE__) && defined(__MACH__)) #include - #if !defined(SYS_STRING_USE_PYTHON) && !defined(SYS_STRING_USE_GENERIC) + #if !SYS_STRING_USE_PYTHON && !SYS_STRING_USE_GENERIC namespace sysstr { using sys_string = sys_string_cfstr; @@ -51,7 +53,7 @@ #include - #if !defined(SYS_STRING_USE_PYTHON) && !defined(SYS_STRING_USE_GENERIC) + #if !SYS_STRING_USE_PYTHON && !SYS_STRING_USE_GENERIC namespace sysstr { @@ -68,7 +70,7 @@ #include #include - #if !defined(SYS_STRING_USE_PYTHON) && !defined(SYS_STRING_USE_GENERIC) + #if !SYS_STRING_USE_PYTHON && !SYS_STRING_USE_GENERIC #if SYS_STRING_WIN_BSTR @@ -109,7 +111,7 @@ #include - #if !defined(SYS_STRING_USE_PYTHON) && !defined(SYS_STRING_USE_GENERIC) + #if !SYS_STRING_USE_PYTHON && !SYS_STRING_USE_GENERIC namespace sysstr { @@ -124,9 +126,7 @@ #elif defined(__linux__) || defined(__FreeBSD__) || defined(__unix__) - #include - - #if !defined(SYS_STRING_USE_PYTHON) && !defined(SYS_STRING_USE_GENERIC) + #if !SYS_STRING_USE_PYTHON && !SYS_STRING_USE_GENERIC namespace sysstr { diff --git a/test/CMakeLists.txt b/test/CMakeLists.txt index 9ae8460..2a82751 100644 --- a/test/CMakeLists.txt +++ b/test/CMakeLists.txt @@ -186,6 +186,7 @@ foreach(STORAGE_SUFFIX ${STORAGE_TYPES}) test_builder.cpp test_utf_iteration.cpp test_utf_util.cpp + test_generic.cpp "$<$:test_apple.mm>" "$<$:test_android.cpp>" "$<$:test_windows.cpp>" diff --git a/test/test_apple.mm b/test/test_apple.mm index b5fe0eb..8a7b82f 100644 --- a/test/test_apple.mm +++ b/test/test_apple.mm @@ -31,23 +31,4 @@ } -#elif SYS_STRING_USE_GENERIC - -TEST_CASE( "Apple Conversions", "[apple]") { - - REQUIRE(sys_string().c_str()); - CHECK(strcmp(sys_string().c_str(), "") == 0); - - REQUIRE(S("").c_str()); - CHECK(strcmp(S("").c_str(), "") == 0); - - REQUIRE(sys_string("").c_str()); - CHECK(strcmp(sys_string("").c_str(), "") == 0); - - REQUIRE(sys_string((const char*)nullptr).c_str()); - CHECK(strcmp(sys_string((const char*)nullptr).c_str(), "") == 0); - - CHECK(strcmp(sys_string("a水𐀀𝄞bcå🤢").c_str(), "a水𐀀𝄞bcå🤢") == 0); -} - #endif diff --git a/test/test_generic.cpp b/test/test_generic.cpp new file mode 100644 index 0000000..3442193 --- /dev/null +++ b/test/test_generic.cpp @@ -0,0 +1,34 @@ +// +// Copyright 2023 Eugene Gershnik +// +// Use of this source code is governed by a BSD-style +// license that can be found in the LICENSE file or at +// https://github.com/gershnik/sys_string/blob/master/LICENSE +// +#include + + +#include "catch.hpp" + +using namespace sysstr; + +#if SYS_STRING_USE_GENERIC + +TEST_CASE( "Generic Conversions", "[generic]") { + + REQUIRE(sys_string().c_str()); + CHECK(strcmp(sys_string().c_str(), "") == 0); + + REQUIRE(S("").c_str()); + CHECK(strcmp(S("").c_str(), "") == 0); + + REQUIRE(sys_string("").c_str()); + CHECK(strcmp(sys_string("").c_str(), "") == 0); + + REQUIRE(sys_string((const char*)nullptr).c_str()); + CHECK(strcmp(sys_string((const char*)nullptr).c_str(), "") == 0); + + CHECK(strcmp(sys_string("a水𐀀𝄞bcå🤢").c_str(), "a水𐀀𝄞bcå🤢") == 0); +} + +#endif \ No newline at end of file diff --git a/test/test_javascript.cpp b/test/test_javascript.cpp index e5d0fe5..07f7658 100644 --- a/test/test_javascript.cpp +++ b/test/test_javascript.cpp @@ -16,7 +16,7 @@ using namespace emscripten; #pragma clang diagnostic ignored "-Wdollar-in-identifier-extension" -#if !SYS_STRING_USE_GENERIC +#if !SYS_STRING_USE_GENERIC && !SYS_STRING_USE_PYTHON TEST_CASE( "Javascript Conversions", "[javascript]") { diff --git a/test/test_linux.cpp b/test/test_linux.cpp index 803b6c8..2f8e30e 100644 --- a/test/test_linux.cpp +++ b/test/test_linux.cpp @@ -12,6 +12,8 @@ using namespace sysstr; +#if !SYS_STRING_USE_GENERIC && !SYS_STRING_USE_PYTHON + TEST_CASE( "Linux Conversions", "[linux]") { REQUIRE(sys_string().c_str()); @@ -27,4 +29,6 @@ TEST_CASE( "Linux Conversions", "[linux]") { CHECK(strcmp(sys_string((const char*)nullptr).c_str(), "") == 0); CHECK(strcmp(sys_string("a水𐀀𝄞bcå🤢").c_str(), "a水𐀀𝄞bcå🤢") == 0); -} \ No newline at end of file +} + +#endif diff --git a/test/test_windows.cpp b/test/test_windows.cpp index 2c43670..842f22c 100644 --- a/test/test_windows.cpp +++ b/test/test_windows.cpp @@ -11,6 +11,8 @@ using namespace sysstr; +#if !SYS_STRING_USE_GENERIC && !SYS_STRING_USE_PYTHON + #if SYS_STRING_WIN_BSTR TEST_CASE("Windows Empty String", "[windows]") { @@ -110,26 +112,7 @@ using namespace sysstr; WindowsDeleteString(hstr); } -#elif SYS_STRING_USE_GENERIC - - TEST_CASE( "Windows Conversions", "[windows]") { - - REQUIRE(sys_string().c_str()); - CHECK(strcmp(sys_string().c_str(), "") == 0); - - REQUIRE(S("").c_str()); - CHECK(strcmp(S("").c_str(), "") == 0); - - REQUIRE(sys_string("").c_str()); - CHECK(strcmp(sys_string("").c_str(), "") == 0); - - REQUIRE(sys_string((const char*)nullptr).c_str()); - CHECK(strcmp(sys_string((const char*)nullptr).c_str(), "") == 0); - - CHECK(strcmp(sys_string("a水𐀀𝄞bcå🤢").c_str(), "a水𐀀𝄞bcå🤢") == 0); - } - -#elif !defined(SYS_STRING_USE_PYTHON) +#else TEST_CASE( "Windows Conversions", "[windows]") { @@ -149,3 +132,4 @@ using namespace sysstr; } #endif +#endif From 77976dcd49f6230df7376792455997e25bc874c3 Mon Sep 17 00:00:00 2001 From: Eugene Gershnik Date: Sat, 11 Mar 2023 22:48:46 -0800 Subject: [PATCH 07/14] Fixing MSVC warnings --- lib/inc/sys_string/impl/unicode/utf_encoding.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/inc/sys_string/impl/unicode/utf_encoding.h b/lib/inc/sys_string/impl/unicode/utf_encoding.h index 154b352..cfe2ea7 100644 --- a/lib/inc/sys_string/impl/unicode/utf_encoding.h +++ b/lib/inc/sys_string/impl/unicode/utf_encoding.h @@ -426,8 +426,8 @@ namespace sysstr if constexpr (Validate) return !( (c & 0xFFFFF800) == 0x0000D800 || c > 0x010FFFF ); - - return true; + else + return true; } constexpr const char32_t * begin() const noexcept From bee9a1928e117552c5890063c27cd290ed68f1b7 Mon Sep 17 00:00:00 2001 From: Eugene Gershnik Date: Sat, 11 Mar 2023 23:39:05 -0800 Subject: [PATCH 08/14] Second pass on Python strings --- lib/inc/sys_string/config.h | 3 ++ .../sys_string/impl/platforms/python_any.h | 50 ++++++++++--------- test/CMakeLists.txt | 1 + test/test_python.cpp | 38 ++++++++++++++ 4 files changed, 68 insertions(+), 24 deletions(-) create mode 100644 test/test_python.cpp diff --git a/lib/inc/sys_string/config.h b/lib/inc/sys_string/config.h index dab1d2b..c854ef2 100644 --- a/lib/inc/sys_string/config.h +++ b/lib/inc/sys_string/config.h @@ -48,6 +48,9 @@ #if SYS_STRING_ENABLE_PYTHON #include + #if PY_MAJOR_VERSION < 3 || (PY_MAJOR_VERSION == 3 && PY_MINOR_VERSION < 7) + #error This code requires Python 3.7 or newer + #endif #endif #if __SIZEOF_POINTER__ == 8 || (defined(_MSC_VER) && _WIN64) diff --git a/lib/inc/sys_string/impl/platforms/python_any.h b/lib/inc/sys_string/impl/platforms/python_any.h index d31f749..1c26bba 100644 --- a/lib/inc/sys_string/impl/platforms/python_any.h +++ b/lib/inc/sys_string/impl/platforms/python_any.h @@ -270,7 +270,7 @@ namespace sysstr {} py_storage(native_handle_type str, handle_retain retain_handle = handle_retain::yes) noexcept : - m_str(str ? (retain_handle == handle_retain::yes ? retain(str) : str) : null_string()) + m_str(canonicalize(str, retain_handle)) {} protected: @@ -284,7 +284,9 @@ namespace sysstr {} template - py_storage(const Char * str, size_t len); + py_storage(const Char * str, size_t len): + py_storage(create_from(str, len), handle_retain::no) + {} ~py_storage() noexcept @@ -374,31 +376,31 @@ namespace sysstr return retain(null.m_str); } + template + static PyObject * create_from(const Char * str, size_t len) + { + if constexpr (utf_encoding_of == utf_encoding::utf8) + return util::check_create(PyUnicode_DecodeUTF8((const char *)str, len, "replace")); + else if constexpr (utf_encoding_of == utf_encoding::utf16) + return util::check_create(PyUnicode_DecodeUTF16((const char *)str, len * sizeof(char16_t), "replace", 0)); + else if constexpr (utf_encoding_of == utf_encoding::utf32) + return util::check_create(PyUnicode_DecodeUTF32((const char *)str, len * sizeof(char32_t), "replace", 0)); + } + + static inline PyObject * canonicalize(PyObject * str, handle_retain retain_handle) + { + if (!str) + return null_string(); + #if (PY_MAJOR_VERSION == 3 && PY_MINOR_VERSION < 12) + if (PyUnicode_READY(str) != 0) + throw std::bad_alloc(); + #endif + return (retain_handle == handle_retain::yes ? retain(str) : str); + } + private: PyObject * m_str = nullptr; }; - - template<> - inline py_storage::py_storage(const char * str, size_t len): - py_storage(util::check_create(PyUnicode_DecodeUTF8((const char *)str, len, "replace")), handle_retain::no) - {} - - #if SYS_STRING_USE_CHAR8 - template<> - inline py_storage::py_storage(const char8_t * str, size_t len): - py_storage(util::check_create(PyUnicode_DecodeUTF8((const char *)str, len, "replace")), handle_retain::no) - {} - #endif - - template<> - inline py_storage::py_storage(const char16_t * str, size_t len) : - py_storage(util::check_create(PyUnicode_DecodeUTF16((const char *)str, len * sizeof(char16_t), "replace", 0)), handle_retain::no) - {} - - template<> - inline py_storage::py_storage(const char32_t * str, size_t len): - py_storage(util::check_create(PyUnicode_DecodeUTF32((const char *)str, len * sizeof(char32_t), "replace", 0)), handle_retain::no) - {} } namespace sysstr::util diff --git a/test/CMakeLists.txt b/test/CMakeLists.txt index 2a82751..017a32b 100644 --- a/test/CMakeLists.txt +++ b/test/CMakeLists.txt @@ -187,6 +187,7 @@ foreach(STORAGE_SUFFIX ${STORAGE_TYPES}) test_utf_iteration.cpp test_utf_util.cpp test_generic.cpp + test_python.cpp "$<$:test_apple.mm>" "$<$:test_android.cpp>" "$<$:test_windows.cpp>" diff --git a/test/test_python.cpp b/test/test_python.cpp new file mode 100644 index 0000000..2f647de --- /dev/null +++ b/test/test_python.cpp @@ -0,0 +1,38 @@ +// +// Copyright 2023 Eugene Gershnik +// +// Use of this source code is governed by a BSD-style +// license that can be found in the LICENSE file or at +// https://github.com/gershnik/sys_string/blob/master/LICENSE +// +#include + + +#include "catch.hpp" + +using namespace sysstr; + +#if SYS_STRING_USE_PYTHON + +TEST_CASE( "Python Conversions", "[python]") { + + auto str = sys_string(); + REQUIRE(str.py_str()); + CHECK(strcmp(PyUnicode_AsUTF8(str.py_str()), "") == 0); + + str = S(""); + REQUIRE(str.py_str()); + CHECK(strcmp(PyUnicode_AsUTF8(str.py_str()), "") == 0); + + str = sys_string((PyObject*)nullptr); + REQUIRE(str.py_str()); + CHECK(strcmp(PyUnicode_AsUTF8(str.py_str()), "") == 0); + + CHECK(strcmp(PyUnicode_AsUTF8(sys_string("a水𐀀𝄞bcå🤢").py_str()), "a水𐀀𝄞bcå🤢") == 0); + + auto raw = PyUnicode_FromString("\xEF\xBF\xBD"); + REQUIRE(raw); + CHECK(sys_string(raw) == sys_string(u"�")); +} + +#endif From 68571a5de64b9dbd024f00a436f184b16e86b2dc Mon Sep 17 00:00:00 2001 From: Eugene Gershnik Date: Thu, 16 Mar 2023 10:42:49 -0700 Subject: [PATCH 09/14] Implemented static Python strings --- .../sys_string/impl/platforms/python_any.h | 38 ++++++++++++++++++- 1 file changed, 37 insertions(+), 1 deletion(-) diff --git a/lib/inc/sys_string/impl/platforms/python_any.h b/lib/inc/sys_string/impl/platforms/python_any.h index 1c26bba..8944d30 100644 --- a/lib/inc/sys_string/impl/platforms/python_any.h +++ b/lib/inc/sys_string/impl/platforms/python_any.h @@ -39,6 +39,26 @@ namespace sysstr::util } return src; } + + template + constexpr auto find_max_codepoint(const char32_t (&ar)[N]) -> char32_t { + char32_t max = 0; + for(char32_t c: ar) { + if (c > max) + max = c; + } + return max; + } + + inline auto init_static_string(PyUnicodeObject & str, size_t size, PyUnicode_Kind kind, const void * chars) { + str._base._base.ob_base.ob_refcnt = 2; + str._base._base.ob_base.ob_type = &PyUnicode_Type; + str._base._base.length = size; + str._base._base.state.kind = kind; + str._base._base.state.ready = 1; + str.data.any = const_cast(chars); + } + class py_builder_storage { @@ -441,4 +461,20 @@ namespace sysstr using sys_string_pystr_builder = sys_string_builder_t; } -#define SYS_STRING_STATIC_PYSTR(x) ::sysstr::sys_string_pystr(::sysstr::util::check_create(PyUnicode_FromString(x)), ::sysstr::handle_retain::no) + +#define SYS_STRING_STATIC_PYSTR(x) ([] () noexcept -> ::sysstr::sys_string_pystr { \ + constexpr auto size = ::std::size(U##x); \ + constexpr auto maxChar = ::sysstr::util::find_max_codepoint(U##x); \ + static PyUnicodeObject str{}; \ + if constexpr (maxChar <= 0x7fu) { \ + ::sysstr::util::init_static_string(str, size - 1, PyUnicode_1BYTE_KIND, x); \ + str._base.utf8 = const_cast(x); \ + str._base.utf8_length = size - 1; \ + } else if constexpr (maxChar <= 0xffffu) { \ + ::sysstr::util::init_static_string(str, size - 1, PyUnicode_2BYTE_KIND, u##x); \ + } else { \ + ::sysstr::util::init_static_string(str, size - 1, PyUnicode_4BYTE_KIND, U##x); \ + } \ + auto ptr = reinterpret_cast(&str); \ + return *reinterpret_cast<::sysstr::sys_string_pystr *>(&ptr); \ + }()) From 0afed114f65c670758067dad0c745b7263134dd8 Mon Sep 17 00:00:00 2001 From: Eugene Gershnik Date: Thu, 16 Mar 2023 11:13:17 -0700 Subject: [PATCH 10/14] Bug fixes for static Python strings --- .../sys_string/impl/platforms/python_any.h | 53 ++++++++++++------- 1 file changed, 34 insertions(+), 19 deletions(-) diff --git a/lib/inc/sys_string/impl/platforms/python_any.h b/lib/inc/sys_string/impl/platforms/python_any.h index 8944d30..dc350ff 100644 --- a/lib/inc/sys_string/impl/platforms/python_any.h +++ b/lib/inc/sys_string/impl/platforms/python_any.h @@ -41,7 +41,8 @@ namespace sysstr::util } template - constexpr auto find_max_codepoint(const char32_t (&ar)[N]) -> char32_t { + constexpr auto find_max_codepoint(const char32_t (&ar)[N]) -> char32_t + { char32_t max = 0; for(char32_t c: ar) { if (c > max) @@ -50,16 +51,6 @@ namespace sysstr::util return max; } - inline auto init_static_string(PyUnicodeObject & str, size_t size, PyUnicode_Kind kind, const void * chars) { - str._base._base.ob_base.ob_refcnt = 2; - str._base._base.ob_base.ob_type = &PyUnicode_Type; - str._base._base.length = size; - str._base._base.state.kind = kind; - str._base._base.state.ready = 1; - str.data.any = const_cast(chars); - } - - class py_builder_storage { public: @@ -438,6 +429,32 @@ namespace sysstr::util auto str = convert_to_string(builder); return sys_string_t(str, handle_retain::no); } + + template + struct PyUnicodeObject_wrapper : PyUnicodeObject + { + constexpr PyUnicodeObject_wrapper(size_t size, const void * chars) + { + this->_base._base.ob_base.ob_refcnt = 1; + this->_base._base.ob_base.ob_type = &PyUnicode_Type; + this->_base._base.length = size; + this->_base._base.hash = -1; + this->_base._base.state.kind = Kind; + this->_base._base.state.ready = 1; + this->data.any = const_cast(chars); + if constexpr (Kind == PyUnicode_1BYTE_KIND) + { + this->_base.utf8 = static_cast(const_cast(chars)); + this->_base.utf8_length = size; + } + } + + sys_string_t as_string() noexcept + { + auto ptr = reinterpret_cast(this); + return sys_string_t(ptr); + } + }; } namespace sysstr @@ -465,16 +482,14 @@ namespace sysstr #define SYS_STRING_STATIC_PYSTR(x) ([] () noexcept -> ::sysstr::sys_string_pystr { \ constexpr auto size = ::std::size(U##x); \ constexpr auto maxChar = ::sysstr::util::find_max_codepoint(U##x); \ - static PyUnicodeObject str{}; \ if constexpr (maxChar <= 0x7fu) { \ - ::sysstr::util::init_static_string(str, size - 1, PyUnicode_1BYTE_KIND, x); \ - str._base.utf8 = const_cast(x); \ - str._base.utf8_length = size - 1; \ + static ::sysstr::util::PyUnicodeObject_wrapper str(size - 1, x); \ + return str.as_string(); \ } else if constexpr (maxChar <= 0xffffu) { \ - ::sysstr::util::init_static_string(str, size - 1, PyUnicode_2BYTE_KIND, u##x); \ + static ::sysstr::util::PyUnicodeObject_wrapper str(size - 1, u##x); \ + return str.as_string(); \ } else { \ - ::sysstr::util::init_static_string(str, size - 1, PyUnicode_4BYTE_KIND, U##x); \ + static ::sysstr::util::PyUnicodeObject_wrapper str(size - 1, U##x); \ + return str.as_string(); \ } \ - auto ptr = reinterpret_cast(&str); \ - return *reinterpret_cast<::sysstr::sys_string_pystr *>(&ptr); \ }()) From 47aae06beade1395a8e341ce4ae0f1e2a24c0752 Mon Sep 17 00:00:00 2001 From: Eugene Gershnik Date: Sat, 18 Mar 2023 06:58:32 -0700 Subject: [PATCH 11/14] Making sure UCS1 python static strings are in UTF-8 --- lib/inc/sys_string/impl/platforms/python_any.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/inc/sys_string/impl/platforms/python_any.h b/lib/inc/sys_string/impl/platforms/python_any.h index dc350ff..7cb5c7d 100644 --- a/lib/inc/sys_string/impl/platforms/python_any.h +++ b/lib/inc/sys_string/impl/platforms/python_any.h @@ -483,7 +483,7 @@ namespace sysstr constexpr auto size = ::std::size(U##x); \ constexpr auto maxChar = ::sysstr::util::find_max_codepoint(U##x); \ if constexpr (maxChar <= 0x7fu) { \ - static ::sysstr::util::PyUnicodeObject_wrapper str(size - 1, x); \ + static ::sysstr::util::PyUnicodeObject_wrapper str(size - 1, u8##x); \ return str.as_string(); \ } else if constexpr (maxChar <= 0xffffu) { \ static ::sysstr::util::PyUnicodeObject_wrapper str(size - 1, u##x); \ From a68fdd1dadbf4c7072420c3eee629fd29a9c9d7d Mon Sep 17 00:00:00 2001 From: Eugene Gershnik Date: Sat, 18 Mar 2023 06:59:00 -0700 Subject: [PATCH 12/14] Proper reference counting in python tests --- test/test_python.cpp | 1 + 1 file changed, 1 insertion(+) diff --git a/test/test_python.cpp b/test/test_python.cpp index 2f647de..3747acd 100644 --- a/test/test_python.cpp +++ b/test/test_python.cpp @@ -33,6 +33,7 @@ TEST_CASE( "Python Conversions", "[python]") { auto raw = PyUnicode_FromString("\xEF\xBF\xBD"); REQUIRE(raw); CHECK(sys_string(raw) == sys_string(u"�")); + Py_DECREF(raw); } #endif From 5415b74d68f906796b96da1ac4bea7058085b264 Mon Sep 17 00:00:00 2001 From: Eugene Gershnik Date: Sat, 18 Mar 2023 06:59:30 -0700 Subject: [PATCH 13/14] Moving embind dependency to the library CMake target --- lib/CMakeLists.txt | 1 + test/CMakeLists.txt | 3 +-- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/CMakeLists.txt b/lib/CMakeLists.txt index b839a7f..a5221ec 100644 --- a/lib/CMakeLists.txt +++ b/lib/CMakeLists.txt @@ -37,6 +37,7 @@ target_compile_options(${LIBNAME} PRIVATE target_link_libraries(${LIBNAME} PUBLIC "$<$:-framework CoreFoundation>" "$<$:runtimeobject.lib>" + "$<$:embind>" ) target_compile_definitions(${LIBNAME} PUBLIC diff --git a/test/CMakeLists.txt b/test/CMakeLists.txt index 017a32b..27c3ad6 100644 --- a/test/CMakeLists.txt +++ b/test/CMakeLists.txt @@ -166,8 +166,7 @@ foreach(STORAGE_SUFFIX ${STORAGE_TYPES}) sys_string${STANDARD_SUFFIX} - "$<$:log>" - "$<$:embind>" + "$<$:log>" ) if (SYS_STRING_TEST_SHARED) From 620a644af83b429c912a5dd87ef1fde135bf27c4 Mon Sep 17 00:00:00 2001 From: Eugene Gershnik Date: Sat, 18 Mar 2023 06:59:49 -0700 Subject: [PATCH 14/14] Updating docs --- README.md | 59 ++++++++++++++++++++++++++++++++++------------- doc/Building.md | 53 ++++++++++++++++++++++++++++-------------- doc/Emscripten.md | 16 +++++++++---- doc/Python.md | 23 ++++++++++++++++++ doc/Usage.md | 1 + 5 files changed, 114 insertions(+), 38 deletions(-) create mode 100644 doc/Python.md diff --git a/README.md b/README.md index c85bf39..f2c4344 100644 --- a/README.md +++ b/README.md @@ -5,28 +5,50 @@ [![License](https://img.shields.io/badge/license-BSD-brightgreen.svg)](https://opensource.org/licenses/BSD-3-Clause) [![Tests](https://github.com/gershnik/sys_string/actions/workflows/test.yml/badge.svg)](https://github.com/gershnik/sys_string/actions/workflows/test.yml) -This library provides a C++ string class `sys_string` that is optimized for **interoperability with platform-native string type**. It is **immutable**, **Unicode-first** and exposes convenient **operations similar to Python or ECMAScript strings**. It uses a separate `sys_string_builder` class to construct strings. It provides fast concatenation via `+` operator that **does not allocate temporary strings**. -The library exposes bidirectional UTF-8/UTF-16/UTF-32 views of `sys_string` as well as of any random access containers +This library provides a C++ string class template `sys_string_t` that is optimized for **interoperability with external native string type**. It is **immutable**, **Unicode-first** and exposes convenient **operations similar to Python or ECMAScript strings**. It uses a separate `sys_string_builder_t` class template to construct strings. It provides fast concatenation via `+` operator that **does not allocate temporary strings**. +The library exposes bidirectional UTF-8/UTF-16/UTF-32 views of `sys_string_t` as well as of any random access containers of characters. ## What does it mean? -* Interoperability with platform-native string type means that `sys_string` makes conversions to and from native string types as efficient as possible and ideally 0 cost operations. Native string types are things like `NSString *` or `CFStringRef` on macOS/iOS, Java `String` on Android, JavaScript `String` on Emscripten/WebAssembly, `const wchar_t *`, `HSTRING` or `BSTR` on Windows and `const char *` on Linux. For example on Apple's platforms it stores `NSString *` internally allowing zero cost conversion. On Android and Emscripten/WebAssembly where no-op conversions to Java/JavaScript strings are impossible for technical reasons, the internal storage is such that it makes conversions as cheap as possible. +* **Interoperability with external native string types** means that SysString makes conversions to and from such types as efficient as possible and, ideally, zero-cost operations. Native string types are things like: - Some platforms, like Windows, support multiple kinds of native string types. Internally, `sys_string` is a specialization of template `sys_string_t` where the `Storage` parameter defines what kind of native string type to use. The default storage for `sys_string` is picked for you based on your platform (you can change it via compilation options) but you can also directly use other specializations in your code if necessary. + * `NSString *` or `CFStringRef` on macOS/iOS + * Java `String` on Android + * Python `str` in Python extensions + * JavaScript `String` on Emscripten/WebAssembly + * `const wchar_t *`, `HSTRING` or `BSTR` on Windows + * `const char *` on Linux. -* Immutable. String instances cannot be modified. To do modifications you use a separate "builder" class. This is similar to how many other languages do it and results in improved performance and elimination of whole class of errors. -* Unicode-first. Instances of `sys_string` always store Unicode characters in either UTF-8, UTF-16 or UTF-32, depending on platform. Iteration can be done in all of these encodings and all operations (case conversion, case insensitive comparisons, trimming) are specified as actions on sequence of Unicode codepoints using Unicode algorithms. -* Operations similar to Python or ECMAScript strings means that you can do things like `rtrim`, `split`, `join`, `starts_with` etc. in a way proven to be natural and productive in those languages. -* Concatenation does not allocate temporaries and copies addends once means that `result = s1 + s2 + s3` has one memory allocation and one copy of each of `s1`, `s2` and `s3` content into the result. Not 2 allocations and 5 copies like in other languages or with `std::string`. -* Bidirectional UTF-8/UTF-16/UTF-32 views. You can view `sys_string` as a sequence of UTF-8/16/32 characters and iterate forward or `backward` equally efficiently. Consider trying to find last instance of Unicode whitespace in UTF-8 data. Doing it as fast as finding the first instance is non-trivial. The views also work on any random access containers (C array, `std::array`, `std::vector`, `std::string`) of characters. Thus you can iterate in UTF-8 over `std::vector` etc. + `sys_string_t` and `sys_string_builder_t` are parametrized on `Storage` which defines what kind of native string type to use internally and interoperate with. Different `Storage` implementations are provided for all the external types above. + + For example the storage for Apple's platforms uses `NSString *` internally, allowing zero cost conversions between C++ and native sides. + + On Android and Emscripten/WebAssembly no-op conversions from C++ to native strings are impossible for technical reasons. + The storage for these platforms' strings still makes conversions as cheap as possible (avoiding UTF conversions for example). + + The library also provides typedefs `sys_string`/`sys_string_builder` that use the "default" storage type on each platform (you can change which one it is via compilation options). Regardless of which storage is the default you can always directly use other specializations in your code if necessary. + + +* **Immutability.** String instances cannot be modified. To do modifications you use a separate "builder" class. This is similar to how many other languages do it and results in improved performance and elimination of whole class of errors. + +* **Unicode-first.** Instances of `sys_string_t` always store Unicode characters in either UTF-8, UTF-16 or UTF-32, depending on their storage. Iteration can be done in all of these encodings and all operations (case conversion, case insensitive comparisons, trimming) are specified as actions on sequence of Unicode codepoints using Unicode algorithms. + +* **Operations similar to Python or ECMAScript strings.** You can do things like `rtrim`, `split`, `join`, `starts_with` etc. on `sys_string_t` in a way proven to be natural and productive in those languages. + +* **Concatenation does not allocate temporaries.** You can safely do things like `result = s1 + s2 + s3`. It will result in **one** memory allocation and one `memcpy` of `s1`, `s2` and `s3` content into the final result. Not 2 allocations and 5 copies like in other languages or with `std::string`. + +* **Bidirectional UTF-8/UTF-16/UTF-32 views**. You can view `sys_string_t` as a sequence of UTF-8/16/32 characters and iterate forward or __backward__ equally efficiently. Consider trying to find last instance of Unicode whitespace in UTF-8 data. Doing it as fast as finding the first instance is non-trivial. The views also work on any random access containers (C array, `std::array`, `std::vector`, `std::string`) of characters. Thus you can iterate in UTF-8 over `std::vector` etc. ## Why bother? Doesn't `std::string` work well? -An `std::string` storing UTF-8 (or `std::wstring` storing UTF-16 on Windows) works very well for some scenarios but fails miserably for others. `sys_string` class is an attempt to create something that works well where `std::string` would be a bad choice. +An `std::string` storing UTF-8 (or `std::wstring` storing UTF-16 on Windows) works very well for some scenarios but fails miserably for others. `sys_string` class is an attempt to create something that works well in situations `std::string` would be a bad choice. Specifically, `std::basic_string` is an STL container of a character type that owns its memory and controls it via a user-supplied allocator. These design choices make it very fast for direct character access but create the following problems: + * They rule out (efficient) reference-counted implementations. Which means that when you copy an `std::string` instance it must copy its content. Some of the penalty of that is alleviated by modern [small string optimization](https://akrzemi1.wordpress.com/2014/04/14/common-optimizations/) but this is, at best, a band-aid. There are workarounds, of course, such as using `std::shared_ptr>` "when it matters" but they result in even more complexity for something that is quite fundamental to any data processing. + * They foreclose any ability to efficiently interchange data with some other string type. It becomes problematic if your code needs to frequently ping-pong data between C++ and your OS string abstraction. Consider Apple's platforms (macOS, iOS). Applications written for these platforms often have to extensively interoperate with code that requires usage of `NSString *` native string type. If you have to ping-pong string data a lot and/or store the same string data on both sides, using `std::string` will mean a large performance and memory penalty. + * They make `std::basic_string` Unicode hostile. By being oblivious to difference between "storage unit" and a "character", `std::basic_string` cannot really handle encodings such as `UTF-8` or `UTF-16` where the two differ. Yes you can store data in these encodings in it but you need to be extremely careful how you use it. What will `erase(it)` do if the iterator points in the middle of 4-byte UTF-8 sequence? Finally, and unrelatedly to the above, `std::string` lacks some simple things that are taken for granted these days by users of pretty much all other languages. There is case insensitive comparisons, no "trim" or "split" etc. It is possible to write those yourself of course but here the Unicode-unfriendliness raises its ugly head. To do any of these correctly you need to be able to handle a string as a sequence of Unicode characters and doing so with `std::string` is cumbersome. @@ -36,18 +58,23 @@ Finally, and unrelatedly to the above, `std::string` lacks some simple things th The following requirements which other string classes often have are specifically non-goals of this library. -* Support C++ allocators. Since `sys_string` is meant to interoperate with system string class/types, it necessarily has to use the same allocation mechanisms as those. +* Support C++ allocators. Since `sys_string_t` is meant to interoperate with system string class/types, it necessarily has to use the same allocation mechanisms as those. + * Have an efficient `const char * c_str()` method on all platforms. The goal of the library is to provide an efficient conversion to the native string types rather than specifically `const char *`. While ability to obtain `const char *` *is* provided everywhere, it might involve additional memory allocations and other overhead. Note that on Linux `char *` is the system type so it can be obtained with 0 cost. -* Make `sys_string` an STL container. Conceptually a string is not a container. You can **view** contents of a string as a sequence of UTF-8 or UTF-16 or UTF-32 codepoints and the library provides such views which function as STL ranges. -* Support non-Unicode "narrow" and "wide" character encodings. `sys_string` only understands Unicode. Conversions to/from non-Unicode encodings are a job for a different library. Specifically `char *` in any of the library's methods is required to be in UTF-8. + +* Make `sys_string_t` an STL container. Conceptually a string is not a container. You can **view** contents of a string as a sequence of UTF-8 or UTF-16 or UTF-32 codepoints and the library provides such views which function as STL ranges. + +* Support non-Unicode "narrow" and "wide" character encodings. `sys_string_t` only understands Unicode. Conversions to/from non-Unicode encodings are a job for a different library. Most significantly it means that any `char *` passed to this library's methods is required to be in UTF-8. + * Provide locale-dependent functionality. Properly supporting locales with Unicode is an important area but it belongs to another library, not this one. This library is focused on locale-independent behavior that works the same everywhere. For example `to_lower` methods implements locale-independent part of Unicode specification. (Final uppercase Σ transforms to ς but I always transforms to i) ## Performance -In general `sys_string` aims to have the same performance of its operations as best hand-crafted code that uses corresponding native string types on every platforms. For example on macOS code using `sys_string` should be as fast as code manually using `NSString *`/`CFStringRef`. -This needs to be kept in mind when evaluating whether `sys_string` is a better choice for your application that `std::string`. Continuing Apple's example, an `std::string` is generally faster for direct character access than `NSString *` and thus faster than `sys_string` too. If your code rarely transfers data from `NSString *` to `std::string` and spends most of the time iterating over `std::string` characters then using `std::string` might be the right choice. +In general `sys_string_t` aims to have the same performance of its operations as best hand-crafted code that uses corresponding native string types on every platforms. For example on macOS code using `sys_string` should be as fast as code manually using `NSString *`/`CFStringRef`. + +This needs to be kept in mind when evaluating whether `sys_string` is a better choice for your application than `std::string`. Continuing Apple's example, an `std::string` is generally faster for direct character access than `NSString *` and thus faster than `sys_string` too. If your code rarely transfers data from `NSString *` to `std::string` and spends most of the time iterating over `std::string` characters then using `std::string` might be the right choice. -Another way to look at it is that `sys_string` sometimes trades micro-benchmarking performance of individual string operations for reduced copying, allocations and memory pressure overall. Whether this is a right tradeoff for you depends on specifics of your codebase. +Another way to look at it is that `sys_string_t` sometimes trades micro-benchmarking performance of individual string operations for reduced copying, allocations and memory pressure overall. Whether this is a right tradeoff for you depends on specifics of your codebase. ## Compatibility diff --git a/doc/Building.md b/doc/Building.md index 07b594a..c4fb3f5 100644 --- a/doc/Building.md +++ b/doc/Building.md @@ -1,16 +1,10 @@ # Building and configuration - - ## Building -If you use CMake clone this repository and add the `lib` directory as subdirectory. Something like +### CMake -```cmake -add_subdirectory(PATH_TO_SYS_STRING_REPO/lib, sys_string) -``` - -Alternatively with modern CMake you can just do +With modern CMake the easiest way to use this library is ```cmake include(FetchContent) @@ -23,16 +17,41 @@ FetchContent_Declare(sys_string FetchContent_MakeAvailable(sys_string) ``` -You need to have your compiler to default to at least C++17 or set `CMAKE_CXX_STANDARD` to at least 17 in order for build to succeed. +Alternatively, if you prefer to store the sources locally clone this repository and +add its `lib` sub-directory as subdirectory of your CMake project. Something like + +```cmake +add_subdirectory(PATH_TO_SYS_STRING_REPO/lib, sys_string) +``` + +In either case you can now use this library by doing something like + +```cmake +target_link_library(your_target PRIVATE sys_string) +``` + +Note that you need to have your compiler to default to at least C++17 or set `CMAKE_CXX_STANDARD` to at least 17 in order for build to succeed. + +### Other build systems + +If you use a different build system then: + +* Set your include path to `lib/inc` +* No special preprocessor flags are required except on Windows where `_CRT_SECURE_NO_WARNINGS` must be defined to avoid MSVC bogus warnings. +* Compile the sources under `lib/cpp` into a static library and add it to your link step +* You will need to link with the following libraries: + * Mac: `CoreFoundation` framework + * Windows: `runtimeobject.lib` + * Emscripten: `embind` -If you use a different build system all you need is to set your include path to `lib/inc` and compile the sources under `lib/cpp`. -No special compilation flags are required except on Windows where `_CRT_SECURE_NO_WARNINGS` must be defined to avoid MSVC bogus warnings. -On Mac you need to link with `CoreFoundation` framework and on Windows with `runtimeobject.lib`. +## Configuration options -### Configuration options +Whichever build system you use you can set the following macros (either on command line or _before_ including any library headers) to control the library behavior: -* `SYS_STRING_NO_S_MACRO` - set to 1 to disable short `S()` macro. See [Usage](doc/Usage.md#basics) for details -* `SYS_STRING_WIN_BSTR` - set to 1 to use `BSTR` as native `sys_string` type on Windows -* `SYS_STRING_WIN_HSTRING` - set to 1 to use `HSTRING` as native `sys_string` type on Windows -* `SYS_STRING_USE_GENERIC` - set to 1 to use `const char *` as native `sys_string` type on MacOS (similar to Linux) \ No newline at end of file +* `SYS_STRING_NO_S_MACRO` - set it to 1 to disable short `S()` macro. See [Usage](doc/Usage.md#basics) for details +* `SYS_STRING_WIN_BSTR` - set it to 1 to use `BSTR` as default `sys_string` storage on Windows. It has no effect on other platforms. +* `SYS_STRING_WIN_HSTRING` - set it to 1 to use `HSTRING` as default `sys_string` storage on Windows. It has no effect on other platforms. +* `SYS_STRING_USE_GENERIC` - set it to 1 to use `const char *` as default `sys_string` storage on MacOS (similar to Linux). It has no effect on other platforms. +* `SYS_STRING_ENABLE_PYTHON` - set it to 1 to enable Python support. This requires header `` to be available on the include path. If enabled the `sys_string_pystr` and related classes become available but the default `sys_string` will still be your platform default one. +* `SYS_STRING_USE_PYTHON` - set it to 1 to make Python strings the default `sys_string` storage. This automatically enables `SYS_STRING_ENABLE_PYTHON` and has the same requirements. diff --git a/doc/Emscripten.md b/doc/Emscripten.md index 8065996..c88cc95 100644 --- a/doc/Emscripten.md +++ b/doc/Emscripten.md @@ -1,10 +1,10 @@ ## Emscripten platform conversions -Similar to Android when compiling under Emscripten there are two storage types available. The default one is optimized for interoperability with JavaScript. It stores a sequence of `char16_t` which can be converted to `String` with the least amount of overhead. +When compiling under Emscripten there are two storage types available. The default one is optimized for interoperability with JavaScript. It stores a sequence of `char16_t` which can be converted to `String` with the least amount of overhead. Additionally you can chose a "generic Unix" storage which stores `char *` and is meant to interoperate with plain Unix API. It can be selected via `#define SYS_STRING_USE_GENERIC 1` and is described under [Linux](Linux.md). -With JavaScript-optimized storage a conversion to and from `String` is not-trivial. It incurs allocation on JavaScript or native heap and copying between them. +With JavaScript-optimized storage a conversion to and from `String` is non-trivial. It incurs allocation on JavaScript or native heap and copying between them. Conversions rely on on `embind` [library](https://emscripten.org/docs/porting/connecting_cpp_and_javascript/embind.html) so you will need to link with that (e.g. `-lembind`). @@ -14,9 +14,12 @@ Conversions rely on on `embind` [library](https://emscripten.org/docs/porting/co EM_VAL handle_in = ... //passed from JavaScript, see below sys_string str(handle_in); assert(str == S("abc")); + +... + EM_VAL handle_out = str.make_js_string(); assert(handle_in != handle_out); //in and out are NOT the same! -//Return handle_out to JavaScript see below +//Return handle_out to JavaScript, see below ``` Passing strings from JavaScript can be accomplished as follows @@ -24,7 +27,7 @@ Passing strings from JavaScript can be accomplished as follows ```javascript let handle_in = Emval.toHandle("abc"); try { - callNativeFunction(handle_in); + nativeFunctionWithStringArg(handle_in); } finally { __emval_decref(handle_in); } @@ -33,7 +36,7 @@ try { And receiving strings from native code as follows ```javascript -let handle_out = callNativeFunction(); +let handle_out = nativeFunctionReturningString(); try { let str = Emval.toValue(handle_out); } finally { @@ -41,4 +44,7 @@ try { } ``` +Note that unlike other platforms you **cannot pass `null` (or `undefined`)** as strings to C++ side. Neither will `make_js_string()` ever return a handle that will convert to `null` or `undefined` on JavaScript side. + + diff --git a/doc/Python.md b/doc/Python.md new file mode 100644 index 0000000..627b945 --- /dev/null +++ b/doc/Python.md @@ -0,0 +1,23 @@ +## Python conversions + +When compiled with `SYS_STRING_USE_PYTHON=1` the default storage type for `sys_string` becomes Python string (e.g. `PyObject *`). You can also compile with just +`SYS_STRING_ENABLE_PYTHON=1` in which case Python strings become available via `sys_string_pystr` and `sys_string_builder_pystr` but not default. + +With `PyObject *` storage `sys_string` is trivially convertible from and to a Python API string. + +```cpp +//Converting to/from PyObject * +auto raw = PyUnicode_FromString("abc"); +sys_string str(raw); //this increments reference. You still owe raw +assert(raw == S("abc")); + +auto raw1 = str.py_str(); //returns borrowed reference! +assert(raw1 == raw); +assert(PyUnicode_AsUTF8(raw1), "abc")); + +Py_DECREF(raw); //you owe raw but not raw1 + +``` + +Not that unlike other platforms `PyObject *` passed to `sys_string` cannot be `null` and `py_str()` will never return `null`. +This is in keeping with normal Python API semantics where `null` pointers are never valid and signify errors. diff --git a/doc/Usage.md b/doc/Usage.md index 24680d0..66dd787 100644 --- a/doc/Usage.md +++ b/doc/Usage.md @@ -111,6 +111,7 @@ Those are described on the following pages. * [Android](Android.md) * [Linux](Linux.md) * [Emscripten](Emscripten.md) +* [Python](Python.md) ## Adding Strings