Skip to content

Commit

Permalink
Merge branch 'vnext'
Browse files Browse the repository at this point in the history
  • Loading branch information
gershnik committed Dec 3, 2024
2 parents f6d5127 + f14cf97 commit d17ff84
Show file tree
Hide file tree
Showing 54 changed files with 12,383 additions and 21,436 deletions.
50 changes: 20 additions & 30 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ on:

env:
BUILD_TYPE: Release
NDK_VER: 21.3.6528147
NDK_VER: 27.2.12479018
NDK_ARCH: x86_64
NDK_API: 29

Expand All @@ -26,33 +26,18 @@ jobs:
fail-fast: false
matrix:
include:
- os: macos-latest
- {os: macos-15, version: 16 }
- {os: macos-14, version: "15.4" }

- os: windows-latest
- os: ubuntu-latest
compiler: gcc
version: 11
- os: ubuntu-latest
compiler: gcc
version: 12
- os: ubuntu-latest
compiler: gcc
version: 13
# See https://github.com/actions/runner-images/issues/8659
# - os: ubuntu-latest
# compiler: clang
# version: 13
# - os: ubuntu-latest
# compiler: clang
# version: 14
- os: ubuntu-latest
compiler: clang
version: 15
- os: ubuntu-latest
compiler: clang
version: 16
- os: ubuntu-latest
compiler: clang
version: 17

- {os: ubuntu-latest, compiler: gcc, version: 12 }
- {os: ubuntu-latest, compiler: gcc, version: 13 }
- {os: ubuntu-24.04, compiler: gcc, version: 14 }

- {os: ubuntu-latest, compiler: clang, version: 16 }
- {os: ubuntu-latest, compiler: clang, version: 17 }
- {os: ubuntu-latest, compiler: clang, version: 18 }

steps:
- name: Checkout
Expand All @@ -61,22 +46,27 @@ jobs:
- name: System Setup
shell: bash
run: |
if [[ '${{ matrix.os }}' == 'ubuntu-latest' ]]; then
if [[ '${{ matrix.os }}' == ubuntu-* ]]; then
if [[ '${{ matrix.compiler }}' == 'clang' ]]; then
wget https://apt.llvm.org/llvm.sh
chmod u+x llvm.sh
sudo ./llvm.sh ${{ matrix.version }}
sudo ./llvm.sh ${{ matrix.version }}
sudo apt-get install -y clang-tools-${{ matrix.version }}
echo "CC=clang-${{ matrix.version }}" >> $GITHUB_ENV
echo "CXX=clang++-${{ matrix.version }}" >> $GITHUB_ENV
fi
if [[ '${{ matrix.compiler }}' == 'gcc' ]]; then
sudo add-apt-repository -y ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install -y gcc-${{ matrix.version }} g++-${{ matrix.version }}
echo "CC=gcc-${{ matrix.version }}" >> $GITHUB_ENV
echo "CXX=g++-${{ matrix.version }}" >> $GITHUB_ENV
fi
fi
if [[ '${{ matrix.os }}' == macos-* ]]; then
echo "DEVELOPER_DIR=/Applications/Xcode_${{ matrix.version }}.app" >> $GITHUB_ENV
fi
- name: Configure
shell: bash
Expand Down Expand Up @@ -149,7 +139,7 @@ jobs:
- name: Set Up Emscripten
uses: mymindstorm/setup-emsdk@v14
with:
version: 3.1.26
version: 3.1.70
actions-cache-folder: 'emsdk-cache'

- name: Configure
Expand Down
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,5 @@ out/

*.pyc


CMakeSettings.json
test/android
50 changes: 50 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,56 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),

## Unreleased

This is a major release with some breaking changes

### Changed

- C++20 or higher is now required for compilation. In particular, the following C++20 features must be available:
- Ranges support in standard library (`__cpp_lib_ranges >= 201911`)
- Three-way comparison (spaceship operator)
- `char8_t` type
- `std::endian` support in standard library (`__cpp_lib_endian >= 201907`)
- Minimal compilers known to work include: GCC 12, Clang 16, Apple Clang 15.4 and MSVC 17.6.
- The library has been _range_-ified.
- All methods that used to accept iterator pairs now take iterator/sentinel pairs.
- All these methods now also have overloads that accept ranges
- Existing informal ranges (`sys_string::char_access`, `sys_string::utf_view`, etc.) are now
formal ranges or views.
- As part of the above `sys_string::utfX_view` classes has been renamed to `sys_string::utfX_access` (because they are
not formally views as defined by standard library). The old names have been retained for compatibility but annotated
as deprecated. Note that `sys_string_builder::utf_view` remains under the same name since it *is* a view.
- Breaking change: as part of the above change the `sys_string::utf_access` and `sys_string_builder::utf_view` now
return distinct iterators and sentinels (that is they no longer satisfy `std::ranges::common_range` concept).
You will need to use ranges algorithms with their iterators (e.g. `std::ranges::for_each` rather than `std::for_each`).
- The global `utf_view` template has been split into two: `utf_ref_view` that takes underlying range by reference (similar
to `std::ref_view`) and `utf_owning_view` that owns a movable underlying range (similar to `std::owning_view`). These
are automatically produced by `as_utf` range adapter closures (see below in Added section)
- Breaking change: the non-standard `Cursor` classes has been removed.
- The library has been _concept_-ified.
- Most templated library calls now have concepts checks that validate their argument types.
- Primitive `std::enable_if` used before have been subsumed by these and removed.
- Unicode data used for case folding and whitespace detection has been updated to version 16.0.0

### Added
- `sys_string_t` can now be `+`-ed with any forward range of any type of character (including C strings and std::string).
This results in a the same optimized addition as when adding `sys_string_t` objects.
- `sys_string_t` objects can now be formatted via `std::format` (if available in your library). On platforms
where `wchar_t` is UTF-16 or UTF-32 you can also use wide character formatting.
- `sys_string_t::std_format` method. This formats a new `sys_string_t` (similar to the existing `sys_string_t::format`)
but uses `std::format` machinery and formatting string syntax.
- Range adapter closures: `as_utf8`, `as_utf18`, `as_utf32` and generic `as_utf<encoding>` .
- These can be used to create `utf_ref_view`/`utf_owning_view` from any range/view. For example `as_utf16(std::string("abc"))`
- If you library supports custom adapter closures (usually `__cpp_lib_ranges >= 202202L`) they can be used in
view pipelines like `std::string("abc") | as_utf16 | std::views::take(2)` etc.

### Fixed
- Printing `sys_string_t` objects into `std::ostream` (and `std::wostream` if available) now functions correctly in presence
of stream formatting flags. Flags are currently ignored. This might change in a future version.
- Printing/formatting `sys_string_t` objects that use `char` storage type now does not perform sanitizing transcoding. The content
of the string is printed as-is. This allows faithful round-tripping and support for invalid Unicode for those scenarios. Similar
behavior applies to `wchar_t` on platform where it is UTF-16 or UTF-32.
- `operator<<` no longer pollutes global namespace

## [2.14] - 2024-05-02

### Fixed
Expand Down
44 changes: 29 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
## SysString

[![Language](https://img.shields.io/badge/language-C++-blue.svg)](https://isocpp.org/)
[![Standard](https://img.shields.io/badge/C%2B%2B-17-blue.svg)](https://en.wikipedia.org/wiki/C%2B%2B#Standardization)
[![Standard](https://img.shields.io/badge/C%2B%2B-20-blue.svg)](https://en.wikipedia.org/wiki/C%2B%2B#Standardization)
[![License](https://img.shields.io/badge/license-BSD-brightgreen.svg)](https://opensource.org/licenses/BSD-3-Clause)
[![Tests](https://github.com/gershnik/sys_string/actions/workflows/test.yml/badge.svg)](https://github.com/gershnik/sys_string/actions/workflows/test.yml)

This library provides a C++ string class template `sys_string_t` that is optimized for **interoperability with external native string type**. It is **immutable**, **Unicode-first** and exposes convenient **operations similar to Python or ECMAScript strings**. It uses a separate `sys_string_builder_t` class template to construct strings. It provides fast concatenation via `+` operator that **does not allocate temporary strings**.
The library exposes bidirectional UTF-8/UTF-16/UTF-32 views of `sys_string_t` as well as of any random access containers
The library exposes bidirectional UTF-8/UTF-16/UTF-32 views of `sys_string_t` as well as of any C++ input ranges of chracters.
of characters.

## What does it mean?
Expand All @@ -24,10 +24,10 @@ of characters.

For example the storage for Apple's platforms uses `NSString *` internally, allowing zero cost conversions between C++ and native sides.

On Android and Emscripten/WebAssembly no-op conversions from C++ to native strings are impossible for technical reasons.
The storage for these platforms' strings still makes conversions as cheap as possible (avoiding UTF conversions for example).
On Android and Emscripten/WebAssembly no-op conversions from C++ to native strings are technically impossible.
The storage implementations for these platforms still makes conversions as cheap as possible (avoiding UTF conversions for example).

The library also provides typedefs `sys_string`/`sys_string_builder` that use the "default" storage type on each platform (you can change which one it is via compilation options). Regardless of which storage is the default you can always directly use other specializations in your code if necessary.
The library also provides `sys_string`/`sys_string_builder` typedefs that use the "default" storage type on each platform (you can change which one it is via compilation options). Regardless of which storage is the default you can always directly use other specializations in your code if necessary.


* **Immutability.** String instances cannot be modified. To do modifications you use a separate "builder" class. This is similar to how many other languages do it and results in improved performance and elimination of whole class of errors.
Expand All @@ -36,7 +36,7 @@ of characters.

* **Operations similar to Python or ECMAScript strings.** You can do things like `rtrim`, `split`, `join`, `starts_with` etc. on `sys_string_t` in a way proven to be natural and productive in those languages.

* **Concatenation does not allocate temporaries.** You can safely do things like `result = s1 + s2 + s3`. It will result in **one** memory allocation and one `memcpy` of `s1`, `s2` and `s3` content into the final result. Not 2 allocations and 5 copies like in other languages or with `std::string`.
* **Concatenation does not allocate temporaries.** You can safely do things like `result = s1 + s2 + s3`. It will result in **one** memory allocation and 3 calls to `memcpy` to copy each of `s1`, `s2` and `s3` content into the final result. Not 2 allocations and 5 copies like in other languages or with `std::string`.

* **Bidirectional UTF-8/UTF-16/UTF-32 views**. You can view `sys_string_t` as a sequence of UTF-8/16/32 characters and iterate forward or __backward__ equally efficiently. Consider trying to find last instance of Unicode whitespace in UTF-8 data. Doing it as fast as finding the first instance is non-trivial. The views also work on any random access containers (C array, `std::array`, `std::vector`, `std::string`) of characters. Thus you can iterate in UTF-8 over `std::vector<char16_t>` etc.

Expand All @@ -49,18 +49,18 @@ Specifically, `std::basic_string` is an STL container of a character type that o

* They foreclose any ability to efficiently interchange data with some other string type. It becomes problematic if your code needs to frequently ping-pong data between C++ and your OS string abstraction. Consider Apple's platforms (macOS, iOS). Applications written for these platforms often have to extensively interoperate with code that requires usage of `NSString *` native string type. If you have to ping-pong string data a lot and/or store the same string data on both sides, using `std::string` will mean a large performance and memory penalty.

* They make `std::basic_string` Unicode hostile. By being oblivious to difference between "storage unit" and a "character", `std::basic_string` cannot really handle encodings such as `UTF-8` or `UTF-16` where the two differ. Yes you can store data in these encodings in it but you need to be extremely careful how you use it. What will `erase(it)` do if the iterator points in the middle of 4-byte UTF-8 sequence?
* They make `std::basic_string` Unicode hostile. By being oblivious to difference between a "storage unit" and a "character", `std::basic_string` cannot really handle encodings such as `UTF-8` or `UTF-16` where the two differ. Yes you can store data in these encodings in it but you need to be extremely careful how you use it. What will `erase(it)` do if the iterator points in the middle of 4-byte UTF-8 sequence?

Finally, and unrelatedly to the above, `std::string` lacks some simple things that are taken for granted these days by users of pretty much all other languages. There is case insensitive comparisons, no "trim" or "split" etc. It is possible to write those yourself of course but here the Unicode-unfriendliness raises its ugly head. To do any of these correctly you need to be able to handle a string as a sequence of Unicode characters and doing so with `std::string` is cumbersome.
Finally, and unrelatedly to the above, `std::string` lacks some simple things that are taken for granted these days by users of pretty much all other languages. There is no case insensitive comparisons, no "trim" or "split" etc. It is possible to write those yourself of course but here the Unicode-unfriendliness raises its ugly head. To do any of these correctly you need to be able to handle a string as a sequence of Unicode characters and doing so with `std::string` is cumbersome.


## Non-goals

The following requirements which other string classes often have are specifically non-goals of this library.

* Support C++ allocators. Since `sys_string_t` is meant to interoperate with system string class/types, it necessarily has to use the same allocation mechanisms as those.
* Support C++ allocators mechanism. Since `sys_string_t` is meant to interoperate with other string class/types, it necessarily has to use the same allocation mechanisms as those. Different allocation behavior can be accomplished via a custom `Storage` class.

* Have an efficient `const char * c_str()` method on all platforms. The goal of the library is to provide an efficient conversion to the native string types rather than specifically `const char *`. While ability to obtain `const char *` *is* provided everywhere, it might involve additional memory allocations and other overhead. Note that on Linux `char *` is the system type so it can be obtained with 0 cost.
* Have an efficient `const char * c_str()` method on all platforms. The goal of the library is to provide an efficient conversion to the native string types rather than specifically `const char *`. While ability to obtain `const char *` *is* provided everywhere, it might involve additional memory allocations and other overhead. Of course, when the storage of `sys_string_t` is `char` it can be obtained with 0 cost.

* Make `sys_string_t` an STL container. Conceptually a string is not a container. You can **view** contents of a string as a sequence of UTF-8 or UTF-16 or UTF-32 codepoints and the library provides such views which function as STL ranges.

Expand All @@ -78,11 +78,25 @@ Another way to look at it is that `sys_string_t` sometimes trades micro-benchmar

## Compatibility

This library has been tested with
* Xcode 13 - 14 on x86_64 and arm64
* MSVC 16.9 - 17.4 on x86_64
* Clang 12.0.5 under Android NDK, ANDROID_PLATFORM=19 on x86, x86_64, `armeabi-v7a` and `arm64-v8a` architectures
* GCC 9.3 - 11.3 on x86_64 Ubuntu 20.04 - 22.04
Starting from version 3 this library requires C++20 compiler. In particular, the following C++20 features must be available:
- Ranges support in standard library (`__cpp_lib_ranges >= 201911`)
- Three-way comparison (spaceship operator)
- `char8_t` type
- `std::endian` support in standard library (`__cpp_lib_endian >= 201907`)

The library is known to work with at least:
* Xcode 15.4
* MSVC 17.6
* Clang 16
* GCC 12
* Emscripten 3.1.70

Version 2 of this library was the last version supporting C++17. It is known to work at least with:

* Xcode 13
* MSVC 16.9
* Clang 12.0.5
* GCC 9.3
* Emscripten 3.1.21

## Usage
Expand Down
Loading

0 comments on commit d17ff84

Please sign in to comment.