Skip to content

Conversation

@darjus
Copy link

@darjus darjus commented Mar 21, 2025

Hey Ferris, I was looking for some punkt-like functionality and came across your crate. It seems it's not been updated in a while and since I need the functionality anyway, decided to update your crate instead.

Test results:

rust-punkt % cargo t
   Compiling punkt v1.0.6 ()
    Finished `test` profile [unoptimized + debuginfo] target(s) in 1.13s
     Running unittests src/lib.rs (target/debug/deps/punkt-1af74e20e78321db)

running 8 tests
test token::tests::test_token_flags ... ok
test tokenizer::tests::smoke_test_is_multi_char_pass ... ok
test tokenizer::tests::sentence_tokenizer_issue_8_test ... ok
test trainer::tests::test_data_load_from_json_test ... ok
test tokenizer::tests::sentence_tokenizer_issue_5_test ... ok
test tokenizer::tests::periodctxt_tokenizer_compare_nltk ... ok
test tokenizer::tests::word_tokenizer_compare_nltk ... ok
test tokenizer::tests::sentence_tokenizer_compare_nltk_train_on_document ... ok

test result: ok. 8 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 1.10s

   Doc-tests punkt

running 8 tests
test src/lib.rs - (line 103) ... ok
test src/lib.rs - (line 78) ... ok
test src/lib.rs - (line 43) ... ok
test src/tokenizer.rs - tokenizer::SentenceTokenizer (line 448) ... ok
test src/trainer.rs - trainer::TrainingData (line 93) ... ok
test src/tokenizer.rs - tokenizer::SentenceByteOffsetTokenizer (line 343) ... ok
test src/lib.rs - (line 53) ... ok
test src/lib.rs - (line 26) ... ok

test result: ok. 8 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 1.20s

Bench results:

Gnuplot not found, using plotters backend
WordTokenizer/short_doc time:   [61.246 µs 61.474 µs 61.720 µs]
                        change: [-3.5504% -1.4128% -0.0352%] (p = 0.12 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
WordTokenizer/medium_doc
                        time:   [71.463 µs 71.617 µs 71.792 µs]
                        change: [-1.3709% -0.4776% +0.1211%] (p = 0.26 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
WordTokenizer/long_doc  time:   [3.2085 ms 3.2154 ms 3.2221 ms]
                        change: [-0.4503% -0.0955% +0.2618%] (p = 0.60 > 0.05)
                        No change in performance detected.
WordTokenizer/very_long_doc
                        time:   [11.827 ms 11.925 ms 12.048 ms]
                        change: [-3.6751% -1.1126% +0.8954%] (p = 0.42 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

SentenceTokenizer/short_doc
                        time:   [44.690 µs 44.775 µs 44.860 µs]
                        change: [-0.8286% -0.5456% -0.2262%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  1 (1.00%) high severe
SentenceTokenizer/medium_doc
                        time:   [56.706 µs 56.874 µs 57.086 µs]
                        change: [-0.2393% +1.6224% +4.9058%] (p = 0.26 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
SentenceTokenizer/long_doc
                        time:   [8.5785 ms 8.5926 ms 8.6067 ms]
                        change: [-1.6259% -0.5729% +0.1424%] (p = 0.25 > 0.05)
                        No change in performance detected.

     Running `rust-punkt/target/release/deps/trainers-190d81f24efadab0 --bench`
Gnuplot not found, using plotters backend
Trainer/short_doc       time:   [120.90 µs 121.10 µs 121.30 µs]
                        change: [-1.0087% -0.7565% -0.4876%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 21 outliers among 100 measurements (21.00%)
  4 (4.00%) low severe
  1 (1.00%) low mild
  12 (12.00%) high mild
  4 (4.00%) high severe
Trainer/medium_doc      time:   [161.96 µs 162.25 µs 162.51 µs]
                        change: [-0.0192% +0.3593% +0.7069%] (p = 0.06 > 0.05)
                        No change in performance detected.
Found 19 outliers among 100 measurements (19.00%)
  2 (2.00%) low severe
  5 (5.00%) low mild
  4 (4.00%) high mild
  8 (8.00%) high severe
Trainer/long_doc        time:   [7.3356 ms 7.3480 ms 7.3614 ms]
                        change: [+0.0040% +0.2177% +0.4423%] (p = 0.05 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe
Trainer/very_long_doc   time:   [25.326 ms 25.433 ms 25.545 ms]
                        change: [-1.2095% +0.4529% +1.5705%] (p = 0.64 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant