Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

自定义词库 2.1 并非总是起作用,请看例子。 #58

Open
zw963 opened this issue Nov 20, 2021 · 2 comments
Open

自定义词库 2.1 并非总是起作用,请看例子。 #58

zw963 opened this issue Nov 20, 2021 · 2 comments

Comments

@zw963
Copy link

zw963 commented Nov 20, 2021

首先这是不加自定义词库时, 返回的结果, 注意看返回的 "感恩“

psql (13.4)
Type "help" for help.

marketbet_crawler_development=# select * from zhparser.zhprs_custom_word;
 word | tf | idf | attr 
------+----+-----+------
(0 rows)
marketbet_crawler_development=# SELECT ts_parse('zhparser','金市周评:FED加息预期升温且国际贸 易局势缓和,感恩节前金价回落');
  ts_parse  
------------
 (110,金市)
 (110,市)
 (110,周评)
 (118,评)
 (117,:)
 (101,FED)
 (118,加息)
 (118,加)
 (110,息)
 (118,预期)
 (118,升温)
 (118,升)
 (118,温)
 (99,且)
 (110,国际)
 (110,国)
 (110,际)
 (110,贸)
 (97,易)
 (110,局势)
 (110,局)
 (110,势)
 (118,缓和)
 (118,缓)
 (117,,)
 (118,感恩)
 (110,恩)
 (116,节前)
 (110,节)
 (110,金价)
 (110,价)
 (118,回落)
 (118,回)
 (118,落)
(34 rows)

下面添加 “感恩节” 到自定义词库 2.1

marketbet_crawler_development=# INSERT INTO zhparser.zhprs_custom_word values('感恩节') ON CONFLICT DO NOTHING;
INSERT 0 1
marketbet_crawler_development=# select * from zhparser.zhprs_custom_word; 
 word  | tf | idf | attr 
--------+----+-----+------
 感恩节 |  1 |   1 | @
(1 row)

为了确保生效,退出 psql, 再次连接, 并且执行 sync_zhprs_custom_word();, 可以看到 “感恩节还在”

marketbet_crawler_development=# SELECT sync_zhprs_custom_word();
 sync_zhprs_custom_word 
------------------------
 
(1 row)

marketbet_crawler_development=# select * from zhparser.zhprs_custom_word;
  word  | tf | idf | attr 
--------+----+-----+------
 感恩节 |  1 |   1 | @
(1 row)

然后再次查询, 问题来了,并未看到 “感恩节” 作为 token 出现,事实上,两者没有任何变化,仿佛没有加这个关键字一样。

marketbet_crawler_development=# SELECT ts_parse('zhparser','金市周评:FED加息预期升温且国际贸 易局势缓和,感恩节前金价回落');
  ts_parse  
------------
 (110,金市)
 (110,市)
 (110,周评)
 (118,评)
 (117,:)
 (101,FED)
 (118,加息)
 (118,加)
 (110,息)
 (118,预期)
 (118,升温)
 (118,升)
 (118,温)
 (99,且)
 (110,国际)
 (110,国)
 (110,际)
 (110,贸)
 (97,易)
 (110,局势)
 (110,局)
 (110,势)
 (118,缓和)
 (118,缓)
 (117,,)
 (118,感恩)
 (110,恩)
 (116,节前)
 (110,节)
 (110,金价)
 (110,价)
 (118,回落)
 (118,回)
 (118,落)
(34 rows)
@zlianzhuang
Copy link
Collaborator

postgres=# SELECT ts_parse('zhparser','感恩节是什么');
ts_parse

(116,感恩节)
(118,是)
(114,什么)
(3 rows)

postgres=# SELECT ts_parse('zhparser','感恩节前金价回落');
ts_parse

(118,感恩)
(116,节前)
(110,金价)
(118,回落)
(4 rows)

postgres=# insert into zhparser.zhprs_custom_word values('感恩节前');
INSERT 0 1
postgres=# select sync_zhprs_custom_word();
sync_zhprs_custom_word

(1 row)

postgres=#
\q
[lzzhang@lzzhang bin]$ ./psql -d postgres
psql (15.0)
Type "help" for help.

postgres=# SELECT ts_parse('zhparser','感恩节前金价回落');
ts_parse

(120,感恩节前)
(110,金价)
(118,回落)
(3 rows)

感恩节 已经在词典中存在了,scws 的优先级似乎倾向于切分出更多的单词。比如 “感恩 节前” 而不是 “感恩节 前” 。又比如 postgres=# SELECT ts_parse('zhparser','感恩节假日来临');
ts_parse

(118,感恩)
(116,节假日)
(118,来临)

似乎比较难处理。

可以给scws提一个issue看看,不过改项目已经很久没维护了。不一定会处理

@zlianzhuang
Copy link
Collaborator

我个人觉得,在业务中处理 感恩节前 这种长词会好些

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants