perplexity 算子，在计算中文数据集时，都特别大

### Before Asking 在提问之前

- [x] I have read the [README](https://github.com/alibaba/data-juicer/blob/main/README.md) carefully. 我已经仔细阅读了 [README](https://github.com/alibaba/data-juicer/blob/main/README_ZH.md) 上的操作指引。

- [x] I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。


### Search before asking 先搜索，再提问

- [x] I have searched the Data-Juicer [issues](https://github.com/alibaba/data-juicer/issues) and found no similar questions. 我已经在 [issue列表](https://github.com/alibaba/data-juicer/issues) 中搜索但是没有发现类似的问题。


### Question

perplexity 算子，在处理中文数据集时，得分特别大超出预期，请问是否合理

{"text":"支付完成","bad_type_opdev":"words_num","source_opdev":"badcase","language_opdev":"cn","score":null,"source":null,"dj__stats":{"alnum_ratio":1.0,"char_rep_ratio":0.0,"num_words":1,"perplexity":225308.5,"special_char_ratio":0.0,"text_len":4,"word_rep_ratio":0.0}}

如这个例子，仅仅“支付完成”这个文本，给出了225308.5 复杂度的。请问，这个是什么问题呢

### Additional 额外信息

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perplexity 算子，在计算中文数据集时，都特别大 #878

Before Asking 在提问之前

Search before asking 先搜索，再提问

Question

Additional 额外信息

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

perplexity 算子，在计算中文数据集时，都特别大 #878

Description

Before Asking 在提问之前

Search before asking 先搜索，再提问

Question

Additional 额外信息

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions