Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Chinese support for the package:collapse #579

Open
anticmason opened this issue May 22, 2024 · 6 comments
Open

add Chinese support for the package:collapse #579

anticmason opened this issue May 22, 2024 · 6 comments

Comments

@anticmason
Copy link

Hi,
I’m the user of your package:collapse from China.
Recently,when I try to use it to improve work efficiency, I find it doesn't support Chinese very well,especially when encounter with Chinese header or field from a file to deal with, some functions used very frequently such as funique,,fsubset,collag,roworder(v),fgroup_by,join,pivot etc. I guess maybe more functions like listed above will get error or None result。Since I'm the heavy user of this package,could it be possible to fix this bug?
Moreover, could you please write a function to read or write xlsx/csv ,which has an encoding parameter to be choosed like 'utf-8','gbk' etc。。。like pandas's read_csv,read_excel?(Since Data.table package doesn't support 'gbk' for the encoding parameter to read or write)
Thanks a lot !
Looking forward to receiving your reply~

@SebKrantz
Copy link
Owner

Hi, so in general, this package is UTF8 only. I think supporting other character encoding would require checking the encoding of every string (since character vectors can be heterogeneous), which would really slow things down. I'm also really not sure where to start here and would possibly need help by people that understand more about Chinese and string encoding in C.

Regarding excel, at the moment I don't plan to create file readers/writers. The package is already quite large.

@anticmason
Copy link
Author

anticmason commented May 25, 2024 via email

@SebKrantz
Copy link
Owner

SebKrantz commented May 25, 2024

Thanks, dplyr is written in R so won’t be of much help. I will need to look at C-based packages such as data.table. How does it do?

In general, I’m thinking it may very well be possible to go beyond UTF8 in a performance friendly way by assuming that string vectors are homogenous.

Could you perhaps provide a set of reproducible examples (using reprex::reprex()) of the different ways collapse currently fails? That would greatly help me test any internal improvements towards that end.

@anticmason
Copy link
Author

anticmason commented May 25, 2024 via email

@SebKrantz
Copy link
Owner

SebKrantz commented May 25, 2024

Thanks. Python packages are not that useful since R has its own C API. Going forward it would be helpfule if you could indeed provide some reprex using simply your hand-typed chinese characters (mock data frames) and demonstrating how collapse currently fails. Then I'll see what can be done.

@SebKrantz
Copy link
Owner

There should be some progress on this in collapse 2.0.16 now on GitHub/ R-universe. Perhaps you can update me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants