Language capabilities #247

janEbert · 2025-01-08T14:00:56Z

Congrats on the amazing model release! I've been unable to find which languages you actually consider "supported" for the model, since the paper just mentions that you were "expanding multilingual coverage beyond English and Chinese" without further data. If you don't have a concrete answer to that question, maybe you can respond which languages the model has been instruction-tuned with. If you can open up the per-language sampling/mixture rate in the data distribution, that'd also be very helpful.

I also noticed that the paper says you used the non-English part from the MMMLU dataset. In the paper, the HF repo is referenced, which does not contain an English part (I'm assuming you mean that the original English MMLU wasn't included), but which does contain a Chinese part.
I'm assuming that you did not include English in this evaluation set in order to not skew the multilingual results, because English is one of the two main powerful languages of the model. Shouldn't Chinese have been filtered as well to achieve a similar reduction of result skewing?

Thank you and good luck continuing the great work! :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language capabilities #247

Language capabilities #247

janEbert commented Jan 8, 2025

Language capabilities #247

Language capabilities #247

Comments

janEbert commented Jan 8, 2025