-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing the xP3 dataset #23
Comments
Q1: You can either rely on heuristics - With arabic I guess you can just look for the first latin letters or you reprocess the dataset and separate the instructions Q2: You can remove them |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Q1: I am trying to extract the Arabic instructions from the xP3 dataset, and I want to put them in the format: “Instruction”, “Input”, and “Output”. Currently, the data is in this format: “inputs” and “targets”.
I found that the instructions sometimes are in the last part of the “inputs” and preceded by \n, and sometimes without any delimiter. In other cases, the instructions are at the beginning of the “inputs”, etc.
Here is an example where the instruction is at the end of the input, but without any delimiter to recognize it.
File: xp3_GEM_xlsum_arabic_train_xp3longrest.jsonl
{"inputs":"...\nووسط هذه القلة يقف أيضا شقيقيها الفنان فيصل لعيبي، الذي أثر كثيرا في تطورها الفني وباتت تشكل معه ثنائيا فنيا مميزا، يجعل من أعمالهما في حوار دائم، فتحمل كثيرا من الوشائج والتشابهات الأسلوبية والشكلية لكنها تفترق في التوجه. ففي الوقت الذي يسعى فيصل إلى تأصيل فنه في قلب منجز الرسم العراقي بالتركيز على الخصوصية العراقية واللمسة المحلية والنهل من التراث الفني الرافديني في مراحله المختلفة وعكسه بلغة فنية حداثوية معاصرة، تسعى عفيفة إلى تمييز نفسها عنه بالتحليق في فضاء إنساني عام، مبعدة لوحاتها عن أي ... Write the rest of the article:","targets":"حوارا مع نظرتها ويركز عليها ليكتشف أنها تنظر في مكان آخر أو ربما في ماضٍ بعيد. \n\nوتعنى عفيفة باختيار
Q2: I found many incomplete inputs and outputs, ex: having this string:
"... Continue the article for another 4000 characters max:","targets":"."}
What should we do in such cases?
Thanks
Hamdy Mubarak
The text was updated successfully, but these errors were encountered: