You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We already have gen_cjk() and per pull #63 might have gen_cyrillic.
If we wanted to, in the future, support other methods (Tamil, Telugu, etc.), we can see where this would get very cumbersome/duplicitous, very quickly.
It might be good to have some generic function that takes any specific range and plugs it in, and then wrap that with a function specific to the unicode block you want to test.
e.g., instead of
codepoints = [random.randint(0x4E00, 0x9FCC) for _ in range(length)]
try:
# (undefined-variable) pylint:disable=E0602
output = u''.join(unichr(codepoint) for codepoint in codepoints)
except NameError:
output = u''.join(chr(codepoint) for codepoint in codepoints)
return _make_unicode(output)
...put this into a generate_unicode_range() function that can have codepoint values passed to it, and then use that inside a function for any desired unicode block...
gen_bengali() gen_hebrew() gen_hiragana()
Now, there is a sticky wicket in all this. Some character sets span multiple, non contiguous blocks. More details here:
So really, we should be able to pass all desired blocks into a python list, and then either make a single range to rule them all, or simply the ability to choose a random character out of each block within the list.
The text was updated successfully, but these errors were encountered:
Generate codepoints. The valid range of UTF-8 codepoints is
0x0-0x10FFFF, minus the following: 0xC0-0xC1, 0xF5-0xFF and
0xD800-0xDFFF. These 2061 invalid codepoints (2 + 11 + 2048) comprise
0.2% of 0x0-0x10FFFF. Thus, it should be OK to just check for invalid
codepoints and generate new ones if need be.
I think adding an optional tuple parameter to gen_utf8 would be the best implementation. then we could either remove the cjk and cryllic functions or shrink them down to just pass the correct tuple to gen_utf8.
So really, we should be able to pass all desired blocks into a python list, and then either make a single range to rule them all, or simply the ability to choose a random character out of each block within the list.
Creating a list that contains all the characters in a given character set and pulling values out is not very streamy. We can find a way to generate a bunch of random-ish characters without creating a list containing tens/hundreds/whatever of thousands of characters and plucking characters out from it.
We already have gen_cjk() and per pull #63 might have gen_cyrillic.
If we wanted to, in the future, support other methods (Tamil, Telugu, etc.), we can see where this would get very cumbersome/duplicitous, very quickly.
It might be good to have some generic function that takes any specific range and plugs it in, and then wrap that with a function specific to the unicode block you want to test.
e.g., instead of
...put this into a
generate_unicode_range()
function that can havecodepoint
values passed to it, and then use that inside a function for any desired unicode block...gen_bengali()
gen_hebrew()
gen_hiragana()
Now, there is a sticky wicket in all this. Some character sets span multiple, non contiguous blocks. More details here:
http://en.wikipedia.org/wiki/Unicode_block
So really, we should be able to pass all desired blocks into a python list, and then either make a single range to rule them all, or simply the ability to choose a random character out of each block within the list.
The text was updated successfully, but these errors were encountered: