Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode issues in Linux and Unix when running the tests. #78

Open
dumol opened this issue Mar 2, 2017 · 4 comments
Open

Unicode issues in Linux and Unix when running the tests. #78

dumol opened this issue Mar 2, 2017 · 4 comments

Comments

@dumol
Copy link

dumol commented Mar 2, 2017

This happens with scandir 1.5 and Python 2.7.11 in all our Linux, AIX, Solaris, FreeBSD and OpenBSD build slaves, but not on Windows and OS X / Mac OS.

Firstly, test_basic fails with the following error:

======================================================================
ERROR: test_basic (test_scandir.TestScandirC)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/srv/buildslave/runtime/build-ubuntu1404-x64/slave/python-package-ubuntu-1404/build/python-modules/python-scandir-1.5/test/test_scandir.py", line 301, in setUp
    TestMixin.setUp(self)
  File "/srv/buildslave/runtime/build-ubuntu1404-x64/slave/python-package-ubuntu-1404/build/python-modules/python-scandir-1.5/test/test_scandir.py", line 101, in setUp
    setup_main()
  File "/srv/buildslave/runtime/build-ubuntu1404-x64/slave/python-package-ubuntu-1404/build/python-modules/python-scandir-1.5/test/test_scandir.py", line 62, in setup_main
    os.mkdir(join(TEST_PATH, 'subdir', 'unidir\u018F'))
UnicodeEncodeError: 'ascii' codec can't encode character u'\u018f' in position 144: ordinal not in range(128)

Subsequently, most tests that follow fail with No such file or directory errors, eg.:

======================================================================
ERROR: test_bytes (test_scandir.TestScandirC)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/srv/buildslave/runtime/build-ubuntu1404-x64/slave/python-package-ubuntu-1404/build/python-modules/python-scandir-1.5/test/test_scandir.py", line 301, in setUp
    TestMixin.setUp(self)
  File "/srv/buildslave/runtime/build-ubuntu1404-x64/slave/python-package-ubuntu-1404/build/python-modules/python-scandir-1.5/test/test_scandir.py", line 104, in setUp
    setup_symlinks()
  File "/srv/buildslave/runtime/build-ubuntu1404-x64/slave/python-package-ubuntu-1404/build/python-modules/python-scandir-1.5/test/test_scandir.py", line 74, in setup_symlinks
    os.mkdir(join(TEST_PATH, 'linkdir', 'linksubdir'))
OSError: [Errno 2] No such file or directory: '/srv/buildslave/runtime/build-ubuntu1404-x64/slave/python-package-ubuntu-1404/build/python-modules/python-scandir-1.5/test/testdir/linkdir/linksubdir'

======================================================================

Actually, this breaks all scandir tests except the following three:

test_traversal (test_walk.TestWalk) ... ok
test_symlink_to_directory (test_walk.TestWalkSymlink) ... ok
test_symlink_to_file (test_walk.TestWalkSymlink) ... ok

All excerpts are from an Ubuntu 16.04 build slave, but the errors are common across Linux distributions and Unix varieties and versions. However, OS X / Mac OS and Windows are not affected.

@adiroiban
Copy link

Just a small note. The tests are failing when executed under an environment with LANG=C. We are using this environment to help detect the implicit encoding done by Python 2.7

Windows and OSX support Unicode API, so there is no need for Python 2.7 to do any conversion.

The tests are passing on Linux with

$ echo $LANG
en_US.UTF-8

@dumol
Copy link
Author

dumol commented Mar 3, 2017

@adiroiban, thank you for the tip! Using UTF-8 locale settings works indeed, but setting $LANG is not enough, I used $LC_ALL.

However, even that is not enough all the time, the chosen UTF-8 locale also has to be available in that system, which is not always the case in the Linux / Unix world.

@benhoyt
Copy link
Owner

benhoyt commented Sep 29, 2017

Does someone know what the right fix for this is? Or is it not a problem with scandir, but just a matter of setting your LANG/LC_ALL environment variables to fix?

@adiroiban
Copy link

For the tests, the fix is to be explicit about the encoding and don't let Python to do the encoding/decoding for you.

So in this code, for example don't pass Unicode to the Python low level API as this will produce various encodings

os.mkdir(join(TEST_PATH, 'subdir', 'unidir\u018F'))

but instead be explicit and pass bytes which are already encoded

path = join(TEST_PATH, 'subdir', 'unidir\u018F')
os.mkdir(path.encode('utf-8')

For general usage, I don't know if there is a fix.

scandir should not try to do any smart thing with the file names and just pass them as bytes (without trying to decode them)

Linux / Unix filesystems are just bytes... so you can store whatever you want as the file name and in whatever format you want

Things can get messy and in the same folder have an UTF-8 encoded name, ASCII and EBCDIC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants