-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MDL crash/fails when markdown file contains UTF-8 characters #502
Comments
@RochaStratovan Can you update/request an update to MDL 0.13.0, released in October 2023? With this version in hand, both your |
Will do. Thank you. |
Hmmmmmm..... so I agree it doesn't happen for the README.md file I posted. I was also able to reproduce it within my environment with that file, and now with MDL.0.13.0 it passes. However, it is still failing with my full README.md file. I'm trying to figure out more to share with you. |
It seems like it's getting a UTF-8 failure on a different README.md file now. The problem no longer happens for that "small" example, but it's still happening on my larger files. I started the "minification" process again to find the problem. Updated file: Updated failure message rocha@e20c13008e8e:~/JRRTEST2$ mdl README_new.md
Traceback (most recent call last):
9: from /usr/local/bin/mdl:23:in `<main>'
8: from /usr/local/bin/mdl:23:in `load'
7: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/bin/mdl:10:in `<top (required)>'
6: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl.rb:83:in `run'
5: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl.rb:83:in `each'
4: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl.rb:91:in `block in run'
3: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl/doc.rb:52:in `new_from_file'
2: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl/doc.rb:52:in `new'
1: from /var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl/doc.rb:39:in `initialize'
/var/lib/gems/2.7.0/gems/mdl-0.13.0/lib/mdl/doc.rb:39:in `split': invalid byte sequence in UTF-8 (ArgumentError) Updated GitLab Rendered Output with what I think are the problem characters highlighted in yellow What it looks like in a vi session, with yellow highlights again |
@RochaStratovan I was able to replicate your findings. the background of the story:The cause is character encoding and what the operating system/the editor uses as code page. Originally, there was ASCII 7bit, allowing to store 2^7 = 128 characters only (some non-visible/control, A-Z, a-z, 0-9, a few special characters) for US American English. Because that's not enough to cover other languages and other scripts, unicode encodings are today the better way. While working with contemporary Python, you possibly encounter lines like #!/usr/bin/env python3
# -*- coding: utf-8 -*-
import os
records = []
with open("example.txt", mode="r", encoding="utf-8") as source:
records = source.readlines() to be a explicit about the file encoding in the Python script file (line 2), or/and about the file to process by the script (line 6). Unicode The character in particular here is the (R) / ®. how to prevent this obstacle with files created in futureCheck your editor used to toggle to UTF-8. By your screen photo, I presume Windows is (one/the) operating system you use. In the case notepad++ (project page, entry on portableapps) you can set this parameter here: In case you prefer cross-platform geany (project page, entry portableapps), go Edit -> Preferences, tab Files: The two only as an example; feel free to use the editor which suits your needs best. Equally, it might be worth to check a twice if (presuming you use git from Windows) the setup of your instance of git uses Linux file endings. (Which is on one of questions on an early pane, during the installation.) how to resolve the current obstacleYou have to edit the files in question, which requires i) to identify "the ones" in first place, and ii) adjust the code page used for them. The following approach requires some basic Unix/Linux commands; in case you don't have access to Linux Ubuntu, Debian, suse, or Fedora, etc you equally can resort to the minimal (Bash) shell provided e.g., by TortoiseGit for the pull down menu there.
|
Hello @nbehrnd, Thank you for the detailed analysis and answer. I understand the problem, however, I don't agree with what I think you are proposing as the solution. I believe you are suggesting that in order to avoid/prevent MDL from crashing, we should modify the input tools. First, this isn't really a solution that scales. We have many developers that contribute to the documentation at our company and they use various tools such as:
just to name a few. Second, I would categorize this as an issue with MDL. It crashes on text files that standard text editors can handle. When my devs and I see this crash, it's an MDL error. I agree as a workaround they could scan the text file to find the symbols that MDL is crashing on, but that doesn't take away from the fact that this is an issue with the MDL parsing logic. MDL is a great tool. It just needs a few fixes such as this to be a bit more robust. |
It is true that I didn't test how various editors react if they i) usually are
used to use one code page (e.g., ISO 8895-1) and now get an input file written
in an other, for instance UTF-8. That is: after an intentional edit, will the
document be consistently saved with their usual ISO 8895-1, or with the UTF-8
code page?
On the other hand, presuming the code basis were hosted on GitHub, I speculate
changing the code page used for files eventually managed by git possibly could
be automated by one of the CI workflows offered, or one one can build and
tailor: after local work, one would file the pull request to the repository;
prior to a merge the automated workflow would i) determine the code page, and
ii) fix it if necessary -- no manual intervention required. Eventually, only
after successfully passing this automated step, the merge could happen: either
after a manual / peer review of the code owner(s), or equally automated (with
an additional secret key to deposit) by this workflow set up.
Recently, I became aware of such a format checker as an automated action in
the avogadro2 project,[1] which can extend to test and build executables,
etc. too.[2] GitHub compiled information how to use such an action[3] which
maybe scale well enough for your work. But perhaps a «local GitHub workflow»
suits your needs better to adhere to local IP policy, and manage performance.
I lack the necessary insight how `markdownlint` itself could become more robust
to process markdown syntax regardless of the code page used.
[1]
https://github.com/OpenChemistry/avogadroapp/blob/master/.github/workflows/clang-format-check.yml
[2]
https://github.com/openmopac/mopac/blob/main/.github/workflows/CI.yaml
[3]
https://docs.github.com/en/actions/examples/using-scripts-to-test-your-code-on-a-runner
|
Description
Running
mdl
against a markdown file that contains utf-8 characters causes it to fail/crash.Environment
Ubuntu 20 Linux docker container running in GitLab pipeline.
MDL Version
0.12.0
Expected Behavior
It should process the UTF-8 characters/file without a problem.
Actual Behavior
It fails/crashes with the following error output
Replication Case
Run
mdl
against a file such as the following:README.md
It renders fine as illustrated in the screen shot from GitLab.
The text was updated successfully, but these errors were encountered: