-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync reserved characters proposal #9539
Comments
a nitpick, before I forget it. I'll see if I have more substantial comments when I have more time. re url encoding and Samba's Catia mapping encoders: FAT12/16/32 do support unicode, as utf-16 I suppose, so they should be able to handle the PUA encoder just fine. The Catia mapping also requires unicode support. But the filesystem specifically using UTF-8 is not required for any encoder. |
You may be right. I was going off of here, but it may be wrong. I will remove that claim.
Will update doc.
Sorry, I don't follow. Can you clarify? |
I think I got lost in the levels of inception of re-encoding, but I think this should be handled exactly the same way as the "case insensitive fs" wrapper. I agree that you can end up with cases where you switch between the different wrappers leading to unexpected effects, i.e., files that were claimed to be with In majority of the cases the codec should be a no-op, and that's fine, switching it back and fourth should have no effect, and will only matter for cases where you do have a genuine ":" in the paths, which should be very few cases. I guess the more interesting case that I don't see handled is where our encoding scheme clashes with files that already exist. Agreed, we can have helper cli utility that help "decode" or "encode" things in place to allow you to convert. |
I think taking a step back and defining the assumed invariants would be good before diving into details and an action plan.
Regarding point 3, I really like the idea of storing the encoding scheme with the data, under As to point 1, we do have some kind of encoders already in Syncthing: encrypted names on untrusted devices (not easily reversible) and the Unicode normalization code (also not reversible if the previous name was not normalized). Looking at those might give some hints regarding the invariance questions. Integrating that functionality with the proposed encoding stuff is probably too far fetched though. Thinking one step further, I could imagine even more radical encoders emerging, such as the mentioned base64 encoding. That might prove useful to implement further filesystem types in Syncthing, e.g. to add object stores. But then it needs to be clear whether this encoding machinery works with only a (non-reversible) hash function. Again, laying down these invariants / requirements for encoding schemes will help set the boundaries for designing the basic encoders we actually need in the first step. |
s/UTF-8/unicode/. It doesn't matter if a filesystem uses UTF8, they need to support unicode, any unicode encoding will do. |
The base filesystems only support 8.3 length non-unicode filenames, but Windows uses an extension to also store longer unicode filenames as an add-on. |
Thank you for writing this up, it's an excellent summary of the problem, your proposed solutions, and the potential issues. ❤️ For me, however, it also illustrates quite clearly why I'm disinclined to accept the proposal (and the corresponding PR). In a nutshell, the problem ("I want to sync filenames containing reserved/unsupported/special characters on any filesystem") is fairly easily avoided and/or corrected when it surfaces. The proposed solutions, however, are complicated and error prone, and the result of mistakes and misconfigurations much harder to reason about and fix than the original problem. In my mind this makes the cost higher than the benefit. |
Some small points to start with ...
|
Oh and as a counterpoint. The requirement seems to be that a particular host has all the files created on every other peer irrespective of the name it might be given here to overcome any local limitation. This is presumably useful for things like backup servers. In that case a translation like the previously mentioned base64 would be acceptable, BUT might still hit a file length limitation. Taking a secure hash (MD5, SHA1 etc) of the pathname would give a name with four or five 8 character sections for any original filename which is (basically) guaranteed to be unique. A small database containing a list of all the paths would be required to know what filenames are stored on the local FS. Working with the filesystem would be mostly trivial but there would be no method of migrating to or from this scheme except for adding another peer to the swarm. Though individual pathnames can be translated using simple tools like Personally I'm more likely to make the backup server a Linux box. |
Hi @rasa, I finally got around to writing a full reply to your proposal. Use casesJudging by the other comments, the use cases/purpose needs fleshing out. My personal use case is the following: I use Linux as desktop, and so I have some of my personal documents using windows reserved characters in their filenames. I also have some documentation downloaded for local use from a website using For my use case I want to be able to view/edit the files with existing Android apps, so proposals such as base64-encoding or storing a filename hash don't cut it for me. In that case I would not be able to identify the file when browsing through the files in e.g. an Android file manager or any other app that is not Syncthing. Using the Unicode PUA works, as I only occasionally have a reserved character in my file names and I can still identify them from the rest of the file name. That is why I proposed including the Samba Catia mapping, in which all characters are still identifiable without using any special software. Other use cases could be backing up your personal files to a Windows server, or using multiple computers with mixed operating systems where Mac or Linux is your main OS, or when handling files from a WSL/cygwin/etc environment on Windows. Other use cases could be when you are not syncing your own personal files, but a file set over which you have no direct control and for which you thus can't just change the file names. However some real use cases are probably more compelling than what I can think up. Restricting file namesI see that, compared to my previous proposal, you've not adopted the part of configuring certain file name characters as disallowed for a folder. I'm totally fine with that, Syncthing cannot actually control what users put in to their synced folders anyway. It was primarily a way to surface something similar to the existing proposal in a way that would be easier to understand. The Default encoderI would suggest renaming it to "None". A setting re-encoding 'inception'As others have also mentioned, the whole part on re-encoding and re-re-encoding is overly complicated. First, the encoder doesn’t know the meaning of the file names it receives, it only knows that it sees some characters incoming that are also in its encoding codomain. The question then is how to handle that. I think it would be clearer to rephrase the section in terms of incoming characters instead of as “re-(re-)*encoding”. IMO there are two sane ways to handle it: don’t, or escape. Adding a separate encoder for different re-encoding is imo way too much complexity for questionable gain, so the FAT encoder should just implement one of these options. If the encoder does not handle incoming encoding target characters, the encoder should reject the file which should lead to a synchronization failure, just like currently already happens with file names with reserved characters. However encoder target characters, whether PUA or the Samba Catia mapping, should be a lot less common than e.g. If the encoder escapes target characters, it would prefix such codomain characters with an escape character such as For this issue it is also worth finding out what WSL/cygwin/Mys2/CIFS do when they encounter the PUA target characters. There are other ways to store files that sidestep the whole filename character issue, such as base64-encoding them or storing a base64-d hash of the filename and a separate file with the real filenames. But with such solutions it is no longer practical to edit files on the encoded side, and the feature set that Syncthing would offer in such a case would be (practically speaking) one-way backup instead of two way synchronization, which is Syncthing’s unique selling point. I have a slight preference for doing escaping, but I’m also fine with not handling re-encoding. Especially if that—being simpler—contributes to the proposal being accepted. Switching encoder settingsA lot of the potential problems come from changing encoder settings. There’s one simple solution for (most of) this problem: don’t allow changing the encoding setting. Syncthing already does this by not allowing the folder path and ID to be changed. Even though changing the path shouldn’t be such a big problem (as far as I know). Users can still change the encoding by editing the There is still an issue with downgrading Syncthing after creating a folder with the FAT encoding, but I don’t think that is worth bothering about a lot. It seems like a quite rare situation, there won’t be any data loss, all that happens is files can be duplicated or their names messed up, and it can be fixed by a script or cli tool. Of course it is also possible to handle this in the GUI. That would certainly be more user friendly, but I’m not sure if it is worth the complexity. In that case Syncthing should handle it as proposed in Potential Issues 5. If all files can be renamed automatically (or don’t need renaming), Syncthing can just do the rename. Otherwise, it should ask the user and affected files would need to be deleted and un-synced. Technical scope of the encoder frameworkJudging by some of the discussion, the proposal should probably clarify that the encoding is something that happens purely locally. File names sent over the wire in the Syncthing protocol are always the unencoded names, and nodes don’t know about each other if they use any kind of encoder when storing their files. Proposal document structureThe proposal document is a bit too complicated, in my opinion. And that probably contributes to the issue appearing more complex to readers than it actually is. Specifically, the description of what the encoders actually do is spread around the document, under the headings “The encoders”, “Other encoders”, and “Possible encoding methods”. I think the main proposal (including the PUA encoding) should be at the beginning of the document, so readers not already familiar with the existing discussion have a clear view of what the (current) proposal constitutes. Also, you divide the work into several “phases”, but you don’t specify that you’re doing that or what these phases are before referring to them. However I think the notion of defining separate phases should be dropped altogether. We only need two “phases”: what will be included in the current proposal and (assuming it gets accepted) pull request, and Future Extensions, i.e. everything that can be implemented as later enhancements. The goal being to limit the scope and complexity of the current proposal, both in the amount of code that needs to be written, but more importantly in the number of issues to be discussed and that can be disagreed about. IMO the proposal should correspond mostly to what is now phase 1 using the PUA encoder, and probably not allowing changing the encoder from the GUI. For the choices of changing encoder and handling filenames that are already encoded, these are the most simple options and other options can be implemented in the future. These other enhancements should still be mentioned under future enhancements, of course. As a reference it might be helpful to see the list of sections a Python Enhancement Proposal (PEP) should include. Not all of them apply to Syncthing, but it is still a helpful and thought out structure. |
@JanKanis I completely rewrote the proposal, using the PEP layout you referenced. Let me know if I captured all your feedback, or if further simplification is needed. Clearly, my initial draft was way too complicated to be accepted. Lesson learned! Thanks again for the thoughtful and detailed feedback! |
I'd like to do that, but I can not add any emoticon response/vote to it, I guess because the issue is locked. Also after authenticating to roadmap.syncthing.net, voting on the issue there doesn't do anything, presumably also because the issue is locked. |
I added a number of comments to the Google doc version on the document structure. One other question: you haven't commented on or adopted what I proposed w.r.t. changing the encoder setting from the GUI. (disallowing it, or if allowed, make sure all files are renamed) What do you think about that? |
@JanKanis It's a good idea, but I don't think we can make a field read-only on the Actions > Advanced > Folder page. And that page already has a big red message Be careful! Incorrect configuration may damage your folder contents and render Syncthing inoperable. so the user's been warned enough. And it doesn't appear that page restricts the user from editing any field, including the folderID, so adding logic to change some fields to read-only may defeat the purpose of this page (which is to change any field, no matter how disastrous the change would be, like changing 'Filesystem Type' to But if we ever add the setting to the folder's setting page, we should make the field read-only if it's not None. I purposely left this idea out of the proposal, as, IMO, changing the encoder is an "Advanced" feature, such as changing the "Case Sensitive FS", "Junctions As Dirs", or "modTimeWindowS" settings. |
Ah, I wasn't aware of that advanced configuration page. That's basically equivalent to directly editing the configuration file, so I agree with you then. Does that mean this option will be an advanced configuration only feature, or do you still want to show the option in the regular UI when creating a new folder? |
@JanKanis IMO, yes, as long as we have the potential for duplicate files, I think we need to hide the option from the user. |
I don't think all the details around config handling need to be defined right out of the gate, but my gut feeling is that this would be the new default on Windows and Android, editable at folder creation time, and otherwise handled pretty much like the folder path -- not easily editable, with some FAQ or doc article explaining the situation and what to do to change it safely. |
See syncthing#9539 for more details.
See syncthing#9539 for more details.
See syncthing#9539 for more details.
fixes syncthing#1734 fixes syncthing#9539 See syncthing#9539 for details.
Sync reserved characters proposal v2.1
1. Preamble
This proposal is authored by @rasa and @JanKanis, and was inspired by JanKanis' comments here. It was last updated on 03-Jun-24. Feedback appreciated. An editable copy is here.
2. Abstract
Syncthing will report "Out of Sync" errors on peers where the underlying filesystem does not allow certain filenames that are allowed on other peers. This proposal addresses this issue. On https://roadmap.syncthing.net/, the issue is tied for 32nd, but was locked over five years ago, so it can't be voted on any more. If you value this proposal, click the thumbs up icon on this issue instead.
3. Motivation
As a user, I want to sync filenames containing reserved/unsupported/special characters on any filesystem. Specifically, I want to sync filenames containing
"*:<>?|
characters on NTFS/exFAT/FAT32 filesystems, which disallows these characters in filenames. For more information, see https://en.wikipedia.org/wiki/Filename#Comparison_of_filename_limitationsUse cases
@JanKanis' comment below documents several use cases.
4. Rationale
We chose encoding filenames using Unicode's Private Use Area characters (explained below), as this encoding method is effectively an industry standard, as it is how GitBash, Windows Subsystem for Linux (WSL), Cygwin1, MSYS, Linux's CIFS driver, and other platforms, encode filenames, and was first implemented in 1996.
5. Specification
Each folder will be configured to use an encoder. Initially, there will be two encoders: the "None encoder" and the "FAT encoder".
All existing folders will start out "using" the None encoder. The None "encoder" isn't really an encoder. It's the way Syncthing works right now. In fact, no new code will be executed when a folder is configured to "use" the None encoder.
Newly created folders will default to using the None encoder as well.
The user can change a folder's encoder setting via the GUI, but only via Actions > Advanced > Folders > Folder. If possible, when the user clicks "Save", a dialog box will pop up that explains the potential pitfalls, and asks for a further confirmation.
The None encoder
The None encoder, as described above, is not really an encoder, as it reads and writes filenames on disk "as is," without any encoding. It does not reject or ignore any filenames it receives. It is designed to be used on filesystems that allow all characters except
/
andNUL
, but it can be used on any filesystem. If it's used on a FAT-based filesystem, filenames a peer receives containing reserved characters won't be able to be written to disk, leading to out-of-sync errors.The FAT encoder
The FAT encoder is designed to be used on filesystems that disallow the characters
\"*:<>?|
in filenames. When filenames with these characters are written to disk, the FAT encoder encodes the filename in a format that the filesystem will accept. When read from disk, the filename is decoded to its original filename, before being sent to the other peers. The FAT encoder can be used on any filesystem, but there is no reason to run it on a non-FAT filesystem.To clarify, encoding is something that happens purely locally. File names sent over the wire in the Syncthing protocol always use the original pre-encoded names, and peers don’t know if another peer is using any sort of encoder when storing their files.
Unicode Private Use Area (PUA) characters
The FAT encoder will replace reserved characters with Unicode Private Use characters (
\xf000
-\xf0ff
). A character will be replaced by adding\xf000
to its code point, so for example a?
(code point\x003f
) is replaced by\xf03f
. It requires that the underlying filesystem allows Unicode characters, such as NTFS, exFAT, and VFAT.6. Backwards Compatibility
Since all folders, both existing and newly created ones, will default to using the None encoder, there are no backward compatibility issues. From the user's perspective nothing changes, and encoding-aware peers can communicate with non-encoding-aware peers without any issues.
A user can even downgrade a peer from a encoding-aware build, to a non-encoding-aware build without issue.
The only issue that can manifest, is if all of the following occurs:
First, we'll describe the problem in detail, and then the proposed solution.
The problem
We have two peers: N and F. Both use the None encoder. Peer F's filesystem is FAT, and so it had an out-of-sync error when it received a file named
acolon:
. Peer F switched its folder's encoder to FAT, which now can saveacolon:
asacolon\xf03a
, and the out-of-sync error goes away.Now, peer F switched the folder's encoder from FAT, back to None. The None encoder on peer F will find the file
acolon\xf03a
on disk and sync this file to peer N, which will see it as a new file, and save it. Peer N now has two files namedacolon:
andacolon\xf03a
, which are effectively the same file.Peer N will then sync these files with peer F. Peer F will still accept
acolon\xf03a
, but will rejectacolon:
as it has a reserved character, leading to an out-of-sync issue.Proposed solution
A separate CLI program is run on any peer where the folder is not on a FAT filesystem. Using the example above, the program is run on peer N. It searches for files where encoded files (
acolon\xf03a
) coexist with their pre-encoded equivalents (acolon:
). If a pair is found, it will see if the files are the same. If they are, it will delete the encoded version (acolon\xf03a
).If the two files differ, it will display the two filenames, timestamps, sizes, and attributes to the user, and ask them to choose:
acolon:
only (by deletingacolon\xf03a
)acolon\xf03a
only (by renamingacolon\xf03a
toacolon:
)Option 1 - Keep
acolon:
Peer N syncs the delete of
acolon\xf03a
with the other peers. None peers will process the delete. FAT peers will silently ignore the delete, as they ignore all encoded filenames they receive on the wire.Option 2 - Keep
acolon\xf03a
Syncthing sees this rename of
acolon\xf03a
toacolon:
as deletingacolon\xf03a
and updatingacolon:
. None peers will process both the delete and update. FAT peers will ignore the delete, and updateacolon:
, by encoding the filename asacolon\xf03a
.Automating the process
The following startup options would automate the above selection process:
--decoded
- always select the pre-encoded filename (choice 1. above)--newer
- always select the newer of the two files--encoded
- always select the encoded filename (choice 2. above)--older
- always select the older of the two filesThe program will not back up files before deleting them. If a user wants backups, they should turn on versioning on a None peer, before running the program.
Which option is most likely to be the right one?
Option 1, "Keep
acolon:
", will almost always be the best choice. Why? Because pre-encoded filenames almost always originated on non-FAT peers, as users cannot generally create these filenames on FAT peers. The most likely way a user on a FAT peer created an encoded filename themselves, is if they created the file via a CLI environment, such as GitBash, Cygwin, MSys2, WSL, etc. So, since they most likely didn't author the file, it's less likely that a FAT peer will be the last one updating it.7. How to Teach This
The documentation will explain the benefits and drawbacks of changing a folder's encoder.
8. Reference Implementation
@rasa has volunteered to draft a PR with full unit tests if this proposal is accepted. Integration tests will also be provided using the new framework provided in #9266. @rasa will also draft a PR for the documentation needed.
9. Alternatives
Other encoding methods that could be implemented
URL-encoded
This encoding replaces reserved characters with their URL-encoded equivalent. See https://en.m.wikipedia.org/wiki/Percent-encoding. This would be a good choice on filesystems that don't support UTF-8 characters. Proposed by @AudriusButkevicius.
Samba's Catia mapping
This encoding replaces reserved characters using the mapping
"→¨ *→¤ /→ø :→÷ <→« >→» ?→¿ \→ÿ |→¦
. This would be a good choice if the user wants to encode to more visually related characters. See https://www.samba.org/samba/docs/current/man-html/vfs_catia.8.html. Proposed by @JanKanis.10. Open Issues
None that we are aware of, but here's a good place to list a potential future enhancement:
Warning the user the encoder was changed
Due to the duplicate file issue noted above, we may want to alert the user whenever a folder's encoder is changed from FAT back to None. To do this, we can update
.stfolder/syncthing-folder-xxxxxx.txt
(See #9525), with eitherEncoder: None
orEncoder: FAT
, if the entry is missing.Then whenever Syncthing starts up, if the encoder in the
.stfolder
file listed FAT, butconfig.xml
lists None, a warning is shown in the GUI. The user can select "Revert", "Accept" or "Ignore". If they select "Revert", the encoder setting is changed back to FAT inconfig.xml
. If they select "Accept", the.stfolder
file is updated to containEncoder: None
. If they select "Ignore", the message goes away, until Syncthing restarts.We could also provide CLI users with these options:
--report-on-encoder-changes
: if the encoder was switched from FAT to None, scan the filesystem, and if there are duplicate files, log the duplicates, and continue--abort-on-encoder-changes
: do the above, but quit instead--accept-encoder-changes
: Update the.stfolder
file withEncoder: None
--revert-encoder-changes
: Switch the encoder back to FAT in theconfig.xml
fileIf no option is provided, a warning about the encoder change is logged.
11. Footnotes
For reference, see:
https://cygwin.com/cygwin-ug-net/using-specialnames.html
http://msdn.microsoft.com/en-us/library/aa365247%28VS.85%29.aspx
https://docs.microsoft.com/en-us/windows/win32/fileio/naming-a-file
https://en.wikipedia.org/wiki/Filename#Comparison_of_filename_limitations .
https://amigotechnotes.wordpress.com/2015/04/02/invalid-characters-in-file-names/
For implementations, see
https://github.com/mirror/newlib-cygwin/blob/fb01286fab9b370c86323f84a46285cfbebfe4ff/winsup/cygwin/path.cc#L435
https://github.com/billziss-gh/winfsp/blob/6e3a8f70b2bd958960012447544d492fc6a2f1af/src/shared/ku/posix.c#L1250
https://github.com/torvalds/linux/blob/master/fs/cifs/cifs_unicode.h#L27
Footnotes
https://cygwin.com/cygwin-ug-net/using-specialnames.html ↩
The text was updated successfully, but these errors were encountered: