Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lexical issues #5

Open
ikreymer opened this issue Oct 8, 2015 · 4 comments
Open

Lexical issues #5

ikreymer opened this issue Oct 8, 2015 · 4 comments

Comments

@ikreymer
Copy link

ikreymer commented Oct 8, 2015

For parsing CDXJ/ORS, need to ensure there is no ambiguity when the key ends.

Ambiguities can occur if there is a { anywhere in the key..

For CDXJ, this is usually avoided as keys are usually url-encoded and there are no spaces in urls.
But should this be a requirement? Or escaping spaces and {?

For ORS, there is of course the general case of multiple JSON dicts, with other nested JSON dicts.

{"foo": "bar"} {"boo": "baz", "foo2": {"a": {"c": "d"}} {"key": "value", "key2": {"a": "b"}}

Since the value must be a valid JSON dict, it would have to be:
value - {"key": "value", "key2": {"a": "b"}}
key - {"foo": "bar"} {"boo": "baz", "foo2": {"a": {"c": "d"}}

Could get tricky if this is to be supported with a more generic key, though I guess escaping enforcement should help...

@ibnesayeed
Copy link
Member

This escaping issue has already been covered briefly in the blog post as follows:

Since the opening square and curly brackets indicate the start of the JSON block, hence it is necessary to escape them (as well as the escape and double quote characters) if they appear in the keys, and optionally their closing pairs should also be escaped.

Let me reiterate it, none of the ORS or CDXJ support objects as keys or multiple objects as values. The prefix key is optional and if present it can be one or more string tokens quoted or unquoted. The value portion is one and only one instance of a single JSON block per line. The value block can be object format or array format JSON which can have arbitrary number of nesting. The value block can be an empty JSON, but cannot be blank/nil. I hope this resolves all the concerns raised here.

@ikreymer
Copy link
Author

ikreymer commented Oct 8, 2015

Yes, I think so. I bring this up with the default MRJob tab-delimited {...}\t{...} format as consideration.

At first glance, it would appear that it could be a compatible (subset) of ORS, with the key also being a JSON dict, but if escaping { is part of the requirement, it would not, since the first JSON dict is not escaped in MRJob format.

@ibnesayeed
Copy link
Member

Additionally, if a data key begins with @ sign, the key should be quoted.

@ibnesayeed
Copy link
Member

I am not sure about the reason why MRJob has an object for the key instead of a basic data type in the tuple, but in the current format it is not compatible with ORS. I have expressed my thoughts around it in the email.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants