Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v7.0.0 #110

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

v7.0.0 #110

wants to merge 10 commits into from

Conversation

TkTech
Copy link
Owner

@TkTech TkTech commented Sep 3, 2023

This is a major breaking change release that removes Array and Object proxies. However, after checking all GitHub repos that have this one as a dependency with > 5 stars, only 2 were using these features. They were generally an anti-pattern - if you needed 1 value, use at_pointer() instead. If you needed more than 1 value, it was almost always faster to use at_pointer() for an entire object at once. This new approach also alleviates memory management issues on PyPy.

If all you used was simdjson.loads() and simdjson.parse(), you should notice no difference.

  • Drop Python 3.6 and 3.7, which are now beyond end-of-life. Add Python 3.11.
  • Exploits CPython Unicode object internals for significantly faster string creation (up to 45%!)
  • Removed Array and Object proxy objects.
    • Changing our approach to this has significantly improved memory safety internally and fixed pypy support.
  • Update deprecated github actions.
  • Update vendored simdjson to version 3.2.3.

ToDo:

  • Re-add JSON-to-buffer/numpy array removed in initial cleanup (this method is many times faster than naively loading JSON when trying turning a homogeneous array of JSON values into a numpy array)
  • Add support for latest PyPy
  • Memory optimization pass
  • Update documentation and examples.

@TkTech TkTech self-assigned this Sep 3, 2023
This was referenced Sep 3, 2023
@TkTech
Copy link
Owner Author

TkTech commented Sep 4, 2023

For certain benchmarks, especially those that are string-heavy, this version is now roughly 45% faster.

---------------------------------------------------------------------- benchmark 'Complete load of data/twitter.json': 2 tests -----------------------------------------------------------------------
Name (time in us)             Min                   Max                  Mean              StdDev                Median                IQR            Outliers       OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
simdjson (NOW)           916.4979 (1.0)      4,188.6930 (1.0)      1,011.2369 (1.0)      391.3896 (1.0)        939.6220 (1.0)      24.2511 (1.0)         16;88  988.8880 (1.0)         690           1
simdjson (OLD)     1,328.0310 (1.45)     4,533.0260 (1.08)     1,428.9499 (1.41)     414.6389 (1.06)     1,355.7710 (1.44)     31.2135 (1.29)        12;49  699.8146 (0.71)        507           1

@edgarsi
Copy link

edgarsi commented Sep 5, 2023

I am using pysimdjson to do work like this:

doc = Parser().parse('{"a": {"b": ...}}')
b = doc['a']['b']
s = b.mini

The contents under b are huge, and pysimdjson allows me to avoid creating Python objects of them.

With the new changes you destroy this feature. Now at_pointer constructs the Python objects forcefully.

I propose at_pointer returns the Document object, and the Document object implements the mini property or method. (I did not find how mini is even accessable in the current code.) With these changes, the sample code above can be rewritten to:

doc = Parser().parse('{"a": {"b": ...}}')
b = doc.at_pointer('/a/b')
s = b.mini

But this still requires one to know the full path.

I am also using pysimdjson as follows:

doc = Parser().parse('{"a": [{"x": ...}, ...]}')
items = list(doc['a'])
for item in items:
    item[y] = ...
s = deep_jsonify(items)  # uses .mini when possible

First of all, the drop-in functionality of read-only list and dict structures is very nice here. Second, the new Document does not offer any way to list items at all, without creating Python objects for the full json subtree. If you hate the Array and Object classes, maybe Document.parse_shallow, which returns the Python element, which, in the case of being list lists Document objects, etc for dict?

P.S. Document.root and Document.as_object are the same function, with two names, and neither seem to be implemented for backward compatibility reasons.

@TkTech
Copy link
Owner Author

TkTech commented Sep 6, 2023

Thanks for the feedback @edgarsi, appreciate it.

With the new changes you destroy this feature. Now at_pointer constructs the Python objects forcefully.

This PR won't be merged until it's back to feature parity with v5. The Array and Object interfaces have to disappear for memory safety. While there are a bunch of ways to make it "safe", they come at a severe performance penalty for small documents. They also tended to be used to access more than a key or two, which is often slower than just getting the entire object.

I propose at_pointer returns the Document object, and the Document object implements the mini property or method. (I did not find how mini is even accessable in the current code.) With these changes, the sample code above can be rewritten to:

Most of the methods on Document() will mimic their counterparts in py_yyjson, where every method can take a pointer. .mini will become mini(at_pointer: str = /a/b). You'll actually see a bit of a speed boost and slightly better memory usage.

list lists Document objects, etc for dict?

1 JSON Document will return 1 Document() object. It's a memory container, not meant to represent a simd::element. The list, dict, and numpy helpers will be back before this is merged. Proxy objects cannot be used safely used in Python, because the Document() may have been reused between calls. All methods in v6 will return Python objects.

P.S. Document.root and Document.as_object are the same function, with two names, and neither seem to be implemented for backward compatibility reasons.

This was already fixed locally. root() isn't exposed to Python, it's a cdef to return the document root for internal functions.

@TkTech TkTech changed the title v6.0.0 v7.0.0 Feb 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants