Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add string schema properties #587

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

bsless
Copy link
Contributor

@bsless bsless commented Dec 5, 2021

Currently only implemented for JVM
What do you think?

Copy link
Member

@ikitommi ikitommi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Great! few comments attached. Also, to be complete, we need also:

  1. JSON Schema mappings
  2. Changes to generator

src/malli/core.cljc Outdated Show resolved Hide resolved
src/malli/core.cljc Outdated Show resolved Hide resolved
src/malli/core.cljc Outdated Show resolved Hide resolved
false))))))

#?(:clj
(defn find-blank-method
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could lift the minimum java to 11 and remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a huge breaking change for users. Sadly, there's still plenty of Java 8 in the world and we must accommodate

Copy link
Member

@Deraen Deraen Dec 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ikitommi Definitely ~half of our work projects are still on Java 8

(let [p (-charset-predicate charset)]
(string-char-predicate p)))
non-blank (when non-blank #(not (blank? %)))]
(-> non-blank
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the :non-blank needed? much shorter to use :min:

[:string {:non-blank true}]
[:string {:min 1}]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valid min 1 string: " ", but it's invalid for non-blank

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be solved by transformers, but users can certainly want to specify they want a string with minimum length which is not blank. Blank in this case is a superset of empty, which I almost missed in the beginning

@Deraen
Copy link
Member

Deraen commented Dec 6, 2021

  • Charset name here is quite strange? It doesn't refer to charset really.
  • Looks like after additional changes this works fully on Cljs now? I would be really careful of adding JVM only features, especially if there is no schema creation time error about using non-supported properties on some env.

@bsless
Copy link
Contributor Author

bsless commented Dec 6, 2021

  • Charset name here is quite strange? It doesn't refer to charset really.

It's a set of characters, but I'm not a fan of the name either, if you have a better idea, I don't mind changing it

  • Looks like after additional changes this works fully on Cljs now? I would be really careful of adding JVM only features, especially if there is no schema creation time error about using non-supported properties on some env.

The only JVM only feature currently is regex generation, but this is not a regression

@bsless bsless requested a review from ikitommi December 6, 2021 21:08
@Deraen
Copy link
Member

Deraen commented Dec 7, 2021

Perhaps just :characters (or :chars)?

Charset (or even character set) means completely different thing in the same context, so it is not an option to use that: https://en.wikipedia.org/wiki/Character_encoding

Quick test to check this is faster than just using equivalent patterns:

(bench/quick-bench (validate [:string {:charset :digit}] "123"))
;; 470ns

(bench/quick-bench (validate [:string {:pattern "^\d*$"}] "123"))
;; 765ns

(defn -charset-predicate
[o]
(case o
:digit #?(:clj #(Character/isDigit ^char %) :cljs -numeric-char?)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw. there is both char and int versions of the predicates in Java:

https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isLetter(char)
https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isLetter(int)

The int version supports unicode characters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this is probably fine. Char supports 0-65535 so it is enough to support unicode range 0x0000 - 0xFFFF.

Clojure characters don't support supplementary char ranges either. Though strings support:

0x2F81A:

\冬
=> Unsupported character: \冬

(.codePointAt "冬" 0)
=> 194586

(char (.codePointAt "冬" 0))
=> Value out of range for char: 194586

(Character/isLetter (int 0x2F81A))
=> true

(int (.charAt "冬" 0))
=> 55422 in this case .charAt only returns first two bytes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JS charCodeAt works the same as JVM charAt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem is I'm using charAt which returns char, not int

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0x2F81A (which is \⾁) is supported under the 12161 unicode: (link)

(int \⾁)
=> 12161

Same with all other unicode characters:

(char 8809)
=> \≩
(char 508)
=> \Ǽ
(int \Ǽ)
=> 508
(int \≩)
=> 8809
(int \Θ)
=> 920
(char 33071)
=> \脯

(merge {:type "string"} (-> schema m/properties (select-keys [:min :max]) (set/rename-keys {:min :minLength, :max :maxLength}))))
(let [props (-> schema m/properties)
pattern (case (:charset props)
:digit "^[0-9]*$"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These patterns don't match the Character predicates, the predicates allow other ranges, e.g.:

'\u0030' through '\u0039', ISO-LATIN-1 digits ('0' through '9')
'\u0660' through '\u0669', Arabic-Indic digits
'\u06F0' through '\u06F9', Extended Arabic-Indic digits
'\u0966' through '\u096F', Devanagari digits
'\uFF10' through '\uFF19', Fullwidth digits

There are some classes in at least JVM Pattern which might match the predicates, but not sure if there are equivelents to all, and what does JS support. Listing all the ranges might work for some cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can use unicode ranges, this works:

(re-find #"[\u0061-\u00F6]" "a")

@Deraen
Copy link
Member

Deraen commented Dec 7, 2021

It is possible to rewrite the string loop and char checks to work with unicode values, using codePointAt:

diff --git a/src/malli/core.cljc b/src/malli/core.cljc
index e8ab01b..2878c12 100644
--- a/src/malli/core.cljc
+++ b/src/malli/core.cljc
@@ -583,13 +583,14 @@
 (defn -charset-predicate
   [o]
   (case o
-    :digit #?(:clj #(Character/isDigit ^char %) :cljs -numeric-char?)
-    :letter #?(:clj #(Character/isLetter ^char %) :cljs -letter?)
-    (:alphanumeric :letter-or-digit) #?(:clj #(Character/isLetterOrDigit ^char %) :cljs -alphanumeric?)
+    :digit #?(:clj #(Character/isDigit ^int %) :cljs -numeric-char?)
+    :letter #?(:clj #(Character/isLetter ^int %)
+               :cljs -letter?)
+    (:alphanumeric :letter-or-digit) #?(:clj #(Character/isLetterOrDigit ^int %) :cljs -alphanumeric?)
     :alphabetic #?(:clj #(Character/isAlphabetic (int %)) :cljs -letter?)
     (cond
       (set? o) (miu/-some-pred (mapv -charset-predicate o))
-      (char? o) #?(:clj #(= ^char o %) :cljs (let [i (.charCodeAt o 0)] #(= i %)))
+      (char? o) #?(:clj #(= ^int o %) :cljs (let [i (.charCodeAt o 0)] #(= i %)))
       :else (eval o))))

 (defn string-char-predicate
@@ -599,10 +600,13 @@
       (loop [i 0]
         (if (= i n)
           true
-          (if (p #?(:clj  (.charAt s (unchecked-int i))
-                    :cljs (.charCodeAt s (unchecked-int i))))
-            (recur (unchecked-inc i))
-            false))))))
+          (let [cur-char #?(:clj  (.codePointAt s (unchecked-int i))
+                            :cljs (.codePointAt s (unchecked-int i)))]
+            (if (p cur-char)
+              (recur (unchecked-add i #?(:clj (Character/charCount cur-char)
+                                         ;; TODO: Is there better way?
+                                         :cljs (.length (String/fromCodePoint cur-char)))))
+              false)))))))

It doesn't seem to hurt the performance either:

;; 470ns
(bench/quick-bench (validate [:string {:charset :letter}] "冬"))
;; 840ns
(bench/quick-bench (validate [:string {:pattern "^\p{L}*$"}] "冬"))

;; string schema helpers
;;

#?(:cljs (defn -numeric-char? [c] (and (< 47 c) (< c 58))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These checks work differently than the JVM versions.

@bsless
Copy link
Contributor Author

bsless commented Dec 7, 2021

Perhaps just :characters (or :chars)?

I think characters is good, I'll make the appropriate change

@Deraen
Copy link
Member

Deraen commented Dec 7, 2021

What is the use case for this feature?

The performance is a bit better (~2x) than Patterns but I don't think that is a huge difference usually. Patterns already support unicode quite well, and both JVM and JS.

This is surprisingly complex feature due to character encodings, and what letter are included in the character classes.

[A-Za-z] and just checking for 64 < c < 91 and 96 < c < 123 break e.g. with Finnish and Scandinavian languages. I don't think providing features that work only in specific cases is that useful. And now the JVM predicates work considerably differently than Cljs predicates. At least we should be very careful to ensure these work similarly in both platforms.

@bsless
Copy link
Contributor Author

bsless commented Dec 7, 2021

Not sure why mix the constructors here:

(require '[criterium.core :as cc])
(require '[malli.core :as m])

(def v0 (m/validator [:string {:charset :digit}]))
(def v1 (m/validator [:string {:pattern #"^\d*$"}]))

(cc/quick-bench (v0 "123")) ;; 16.756693 ns
(cc/quick-bench (v0 "123456789")) ;; 48.106319 ns
(cc/quick-bench (v0 "123456789a")) ;; 53 ns

(cc/quick-bench (v1 "123")) ;; 71.737728 ns
(cc/quick-bench (v1 "123456789")) ;; 109.460351 ns
(cc/quick-bench (v1 "123456789a")) ;; 128 ns

@bsless
Copy link
Contributor Author

bsless commented Dec 7, 2021

The use case for this feature is twofold:

  • regular expressions are slower
  • expressive power - being able to say "a string is alphanumeric" is more readable than ^[A-Za-z0-9]*$
  • Especially when mixed with other constraints, like "it should have at least two elements"
  • Composition / working with data: specify a set of options instead of a regex. One of the options can be a characters set, too, or an ad-hoc validation function

I've seen too many examples at work for such use cases and I don't really want a regex on my hot path

@bsless
Copy link
Contributor Author

bsless commented Dec 13, 2021

Doing some more thinking about this, how about two properties, include-chars and exclude-chars?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants