Allow the parsing of recursive types (#91) #99

GregBowyer · 2019-10-29T02:31:47Z

Support recursive types (#92) and (#95)

The following is a reworking of the schema module to allow for recursive types to function correctly.
When I started this work I wasnt aware that #53 existed so sorry to @stew for stomping on that work. As far as can be tested, these changes fully support all the crazy recursive schemas that are created for avro

Basic approach

Unlike the approach taken in #53, this implementation hides the internal details of the schema data, and flattens the nesting of the schema to a hash map of ident -> data. The ident is taken from a string interning datastructure to make the mapping as cheap as possible.

This means that, all of the public facade interface types are value type objects and can be passed about as Copy types since they are essentially fat pointers.

Each value type holds a reference to the schema containing all data, as well its id for the actual type for lookup. From here each facade value type implements accessor methods to allow the retrieval of further data, as well as allowing facades to be further placed onto data that might recurse.

This means that there is some added indirection for accessing data, although this is somewhat offset by the fact that the data structure should be smaller, involve less allocations and have no Rc requirements.

Since the datastructure holding the schema is none trivial to construct, a builder interface has been provided that allows for the creation of new schemas, this builder tries to guide schema creation in the simplest fashion possible.

The schema module due to complexities has been split into sub modules for the significant parts of schema management (parsing, serialisation etc).

TODO

The following are a few todos that are left, which should be landed before this PR is totally good to go

Implement a proptest for schema roundtrips
Consider reworking pcf (we could really serialise to a tree via serde and then just manipulate that tree)
Remove some more of the .clones in the codebase that could be &'schema str references
Consider adding a cache for schema parsing against pcf forms

GregBowyer · 2019-11-07T20:22:27Z

Any thoughts?

poros · 2019-11-09T19:28:10Z

Sorry @GregBowyer , I didn't understand that you wanted some feedback even if the PR was work in progress. I'll give it a read when I have a bit of time; sorry for the delay.

I think it's also worth pointing out that this would collide with #90 by @CtrlC-Root .

CtrlC-Root · 2019-11-09T19:31:38Z

I've been pretty busy recently so feel free to merge any PRs that conflict with mine especially if they get us closer to supporting unions of complex types. I can come back and close mine or rebase and refactor as necessary to make it work when I have time.

poros

Wow, this is practically a complete rewrite of more than half of the crate. Reviewing and merging this kind of PRs is difficult and takes a long time, so I'd like to to set a clear expectations here that this one may take some iterations and it could have some long pauses because it is not my paid job to maintain this crate. In addition, for such a big change, I'd like to ask the @flavray to give his blessing too, since he is the owner of the crate.

Having said so, from a first glance, this seems pretty solid code! It must have taken a lot of time and effort, thanks for the impressive contribution.

I am a bit scared of the massive breaking changes this is going to introduce to the public interface of the crate. This could make very hard for users to migrate to the new version. Could you please update the main documentation page living in lib.rs so I can get a better grasp of what the interface change would mean for a user of the library? (if you haven't done it already and I missed it among the 50 files to review).

poros · 2019-11-09T23:52:14Z

tests/schema_int.rs

@@ -0,0 +1,1094 @@
+use std::{fs::File, io::BufReader, path::PathBuf, str::FromStr};


I think I missed why this test file is called schema_int.rs. It might be a stupid question, but I would appreciate if you could explain it to me

There was on master a schema.rs as a integration test, so I made it a new target just to make it clear what changes.

It could be merged with schema.rs

I see, that was int for integration. Thanks for separating it so to make the PR easier to read. I think that at the end it would be best to merge it in test/schema.rs, yes.

poros · 2019-11-10T00:12:37Z

src/schema/builder.rs

+
+#[derive(Fail, Debug)]
+#[fail(display = "Invalid schema construction: {}", _0)]
+pub struct BuilderError(pub String);


nit: perhaps SchemaBuilderError given that the struct is called SchemaBuilder?

poros · 2019-11-10T00:49:45Z

src/schema/builder.rs

+        let mut builder = SchemaBuilder::new();
+        let mut record_builder = builder.record("test");
+        record_builder.field("field1", builder.int());
+        record_builder.field("field2", builder.long());
+        let record = record_builder.build(&mut builder)?;
+
+        let schema = builder.build(record)?;


I fear that this asymmetry with primitive types could actually cause a bit of confusion:

let builder = Schema::builder(); let root = builder.long(); builder.build(root).unwrap()

I understand that same kind of difference for named types must exist, but this line

let record = record_builder.build(&mut builder)?;

where the builder now becomes the argument of the finalizing function for the RecordBuilder is really twisting my brain.

The code here is pretty complex and I must admit I am struggling a bit to follow, but:

Is there any way to make the record type be added to the schema builder with the record() call and then return a mutable reference to it?

If not, what would be your opinion on reversing the finalizing call into something like builder.add(record_builder) instead of record_builder.build(&mut bulder)?

Yeah the builder isn't my favorite part and I only cope with its API as I wrote it.

I will try to take another swing at it

poros · 2019-11-10T01:08:14Z

src/schema/mod.rs

+    }
+
+    /// Get the root element for the schema, this is the element that defines the current schema
+    pub fn root(&self) -> SchemaType<'_> {


I think that the name root is befitting of the recursive nature of avro schemas, but the spec doesn't mention the word at all. I see that it is used in the C++ implementation, but I was wondering if something like schema_type could be more descriptive...

What's your opinion?

I have little opinion, I mostly took root as in root of the tree, I dont mind renaming it to something more descriptive

My vote would be for schema_type, I think. @flavray , do you have any opinion?

poros · 2019-11-10T01:09:47Z

tests/schema.rs

@@ -227,9 +223,6 @@ lazy_static! {
            }"#,
            true
        ),
-        /*
-        // TODO: (#95) support same types but with different names in unions and uncomment (below the explanation)


Thanks for having noticed and uncommented!

poros · 2020-01-14T22:33:09Z

@GregBowyer this one is a great PR and I'd love to see it land. I hope my requests for changes didn't put you off :)

GregBowyer · 2020-01-15T17:19:42Z

No I just been super busy with work and will try to get back to this

praetp · 2020-03-01T08:38:05Z

Just wanted to say I hope this PR could be merged in. I have a very complex schema and with this PR it can be parsed 👍 .
Very nice work.

veloxl · 2020-03-12T06:27:51Z

Agree with @praetp, hoping this PR could be merged as this also allowed me to parse a complex schema.

GregBowyer · 2020-07-20T23:51:08Z

So I am looking at old PRs I have open and I notice this is not merged.

Is this still wanted?

poros · 2020-07-21T00:36:25Z

@GregBowyer yes, very much! The code changed quite a bit in the meanwhile, though, so there will be conflicts to fix.

praetp · 2020-08-07T08:13:32Z

Hope this could be merged in soon..

ghost · 2020-09-01T08:39:18Z

@poros @flavray
Hey, could you take a look at it and merge it? I would need this kind of feature for my project. Thank you.

GregBowyer · 2020-09-01T23:26:05Z

@pavlicekcn I am working on rebasing it, sorry I have not got further (just busy with other things)

jtvoorde · 2021-03-19T21:39:11Z

@GregBowyer

I fixed some more tests (including the resolve_union_schema test) in GregBowyer#2
Actually the parse_list I added was broken was broken as well. Due to other test errors the tests for this functionality were never run.

GregBowyer · 2021-04-23T00:08:50Z

FWIW I have not forgotton about this PR, I have been integrating it into some internal work (which has generated followup changes to make peoples lifes easier)

tzachshabtay · 2022-01-19T22:59:43Z

@GregBowyer any updates on this?

My personal use-case (which is a blocker for adopting this lib and possibly rust in general) is not actually full blown recursive types, but reusing named types (what #53 was seemingly trying to solve), i.e this currently fails for me:

let reader_raw_schema = r#"
    {
        "type": "record",
        "name": "User",
        "namespace": "office",
        "fields": [
            {
              "name": "details",
              "type": [
                {
                  "type": "record",
                  "name": "Employee",
                  "fields": [
                    {
                      "name": "gender",
                      "type": {
                        "type": "enum",
                        "name": "Gender",
                        "symbols": [
                          "male",
                          "female"
                        ]
                      },
                      "default": "female"
                    }
                  ]
                },
                {
                  "type": "record",
                  "name": "Manager",
                  "fields": [
                    {
                      "name": "gender",
                      "type": "Gender"
                    }
                  ]
                }
              ]
            }
          ]
}
"#;

    let reader_schema = Schema::parse_str(reader_raw_schema).unwrap();

(panics with value: ParsePrimitive("Gender")). This MR should solve this use-case, right? Is there a workaround in the meantime?

We use Avro IDL which generates this type of pattern whenever you reuse an enum in the same schema.

martin-g · 2022-01-20T07:37:21Z

Hi, @tzachshabtay !
I've just added support for this with apache/avro@064cc6b

Some history: This project has been donated to Apache Avro some time ago. For some reason its original authors are not involved there anymore. I don't know the reasons. Anyway, I wanted to learn Rust (I'm still learning it!) and decided to give the project my help (I'm am an Apache Avro committer since few weeks, yey!).
I've started adding interop tests to the Rust module and faced the missing support for recursive types, so I've implemented it with https://issues.apache.org/jira/browse/AVRO-3302 & apache/avro#1456.
It is in my TODO list to take a look at all open issues in this repo but I didn't have time to do it yet.
It seems my approach in solving this issue is less invasive than this PR. I didn't look at the diff here yet but I needed to change just a few files after adding Schema::Ref type.

Any feedback is very welcome! Please use https://issues.apache.org/jira/browse/AVRO and users@avro.apache.org.

Similar to 064cc6b for Enum Reported at flavray/avro-rs#99 (comment) Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

tzachshabtay · 2022-01-20T23:42:14Z

Hey @martin-g , thanks! I'm also new to rust, I'm excited to try your build, what do I need to write in my cargo.toml to get it?

martin-g · 2022-01-21T05:54:13Z

You could use either:

avro-rs = { git = "https://github.com/apache/avro" }    # for master branch
avro-rs = { git = "https://github.com/apache/avro", branch="avro-3302-add-interop-tests-for-rust" }  # to test a specific branch

Update: the above does not work because the Rust project is in a sub-folder (lang/rust).

or clone the Git repo locally and refer to it with:

[dependencies]
avro-rs = {path="/home/martin/git/apache/avro/lang/rust"}

This will use the currently checked out branch.

Please consult with https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html for more information.

tzachshabtay · 2022-01-27T03:53:28Z

@martin-g thanks, I can confirm your version works. 🎉

Similar to 064cc6b for Enum Reported at flavray/avro-rs#99 (comment) Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org> (cherry picked from commit 29197da)

WaterKnight1998 · 2023-03-02T09:31:42Z

Hi, @tzachshabtay ! I've just added support for this with apache/avro@064cc6b

Some history: This project has been donated to Apache Avro some time ago. For some reason its original authors are not involved there anymore. I don't know the reasons. Anyway, I wanted to learn Rust (I'm still learning it!) and decided to give the project my help (I'm am an Apache Avro committer since few weeks, yey!). I've started adding interop tests to the Rust module and faced the missing support for recursive types, so I've implemented it with https://issues.apache.org/jira/browse/AVRO-3302 & apache/avro#1456. It is in my TODO list to take a look at all open issues in this repo but I didn't have time to do it yet. It seems my approach in solving this issue is less invasive than this PR. I didn't look at the diff here yet but I needed to change just a few files after adding Schema::Ref type.

Any feedback is very welcome! Please use https://issues.apache.org/jira/browse/AVRO and users@avro.apache.org.

@martin-g Is it possible to recover a schema using the ref? Did you implemented any method in the lib for this? Thanks in advance :)

martin-g · 2023-03-02T09:40:09Z

The recovering happens internally to validate and resolve the fields.
What exactly do you need ?

WaterKnight1998 · 2023-03-02T09:49:17Z

The recovering happens internally to validate and resolve the fields. What exactly do you need ?

I am implementing an Arrow to Avro serializer, the parquet was serialized with Java Avro Parquet Write so I have the Avro Schema inside the key "parquet.avro.schema" of metadata.

So I've created a recursive function that converts the parquet arrays into Avro records with apache-avro rust lib. I'm forwarding in the recursion the appropriate parquet and avro subschema in each part, because depending on the schemas I'm doing one conversion or another.

The thing is that I have reached a point in the recursion where I have a reference to a record type, but I need the complete avro schema to continue with the recursion.

If you can help me obtaining the appropiate schema it would be helpful, I will be glad to do a PR to the official lib when I have all this working :)

WaterKnight1998 · 2023-03-02T10:09:19Z

@martin-g I have seen this methods in the lib: https://github.com/apache/avro/blob/5b2c27956b601c580af277614397b2e43e0ba9f9/lang/rust/avro/src/decode.rs#L71

But this structs and methods are not public available :(

martin-g · 2023-03-02T10:48:06Z

The best is to fork apache/avro, make the needed changes and send a PR.

WaterKnight1998 · 2023-03-02T11:29:00Z

The best is to fork apache/avro, make the needed changes and send a PR.

Ok, I will do it, is just making ResolvedSchema public. When I have the parquet to avro conversion I will figure out where is the best place to open PR, I think arrow-rs repo

WaterKnight1998 · 2023-03-06T13:44:14Z

@martin-g I have a Record with two attributes of type recrod two. Eachf of them validates separately angaints its schema. But when I execute:

        let mut record = Record::new(avro_writer.schema()).unwrap();
              record.put(col_name, value);
avro_writer.append(record).unwrap();
avro_writer.flush().unwrap()

I get next error on flush:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ValidationWithReason("Unresolved schema reference: 'XXX'. Parsed names: [Name { name: \"XXX\", namespace: Some(\"com.XXXX.XXXX.XXX\") }]

martin-g · 2023-03-06T14:37:21Z

@WaterKnight1998 Please use Apache Avro users@ mailing list or Slack.

I will need to see the schema and the record to be sure but it looks like the resolved schema for this reference is namespaced while you try to point to it without the namespace.
The Schema::Ref should be either fully qualified or the namespace should be provided as enclosing.

WaterKnight1998 · 2023-03-06T15:10:39Z

@WaterKnight1998 Please use Apache Avro users@ mailing list or Slack.

I will need to see the schema and the record to be sure but it looks like the resolved schema for this reference is namespaced while you try to point to it without the namespace. The Schema::Ref should be either fully qualified or the namespace should be provided as enclosing.

Could you share the slack please?

martin-g · 2023-03-06T15:13:48Z

Give me an email and I will send you an invitation. Either here or at mgrigorov at apache org

WaterKnight1998 · 2023-03-06T15:26:51Z

@martin-g done, i sent you an email

WaterKnight1998 · 2023-03-06T15:35:05Z

@martin-g I managed to reproduce:

use apache_avro::{Schema, Writer};
use apache_avro::types::{Record, Value};

        let raw_schema = r#"
        {
            "type": "record",
            "name": "test",
            "namespace": "com.example.namespace.1",
            "fields": [
                {"name": "ListID", "type": "record", "fields": [ {
                    "name": "id",
                    "type": "long",
                    "namespace": "com.example.namespace.2",
                    "doc": "The cart identifier"
                  }]},
                {"name": "b", "type": "ListID"}
            ]
        }
    "#;
let schema = Schema::parse_str(raw_schema).unwrap();
let mut writer = Writer::new(&schema, Vec::new());
let mut record = Record::new(writer.schema()).unwrap();
record.put("ListID", Value::Record(vec![(String::from("id"), Value::Long(36))]));
record.put("b", Value::Record(vec![(String::from("id"), Value::Long(36))]));
writer.append(record).unwrap();

WaterKnight1998 · 2023-03-06T18:11:05Z

I forked the lib and made public several structs and methods and now is working :)

martin-g · 2023-03-09T12:03:57Z

The code at #99 (comment) works just fine with current master of apache_avro.
@WaterKnight1998 If you still face any problems with it then please report a bug at JIRA!

Similar to apache/avro@064cc6b for Enum Reported at flavray/avro-rs#99 (comment) Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

GregBowyer force-pushed the recursive-types branch 2 times, most recently from cb031d2 to 75bf896 Compare October 29, 2019 18:08

Allow the parsing of recursive types (flavray#92 & flavray#95)

e37b2bd

GregBowyer force-pushed the recursive-types branch from 75bf896 to e37b2bd Compare October 29, 2019 19:31

poros requested changes Nov 9, 2019

View reviewed changes

poros reviewed Nov 9, 2019

View reviewed changes

poros reviewed Nov 10, 2019

View reviewed changes

poros added bug enhancement feature request labels Nov 16, 2019

poros mentioned this pull request Nov 16, 2019

Support 'aliases' in fields #84

Closed

poros mentioned this pull request Apr 11, 2020

fix-for-reusing-enum-references #118

Closed

stew mentioned this pull request Apr 18, 2020

allow for complex schema to refer to previously named schema #53

Closed

poros mentioned this pull request Apr 26, 2020

Record type deserialisation performance #89

Open

poros mentioned this pull request Jul 5, 2020

Add a parser, to enable references to other schema's #139

Closed

Add Topological sort for schemas

4faf9b0

martin-g added a commit to apache/avro that referenced this pull request Jan 20, 2022

AVRO-3302: Add support for recursive types using Fixed schema

29197da

Similar to 064cc6b for Enum Reported at flavray/avro-rs#99 (comment) Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

tzachshabtay mentioned this pull request Jan 27, 2022

Switch to apache-avro gklijs/schema_registry_converter#75

Closed

martin-g mentioned this pull request Mar 5, 2022

Add support for deduplicating definitions in schemas #202

Closed

This was referenced Apr 13, 2022

No support for named type references #203

Closed

Fixes #25 - Migrate to Apache Avro Rust SDK lerouxrgd/rsgen-avro#26

Merged

WaterKnight1998 mentioned this pull request Mar 7, 2023

AVRO-3723: [RUST] These methods should be public. apache/avro#2131

Merged

		@@ -0,0 +1,1094 @@
		use std::{fs::File, io::BufReader, path::PathBuf, str::FromStr};

Allow the parsing of recursive types (#91) #99

Are you sure you want to change the base?

Allow the parsing of recursive types (#91) #99

Conversation

GregBowyer commented Oct 29, 2019 • edited Loading

Support recursive types (#92) and (#95)

Basic approach

TODO

GregBowyer commented Nov 7, 2019

poros commented Nov 9, 2019

CtrlC-Root commented Nov 9, 2019

poros left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

poros Nov 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

poros commented Jan 14, 2020

GregBowyer commented Jan 15, 2020

praetp commented Mar 1, 2020

veloxl commented Mar 12, 2020

GregBowyer commented Jul 20, 2020

poros commented Jul 21, 2020

praetp commented Aug 7, 2020

ghost commented Sep 1, 2020 • edited by ghost Loading

GregBowyer commented Sep 1, 2020

jtvoorde commented Mar 19, 2021 • edited Loading

GregBowyer commented Apr 23, 2021

tzachshabtay commented Jan 19, 2022

martin-g commented Jan 20, 2022

tzachshabtay commented Jan 20, 2022

martin-g commented Jan 21, 2022 • edited Loading

tzachshabtay commented Jan 27, 2022

WaterKnight1998 commented Mar 2, 2023

martin-g commented Mar 2, 2023

WaterKnight1998 commented Mar 2, 2023

WaterKnight1998 commented Mar 2, 2023

martin-g commented Mar 2, 2023

WaterKnight1998 commented Mar 2, 2023

WaterKnight1998 commented Mar 6, 2023

martin-g commented Mar 6, 2023

WaterKnight1998 commented Mar 6, 2023

martin-g commented Mar 6, 2023

WaterKnight1998 commented Mar 6, 2023

WaterKnight1998 commented Mar 6, 2023

WaterKnight1998 commented Mar 6, 2023

martin-g commented Mar 9, 2023

GregBowyer commented Oct 29, 2019 •

edited

Loading

poros Nov 16, 2019 •

edited

Loading

ghost commented Sep 1, 2020 •

edited by ghost

Loading

jtvoorde commented Mar 19, 2021 •

edited

Loading

martin-g commented Jan 21, 2022 •

edited

Loading