Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow the parsing of recursive types (#91) #99

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

GregBowyer
Copy link

@GregBowyer GregBowyer commented Oct 29, 2019

Support recursive types (#92) and (#95)

The following is a reworking of the schema module to allow for recursive types to function correctly.
When I started this work I wasnt aware that #53 existed so sorry to @stew for stomping on that work. As far as can be tested, these changes fully support all the crazy recursive schemas that are created for avro

Basic approach

Unlike the approach taken in #53, this implementation hides the internal details of the schema data, and flattens the nesting of the schema to a hash map of ident -> data. The ident is taken from a string interning datastructure to make the mapping as cheap as possible.

This means that, all of the public facade interface types are value type objects and can be passed about as Copy types since they are essentially fat pointers.

Each value type holds a reference to the schema containing all data, as well its id for the actual type for lookup. From here each facade value type implements accessor methods to allow the retrieval of further data, as well as allowing facades to be further placed onto data that might recurse.

This means that there is some added indirection for accessing data, although this is somewhat offset by the fact that the data structure should be smaller, involve less allocations and have no Rc requirements.

Since the datastructure holding the schema is none trivial to construct, a builder interface has been provided that allows for the creation of new schemas, this builder tries to guide schema creation in the simplest fashion possible.

The schema module due to complexities has been split into sub modules for the significant parts of schema management (parsing, serialisation etc).

TODO

The following are a few todos that are left, which should be landed before this PR is totally good to go

  • Implement a proptest for schema roundtrips
  • Consider reworking pcf (we could really serialise to a tree via serde and then just manipulate that tree)
  • Remove some more of the .clones in the codebase that could be &'schema str references
  • Consider adding a cache for schema parsing against pcf forms

Sorry, something went wrong.

@GregBowyer GregBowyer force-pushed the recursive-types branch 2 times, most recently from cb031d2 to 75bf896 Compare October 29, 2019 18:08
@GregBowyer
Copy link
Author

Any thoughts?

@poros
Copy link
Collaborator

poros commented Nov 9, 2019

Sorry @GregBowyer , I didn't understand that you wanted some feedback even if the PR was work in progress. I'll give it a read when I have a bit of time; sorry for the delay.

I think it's also worth pointing out that this would collide with #90 by @CtrlC-Root .

@CtrlC-Root
Copy link

I've been pretty busy recently so feel free to merge any PRs that conflict with mine especially if they get us closer to supporting unions of complex types. I can come back and close mine or rebase and refactor as necessary to make it work when I have time.

Copy link
Collaborator

@poros poros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is practically a complete rewrite of more than half of the crate. Reviewing and merging this kind of PRs is difficult and takes a long time, so I'd like to to set a clear expectations here that this one may take some iterations and it could have some long pauses because it is not my paid job to maintain this crate. In addition, for such a big change, I'd like to ask the @flavray to give his blessing too, since he is the owner of the crate.

Having said so, from a first glance, this seems pretty solid code! It must have taken a lot of time and effort, thanks for the impressive contribution.

I am a bit scared of the massive breaking changes this is going to introduce to the public interface of the crate. This could make very hard for users to migrate to the new version. Could you please update the main documentation page living in lib.rs so I can get a better grasp of what the interface change would mean for a user of the library? (if you haven't done it already and I missed it among the 50 files to review).

@@ -0,0 +1,1094 @@
use std::{fs::File, io::BufReader, path::PathBuf, str::FromStr};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I missed why this test file is called schema_int.rs. It might be a stupid question, but I would appreciate if you could explain it to me

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was on master a schema.rs as a integration test, so I made it a new target just to make it clear what changes.

It could be merged with schema.rs

Copy link
Collaborator

@poros poros Nov 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that was int for integration. Thanks for separating it so to make the PR easier to read. I think that at the end it would be best to merge it in test/schema.rs, yes.


#[derive(Fail, Debug)]
#[fail(display = "Invalid schema construction: {}", _0)]
pub struct BuilderError(pub String);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps SchemaBuilderError given that the struct is called SchemaBuilder?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep :P

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

Comment on lines +563 to +569
let mut builder = SchemaBuilder::new();
let mut record_builder = builder.record("test");
record_builder.field("field1", builder.int());
record_builder.field("field2", builder.long());
let record = record_builder.build(&mut builder)?;

let schema = builder.build(record)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fear that this asymmetry with primitive types could actually cause a bit of confusion:

let builder = Schema::builder();
let root = builder.long();
builder.build(root).unwrap()

I understand that same kind of difference for named types must exist, but this line

let record = record_builder.build(&mut builder)?;

where the builder now becomes the argument of the finalizing function for the RecordBuilder is really twisting my brain.

The code here is pretty complex and I must admit I am struggling a bit to follow, but:

  1. Is there any way to make the record type be added to the schema builder with the record() call and then return a mutable reference to it?
  2. If not, what would be your opinion on reversing the finalizing call into something like builder.add(record_builder) instead of record_builder.build(&mut bulder)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the builder isn't my favorite part and I only cope with its API as I wrote it.

I will try to take another swing at it

}

/// Get the root element for the schema, this is the element that defines the current schema
pub fn root(&self) -> SchemaType<'_> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the name root is befitting of the recursive nature of avro schemas, but the spec doesn't mention the word at all. I see that it is used in the C++ implementation, but I was wondering if something like schema_type could be more descriptive...

What's your opinion?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have little opinion, I mostly took root as in root of the tree, I dont mind renaming it to something more descriptive

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My vote would be for schema_type, I think. @flavray , do you have any opinion?

tests/schema.rs Outdated
@@ -227,9 +223,6 @@ lazy_static! {
}"#,
true
),
/*
// TODO: (#95) support same types but with different names in unions and uncomment (below the explanation)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for having noticed and uncommented!

@poros
Copy link
Collaborator

poros commented Jan 14, 2020

@GregBowyer this one is a great PR and I'd love to see it land. I hope my requests for changes didn't put you off :)

@GregBowyer
Copy link
Author

No I just been super busy with work and will try to get back to this

@praetp
Copy link

praetp commented Mar 1, 2020

Just wanted to say I hope this PR could be merged in. I have a very complex schema and with this PR it can be parsed 👍 .
Very nice work.

@veloxl
Copy link

veloxl commented Mar 12, 2020

Agree with @praetp, hoping this PR could be merged as this also allowed me to parse a complex schema.

@GregBowyer
Copy link
Author

So I am looking at old PRs I have open and I notice this is not merged.

Is this still wanted?

@poros
Copy link
Collaborator

poros commented Jul 21, 2020

@GregBowyer yes, very much! The code changed quite a bit in the meanwhile, though, so there will be conflicts to fix.

@praetp
Copy link

praetp commented Aug 7, 2020

Hope this could be merged in soon..

@ghost
Copy link

ghost commented Sep 1, 2020

@poros @flavray
Hey, could you take a look at it and merge it? I would need this kind of feature for my project. Thank you.

@GregBowyer
Copy link
Author

@pavlicekcn I am working on rebasing it, sorry I have not got further (just busy with other things)

@jtvoorde
Copy link

jtvoorde commented Mar 19, 2021

@GregBowyer

I fixed some more tests (including the resolve_union_schema test) in GregBowyer#2
Actually the parse_list I added was broken was broken as well. Due to other test errors the tests for this functionality were never run.

@GregBowyer
Copy link
Author

FWIW I have not forgotton about this PR, I have been integrating it into some internal work (which has generated followup changes to make peoples lifes easier)

@tzachshabtay
Copy link

@GregBowyer any updates on this?

My personal use-case (which is a blocker for adopting this lib and possibly rust in general) is not actually full blown recursive types, but reusing named types (what #53 was seemingly trying to solve), i.e this currently fails for me:

let reader_raw_schema = r#"
    {
        "type": "record",
        "name": "User",
        "namespace": "office",
        "fields": [
            {
              "name": "details",
              "type": [
                {
                  "type": "record",
                  "name": "Employee",
                  "fields": [
                    {
                      "name": "gender",
                      "type": {
                        "type": "enum",
                        "name": "Gender",
                        "symbols": [
                          "male",
                          "female"
                        ]
                      },
                      "default": "female"
                    }
                  ]
                },
                {
                  "type": "record",
                  "name": "Manager",
                  "fields": [
                    {
                      "name": "gender",
                      "type": "Gender"
                    }
                  ]
                }
              ]
            }
          ]
}
"#;

    let reader_schema = Schema::parse_str(reader_raw_schema).unwrap();

(panics with value: ParsePrimitive("Gender")). This MR should solve this use-case, right? Is there a workaround in the meantime?

We use Avro IDL which generates this type of pattern whenever you reuse an enum in the same schema.

@martin-g
Copy link

Hi, @tzachshabtay !
I've just added support for this with apache/avro@064cc6b

Some history: This project has been donated to Apache Avro some time ago. For some reason its original authors are not involved there anymore. I don't know the reasons. Anyway, I wanted to learn Rust (I'm still learning it!) and decided to give the project my help (I'm am an Apache Avro committer since few weeks, yey!).
I've started adding interop tests to the Rust module and faced the missing support for recursive types, so I've implemented it with https://issues.apache.org/jira/browse/AVRO-3302 & apache/avro#1456.
It is in my TODO list to take a look at all open issues in this repo but I didn't have time to do it yet.
It seems my approach in solving this issue is less invasive than this PR. I didn't look at the diff here yet but I needed to change just a few files after adding Schema::Ref type.

Any feedback is very welcome! Please use https://issues.apache.org/jira/browse/AVRO and users@avro.apache.org.

martin-g added a commit to apache/avro that referenced this pull request Jan 20, 2022
Similar to
064cc6b
for Enum
Reported at flavray/avro-rs#99 (comment)

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>
@tzachshabtay
Copy link

Hey @martin-g , thanks! I'm also new to rust, I'm excited to try your build, what do I need to write in my cargo.toml to get it?

@martin-g
Copy link

martin-g commented Jan 21, 2022

You could use either:

avro-rs = { git = "https://github.com/apache/avro" }    # for master branch
avro-rs = { git = "https://github.com/apache/avro", branch="avro-3302-add-interop-tests-for-rust" }  # to test a specific branch 

Update: the above does not work because the Rust project is in a sub-folder (lang/rust).

or clone the Git repo locally and refer to it with:

[dependencies]
avro-rs = {path="/home/martin/git/apache/avro/lang/rust"}

This will use the currently checked out branch.

Please consult with https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html for more information.

@tzachshabtay
Copy link

@martin-g thanks, I can confirm your version works. 🎉

martin-g added a commit to apache/avro that referenced this pull request Feb 1, 2022
Similar to
064cc6b
for Enum
Reported at flavray/avro-rs#99 (comment)

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>
(cherry picked from commit 29197da)
@WaterKnight1998
Copy link

Hi, @tzachshabtay ! I've just added support for this with apache/avro@064cc6b

Some history: This project has been donated to Apache Avro some time ago. For some reason its original authors are not involved there anymore. I don't know the reasons. Anyway, I wanted to learn Rust (I'm still learning it!) and decided to give the project my help (I'm am an Apache Avro committer since few weeks, yey!). I've started adding interop tests to the Rust module and faced the missing support for recursive types, so I've implemented it with https://issues.apache.org/jira/browse/AVRO-3302 & apache/avro#1456. It is in my TODO list to take a look at all open issues in this repo but I didn't have time to do it yet. It seems my approach in solving this issue is less invasive than this PR. I didn't look at the diff here yet but I needed to change just a few files after adding Schema::Ref type.

Any feedback is very welcome! Please use https://issues.apache.org/jira/browse/AVRO and users@avro.apache.org.

@martin-g Is it possible to recover a schema using the ref? Did you implemented any method in the lib for this? Thanks in advance :)

@martin-g
Copy link

martin-g commented Mar 2, 2023

The recovering happens internally to validate and resolve the fields.
What exactly do you need ?

@WaterKnight1998
Copy link

The recovering happens internally to validate and resolve the fields. What exactly do you need ?

I am implementing an Arrow to Avro serializer, the parquet was serialized with Java Avro Parquet Write so I have the Avro Schema inside the key "parquet.avro.schema" of metadata.

So I've created a recursive function that converts the parquet arrays into Avro records with apache-avro rust lib. I'm forwarding in the recursion the appropriate parquet and avro subschema in each part, because depending on the schemas I'm doing one conversion or another.

The thing is that I have reached a point in the recursion where I have a reference to a record type, but I need the complete avro schema to continue with the recursion.

If you can help me obtaining the appropiate schema it would be helpful, I will be glad to do a PR to the official lib when I have all this working :)

@WaterKnight1998
Copy link

@martin-g I have seen this methods in the lib: https://github.com/apache/avro/blob/5b2c27956b601c580af277614397b2e43e0ba9f9/lang/rust/avro/src/decode.rs#L71

But this structs and methods are not public available :(

@martin-g
Copy link

martin-g commented Mar 2, 2023

The best is to fork apache/avro, make the needed changes and send a PR.

@WaterKnight1998
Copy link

The best is to fork apache/avro, make the needed changes and send a PR.

Ok, I will do it, is just making ResolvedSchema public. When I have the parquet to avro conversion I will figure out where is the best place to open PR, I think arrow-rs repo

@WaterKnight1998
Copy link

@martin-g I have a Record with two attributes of type recrod two. Eachf of them validates separately angaints its schema. But when I execute:

        let mut record = Record::new(avro_writer.schema()).unwrap();
              record.put(col_name, value);
avro_writer.append(record).unwrap();
avro_writer.flush().unwrap()

I get next error on flush:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ValidationWithReason("Unresolved schema reference: 'XXX'. Parsed names: [Name { name: \"XXX\", namespace: Some(\"com.XXXX.XXXX.XXX\") }]

@martin-g
Copy link

martin-g commented Mar 6, 2023

@WaterKnight1998 Please use Apache Avro users@ mailing list or Slack.

I will need to see the schema and the record to be sure but it looks like the resolved schema for this reference is namespaced while you try to point to it without the namespace.
The Schema::Ref should be either fully qualified or the namespace should be provided as enclosing.

@WaterKnight1998
Copy link

@WaterKnight1998 Please use Apache Avro users@ mailing list or Slack.

I will need to see the schema and the record to be sure but it looks like the resolved schema for this reference is namespaced while you try to point to it without the namespace. The Schema::Ref should be either fully qualified or the namespace should be provided as enclosing.

Could you share the slack please?

@martin-g
Copy link

martin-g commented Mar 6, 2023

Give me an email and I will send you an invitation. Either here or at mgrigorov at apache org

@WaterKnight1998
Copy link

@martin-g done, i sent you an email

@WaterKnight1998
Copy link

@martin-g I managed to reproduce:

use apache_avro::{Schema, Writer};
use apache_avro::types::{Record, Value};

        let raw_schema = r#"
        {
            "type": "record",
            "name": "test",
            "namespace": "com.example.namespace.1",
            "fields": [
                {"name": "ListID", "type": "record", "fields": [ {
                    "name": "id",
                    "type": "long",
                    "namespace": "com.example.namespace.2",
                    "doc": "The cart identifier"
                  }]},
                {"name": "b", "type": "ListID"}
            ]
        }
    "#;
let schema = Schema::parse_str(raw_schema).unwrap();
let mut writer = Writer::new(&schema, Vec::new());
let mut record = Record::new(writer.schema()).unwrap();
record.put("ListID", Value::Record(vec![(String::from("id"), Value::Long(36))]));
record.put("b", Value::Record(vec![(String::from("id"), Value::Long(36))]));
writer.append(record).unwrap();

@WaterKnight1998
Copy link

I forked the lib and made public several structs and methods and now is working :)

@martin-g
Copy link

martin-g commented Mar 9, 2023

The code at #99 (comment) works just fine with current master of apache_avro.
@WaterKnight1998 If you still face any problems with it then please report a bug at JIRA!

martin-g added a commit to apache/avro-rs that referenced this pull request Sep 23, 2024
Similar to
apache/avro@064cc6b
for Enum
Reported at flavray/avro-rs#99 (comment)

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet