-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow the parsing of recursive types (#91) #99
base: main
Are you sure you want to change the base?
Conversation
cb031d2
to
75bf896
Compare
75bf896
to
e37b2bd
Compare
Any thoughts? |
Sorry @GregBowyer , I didn't understand that you wanted some feedback even if the PR was work in progress. I'll give it a read when I have a bit of time; sorry for the delay. I think it's also worth pointing out that this would collide with #90 by @CtrlC-Root . |
I've been pretty busy recently so feel free to merge any PRs that conflict with mine especially if they get us closer to supporting unions of complex types. I can come back and close mine or rebase and refactor as necessary to make it work when I have time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, this is practically a complete rewrite of more than half of the crate. Reviewing and merging this kind of PRs is difficult and takes a long time, so I'd like to to set a clear expectations here that this one may take some iterations and it could have some long pauses because it is not my paid job to maintain this crate. In addition, for such a big change, I'd like to ask the @flavray to give his blessing too, since he is the owner of the crate.
Having said so, from a first glance, this seems pretty solid code! It must have taken a lot of time and effort, thanks for the impressive contribution.
I am a bit scared of the massive breaking changes this is going to introduce to the public interface of the crate. This could make very hard for users to migrate to the new version. Could you please update the main documentation page living in lib.rs
so I can get a better grasp of what the interface change would mean for a user of the library? (if you haven't done it already and I missed it among the 50 files to review).
@@ -0,0 +1,1094 @@ | |||
use std::{fs::File, io::BufReader, path::PathBuf, str::FromStr}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I missed why this test file is called schema_int.rs
. It might be a stupid question, but I would appreciate if you could explain it to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was on master a schema.rs as a integration test, so I made it a new target just to make it clear what changes.
It could be merged with schema.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that was int for integration. Thanks for separating it so to make the PR easier to read. I think that at the end it would be best to merge it in test/schema.rs
, yes.
src/schema/builder.rs
Outdated
|
||
#[derive(Fail, Debug)] | ||
#[fail(display = "Invalid schema construction: {}", _0)] | ||
pub struct BuilderError(pub String); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: perhaps SchemaBuilderError
given that the struct is called SchemaBuilder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
let mut builder = SchemaBuilder::new(); | ||
let mut record_builder = builder.record("test"); | ||
record_builder.field("field1", builder.int()); | ||
record_builder.field("field2", builder.long()); | ||
let record = record_builder.build(&mut builder)?; | ||
|
||
let schema = builder.build(record)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fear that this asymmetry with primitive types could actually cause a bit of confusion:
let builder = Schema::builder();
let root = builder.long();
builder.build(root).unwrap()
I understand that same kind of difference for named types must exist, but this line
let record = record_builder.build(&mut builder)?;
where the builder
now becomes the argument of the finalizing function for the RecordBuilder
is really twisting my brain.
The code here is pretty complex and I must admit I am struggling a bit to follow, but:
- Is there any way to make the record type be added to the schema builder with the
record()
call and then return a mutable reference to it? - If not, what would be your opinion on reversing the finalizing call into something like
builder.add(record_builder)
instead ofrecord_builder.build(&mut bulder)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the builder isn't my favorite part and I only cope with its API as I wrote it.
I will try to take another swing at it
} | ||
|
||
/// Get the root element for the schema, this is the element that defines the current schema | ||
pub fn root(&self) -> SchemaType<'_> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the name root
is befitting of the recursive nature of avro schemas, but the spec doesn't mention the word at all. I see that it is used in the C++ implementation, but I was wondering if something like schema_type
could be more descriptive...
What's your opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have little opinion, I mostly took root as in root of the tree, I dont mind renaming it to something more descriptive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My vote would be for schema_type
, I think. @flavray , do you have any opinion?
tests/schema.rs
Outdated
@@ -227,9 +223,6 @@ lazy_static! { | |||
}"#, | |||
true | |||
), | |||
/* | |||
// TODO: (#95) support same types but with different names in unions and uncomment (below the explanation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for having noticed and uncommented!
@GregBowyer this one is a great PR and I'd love to see it land. I hope my requests for changes didn't put you off :) |
No I just been super busy with work and will try to get back to this |
Just wanted to say I hope this PR could be merged in. I have a very complex schema and with this PR it can be parsed 👍 . |
Agree with @praetp, hoping this PR could be merged as this also allowed me to parse a complex schema. |
So I am looking at old PRs I have open and I notice this is not merged. Is this still wanted? |
@GregBowyer yes, very much! The code changed quite a bit in the meanwhile, though, so there will be conflicts to fix. |
Hope this could be merged in soon.. |
@pavlicekcn I am working on rebasing it, sorry I have not got further (just busy with other things) |
I fixed some more tests (including the resolve_union_schema test) in GregBowyer#2 |
FWIW I have not forgotton about this PR, I have been integrating it into some internal work (which has generated followup changes to make peoples lifes easier) |
@GregBowyer any updates on this? My personal use-case (which is a blocker for adopting this lib and possibly rust in general) is not actually full blown recursive types, but reusing named types (what #53 was seemingly trying to solve), i.e this currently fails for me: let reader_raw_schema = r#"
{
"type": "record",
"name": "User",
"namespace": "office",
"fields": [
{
"name": "details",
"type": [
{
"type": "record",
"name": "Employee",
"fields": [
{
"name": "gender",
"type": {
"type": "enum",
"name": "Gender",
"symbols": [
"male",
"female"
]
},
"default": "female"
}
]
},
{
"type": "record",
"name": "Manager",
"fields": [
{
"name": "gender",
"type": "Gender"
}
]
}
]
}
]
}
"#;
let reader_schema = Schema::parse_str(reader_raw_schema).unwrap(); (panics with We use Avro IDL which generates this type of pattern whenever you reuse an enum in the same schema. |
Hi, @tzachshabtay ! Some history: This project has been donated to Apache Avro some time ago. For some reason its original authors are not involved there anymore. I don't know the reasons. Anyway, I wanted to learn Rust (I'm still learning it!) and decided to give the project my help (I'm am an Apache Avro committer since few weeks, yey!). Any feedback is very welcome! Please use https://issues.apache.org/jira/browse/AVRO and users@avro.apache.org. |
Similar to 064cc6b for Enum Reported at flavray/avro-rs#99 (comment) Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>
Hey @martin-g , thanks! I'm also new to rust, I'm excited to try your build, what do I need to write in my |
You could use either:
Update: the above does not work because the Rust project is in a sub-folder ( or clone the Git repo locally and refer to it with:
This will use the currently checked out branch. Please consult with https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html for more information. |
@martin-g thanks, I can confirm your version works. 🎉 |
Similar to 064cc6b for Enum Reported at flavray/avro-rs#99 (comment) Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org> (cherry picked from commit 29197da)
@martin-g Is it possible to recover a schema using the ref? Did you implemented any method in the lib for this? Thanks in advance :) |
The recovering happens internally to validate and resolve the fields. |
I am implementing an Arrow to Avro serializer, the parquet was serialized with Java Avro Parquet Write so I have the Avro Schema inside the key So I've created a recursive function that converts the parquet arrays into Avro records with The thing is that I have reached a point in the recursion where I have a reference to a record type, but I need the complete avro schema to continue with the recursion. If you can help me obtaining the appropiate schema it would be helpful, I will be glad to do a PR to the official lib when I have all this working :) |
@martin-g I have seen this methods in the lib: https://github.com/apache/avro/blob/5b2c27956b601c580af277614397b2e43e0ba9f9/lang/rust/avro/src/decode.rs#L71 But this structs and methods are not public available :( |
The best is to fork apache/avro, make the needed changes and send a PR. |
Ok, I will do it, is just making ResolvedSchema public. When I have the parquet to avro conversion I will figure out where is the best place to open PR, I think arrow-rs repo |
@martin-g I have a Record with two attributes of type recrod two. Eachf of them validates separately angaints its schema. But when I execute: let mut record = Record::new(avro_writer.schema()).unwrap();
record.put(col_name, value);
avro_writer.append(record).unwrap();
avro_writer.flush().unwrap() I get next error on flush: thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ValidationWithReason("Unresolved schema reference: 'XXX'. Parsed names: [Name { name: \"XXX\", namespace: Some(\"com.XXXX.XXXX.XXX\") }] |
@WaterKnight1998 Please use Apache Avro users@ mailing list or Slack. I will need to see the schema and the record to be sure but it looks like the resolved schema for this reference is namespaced while you try to point to it without the namespace. |
Could you share the slack please? |
Give me an email and I will send you an invitation. Either here or at mgrigorov at apache org |
@martin-g done, i sent you an email |
@martin-g I managed to reproduce: use apache_avro::{Schema, Writer};
use apache_avro::types::{Record, Value};
let raw_schema = r#"
{
"type": "record",
"name": "test",
"namespace": "com.example.namespace.1",
"fields": [
{"name": "ListID", "type": "record", "fields": [ {
"name": "id",
"type": "long",
"namespace": "com.example.namespace.2",
"doc": "The cart identifier"
}]},
{"name": "b", "type": "ListID"}
]
}
"#;
let schema = Schema::parse_str(raw_schema).unwrap();
let mut writer = Writer::new(&schema, Vec::new());
let mut record = Record::new(writer.schema()).unwrap();
record.put("ListID", Value::Record(vec![(String::from("id"), Value::Long(36))]));
record.put("b", Value::Record(vec![(String::from("id"), Value::Long(36))]));
writer.append(record).unwrap(); |
I forked the lib and made public several structs and methods and now is working :) |
The code at #99 (comment) works just fine with current master of apache_avro. |
Similar to apache/avro@064cc6b for Enum Reported at flavray/avro-rs#99 (comment) Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>
Support recursive types (#92) and (#95)
The following is a reworking of the
schema
module to allow for recursive types to function correctly.When I started this work I wasnt aware that #53 existed so sorry to @stew for stomping on that work. As far as can be tested, these changes fully support all the crazy recursive schemas that are created for avro
Basic approach
Unlike the approach taken in #53, this implementation hides the internal details of the schema data, and flattens the nesting of the schema to a hash map of
ident -> data
. The ident is taken from a string interning datastructure to make the mapping as cheap as possible.This means that, all of the public facade interface types are value type objects and can be passed about as Copy types since they are essentially fat pointers.
Each value type holds a reference to the schema containing all data, as well its id for the actual type for lookup. From here each facade value type implements accessor methods to allow the retrieval of further data, as well as allowing facades to be further placed onto data that might recurse.
This means that there is some added indirection for accessing data, although this is somewhat offset by the fact that the data structure should be smaller, involve less allocations and have no
Rc
requirements.Since the datastructure holding the schema is none trivial to construct, a builder interface has been provided that allows for the creation of new schemas, this builder tries to guide schema creation in the simplest fashion possible.
The schema module due to complexities has been split into sub modules for the significant parts of schema management (parsing, serialisation etc).
TODO
The following are a few todos that are left, which should be landed before this PR is totally good to go
pcf
(we could really serialise to a tree via serde and then just manipulate that tree).clones
in the codebase that could be&'schema str
references