Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLJ API for Missing Imputation ? #950

Open
sylvaticus opened this issue Jun 28, 2022 · 5 comments
Open

MLJ API for Missing Imputation ? #950

sylvaticus opened this issue Jun 28, 2022 · 5 comments

Comments

@sylvaticus
Copy link
Contributor

sylvaticus commented Jun 28, 2022

[Possibly related to the API discussion on Clustering Models]

I am in the process to implement several Missing Imputers in a new BetaML Imputation sub module, based on GMM (as the current MissingImputator that I will deprecate once Imputation is ready ), random forests and simple means.

My tentative BetaML API is currently

mod = ImputerModel(params)
fit!(mod,XWithMissing)
XImputed = predict(mod)
complementaryInfo = info(mod)

This however doesn't fit well with the MLJ interface currently used for MissingImputator

fitresults   = fit(model,XWithMissing);
XImputed = transform(mod,fitresults,newX)

This approach seems a bit forced to me.. in missing imputation problems we don't really have the concept of generalising a model to new data.. what we would need is instead a sort of fit_transform function.. however with the possibility to extract the imputed values (whether in a dense or sparse way) but also some information related to the fitting...
What do you think? Should I just implement my "low level" fit! and predict in MLJ fit and return imputed values and fitting info from it ?
How to deal with multiple imputations (an option of the random forest imputer) ? Currently BetaML.Imputation.predict(mod::RFImputer,X) returns a vector of imputed values instead of a scalar if mod.multipleImputations (a parameter of the model) is higher than 1...

EDIT: This tentative interface implements the fit/predict in the MLJ fit function.. does it looks good for you?

https://github.com/sylvaticus/BetaML.jl/blob/b631d82a2a86b13877fafa139c7db97625f36700/src/Imputation/Imputation_MLJ.jl

(still don't know how to return the imputations when multiple ones are possible depending on the parameter - that will be the case with BetaMLRFImputer.. should I always return a vector, even if most of the users will just need one? Or should I ignore multiple imputations in the MLJ interface ?)

@ablaom
Copy link
Member

ablaom commented Jul 1, 2022

@sylvaticus Thanks for raising your use case. This kind of issue (the first one) has recurred in a few cases, and to be honest, while there are possible paths within the API I've never been completely happy with any of them. For now, let me just record what is possible at the moment and try to add more later. I'm sorry, I haven't had a chance yet to look at your POC yet

Your first question, if I understand correctly, is what to do if you have a "one-shot" transformer which has byproducts of "training" that you want to inspect - you call it "complementary data" and it's the report in MLJ lingo. Here are options within the current MLJ API (the same options mentioned in the clustering thread).

  1. Unsupervised model with no transform or predict but the output and complementary data are exposed in the report.
  2. Static model for which transform (which does the actual imputing) returns both output and complementary data in the form of a tuple (Xout, report) . See Static Transformers.

(Note that presently, predict and transform in MLJ are always associated with generalization to new data.)

From the point of view of model composition, 2 is probably more convenient than 1. You can do imputation |> first for use of the imputator in pipelines, for example, but you will not easily access the "report" in a pipeline. (You could expose it in a manually built composite model by virtue of this.)

As an aside, there's been suggestions to have a fit_transform convenience method, something like

fit_transform(model, X) = transform(machine(model, X) |> fit!, X)

but this is still going to return both (Xout, report) in case 2.

Regarding your second question, I don't quite understand what is meant by multiple imputation. Maybe you can point me to an example of this somewhere.

@sylvaticus
Copy link
Contributor Author

Thanks @ablaom, I'll look on it.
Multiple imputations refer to the idea that in some contexts (mostly statistical analysis more than ML) the user is interested to take into account the uncertainty of the imputations in the results of its follow-up analysis, so instead of a simple imputation the imputer returns a set of different imputations (in RF due to the random dimensions chosen, random sampling, etc..) and the user performs its (statistical) analysis on each of this imputed data separately and then pools the final results.
R packages like MICE allow this "multiple analysis" to be performed automatically together with the final pooling, in BetaML case we would just provide with the multiple imputations.
Note that sci-kit learn takes a pragmatic approach: it recognises its utility in certain contexts but it says "just run the model multiple times".
One advantage of BetaML is that conditional to a certain RNG all the different imputations would be deterministic and the experiment replicable.

@ablaom
Copy link
Member

ablaom commented Jul 7, 2022

Thinking about this some more today, I think Static is the way to go. And while this makes static imputers look quite different from those that do generalize, I believe this is unavoidable as this is conceptually a big difference, at least as far as model composition is concerned.

The multiple imputations apparatus looks interesting. I don't think this is impossible, but realistically, it's out of scope for MLJ integration. at present.

@ablaom
Copy link
Member

ablaom commented Jul 11, 2022

Okay, I have some further thoughts on how we can improve the API for models that don't generalize to new data, such as some imputers and some clustering algorithms, and where there are byproducts of the computation you want accessible. As above, I suggest these be implemented as Static models, which require atransform(::MyStaticModel, ::Nothing, data...) method but no fit(::MyStaticModel, ...). (In the long term staticness will be a trait, but for now it is a Model subtype.)

Currently, only the fit method can create stuff for addition to a machine's report; under this proposal, calling transform on a model returns (Ximputed, report) but calling transform on a machine returns only Xout while report is merged into the machine's report. So something like this:

In implementation

function MLJModelInterface.transform(my_imputer::MyImputer, ::Nothing, X)
    ...
    return (Ximputed, report)
end 

# new trait to flag the fact that `transform` is returning extra "report" data:
MLJModelnterface.reporting_operations(::Type{<:MyImputer}) = (:transform,) 

User workflow

mach = machine(MyImputer(...))   # No need to `fit!` here
X = ... # some data to impute
Ximputed = transfrom(mach, X)

report(mach) # returns extra stats about the imputation 

If you don't care for the report, you can just do

Ximputed = transform(machine(MyImputer()), X)

or, if we add the overloading transform(model::Static, data...) = transform(machine(model), data), we can simplify the last to

Ximputed = transform(MyImputer, X)

but this last assumes data never includes nothing and there are not some method ambiguities I haven't thought of. I'd have to check this more carefully. Frankly, I don't think the shortcut would be needed now.

As proposed, this is non-breaking, but requires the addition of a trait. I'm working on a POC but it wold be good to get any feedback before I get too far along.

In pipelines, the report would be accessible in the usual way (something like report(pipe_machine).my_imputer) since it associated with an internal machine.

@ablaom
Copy link
Member

ablaom commented Jul 14, 2022

Okay, the proposal referenced above has now been implemented. You will need to make your lower bound on MLJModelInterface = "1.6" to make use of it in BetaML. Let me know if further guidance is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants