Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch feature parity #1431

Open
60 of 92 tasks
CarloLucibello opened this issue Dec 19, 2020 · 94 comments
Open
60 of 92 tasks

PyTorch feature parity #1431

CarloLucibello opened this issue Dec 19, 2020 · 94 comments

Comments

@CarloLucibello
Copy link
Member

CarloLucibello commented Dec 19, 2020

A list of PyTorch 1.7 features.
Items are checked if we have something more or less equivalent in Flux or in the julia ecosystem and supported by Flux.
This list is not complete, it comes from a rough scan of pytorch's documentation. Please feel free to add anything I missed in the comments, and whoever has write access to modify the list.
Related issue https://github.com/FluxML/ML-Coordination-Tracker/issues/16, and more generally anything in https://github.com/FluxML/ML-Coordination-Tracker/issues

Pytorch Features

Conv Layers

  • Conv1d, Conv2d, Conv3d.
  • ConvTranspose1d, ConvTranspose2d, ConvTranspose3d.
  • groups in convolution layers
  • Fold, Unfold. In progress: Add fold and unfold NNlib.jl#444

Pooling Layers

  • MaxPool1d, MaxPool2d, MaxPool3d
  • MaxUnPool1d, MaxUnPool2d, MaxUnPool3d
  • AvgPool1d, AvgPool2d, AvgPool3d
  • FractionalMaxPool2d
  • LPPool1d, LPPool2d
  • AdaptiveAvgPool1d, AdaptiveAvgPool2d, AdaptiveAvgPool3d
  • AdaptiveMaxPool1d, AdaptiveMaxPool2d, AdaptiveMaxPool3d

Padding Layers

  • ReflectionPad (1d,2d)
  • ReplicationPad (1d,2d,3d) ( NNlib.pad_repeat)
  • ZeroPad (2d)
  • ConstantPad (1d,2d,3d)
  • Add corresponding layers for all of the aboves wrapping the NNlin functions keep as functions. Need to add them Flux's docs.

Activations

  • ... . NNlib has an extensive collection of activation, plus we have any julia function.

Normalization Layers

Recurrent Layers

  • RNN
  • GRU
  • LSTM

Attention Layers

Linear Layers

  • Identity
  • Linear
  • Bilinear

Dropout Layers

Sparse Layers

Distance Functions

  • CosineSimilarity. We have this in Distances.jl. Also easy to handcode. TODO check if AD and gpu friendly.
  • PairwiseDistance. We have this in Distances.jl TODO check if AD and gpu friendly (could use Tullio.jl to achieve both)

Loss Functions

Vision Layers

Initialization

Parallelism and Distributed

  • DataParallel
  • DistributedDataParallel(solved by https://github.com/DhairyaLGandhi/DaggerFlux.jl
  • set_num_threads, set_num_interop_threads. Not sure which operations are parallelized in pytorch. Here we have parallelization only in blas operations.

Distributions

  • diff rules for logpdf offered by DistributionsAD.jl
  • rsample. params's differentiability through sampling supported by many distr: gradient(mu -> rand(Normal(mu, 1)), 0) == (1,).

ONNX

FFT

  • ... . Zygote has the adjoints for AbstractFFTs.

Quantization

  • ...

Pruning

  • WIP pruning package here

Optim

LinAlg

  • det
  • norm

Tensorboard

XLA

Misc

Pytorch Extras

Torchvision

Torchaudio
...

Torchtext
...

@gxyd
Copy link
Contributor

gxyd commented Dec 19, 2020

Do you mind if I try to implement the support in Flux corresponding to Dropout2D in pytorch?

@CarloLucibello
Copy link
Member Author

CarloLucibello commented Dec 19, 2020

yes please, essentially this is all up for grabs

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Dec 19, 2020

Note that we shouldn't add all these layers here, for eg pixel shuffle has an implementation, so does Transformers, up sampling and embedding are direct Julia operations etc

@CarloLucibello
Copy link
Member Author

Maybe @chengchingwen could provide some suggestions on the last two items

@DrChainsaw
Copy link
Contributor

@CarloLucibello About ONNX: This exists which I guess (hope?) is better than nothing: https://github.com/DrChainsaw/ONNXmutable.jl

I haven't registered it because 1) the name sucks and I can't think of anything better and 2) I'm thinking of splitting the import and export into two separate packages. 1 is the main blocker though :)

I'd be happy to donate it to FluxML, or parts of it (e.g. import/export primitives).

@darsnack
Copy link
Member

Yeah upsampling is non-trivial to get right and be performant on the GPU as well (last time I tried it, I had to ask in #gpu on Slack to get a good implementation).

For ONNX, is it possible to hand control of ONNX.jl to @DrChainsaw? It seems like ONNXmutable.jl should really supersede that package.

For vision models, there is this Metalhead PR which I think will bring us much closer to PyTorch parity. I am planning on training some of the simpler ones this weekend, but I would appreciate the help to add pre-trained weights from anyone with a GPU.

Lastly, for hyperparameter/learning rate schedules, I just started ParameterSchedulers.jl to break the functionality out of FluxTraining.jl. This is quite a simple package, and I want to finish it this weekend for a project. I am happy to transfer ownership to FluxML.

@bhvieira
Copy link
Contributor

I tried implementing WeightNorm before, but it's harder than I thought without doing a per-layer implementation. See #1005
Doing a per layer implementation is actually easy, but maintenance hell at the same time.

@CarloLucibello
Copy link
Member Author

@DrChainsaw what are the limitations of ONNXmutable?

@DrChainsaw
Copy link
Contributor

DrChainsaw commented Dec 19, 2020

@CarloLucibello From the ML-Coordination issue it seems like there is alot of ways to look at ONNX import/export so what is a limitation appears to be a bit more subjective than I thought.

Here are some things I can think of

Only a subset of OPs supported. This is imo not a big deal as I have made an effort for it to be easy to add more and even make it easy for users to just hack in own versions locally. Most OPs are trivial to add but I have intentionally not added more than what I happen to need in hopes that it would encourage contribution.

It has capabilities which perhaps only a small subset of users have use for w.r.t model manipulation. This translates to dependencies like JuMP and Cbc (used to solve the problem of keeping all parameter shapes aligned when changing the model structure) as well as metadata used to formulate the shape constraints. This may appear as bloat to users who only want to import a model and use it. The annoying part here is that Chain can't represent any graph and even things like what is proposed in #1289 seem very hard to translate to from a more standard graph format such as the one used in ONNX. NaiveNASlib has an internal graph format which does not have the extra functionality for shape alignment which could perhaps be used, but there seem to be a desire for a 'Flux native' format.

RNNs are currently a bit limited, although this is more on Flux than on ONNXmutable since ONNX wants RNN to have 3D input while Flux wants 2D (in a loop). I have worked around this to some extent by just changing the model shape to 3D if a recurrent layer is found and then just fold the time dimension into the batch dimension if a Dense layer is encountered. This only works for a few model architecture types though.

Exporting functionality can't handle 1) non-primitive functions with type constraints (e.g. function thewholemodel(x::AbstractArray)) and 2) non-function control flow (e.g if/else/for, functions like ifelse/map/reduce or let_onnx_know_this_is_a_loop(f, n) could be solveable I think). The first can probably be hacked around with IRtools but I think that the latter would require some abstract interpreter or similar sopisticated code analysis/transformation, e.g. mjolnir.

Eco-system wise it would be better to refactor at least the export primitives to use NNlib as that would make them useable from other libraries which use NNlib (KNet, Avalorn etc). Perhaps not so much a limitation in itself though and can always be broken out later down the road. For export there is no limit on how many ways one can chose to translate a Julia function to an ONNX Node.

Btw, I think it would be better to try to remove ONNX.jl from the general registry and use a name like OnnxFlux.jl to clearly state that it translates between ONNX and Flux.

@darsnack
Copy link
Member

Btw, I think it would be better to try to remove ONNX.jl from the general registry and use a name like OnnxFlux.jl to clearly state that it translates between ONNX and Flux.

Unfortunately we can't remove packages from the registry. But if ONNXFlux.jl makes more sense, then we can just archive the ONNX.jl repo.

@ToucheSir
Copy link
Member

ToucheSir commented Dec 19, 2020

I don't think it's unreasonable to expect anyone looking to use transformer layers to use Transformers.jl. One potential reason for Torch to add them is because there is no canonical library for transformers in that ecosystem (or really for any other domain...).

RE ONNX, why not give that repo name over to ONNXMutable and then consider how best to refactor/reorganize? I highly doubt anyone is using the existing functionality, given that it's broken on most recent versions of Julia that Flux supports.

RE XLA, I presume this is covered by the work Keno and Tim are doing? Not sure if there's a link to any details there.

@jeremiedb
Copy link
Contributor

Regarding embeddings, although I haven't dealt with the potential caveats from weight norm and such, are there challenges I'm overlooking compared to doing a fairly trivial matrix indexing? Example:

struct Embed{T}
    w::T
end

@functor Embed
Embed(in::Integer, out::Integer; initW=glorot_uniform) = Embed(initW(out, in))
(m::Embed)(x::AbstractVector) = m.w[:,x]

@ToucheSir
Copy link
Member

My understanding is that the trivial indexing triggers scalar indexing on GPU arrays. Transformers.jl has custom implementations for both CPU and CUDA, so in that sense the hard work is already done.

@ToucheSir
Copy link
Member

Something else I'd like to submit for consideration is an equivalent to the upcoming LazyModuleMixin. Not a 1-1 port, but some mechanism to avoid specifying intermediate sizes during model construction.

@CarloLucibello
Copy link
Member Author

CarloLucibello commented Dec 20, 2020

Are Embeddings something of general utility besides Transformers, worth moving to Flux.jl?

cc @chengchingwen @jeremiedb @ToucheSir

@CarloLucibello
Copy link
Member Author

CarloLucibello commented Dec 20, 2020

My understanding is that the trivial indexing triggers scalar indexing on GPU arrays. Transformers.jl has custom implementations for both CPU and CUDA, so in that sense the hard work is already done.

That gather is similar to GeometricFlux's one? worth having it as a primitive in Flux.jl or CUDA.jl?
@yuehhua

@bhvieira
Copy link
Contributor

Flux is lacking attention modules. That would be good to have (and PyTorch does have it).

@dfdx
Copy link

dfdx commented Dec 20, 2020

That gather is similar to GeometricFlux's one? worth having it as a primitive in Flux.jl or CUDA.jl?

Note that there's also very similar implementation in ScatterNNlib (gather, scatter, their gradients). It wold be great to have them in NNlib and CUDA so other packages (like Avalon of my own) could use it.

@jeremiedb
Copy link
Contributor

the trivial indexing triggers scalar indexing on GPU arrays

I recently used this approach for embedding and can confirm good performance on GPU, maybe there's been recent improvement in CUDA.jl explaining that it doesn't resort to scalar operations. Benchmark against Transformers.jl would be interesting through.

@gxyd
Copy link
Contributor

gxyd commented Dec 21, 2020

@darsnack

I would appreciate the help to add pre-trained weights from anyone with a GPU.

I would want to help on that if possible, I'm not really sure of the process of it though. I do have access to a GPU (GTX 1080), let me know if I can be of any help on that. I'll try to figure out the procedure for that.

@darsnack
Copy link
Member

@gxyd Take a look at the PR linked above. Someone already posted a training script (I haven't had the time to check if it works). I would just ping that PR thread if you manage to get something to train.

@chengchingwen
Copy link
Member

I think something need to be mentioned together with Embedding is the one-hot encoding implementation. The problem of Embedding/OneHotEncoding is to maintain semantics and composability without hurting the performance on GPU. Currently the implementation of OneHotVector is not that handy, so I have one custom one-hot implementation in Transformers.jl.

I do think they are worth moving to Flux/NNlib but there some questions need to be discuss. The semantics of gather/scatter in Transformers.jl and ScatterNNlib.jl are different. I follow the definition in TF and @yuehhua follows the one in pytorch_scatter package. The decision need to be made before we treat them as a basic building block in Flux/NNlib.

@chengchingwen
Copy link
Member

@CarloLucibello I would like to add Einstein summation and tensor product to the discussion list. They are quite useful in some novel model design.

@CarloLucibello
Copy link
Member Author

@CarloLucibello I would like to add Einstein summation and tensor product to the discussion list. They are quite useful in some novel model design.

I added them as covered by Tullio.jl. Possibly we just have to add references and examples in Flux.

@CarloLucibello
Copy link
Member Author

@chengchingwen could you open an issue here about OneHotVector's limitations?

@DhairyaLGandhi
Copy link
Member

I think the issue with onnx implementations in general isn't writing the package initially, but the additional ops that need to be added regularly. We need a solution to that problem which is more pressing imo.

I agree we need more attention modules.

I would want to gather the relative issues with upsampling

@CarloLucibello https://github.com/FluxML/NNlib.jl/pull/112/files

@darsnack
Copy link
Member

That makes sense. Did we reexport those functions in Flux?

@CarloLucibello
Copy link
Member Author

yes, we have @reexport using NNlib. We still have to add them to the docs though.

@ToucheSir
Copy link
Member

I'm not sure if the top post can be made into a wiki or something, but barring that some updates to keep this going:

  • Trilinear upsampling is in NNlib and pending for NNlibCUDA. There's also a PR out for linear upsampling.
  • ONNX.jl's old implementation has been replaced. Everything else is still pending.
  • AFAICT the normalization layers are the only ones that don't have a "functional" equivalent in NNlib. This has been tracked for a while by BatchNorm and Dropout NNlib.jl#19, so it comes down to whether or not we want to add them.

@CarloLucibello
Copy link
Member Author

Updated the OP with @ToucheSir's comments.

AFAICT the normalization layers are the only ones that don't have a "functional" equivalent in NNlib. This has been tracked for a while by FluxML/NNlib.jl#19, so it comes down to whether or not we want to add them.

since there has been some request and it is what we do with basically anything else I think we should do it

@tantheta01
Copy link

Apologies if I am slightly late. I went through the discussion and the tracker and want to implement the FractionalMaxPooling layer. Can someone please let me know if it is already implemented? Otherwise, I would love to work on it. Thanks!

@ToucheSir
Copy link
Member

ToucheSir commented Dec 22, 2021 via email

@tantheta01
Copy link

Thank You!
I am slightly new to Julia and needed advise on some design choices. I went through the code of MaxPooling and found that the underlying implementation was done in NNlib.jl which was then used like this
Now I am confused as to what would be the best design choice:

  1. Should the implementation of FractionalPooling be added to NNlib.jl and then be imported in Flux
  2. Should we create a new file in the layers directory to contain the implementation
    Any guidance would be appreciated

@DhairyaLGandhi
Copy link
Member

The pooling layer should ideally reuse as much code from NNlib and the layer can be In Flux. We would expect the layer to be a generalization of the maxpool layer in NNlib to accept real input in addition to integers, so the new layer would serve to generate the fractional sections and use pooling in those sections and combine the resulting array. Ideally it would be done with minimal changes to NNlib.

@FelixBenning
Copy link

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Jan 20, 2022

Pytorch et al need these constructs because they require users to use custom intrinsics that their own codebases can understand (eg torch.log instead of log). In Flux, we don't want that. We should be able to use Base methods and those overloaded by other packages directly so it's easy for us to avoid such functions and let users call log generically.

@bhvieira
Copy link
Contributor

(BTW that naming convention in PyTorch makes no sense, since it refers to a specific distributional assumption)

@mcognetta mcognetta mentioned this issue Aug 2, 2022
3 tasks
@Moelf
Copy link
Contributor

Moelf commented Aug 2, 2022

for quantization maybe https://github.com/google/qkeras is a good reference? do we have any advancement since this issue opened?

@kpa28-git
Copy link

kpa28-git commented Aug 5, 2022

Pytorch has weight normalization, this would be good to add the normalization section

@CarloLucibello
Copy link
Member Author

Pytorch has weight normalization, this would be good to add the normalization section

weight_norm is in the misc section (reflecting the organization of pytorch docs)

@CarloLucibello
Copy link
Member Author

for quantization maybe https://github.com/google/qkeras is a good reference? do we have any advancement since this issue opened?

not much progress on that front. In any case, this issue just tracks pytorch's features, not those exposed by specialized libraries (although having other references is good).

@pri1311
Copy link

pri1311 commented Dec 31, 2022

Any plans of adding HeNormal and HeUniform initialization functions? They are currently not present in Pytorch, however. Would be happy to send a PR if it's welcome.

@ToucheSir
Copy link
Member

Aren't those already covered by Flux.kaiming_{normal,uniform}? We borrowed the PyTorch naming scheme.

@pri1311
Copy link

pri1311 commented Dec 31, 2022

@ToucheSir Aah yes, my bad, switching back and forth between Tf and pytorch hasn't been going well :)

@dorn-gerhard
Copy link

A short question: are there any plans to implement sparse convolutional layers?
Article about sparse convolution

I found the following pytorch implementation referencing this article

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests