Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix all the models #266

Open
20 of 59 tasks
ghost opened this issue Jan 17, 2021 · 31 comments
Open
20 of 59 tasks

Fix all the models #266

ghost opened this issue Jan 17, 2021 · 31 comments

Comments

@ghost
Copy link

ghost commented Jan 17, 2021

I started testing all the models in this repo against the flux 0.11.4 release on julia 1.5 (ref FluxML/FluxML-Community-Call-Minutes#9). In this issue i will collect all encoutered problems and try to list all other open issues for each model. The tests are ran against my fork which contains the updated manifests and hotfixes needed to actually run some of the models.

👍 = runs without errors on julia 1.5 + flux 0.11.4 and shows decent results (converging loss, ok looking generated images, ..)
❕ = does not run on julia 1.5 + flux 0.11.4 or gives wrong results, but PR with fix available
❗ = does not run on julia 1.5 + flux 0.11.4 or gives wrong results

all

vision/cdcgan_mnist: 👍

  • runs without issues

vision/cifar10: 👍

  • dependency problems when updating to Flux 0.11.4 (Fix cifar10 model #270)
  • depends on Metalhead dataset instead of MLDatasets (Fix cifar10 model #270)
  • OOM errors because dataset is fully moved to gpu before training (Fix cifar10 model #270)
  • it defines the vgg19 model, but does not provide an option to use it. Just remove it?
  • can take a while to run, but does not provide much feedback

vison/cppn: 👍

  • no GPU support

vison/dcgan_mnist: 👍

vison/lenet_mnist: 👍

vison/mnist/mlp: 👍

  • runs without issues

vison/mnist/conv: 👍

  • depends on Flux.Data.MNIST instead of MLDatasets

vison/mnist/autoencoder: 👍

  • depends on Flux.Data.MNIST instead of MLDatasets

vison/vae_mnist: 👍

text/char-rnn: 👍

  • no GPU support
  • could use MLDataPattern.splitobs

text/lang-detection: 👍

  • no GPU support
  • uses old logitcrossentropy loss
  • use a standard dataset instead of providing a scraper

text/phonemens: 👍

  • no GPU support
  • data and model split over 2 files, could be merged
  • uses old logitcrossentropy loss

text/treebank: 👍

  • several errors in scripts (Fix treebank model #271)
  • uses deprecated Flux.Data dataset
  • data and model split over 2 files, could be merged

other/housing: 👍

  • diverging loss due to incorrect gradient descent (Fix housing model #273)
  • model overfits on training data, loss on testing data diverges
  • dataset can be replaced with MLDatasets
  • replace custom meansquarederror?
  • could use MLDataPattern splitobs
  • should probably use Float32 instead of Float64

other/iris: 👍

  • no gpu support
  • uses deprecated Flux.Data dataset
  • could use MLDataPatterns.splitobs
  • uses deprecated logitcrossentropy

other/fizzbuzz: 👍

  • no gpu support
  • uses deprecated logitcrossentropy

other/flux-next: ❗

  • does not ssem to make sense anymore, build around non existing step! optimiser api

other/bitstring-parity: 👍

  • no gpu support
  • split over several files

tutorial/60-minute-blitz: 👍

tutorial/transfer_learning: ❕

contrib/audio/speech-mlstm: ❗

  • dataset seems only available to members (this will also probably prevent using it in a CI setting), unable to test

contrib/games/cartpole: ❗

  • uses unregistered Gym.jl package which fails because of missing @guarded macro

contrib/games/pendulum: ❗

  • uses unregistered Gym.jl package which fails because of missing @guarded macro

contrib/games/trebuchet: ❗

  • DDPG variant is tied to Tracker and does not work with Zygote without significant changes
@DhairyaLGandhi
Copy link
Member

Can we get PRs for the hotfixes?

@CarloLucibello
Copy link
Member

I suggest we also rationalize the examples. We could remove

  • vision/mnist/conv
  • vision/mnist/autoencoder
  • vison/cppn

as redundant or not adding much value (in the last case)

@CarloLucibello
Copy link
Member

Also we could rename

  • vision/cifar10 -> vision/vgg_cifar10
  • vision/mnist/mlp -> vision/mlp_mnist

@darsnack
Copy link
Member

Just for reference, what's the approach here? Are you following the steps in FluxML/FluxML-Community-Call-Minutes#9 ? So, is this to track hot fixes to get things running, and later we'll refresh the files to use the latest packages in the ecosystem?

@ghost
Copy link
Author

ghost commented Jan 17, 2021

Yes, steps as mentioned in FluxML/FluxML-Community-Call-Minutes#9. This issue is to keep track of the test results, but I plan to note down other issues found along the way.

For the models that fail to run I do update the dependencies before writing a fix. And if a package is the source of the problem I intend to replace it with the latest ecosystem package one. Take for example the cifar10 model, this one can not be used with the newest flux release due to the dependency on Metalhead for it's dataset, with MLDatasets it can actually be tested on flux 0.11.4.

@DhairyaLGandhi
Copy link
Member

I wouldn't remove the models, those are some of the better maintained models, compared to how the some others have bloated argument handling taking away from the core of the solved problem. Removing those as redundant seems like the better approach no?

@ghost ghost changed the title Refresh model zoo Fix all the models Jan 19, 2021
@ghost
Copy link
Author

ghost commented Jan 19, 2021

I do agree, some of the "simpler" models could actually serve as a great example of a "my first model". But I do think we should have that discussion in the coordination tracker issue, and keep this issue about the quick fixes to get the models running again.

@ghost
Copy link
Author

ghost commented Jan 20, 2021

The other/flux-next model is a bit problematic. This one is rather old and seems to explain a 'new' api for the optimisers, but this does not work (anymore). It probably needs a full rewrite, but than it wouldn't really add something to the official Flux docs. I would suggest to just remove it from the zoo or does anyone think it still has a place here?

@DhairyaLGandhi
Copy link
Member

For now, we can leave it be, it will be part of the Optimisers.jl release. It's not expected to be a guarantee but maybe adding a readme saying so would be alright

@ghost
Copy link
Author

ghost commented Jan 20, 2021

Are you sure? It does not seems to match with the api from the Optimisers.jl package (it uses a step! function)..

@DhairyaLGandhi
Copy link
Member

It's from a flux pr that we plan to merge, but it's a breaking release so may not do it immediately.

@DhairyaLGandhi
Copy link
Member

Can we add an item in the tracker about adding guiding texts with the models that talk about:
A. The model, and the problem being solved
B. The specific features in flux that we are trying to communicate (custom structs, loss, training loop, model construction etc)

@DhairyaLGandhi
Copy link
Member

cc @darsnack ^

@darsnack
Copy link
Member

Sure, did you mean zoo tutorials?

@DhairyaLGandhi
Copy link
Member

Basically having text pointing users to how to use certain Flux features, and also guiding them about different problems beings solved (recurrent nets, basic features and regression, custom train loops etc.)

If we can use the Literate.jl format, we would be in good stead to automate moving them to the site.

@CarloLucibello
Copy link
Member

We can remove contrib/games/cartpole and pendulum, curated Flux's based examples are available in https://github.com/JuliaReinforcementLearning/ReinforcementLearningZoo.jl

@DhairyaLGandhi
Copy link
Member

Those are not there for the RL, they are there to show what the dP equivalent would look like, so I would update then with the literature and maybe add them as ci examples, because I'm sure recent changes have to support this.

@ghost
Copy link
Author

ghost commented Jan 23, 2021

Now that all the models are tested I created a new issue (#280) to start giving some direction to the next steps and address the last comments made here.

@DhairyaLGandhi
Copy link
Member

We still need to update Metalhead I take it.

@ghost ghost mentioned this issue Jan 23, 2021
@ghost
Copy link
Author

ghost commented Jan 23, 2021

My comment wasn't meant to skip steps. I'm just trying to prevent doing work now that will be flushed down the drain later in the process.

@DhairyaLGandhi
Copy link
Member

Sure, i was ensuring that that was a pending item that's blocking some of the updates.

@DhairyaLGandhi
Copy link
Member

Metalhead doesn't bound flux, also I have released patches to both. We might want to consolidate the environments and only have one at the top level.

It's hard to keep track otherwise

@ghost
Copy link
Author

ghost commented Jan 26, 2021

Do you mean a top level project and manifest file for all the models/tutorials?

That could become problematic when one of the tutorials depends on a package that is not compatible with the newest Flux/CUDA. It could hold back updating of all the models until that external package has been made compatible. Transformers.jl is an example, if one tutorial would depend on it everthing will be held back.

@DhairyaLGandhi
Copy link
Member

That's not been a big issue since we only need to activate the env, and updating the global env is usually painless, whereas individual ones usually end up becoming inconsistent very quickly and have the same issue of holding back updates. With the added annoyance of having to update many of them for every minor change in a dependency.

@DhairyaLGandhi
Copy link
Member

We used to have this pattern before and that's where this learning comes from..

@darsnack
Copy link
Member

I agree that a single Project.toml and Manifest.toml is better. Allowing certain tutorials to fall out of compatibility with the latest Flux is not what we want. Every tutorial should be running on the latest Flux or the latest Flux/the tutorial should be fixed.

The flow should be "Release Flux" --> "Update zoo" --> "Release zoo."

@darsnack
Copy link
Member

My only concern is that a user needs to install lots of packages to run one specific example. Maybe it is better to have a script that automates bumping Flux for all the tutorials.

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Jan 26, 2021

My only concern is that a user needs to install lots of packages

Yeah mine too, but that's still not too bad. For instance, things like CUDA would dominate the space/ network usage for most users and adding packages doesn't take too long these days.

The flow should be "Release Flux" --> "Update zoo" --> "Release zoo."

It depends, we would end up doing some benchmarking as well and see changes with individual prs with #278

@DhairyaLGandhi
Copy link
Member

What's the last column for?

@sambitdash
Copy link
Contributor

I really like the MNIST models. So please do not remove if you can. The reason being, they are pry the only model that can run reasonably on a CPU. Other vision models need GPUs for good accuracy. I will set up this expection from the MNIST example.

  1. The mlp one is a quick example to showcase ANN.
  2. The CNN should show the benefits of using a lot less parameters. Hence, the current one may need some fixing and also take into consideration the data loaders etc.
  3. I think the lautoencoder should preferably use a CNN based encoder architecture.

@ToucheSir
Copy link
Member

Ideally any model in the zoo that can run on GPU should also converge just as well on CPU. I don't think the MNIST models are going anywhere either!

As for the autoencoder example, we should probably have both dense and conv examples. That may be redundant with the VAE models though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants