Fix all the models #266

ghost · 2021-01-17T14:08:38Z

I started testing all the models in this repo against the flux 0.11.4 release on julia 1.5 (ref FluxML/FluxML-Community-Call-Minutes#9). In this issue i will collect all encoutered problems and try to list all other open issues for each model. The tests are ran against my fork which contains the updated manifests and hotfixes needed to actually run some of the models.

👍 = runs without errors on julia 1.5 + flux 0.11.4 and shows decent results (converging loss, ok looking generated images, ..)
❕ = does not run on julia 1.5 + flux 0.11.4 or gives wrong results, but PR with fix available
❗ = does not run on julia 1.5 + flux 0.11.4 or gives wrong results

all

Parameters.@with_kw vs Base.@kwdef, see Roadmap with respect to Base.@kwdef mauro3/Parameters.jl#89. None of the models seem to depend on features not provided by the Base implementation.
Most of the models are missing guiding texts

vision/cdcgan_mnist: 👍

runs without issues

vision/cifar10: 👍

dependency problems when updating to Flux 0.11.4 (Fix cifar10 model #270)
depends on Metalhead dataset instead of MLDatasets (Fix cifar10 model #270)
OOM errors because dataset is fully moved to gpu before training (Fix cifar10 model #270)
it defines the vgg19 model, but does not provide an option to use it. Just remove it?
can take a while to run, but does not provide much feedback

vison/cppn: 👍

no GPU support

vison/dcgan_mnist: 👍

fails to run, probably this issue (workaround in Fix dcgan_mnist model #269)
hparams.latent_dim not available in generator constructor (Fix dcgan_mnist model #269)
uses deprecated logitbinarycrossentropy (Fix dcgan_mnist model #269)

vison/lenet_mnist: 👍

depends on CUDAapi instead of CUDA (Fix lenet_mnist model #268)
fails on julia 1.6-beta1 because of BSON.jl issue (Fix lenet_mnist model #268)
use Flux.outputsize instead of manually calculating the output size of the layers

vison/mnist/mlp: 👍

runs without issues

vison/mnist/conv: 👍

depends on Flux.Data.MNIST instead of MLDatasets

vison/mnist/autoencoder: 👍

depends on Flux.Data.MNIST instead of MLDatasets

vison/vae_mnist: 👍

logitbinarycrossentropy.(ŷ, y) fails and is deprecated (Fix vae_mnist model #267)
depends on CUDAapi instead of CUDA (Fix vae_mnist model #267)
resulting images are all gray (Fix vae_mnist model #267)
encoder is not moved to GPU before saving resulting in errors in plotting (Fix vae_mnist model #267)
scalar operations on GPU

text/char-rnn: 👍

no GPU support
could use MLDataPattern.splitobs

text/lang-detection: 👍

no GPU support
uses old logitcrossentropy loss
use a standard dataset instead of providing a scraper

text/phonemens: 👍

no GPU support
data and model split over 2 files, could be merged
uses old logitcrossentropy loss

text/treebank: 👍

several errors in scripts (Fix treebank model #271)
uses deprecated Flux.Data dataset
data and model split over 2 files, could be merged

other/housing: 👍

diverging loss due to incorrect gradient descent (Fix housing model #273)
model overfits on training data, loss on testing data diverges
dataset can be replaced with MLDatasets
replace custom meansquarederror?
could use MLDataPattern splitobs
should probably use Float32 instead of Float64

other/iris: 👍

no gpu support
uses deprecated Flux.Data dataset
could use MLDataPatterns.splitobs
uses deprecated logitcrossentropy

other/fizzbuzz: 👍

no gpu support
uses deprecated logitcrossentropy

other/flux-next: ❗

does not ssem to make sense anymore, build around non existing step! optimiser api

other/bitstring-parity: 👍

no gpu support
split over several files

tutorial/60-minute-blitz: 👍

cannot use flux 0.11.4 due to dependecy on Metalhead.jl (Update Metalhead #283)
move dataset from Metalhead to MLDatasets
error in gradient descent (Fix 60 minute blitz tutorial #274)
uses deprecated loss functions (Fix 60 minute blitz tutorial #274)
uses scalar operations on GPU

tutorial/transfer_learning: ❕

cannot use flux 0.11.4 due to dependecy on Metalhead.jl (Update Metalhead #283)
wrong path for dataset (Fix transfer learning tutorial #275)
missing packages in project.toml (Fix transfer learning tutorial #275)
returns NaN in some cases (Transfer learning tutorial which uses ResNet returns NaN values #242, Fix NaNs in some cases for transfer learning tutorial #284)

contrib/audio/speech-mlstm: ❗

dataset seems only available to members (this will also probably prevent using it in a CI setting), unable to test

contrib/games/cartpole: ❗

uses unregistered Gym.jl package which fails because of missing @guarded macro

contrib/games/pendulum: ❗

uses unregistered Gym.jl package which fails because of missing @guarded macro

contrib/games/trebuchet: ❗

DDPG variant is tied to Tracker and does not work with Zygote without significant changes

The text was updated successfully, but these errors were encountered:

DhairyaLGandhi · 2021-01-17T14:12:19Z

Can we get PRs for the hotfixes?

CarloLucibello · 2021-01-17T15:21:22Z

I suggest we also rationalize the examples. We could remove

vision/mnist/conv
vision/mnist/autoencoder
vison/cppn

as redundant or not adding much value (in the last case)

CarloLucibello · 2021-01-17T15:28:17Z

Also we could rename

vision/cifar10 -> vision/vgg_cifar10
vision/mnist/mlp -> vision/mlp_mnist

darsnack · 2021-01-17T15:33:59Z

Just for reference, what's the approach here? Are you following the steps in FluxML/FluxML-Community-Call-Minutes#9 ? So, is this to track hot fixes to get things running, and later we'll refresh the files to use the latest packages in the ecosystem?

ghost · 2021-01-17T15:53:24Z

Yes, steps as mentioned in FluxML/FluxML-Community-Call-Minutes#9. This issue is to keep track of the test results, but I plan to note down other issues found along the way.

For the models that fail to run I do update the dependencies before writing a fix. And if a package is the source of the problem I intend to replace it with the latest ecosystem package one. Take for example the cifar10 model, this one can not be used with the newest flux release due to the dependency on Metalhead for it's dataset, with MLDatasets it can actually be tested on flux 0.11.4.

DhairyaLGandhi · 2021-01-18T18:08:35Z

I wouldn't remove the models, those are some of the better maintained models, compared to how the some others have bloated argument handling taking away from the core of the solved problem. Removing those as redundant seems like the better approach no?

ghost · 2021-01-19T11:29:18Z

I do agree, some of the "simpler" models could actually serve as a great example of a "my first model". But I do think we should have that discussion in the coordination tracker issue, and keep this issue about the quick fixes to get the models running again.

ghost · 2021-01-20T12:32:02Z

The other/flux-next model is a bit problematic. This one is rather old and seems to explain a 'new' api for the optimisers, but this does not work (anymore). It probably needs a full rewrite, but than it wouldn't really add something to the official Flux docs. I would suggest to just remove it from the zoo or does anyone think it still has a place here?

DhairyaLGandhi · 2021-01-20T14:29:49Z

For now, we can leave it be, it will be part of the Optimisers.jl release. It's not expected to be a guarantee but maybe adding a readme saying so would be alright

ghost · 2021-01-20T14:42:41Z

Are you sure? It does not seems to match with the api from the Optimisers.jl package (it uses a step! function)..

DhairyaLGandhi · 2021-01-22T05:43:41Z

It's from a flux pr that we plan to merge, but it's a breaking release so may not do it immediately.

DhairyaLGandhi · 2021-01-22T05:48:54Z

Can we add an item in the tracker about adding guiding texts with the models that talk about:
A. The model, and the problem being solved
B. The specific features in flux that we are trying to communicate (custom structs, loss, training loop, model construction etc)

DhairyaLGandhi · 2021-01-22T05:49:08Z

cc @darsnack ^

darsnack · 2021-01-22T14:13:28Z

Sure, did you mean zoo tutorials?

DhairyaLGandhi · 2021-01-23T12:41:48Z

Basically having text pointing users to how to use certain Flux features, and also guiding them about different problems beings solved (recurrent nets, basic features and regression, custom train loops etc.)

If we can use the Literate.jl format, we would be in good stead to automate moving them to the site.

CarloLucibello · 2021-01-23T13:36:56Z

We can remove contrib/games/cartpole and pendulum, curated Flux's based examples are available in https://github.com/JuliaReinforcementLearning/ReinforcementLearningZoo.jl

DhairyaLGandhi · 2021-01-23T13:50:40Z

Those are not there for the RL, they are there to show what the dP equivalent would look like, so I would update then with the literature and maybe add them as ci examples, because I'm sure recent changes have to support this.

ghost · 2021-01-23T16:30:44Z

Now that all the models are tested I created a new issue (#280) to start giving some direction to the next steps and address the last comments made here.

DhairyaLGandhi · 2021-01-23T16:32:36Z

We still need to update Metalhead I take it.

ghost · 2021-01-23T17:02:44Z

My comment wasn't meant to skip steps. I'm just trying to prevent doing work now that will be flushed down the drain later in the process.

DhairyaLGandhi · 2021-01-23T17:11:06Z

Sure, i was ensuring that that was a pending item that's blocking some of the updates.

DhairyaLGandhi · 2021-01-25T06:10:20Z

Metalhead doesn't bound flux, also I have released patches to both. We might want to consolidate the environments and only have one at the top level.

It's hard to keep track otherwise

ghost · 2021-01-26T11:44:22Z

Do you mean a top level project and manifest file for all the models/tutorials?

That could become problematic when one of the tutorials depends on a package that is not compatible with the newest Flux/CUDA. It could hold back updating of all the models until that external package has been made compatible. Transformers.jl is an example, if one tutorial would depend on it everthing will be held back.

DhairyaLGandhi · 2021-01-26T13:47:09Z

That's not been a big issue since we only need to activate the env, and updating the global env is usually painless, whereas individual ones usually end up becoming inconsistent very quickly and have the same issue of holding back updates. With the added annoyance of having to update many of them for every minor change in a dependency.

DhairyaLGandhi · 2021-01-26T13:48:40Z

We used to have this pattern before and that's where this learning comes from..

darsnack · 2021-01-26T14:31:00Z

I agree that a single Project.toml and Manifest.toml is better. Allowing certain tutorials to fall out of compatibility with the latest Flux is not what we want. Every tutorial should be running on the latest Flux or the latest Flux/the tutorial should be fixed.

The flow should be "Release Flux" --> "Update zoo" --> "Release zoo."

darsnack · 2021-01-26T14:31:59Z

My only concern is that a user needs to install lots of packages to run one specific example. Maybe it is better to have a script that automates bumping Flux for all the tutorials.

DhairyaLGandhi · 2021-01-26T14:39:13Z

My only concern is that a user needs to install lots of packages

Yeah mine too, but that's still not too bad. For instance, things like CUDA would dominate the space/ network usage for most users and adding packages doesn't take too long these days.

The flow should be "Release Flux" --> "Update zoo" --> "Release zoo."

It depends, we would end up doing some benchmarking as well and see changes with individual prs with #278

DhairyaLGandhi · 2021-01-29T19:17:31Z

What's the last column for?

sambitdash · 2021-03-10T06:22:21Z

I really like the MNIST models. So please do not remove if you can. The reason being, they are pry the only model that can run reasonably on a CPU. Other vision models need GPUs for good accuracy. I will set up this expection from the MNIST example.

The mlp one is a quick example to showcase ANN.
The CNN should show the benefits of using a lot less parameters. Hence, the current one may need some fixing and also take into consideration the data loaders etc.
I think the lautoencoder should preferably use a CNN based encoder architecture.

ToucheSir · 2021-03-10T16:31:10Z

Ideally any model in the zoo that can run on GPU should also converge just as well on CPU. I don't think the MNIST models are going anywhere either!

As for the autoencoder example, we should probably have both dense and conv examples. That may be redundant with the VAE models though.

This was referenced Jan 17, 2021

Refresh Model Zoo FluxML/FluxML-Community-Call-Minutes#9

Open

Fix vae_mnist model #267

Merged

Fix lenet_mnist model #268

Merged

Fix dcgan_mnist model #269

Merged

ghost changed the title ~~Refresh model zoo~~ Fix all the models Jan 19, 2021

ghost mentioned this issue Jan 23, 2021

Tutorial matrix #280

Open

alperyilmaz mentioned this issue Jan 25, 2021

Metalhead dependency messed up packages #281

Closed

mchristianl mentioned this issue Jan 16, 2022

re-establishing GPU support for char-rnn.jl #331

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix all the models #266

Fix all the models #266

ghost commented Jan 17, 2021 •

edited by ghost

Loading

DhairyaLGandhi commented Jan 17, 2021

CarloLucibello commented Jan 17, 2021

CarloLucibello commented Jan 17, 2021

darsnack commented Jan 17, 2021

ghost commented Jan 17, 2021

DhairyaLGandhi commented Jan 18, 2021

ghost commented Jan 19, 2021

ghost commented Jan 20, 2021

DhairyaLGandhi commented Jan 20, 2021

ghost commented Jan 20, 2021

DhairyaLGandhi commented Jan 22, 2021

DhairyaLGandhi commented Jan 22, 2021

DhairyaLGandhi commented Jan 22, 2021

darsnack commented Jan 22, 2021

DhairyaLGandhi commented Jan 23, 2021

CarloLucibello commented Jan 23, 2021

DhairyaLGandhi commented Jan 23, 2021

ghost commented Jan 23, 2021

DhairyaLGandhi commented Jan 23, 2021

ghost commented Jan 23, 2021

DhairyaLGandhi commented Jan 23, 2021

DhairyaLGandhi commented Jan 25, 2021

ghost commented Jan 26, 2021

DhairyaLGandhi commented Jan 26, 2021

DhairyaLGandhi commented Jan 26, 2021

darsnack commented Jan 26, 2021

darsnack commented Jan 26, 2021

DhairyaLGandhi commented Jan 26, 2021 •

edited

Loading

DhairyaLGandhi commented Jan 29, 2021

sambitdash commented Mar 10, 2021

ToucheSir commented Mar 10, 2021

Fix all the models #266

Fix all the models #266

Comments

ghost commented Jan 17, 2021 • edited by ghost Loading

DhairyaLGandhi commented Jan 17, 2021

CarloLucibello commented Jan 17, 2021

CarloLucibello commented Jan 17, 2021

darsnack commented Jan 17, 2021

ghost commented Jan 17, 2021

DhairyaLGandhi commented Jan 18, 2021

ghost commented Jan 19, 2021

ghost commented Jan 20, 2021

DhairyaLGandhi commented Jan 20, 2021

ghost commented Jan 20, 2021

DhairyaLGandhi commented Jan 22, 2021

DhairyaLGandhi commented Jan 22, 2021

DhairyaLGandhi commented Jan 22, 2021

darsnack commented Jan 22, 2021

DhairyaLGandhi commented Jan 23, 2021

CarloLucibello commented Jan 23, 2021

DhairyaLGandhi commented Jan 23, 2021

ghost commented Jan 23, 2021

DhairyaLGandhi commented Jan 23, 2021

ghost commented Jan 23, 2021

DhairyaLGandhi commented Jan 23, 2021

DhairyaLGandhi commented Jan 25, 2021

ghost commented Jan 26, 2021

DhairyaLGandhi commented Jan 26, 2021

DhairyaLGandhi commented Jan 26, 2021

darsnack commented Jan 26, 2021

darsnack commented Jan 26, 2021

DhairyaLGandhi commented Jan 26, 2021 • edited Loading

DhairyaLGandhi commented Jan 29, 2021

sambitdash commented Mar 10, 2021

ToucheSir commented Mar 10, 2021

ghost commented Jan 17, 2021 •

edited by ghost

Loading

DhairyaLGandhi commented Jan 26, 2021 •

edited

Loading