Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use without Run_Eagerly=True #4

Open
roggf opened this issue Apr 16, 2024 · 15 comments
Open

Use without Run_Eagerly=True #4

roggf opened this issue Apr 16, 2024 · 15 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@roggf
Copy link

roggf commented Apr 16, 2024

The usage with the parameter python run_eagerly=True comes with a very high computation time for the training. Is there any way to use this loss function in the tensorflow native graph-mode?

@stellarpower
Copy link
Contributor

The for loops in the implementation functions would need to be removed, I believe. The code also needs not to expect an integral batch size, as this is None with a symbolic tensor as it compiles - although removing the for loop takes care of this.

I had a go at this, but given the very terse variable naming etc. I'm a bit hesitant to try making major modifications when I am not confident what it is doing. Copilot came up with some options but based on past history I don't remotely trust it farther than I can throw it.

I guess a first pass is probably making the code more functional as a first step, and ensuring the output is the same - as I don't know much about tensorflow internals, but assume for optimal execution it needs a functional-style pipeline it can parallelise. Given that R_ is being used each time though, I guess it may not be possible to run in parallel.

It would also help to have a mention in the readme if this is required please! As I missed it when integrating in my own code. And I think it's an important limitation at this early stage.

Thanks!

@stellarpower
Copy link
Contributor

Come to think of it, I've only just looked at the Cython part. I assume it's there for a reason, but, as it's very small, couldn't tensorflow compile the same expression to something that could run on a GPU? I would have to guess that this is now running on CPU, so the jumping in and out again would add quite a performance hit.

@stellarpower
Copy link
Contributor

@roggf

The usage with the parameter python run_eagerly=True comes with a very high computation time for the training. Is there any way to use this loss function in the tensorflow native graph-mode?

I have a version that works with backend execution in my branch - I have done some basic tests and it seems to work, and produce the same numbers as the master here. Time will tell if the wheels all fall off though, and I am yet to check the performance and see how much better (or worse) it may be.

@apassero01
Copy link

@roggf

The usage with the parameter python run_eagerly=True comes with a very high computation time for the training. Is there any way to use this loss function in the tensorflow native graph-mode?

I have a version that works with backend execution in my branch - I have done some basic tests and it seems to work, and produce the same numbers as the master here. Time will tell if the wheels all fall off though, and I am yet to check the performance and see how much better (or worse) it may be.

I checked this out and it is definitely much faster but I am running into an issue where my loss starts off as negative and goes to nan. No issues with any other loss functions wondering if you've encountered it

@stellarpower
Copy link
Contributor

Yeah, thanks for looking into it!

I think I hit the same issue yesterday, I think it's fixed now, I have some unpushed commits at the moment.

The problem I am running into right now is that the losses, alignment matrix, etc. all are identical within numerical precision to the PyTorch implementations I've been referencing, but the gradients are not - but only when gamma is not 1.0. When it's 1, tests pass.

I don't really know what I'm doing with gradient tapes and all that, so maybe I am just not using it properly, but it is really hard to pin down why the values are deviating when the intermediate calculations are all looking the same.

@stellarpower
Copy link
Contributor

Okay, turns out that the reference values were wrong. Torch requires you to zero-out the gradient tape after you use it. Ironic given it's written in C++ internally, and yet it doesn't use RAII as Keras does with the using declaration. and of course after having copilot analyse the code multiple times it did not register his when I am using it in a loop. 🙃

I think the version I have pushed to my branch should be working. I have some refactors and tidy-ups coming along the pipeline later tonight now it looks like the numbers are and always were in fact correct.

@stellarpower
Copy link
Contributor

Also, I'd like to profile the memory use as well as the speed. If you hit any situations where you run out of memory to complete the ops, would be good to know thanks. Each sequence in the batch is processed independently, so I am hoping tensorflow is smart enough to schedule these serially if there isn't enough memory to parallelise, thereby allowing relatively large batch sizes. But also will be critical to know how we are doing when the individual sequences become long or the feature vector is larger.

@apassero01
Copy link

Nice work - what does the speed of training look like for you? I obviously expect some drop in performance when using a loss like dtw but I am seeing like ~1.5 hours / epoch estimates. The shape of my data is (15781, 50, 22) so I wouldn't expect training to fly by but that is a little much.

@stellarpower
Copy link
Contributor

Cheers. Oh, that's a lot. I'm still tidying up and will profile over the next few hours. To confirm, your batch size is 15781? Or do you have 4D input? That seems really large to me (not an expert, but I've heard "friends don't let friends use batch sizes > 32". Also, I don't now how relevant it would be to performance here specifically, but I've been advised to stick to powers of two (I'm using an LSTM). As a systems programmer, I would tend to do so anyway, but it does seem common that the machine learning examples I have seen use e.g. 50 nodes, which doesn't seem that sensible when 64 is so close to me. I know CUDA explicitly can only use some optimisations if it's a multiple of 8/16 (Tensorcore e.g.). Do you have a ballpark of how long it takes with vanilla MSE?

@apassero01
Copy link

Sorry I copied the wrong line oops. My batch was definitely on the larger side at like 128 (if what your friends say is true) but I don't really see that being an issue but I will try smaller. I switched to MSE for example and it was like 3s an epoch which doesn't really compare to the 1.5h estimate it was giving me. Also the loss was negative, I've seen some places that with soft dtw that can be the case so is that what we are expecting here? Minimize the most negative value?

@apassero01
Copy link

image

@stellarpower
Copy link
Contributor

Okay, so that's like the shape of your raw dataset before batching?

Currently I don't think Tensorflow is parallelising the algorithm very well. For larger batches it seems to queue them up in series (good, so we don't exhaust the memory), but, I don't see the GPU being utilised as much as it could be (bad, as we can run these in parallel). I am running with the profiler and try to see what is happening. Most interestingly, the feature vector length doesn't seem to affect the compute time. It may be that those computations can be vectorised very efficiently, but, It's a little concerning if I feed it 64 times more data and there's no change in the time per epoch.

I have set the parallel_iterations parameter here to equal th batch size, but seems to make no difference.

is that what we are expecting here? Minimize the most negative value?

I am not sure. I know it's possible, but have asked myself here as the technicalities of the maths seem complicated, and I'm not sure if this is something we ought to avoid, or if as you say, the optimiser doesn't are and we can carry on just as well.

But yeah, keep me posted if you run any more experiments. If we can tune this to get some more performance, then great; I at least want to be able to take it for a roadtest and see if my model learns better. If it does and it's still too slow then I'll look into writing a proper kernel in C++.

@apassero01
Copy link

Yeah it is interesting. The one concern I have with allowing a negative is when I let the model train for a couple of batches the loss started as very negative around like -2k and quickly rose to positive 500. That is probably not a good thing. I would love to hear about your progress with it I am honestly pretty inexperienced this deep into the weeds but think if figured out dtw would be really value for seq2seq models.

If you don't mind me asking what are you working on?

@stellarpower
Copy link
Contributor

BTW, I'm currently experimenting with jit-compilation using XLA, but FYI, if you change the map_fn to vectorized_map, and remove the unnecessary keyword arguments, that got me a 6x speedup or so. In theory this should be able to run in parallel but it looks like it wasn't, whilst vectorized_map explicitly runs each sequence in the batch independently, if you wanted to trial it out at all.

@gabrielspadon gabrielspadon added bug Something isn't working enhancement New feature or request labels May 3, 2024
@stellarpower
Copy link
Contributor

@apassero01 I finally have a branch with a (just about) working version using XLA. To be blunt, I have found the whole Tensorflow-XLA thing like having teeth pulled, it's a nice idea but the way Tensorflow has to compile everything to a graph lazily, with all the shapes not known, then handing this off to a compiler that explicitly needs to know the shapes and several numerical arguments, it's just completely brittle. So, just saying, because the branch is a complete mess as it has taken best part of a week to get a version that doesn't just keep bringing up new exceptions any time I fix the old ones, and I've walked away the second it seems to run with a usable time for me (~100ms per epoch) 😅 .

Long-term I think just expressing the algorithm in StableHLO would probably be better, but reading that back into TF again seems like it might be tricky.

If you want to see if it's any faster for you, it's available here. To try it out you will need to install jax - I gave up fighting the TF graph compiler and instead, am using jax for the implementation; this is compiled to XLA and then converted to a TF function. I expect you may need a pretty recent version of Tensorflow too. Bear in mind, the XLA compilation seems to be extremely slow (potentially minutes), and using an enormous amount of memory. I am not sure why, it may be because of the for loops that something vastly more complicated gets generated and the optimiser takes ages to work it out.

Feel free to let me know if you encounter any errors, and if it's usably fast for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants