Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel training #133

Open
wants to merge 153 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
153 commits
Select commit Hold shift + click to select a range
97993a1
Adding sources of pipeline tool into gdeep trainer
act-reds May 4, 2023
d995340
Updated trainer and extractor to be compatible with pipeline_tool
act-reds May 4, 2023
d5f29fb
Add if statement to shutdown rpc only if it as been init before
act-reds May 4, 2023
e7a1616
Update requirements to accomodate fsdp
AnthoJack May 10, 2023
e3d24de
Custom sampler
AnthoJack May 10, 2023
69c0faf
delete: caltech example
AnthoJack May 10, 2023
0338f3a
Allow custom device
AnthoJack May 10, 2023
5795ff3
Apply refactor to gdeep
act-reds May 23, 2023
48b7c98
Add doc
act-reds May 23, 2023
6b152e3
Refactor include
act-reds May 23, 2023
ec313c1
Introduce FSDP
yorickbrunet May 25, 2023
01967cd
Add some logs
yorickbrunet May 25, 2023
d594fc9
Remove copy of model at init
yorickbrunet May 25, 2023
6bbaaed
Fix missing data on GPU > 0
yorickbrunet May 25, 2023
48ad37a
Readd commented deepcopy of model
yorickbrunet May 25, 2023
77e2bfd
Recover trained model and return values after FSDP training
AnthoJack May 25, 2023
eeab757
Eval working
act-reds Jun 7, 2023
d21877a
WIP: Retrieve training results and trained model
AnthoJack Jun 8, 2023
a476fba
FSDP WORKS !!!!
AnthoJack Jun 8, 2023
27cf80b
faster train for easier tests
AnthoJack Jun 8, 2023
1b74e9a
Add examples script
act-reds Jun 9, 2023
85758a3
Init
act-reds Jun 9, 2023
03ebc41
Push for script example
act-reds Jun 9, 2023
2e36237
Able to measure Memory Peaks on GPUs
act-reds Jun 12, 2023
33645e8
Only split training dataloader if no validation dataloader provided
AnthoJack Jun 14, 2023
fc478e6
reenable inital model copy and reset
AnthoJack Jun 14, 2023
63bd3bd
fix profiler setup
AnthoJack Jun 14, 2023
2cc1c76
Better config for train and profiling
AnthoJack Jun 14, 2023
9b7067a
Save works
act-reds Jun 14, 2023
80e3ead
Update trainer for non naive repartition
act-reds Jun 14, 2023
f64382e
FSDP works with cross-validation
AnthoJack Jun 22, 2023
375619a
Improve Caltech_resnet example to simplify scripting
AnthoJack Jun 22, 2023
ea34354
WIP: Tidying up and documenting
AnthoJack Jun 22, 2023
6eda42f
Non naive repartition works for basic model, still need to be tested …
act-reds Jun 23, 2023
fa03553
Update to align
act-reds Jun 23, 2023
5f2f13d
WIP: doc
AnthoJack Jun 23, 2023
053f10d
implement prefetching
AnthoJack Jun 28, 2023
e37f360
Fix first GPU repartition, and test dynamic repartition with orbits5k
act-reds Jun 30, 2023
26f0cbc
Remove commented code
act-reds Jun 30, 2023
0e292ba
Add comment
act-reds Jul 3, 2023
8b368a4
Merge branch 'pipeline_integration' into 'non_naive_pipeline'
bruno-darochac Jul 5, 2023
6ce70da
Merge branch 'non_naive_pipeline' into 'pipeline_integration'
bruno-darochac Jul 5, 2023
61fcd8e
Support transformer wrap policy
AnthoJack Jul 21, 2023
63e54b2
Doc
AnthoJack Jul 21, 2023
5ce47f4
TBD: Examples
AnthoJack Jul 21, 2023
16bd281
Merge remote-tracking branch 'origin/pipeline_integration' into paral…
AnthoJack Jul 24, 2023
e260236
remove useless import
AnthoJack Jul 24, 2023
86c8390
bug fix and API improvement
AnthoJack Jul 28, 2023
28d3479
Translate pipeline explaination to EN
act-reds Sep 4, 2023
1a298a4
Add Giotto integration explains + upgrade first step with new parameters
act-reds Sep 4, 2023
8f5ae85
Finalise README
act-reds Sep 4, 2023
8b0e1ab
Correct improvements
act-reds Sep 4, 2023
bc51330
Remoeve old readme + add img
act-reds Sep 5, 2023
9fbad37
Update file README.md
adiReds Sep 6, 2023
ae4cdea
Improve import without absolute path + start generic benchmark
act-reds Sep 7, 2023
934c96c
Framework benchmark construction
act-reds Sep 7, 2023
63e9d3d
Benchmark ready for memory consumption
act-reds Sep 22, 2023
8294c2c
Finialisation launch of benchmarks
act-reds Sep 25, 2023
697de05
Save current state to have rollback
act-reds Sep 26, 2023
04f5e07
Fix pipeline tool use from giotto deep
act-reds Sep 26, 2023
1a09cdf
Fix subprocess robustess
act-reds Sep 26, 2023
932bd48
fix world_size
AnthoJack Aug 23, 2023
ac7dddf
New example
AnthoJack Aug 23, 2023
4a21fb7
Add pipeline tool benchmark's script
act-reds Sep 26, 2023
29335d8
Fix subprocess robustness by adding sys.executable instead of hard co…
act-reds Sep 27, 2023
aaf0b4b
Fix balancing for more thatn 2 GPUs
bruno-darochac Oct 3, 2023
2e9cbc9
Benchmarks for giotto-deep
yorickbrunet Oct 5, 2023
966a727
Explain debug pod
yorickbrunet Oct 6, 2023
d6153df
[readme] add download data
yorickbrunet Oct 6, 2023
5584c76
Modify API for unrestricted FSDP configuration
AnthoJack Oct 9, 2023
234351c
unfix unnecessary fix
AnthoJack Oct 9, 2023
fc4a0cc
Modify balancing function
bruno-darochac Oct 9, 2023
705fc01
Merge branch 'parallel_training' of https://reds-gitlab.heig-vd.ch/l2…
bruno-darochac Oct 9, 2023
3ce74c9
Adapt benchmark to new parallelism types
yorickbrunet Oct 9, 2023
13482f1
Allow batch size 1 in benchmark
yorickbrunet Oct 11, 2023
c82223f
Improve benchmark readme
yorickbrunet Oct 11, 2023
71dd62a
update requirements
AnthoJack Oct 11, 2023
88ba2dd
Recover sharded state_dict correctly
AnthoJack Oct 11, 2023
60bd7ac
Rename Parallelism fsdp config argument
yorickbrunet Oct 11, 2023
07a41b2
Adapt orbit5k args
yorickbrunet Oct 11, 2023
3fc2df8
Rename config for mha in orbit5k
yorickbrunet Oct 11, 2023
41d159e
Implement FSDP sharding strategy in benchmark
yorickbrunet Oct 11, 2023
5ec5662
Colorize plot lines per parallelism
yorickbrunet Oct 12, 2023
eba07ea
Add optional subdirectory to store benchmark results
yorickbrunet Oct 12, 2023
f404ce6
Assign linestyles and markers to same gpu models and counts
yorickbrunet Oct 12, 2023
413000f
Clean orbit5k file and add generic ex args to gdeep
yorickbrunet Oct 16, 2023
157b780
Add model BERT to benchmark
yorickbrunet Oct 16, 2023
54b8c3e
Move dataset download out of bert main fn
yorickbrunet Oct 16, 2023
f917de3
Add run separator in benchmark
yorickbrunet Oct 16, 2023
4da93ef
Add BERT big and improve models management
yorickbrunet Oct 16, 2023
3ad76a1
Add node pools with V100
yorickbrunet Oct 17, 2023
a04edae
Update Doc
AnthoJack Oct 20, 2023
5328d13
Use BertLayer with fsdp
AnthoJack Oct 23, 2023
cc6ab8f
Provide warnings in code
AnthoJack Oct 25, 2023
991a0bc
Add labels to pods
yorickbrunet Oct 26, 2023
9c8d542
[doc] resources
yorickbrunet Oct 26, 2023
bd40eab
[doc] Add bert to available models
yorickbrunet Oct 26, 2023
5ef71ce
Provide warnings in code
AnthoJack Oct 25, 2023
5f3634e
Rename orbit 5k example
yorickbrunet Oct 26, 2023
af7f92c
Save for error check
bruno-darochac Nov 7, 2023
91cea99
Update behaviour for work with pipeline_tool package + update pipelin…
bruno-darochac Nov 7, 2023
c0778e9
Update doc for FSDP
AnthoJack Nov 8, 2023
989127f
Merge branch 'parallel_training' of reds-gitlab.einet.ch:l2f-reds/gio…
AnthoJack Nov 8, 2023
55e5557
Adapt to new trainer API
AnthoJack Nov 8, 2023
97f67f9
Fix review of Yorick
bruno-darochac Nov 9, 2023
3601184
Merge branch 'pipeline_package_integration' into 'parallel_training'
bruno-darochac Nov 10, 2023
89868f0
Updated example of pipeline tool
bruno-darochac Nov 10, 2023
9888750
Import regularizer
yorickbrunet Nov 13, 2023
28f099a
Remove working files that are now unnecessary
yorickbrunet Nov 21, 2023
410884e
Revert saved data
yorickbrunet Nov 21, 2023
6af2735
Revert further saved results
yorickbrunet Nov 21, 2023
d485fab
Remove debug messages
yorickbrunet Nov 21, 2023
3e9a7ff
Remove unused imports
yorickbrunet Nov 21, 2023
bb254b0
Fix typo
yorickbrunet Nov 21, 2023
6f1a69c
Remove benchmark for pipeline tool
yorickbrunet Nov 21, 2023
d2aef2d
Clear outputs
yorickbrunet Nov 21, 2023
7ccb588
Add benchmarks to doc
yorickbrunet Nov 23, 2023
a6888c3
Add images
yorickbrunet Nov 23, 2023
a0fd5a7
Rename img folder
yorickbrunet Nov 23, 2023
32f8d61
Add missing image
yorickbrunet Nov 23, 2023
f4e3d42
Rename img folder
yorickbrunet Nov 23, 2023
d402966
Added links and precisions on deepcopy
AnthoJack Nov 24, 2023
eb9684e
Set fig num
yorickbrunet Nov 24, 2023
c9d9050
Improve doc benchmarks
yorickbrunet Nov 24, 2023
71759c2
Add benchmarks to doc
yorickbrunet Nov 24, 2023
ae05473
Merge branch 'add_benchmark_to_doc' into 'parallel_training_merged'
yorickbrunet Nov 24, 2023
c91d4ea
Improve doc
yorickbrunet Nov 24, 2023
f3ce03b
Fix tests according to new dataloaders setup
yorickbrunet Nov 27, 2023
b900e52
Clean code further
yorickbrunet Nov 27, 2023
ead4131
Fix example according to new dataloaders setup
yorickbrunet Nov 27, 2023
e031ef5
Remove unworking example
yorickbrunet Nov 28, 2023
c4f00d3
Revert setting specific package versions
yorickbrunet Nov 28, 2023
4614e3c
Fix by reshaping the validation tensors to a valid shape
mAkeddar Nov 30, 2023
73509b1
fix second example, the compute of val loss was different from train …
mAkeddar Dec 1, 2023
24268cc
fix third example, optimizer file names contains space
mAkeddar Dec 1, 2023
bcdf25b
Clear outputs
yorickbrunet Dec 1, 2023
ce1a7f6
Clean files
yorickbrunet Dec 1, 2023
d995a80
Merge branch 'fix/tutorials' into 'parallel_training_merged'
yorickbrunet Dec 1, 2023
94ac8b1
exclude windows for parallel training since not supported
matteocao Dec 8, 2023
2b2cb4c
move the import
matteocao Dec 8, 2023
3363152
Update trainer.py
matteocao Dec 8, 2023
6c04e23
remove one test to make the CI pass on windows
matteocao Dec 8, 2023
ce5825c
reverting back the failsafe for win - thorough investigation is needed
matteocao Dec 8, 2023
54dc71b
try to use an older version of windows
matteocao Dec 9, 2023
9841325
Improve arguments readability
yorickbrunet Dec 11, 2023
6b464de
Remove type ignore
yorickbrunet Dec 11, 2023
432eb25
Avoid abbreviations
yorickbrunet Dec 11, 2023
0d23fdb
Remove extra space
yorickbrunet Dec 11, 2023
d0457e3
Fix computation of loss
yorickbrunet Dec 11, 2023
8c99c29
Add type hint for argument
yorickbrunet Dec 14, 2023
26930b5
address review comments
yorickbrunet Mar 7, 2024
25ae355
address review comments
yorickbrunet Mar 7, 2024
9f4e325
address review comments
yorickbrunet Mar 7, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/python-package-windows.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ on:
jobs:
build:

runs-on: windows-latest
runs-on: windows-2019
raphaelreinauer marked this conversation as resolved.
Show resolved Hide resolved
strategy:
matrix:
python-version: [3.8, 3.9, '3.10']
Expand Down
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,6 @@ examples/*.json
examples/*.html

# Image/Video files
*.png
*.mp4

# Data files
Expand Down
2 changes: 2 additions & 0 deletions benchmark/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
run-*.yml
plot.yml
31 changes: 31 additions & 0 deletions benchmark/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
FROM nvidia/cuda:12.2.0-runtime-ubuntu22.04

RUN echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections && \
apt-get update && \
apt-get install -y \
python3 python3-pip \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please pin specific versions to ensure reproducibility.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The package used comes from ubuntu's packages. There is no version to pin as it won't change during the life of this version of the distribution.

&& \
apt-get autoremove && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*

COPY ./requirements.txt giotto-deep/

RUN cd giotto-deep && \
raphaelreinauer marked this conversation as resolved.
Show resolved Hide resolved
pip3 install --no-cache-dir --disable-pip-version-check -r requirements.txt

COPY ./benchmark/requirements.txt giotto-deep/requirements.txt

RUN cd giotto-deep && \
pip3 install --no-cache-dir --disable-pip-version-check -r requirements.txt

COPY ./setup.py giotto-deep/
COPY ./setup.cfg giotto-deep/
COPY ./README.md giotto-deep/
COPY ./gdeep giotto-deep/gdeep/
COPY ./examples giotto-deep/examples/
COPY ./benchmark giotto-deep/benchmark/

RUN cd giotto-deep && pip3 install --no-cache-dir --disable-pip-version-check -e .

ENTRYPOINT [ "python3", "/giotto-deep/benchmark/benchmark.py" ]
Loading
Loading