Skip to content
This repository has been archived by the owner on Dec 11, 2020. It is now read-only.

failed to load the pretrained v2 model to run Go bot #138

Open
hejin opened this issue Feb 16, 2019 · 12 comments
Open

failed to load the pretrained v2 model to run Go bot #138

hejin opened this issue Feb 16, 2019 · 12 comments

Comments

@hejin
Copy link

hejin commented Feb 16, 2019

Hi guys,

I completely followed the project homepage instructions (all the software versions are strictly aligned) and tried to run the Go bot with the pretrained v2 model but failed with the msg:
"
RuntimeError: Error(s) in loading state_dict for Model_PolicyValue:
Missing key(s) in state_dict: "init_conv.0.weight", "init_conv.0.bias", "init_conv.1.weight", "init_conv.1.bias", "init_conv.1.running_mean", "init_conv.1.running_var".
Unexpected key(s) in state_dict: "init_conv.module.0.weight", "init_conv.module.0.bias", "init_conv.module.1.weight", "init_conv.module.1.bias", "init_conv.module.1.running_mean", "init_conv.module.1.running_var", "init_conv.module.1.num_batches_tracked".
"

The box is a 24 core x86-64 with a Nvidia GPU V100 / 16GB.

The full log is here and thanks much!

(base) roobot@ELF:~/play-ELF/ELF/scripts/elfgames/go$ ./run.sh /home/roobot/play-ELF/ELF/scripts/elfgames/go/pretrained-go-19x19-v2.bin
Python version: 3.7.1 (default, Dec 14 2018, 19:28:38)
[GCC 7.3.0]
PyTorch version: 1.0.1.post2
CUDA version 10.0.130
Conda env: base
[2019-02-16 22:29:30.383] [rlpytorch.model_loader.load_env0] [info] Loading env
<module 'elfgames.go.game' from '/home/roobot/play-ELF/ELF/src_py/elfgames/go/game.py'> elfgames.go.game
<module 'elfgames.go.df_model3' from '/home/roobot/play-ELF/ELF/src_py/elfgames/go/df_model3.py'> elfgames.go.df_model3
[2019-02-16 22:29:30.394] [rlpytorch.model_loader.load_env0] [info] Parsed options: {'T': 1,
'actor_only': False,
'adam_eps': 0.001,
'additional_labels': ['aug_code', 'move_idx'],
'batchsize': 16,
'batchsize2': -1,
'black_use_policy_network_only': False,
'bn': True,
'bn_eps': 1e-05,
'bn_momentum': 0.1,
'cheat_eval_new_model_wins_half': False,
'cheat_selfplay_random_result': False,
'check_loaded_options': False,
'client_max_delay_sec': 1200,
'comment': '',
'data_aug': -1,
'dim': 256,
'dist_rank': -1,
'dist_url': '',
'dist_world_size': -1,
'dump_record_prefix': '',
'epsilon': 0.0,
'eval_model_pair': '',
'eval_num_games': 400,
'eval_old_model': -1,
'eval_stats': '',
'eval_winrate_thres': 0.55,
'expected_num_clients': -1,
'following_pass': False,
'gpu': 0,
'greedy': True,
'keep_prev_selfplay': False,
'keys_in_reply': ['V', 'rv'],
'leaky_relu': False,
'list_files': [],
'load': '/home/roobot/play-ELF/ELF/scripts/elfgames/go/pretrained-go-19x19-v2.bin',
'load_model_sleep_interval': 0.0,
'loglevel': 'debug',
'lr': 0.001,
'mcts_alpha': 0.0,
'mcts_epsilon': 0.0,
'mcts_persistent_tree': True,
'mcts_pick_method': 'most_visited',
'mcts_puct': 1.5,
'mcts_rollout_per_batch': 16,
'mcts_rollout_per_thread': 8192,
'mcts_root_unexplored_q_zero': False,
'mcts_threads': 2,
'mcts_unexplored_q_zero': False,
'mcts_use_prior': True,
'mcts_verbose': False,
'mcts_verbose_time': True,
'mcts_virtual_loss': 1,
'mode': 'online',
'model': 'online',
'momentum': 0.9,
'move_cutoff': -1,
'multipred_backprop': True,
'num_block': 20,
'num_future_actions': 1,
'num_games': 1,
'num_games_per_thread': -1,
'num_minibatch': 5000,
'num_reader': 50,
'num_reset_ranking': 5000,
'omit_keys': [],
'onload': [],
'opt_method': 'adam',
'parameter_print': False,
'parsed_args': ['df_console.py',
'--mode',
'online',
'--keys_in_reply',
'V',
'rv',
'--use_mcts',
'--mcts_verbose_time',
'--mcts_use_prior',
'--mcts_persistent_tree',
'--load',
'/home/roobot/play-ELF/ELF/scripts/elfgames/go/pretrained-go-19x19-v2.bin',
'--server_addr',
'localhost',
'--port',
'1234',
'--replace_prefix',
'resnet.module,resnet',
'--no_check_loaded_options',
'--no_parameter_print',
'--verbose',
'--gpu',
'0',
'--num_block',
'20',
'--dim',
'256',
'--mcts_puct',
'1.50',
'--batchsize',
'16',
'--mcts_rollout_per_batch',
'16',
'--mcts_threads',
'2',
'--mcts_rollout_per_thread',
'8192',
'--resign_thres',
'0.05',
'--mcts_virtual_loss',
'1',
'--loglevel',
'debug'],
'ply_pass_enabled': 0,
'policy_distri_cutoff': 0,
'policy_distri_training_for_all': False,
'port': 1234,
'preload_sgf': '',
'preload_sgf_move_to': -1,
'print_result': False,
'q_max_size': 1000,
'q_min_size': 10,
'ratio_pre_moves': 0,
'replace_prefix': ['resnet.module,resnet'],
'resign_thres': 0.05,
'sample_nodes': ['pi,a'],
'sample_policy': 'epsilon-greedy',
'selfplay_async': False,
'selfplay_init_num': 2000,
'selfplay_timeout_usec': 0,
'selfplay_update_num': 1000,
'server_addr': 'localhost',
'server_id': '',
'start_ratio_pre_moves': 0.5,
'store_greedy': False,
'suicide_after_n_games': -1,
'use_data_parallel': False,
'use_data_parallel_distributed': False,
'use_df_feature': False,
'use_fp16': False,
'use_mcts': True,
'use_mcts_ai2': False,
'verbose': True,
'weight_decay': 0.0,
'white_mcts_rollout_per_batch': -1,
'white_mcts_rollout_per_thread': -1,
'white_puct': -1.0,
'white_use_policy_network_only': False}
[2019-02-16 22:29:30.396] [rlpytorch.model_loader.load_env0] [info] Finished loading env
[2019-02-16 22:29:30.397] [elf::base::ThreadedDispatcherT-11] [info] Wait all games[1] to register their mailbox
human_actor: {'input': ['s', 'aug_code', 'move_idx'], 'reply': ['pi', 'a', 'V'], 'batchsize': 1}
SharedMem: "human_actor", keys: ['a', 'V', 'pi', 's', 'aug_code', 'move_idx']
a int64_t [16]
V float [16]
pi float [16, 362]
s float [16, 18, 19, 19]
aug_code int32_t [16]
move_idx int32_t [16]
a int64_t [16]
V float [16]
pi float [16, 362]
s float [16, 18, 19, 19]
aug_code int32_t [16]
move_idx int32_t [16]
actor_black: {'input': ['s', 'aug_code', 'move_idx'], 'reply': ['pi', 'V', 'a', 'rv'], 'timeout_usec': 10, 'batchsize': 16}
SharedMem: "actor_black", keys: ['a', 'V', 'rv', 'pi', 's', 'aug_code', 'move_idx']
a int64_t [16]
V float [16]
rv int64_t [16]
pi float [16, 362]
s float [16, 18, 19, 19]
aug_code int32_t [16]
move_idx int32_t [16]
a int64_t [16]
V float [16]
rv int64_t [16]
pi float [16, 362]
s float [16, 18, 19, 19]
aug_code int32_t [16]
move_idx int32_t [16]
[2019-02-16 22:29:34.512] [rlpytorch.model_loader.ModelLoader-1-model_indexNone] [info] Loading model from /home/roobot/play-ELF/ELF/scripts/elfgames/go/pretrained-go-19x19-v2.bin
[2019-02-16 22:29:34.512] [rlpytorch.model_loader.ModelLoader-1-model_indexNone] [info] replace_prefix for state dict: [['resnet.module', 'resnet']]
Traceback (most recent call last):
File "df_console.py", line 87, in
main()
File "df_console.py", line 47, in main
model = model_loader.load_model(GC.params)
File "/home/roobot/play-ELF/ELF/src_py/rlpytorch/model_loader.py", line 161, in load_model
check_loaded_options=self.options.check_loaded_options)
File "/home/roobot/play-ELF/ELF/src_py/rlpytorch/model_base.py", line 147, in load
self.load_state_dict(sd)
File "/home/roobot/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 769, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Model_PolicyValue:
Missing key(s) in state_dict: "init_conv.0.weight", "init_conv.0.bias", "init_conv.1.weight", "init_conv.1.bias", "init_conv.1.running_mean", "init_conv.1.running_var".
Unexpected key(s) in state_dict: "init_conv.module.0.weight", "init_conv.module.0.bias", "init_conv.module.1.weight", "init_conv.module.1.bias", "init_conv.module.1.running_mean", "init_conv.module.1.running_var", "init_conv.module.1.num_batches_tracked".

@l1t1
Copy link

l1t1 commented Feb 16, 2019

#133 (comment)
i still have two errors not solved by using replace prefix

@l1t1
Copy link

l1t1 commented Feb 17, 2019

did you try the sever.sh and client.sh?

@hejin
Copy link
Author

hejin commented Feb 17, 2019

No :(
I will try. Thanks much! @l1t1

@yuandong-tian
Copy link
Contributor

This is probably because of the version of PyTorch. A fix is on the way.

@yuandong-tian
Copy link
Contributor

yuandong-tian commented Feb 18, 2019

@hejin @l1t1 what version of pytorch did you use? We use PyTorch 1.0.

@l1t1
Copy link

l1t1 commented Feb 18, 2019

I use 1.0.1 with elf_convert.py too, but the windows binary df_console.exe shouldnt require pytorch installed by user

Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 14:57:15) [MSC v.1915 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.__version__)
1.0.1

@l1t1
Copy link

l1t1 commented Feb 18, 2019

suggest df_console.exe also support load elfv2.bin and train data such as 1500000.bin etc

@jma127
Copy link
Contributor

jma127 commented Feb 20, 2019

Could you please try the newly-revised gtp.sh in master?

@l1t1
Copy link

l1t1 commented Feb 21, 2019

I download todays

D:\elfv2>\tool\wget -c https://dl.fbaipublicfiles.com/elfopengo/play/play_opengo_v2.zip
--2019-02-21 07:45:54--  https://dl.fbaipublicfiles.com/elfopengo/play/play_opengo_v2.zip
Length: 1076887016 (1.0G) [application/zip]
Saving to: 'play_opengo_v2.zip'

play_opengo_v2.zip            100%[=================================================>]   1.00G

2019-02-21 08:30:52 (391 KB/s) - 'play_opengo_v2.zip' saved [1076887016/1076887016]

and run the cpu version with buildin sabaki
set engine to D:\elfv2\play_opengo_v2\elf_cpu_full\elf\df_console.exe
it dosent work at all

○ newelfv2> name 
connection failed
○ newelfv2> version 
connection failed
○ newelfv2> protocol_version 
connection failed
○ newelfv2> list_commands 
connection failed
○ newelfv2> komi 6.5
connection failed
[5504] Failed to execute script df_console
Traceback (most recent call last):
  File "df_console.py", line 92, in <module>
  File "df_console.py", line 85, in main
  File "elf\utils_elf.py", line 435, in run
  File "elf\utils_elf.py", line 383, in _call
  File "elf\utils_elf.py", line 253, in cpu2gpu
  File "elf\utils_elf.py", line 253, in <dictcomp>
  File "site-packages\torch\cuda\__init__.py", line 161, in _lazy_init
  File "site-packages\torch\cuda\__init__.py", line 75, in _check_driver
AssertionError: Torch not compiled with CUDA enabled

@l1t1
Copy link

l1t1 commented Feb 21, 2019

but the gpu version works

D:\elfv2\play_opengo_v2\elf_gpu_full\elf>df_console

list_commands
= boardsize
clear_board
exit
final_score
genmove
komi
list_commands
name
play
protocol_version
quit
showboard
version

play b d16
=

genmove w
= N1

@l1t1
Copy link

l1t1 commented Feb 21, 2019

and the gpu version also support --load weights

D:\>fc /b D:\elfv2\play_opengo_v2\elf_gpu_full\elf\model-v2.bin d:\elfv2.bin |more
正在比较文件 D:\ELFV2\PLAY_OPENGO_V2\ELF_GPU_FULL\ELF\model-v2.bin 和 D:\ELFV2.BIN
FC: 找不到差异

some tests

quit
[2019-02-21 09:52:26.508] [elf::base::Context-3] [info] Prepare to stop ...
[2019-02-21 09:52:26.692] [elfgames::go::GoGameSelfPlay-0-15] [warning] Invalid move: x = 3 y = 15 move: dp please try a
gain
[2019-02-21 09:52:27.259] [elfgames::go::GoGameSelfPlay-0-15] [warning] Invalid move: x = 3 y = 15 move: dp please try a
gain
[2019-02-21 09:52:27.369] [elf::base::Context-3] [info] Stop all game threads ...
[2019-02-21 09:52:27.682] [elf::base::Context-3] [info] All games sent notification, Waiting until they join
[2019-02-21 09:52:27.684] [elf::base::Context-3] [info] Stop all collectors ...
[2019-02-21 09:52:27.687] [elf::base::Context-3] [info] Stop tmp pool...

D:\elfv2\play_opengo_v2\elf_gpu_full\elf>df_console --load d:/elfv2.bin
version
= 1.0

quit
[2019-02-21 09:55:16.300] [elf::base::Context-3] [info] Prepare to stop ...
[2019-02-21 09:55:16.301] [elfgames::go::GoGameSelfPlay-0-15] [warning] Invalid move: x = 0 y = 1 move: ab please try ag
ain
[2019-02-21 09:55:16.303] [elfgames::go::mcts::MCTSActor-21] [error] model version 1 and required version 1290000 are no
t consistent

D:\elfv2\play_opengo_v2\elf_gpu_full\elf>df_console --load d:/1500000.bin
genmove b
= D3


? Invalid input


? Invalid input


? Invalid input


? Invalid input


? Invalid input


? Invalid input

genmove w
= C16

quit
[2019-02-21 10:08:29.307] [elf::base::Context-3] [info] Prepare to stop ...
[2019-02-21 10:08:30.431] [elf::base::Context-3] [info] Stop all game threads ...
[2019-02-21 10:08:30.933] [elf::base::Context-3] [info] All games sent notification, Waiting until they join
[2019-02-21 10:08:30.937] [elf::base::Context-3] [info] Stop all collectors ...
[2019-02-21 10:08:30.957] [elf::base::Context-3] [info] Stop tmp pool...

@l1t1
Copy link

l1t1 commented Feb 21, 2019

test elf v1 weight

D:\elfv2\play_opengo_v2\elf_gpu_full\elf>df_console --load  d:/pretrained-go-19x19-v1.bin --num_block 20 --dim 224

? Invalid input


? Invalid input

genmove b
= Q16


? Invalid input

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants