Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no cluster,cluster_mrg and node_mgr standing for one night,clustermgr no heartbeat,node_mgr no log output #25

Open
jd-zhang opened this issue May 18, 2022 · 7 comments

Comments

@jd-zhang
Copy link
Contributor

Issue migrated from trac ticket # 705

component: cluster manager | priority: major

2022-05-18 11:13:00: [email protected] created the issue


1.log of cluster mgr
Wed May 18 10:50:40 2022 tid:0x5e6b [INFO] [/home/kunlun/program_binaries/test_rbr/cluster_mgr_0513/src/http_server/http_server.cc:300 GenerateRequest]: Http post: {
"version":"1.0",
"job_id":"",
"job_type":"create_cluster",
"user_name":"kunlun_test",
"timestamp":"202205131532",
"paras":{
"nick_name":"rbrcluster001",
"ha_mode":"rbr",
"shards":"2",
"nodes":"3",
"comps":"1",
"max_storage_size":"20",
"max_connections":"6",
"cpu_cores":"8",
"innodb_size":"1",
"dbcfg":"1",
"machinelist": [ {"hostaddr":"192.168.0.129"} ]
}
}
2.
that time nodemgr no log output

@jd-zhang
Copy link
Contributor Author

2022-05-18 11:14:15: [email protected] commented


两个mgr都是启动状态,且元数据表里没有集群,静置一晚上。
次日早上发了一条创建集群命令,cluster_mgr收到,但没有写到数据库cluster_general_job_log,
当时研发以为可能是元数据表出错导致不能写入,但登录元数据主库可以写入,之后再次发送创建集群命令,就正常可以写入和启动创建集群动作了。

@jd-zhang
Copy link
Contributor Author

2022-05-18 11:15:36: [email protected] commented


创建集群的数据是这样的:{
"version":"1.0",
"job_id":"",
"job_type":"create_cluster",
"user_name":"kunlun_test",
"timestamp":"202205131532",
"paras":{
"nick_name":"rbrcluster001",
"ha_mode":"rbr",
"shards":"2",
"nodes":"3",
"comps":"1",
"max_storage_size":"20",
"max_connections":"6",
"cpu_cores":"8",
"innodb_size":"1",
"dbcfg":"1",
"machinelist": [ {"hostaddr":"${node_mgr.1}"} ]
}
}

@jd-zhang
Copy link
Contributor Author

2022-05-19 09:41:04: [email protected] commented


18号晚上重现了这个问题,
发送创建rbr集群,api返回:
{"attachment":null,"error_code":"1","error_info":"execute query failed [this lead to connection closed]: , error number: 2006, sql: begin","status":"failed","version":"1.0"}

此时clustermgr只有这点日志:
Thu May 19 09:36:40 2022 tid:0x5e63 [INFO] [/home/kunlun/program_binaries/test_rbr/cluster_mgr_0513/src/http_server/http_server.cc:300 GenerateRequest]: Http post: {
"version":"1.0",
"job_id":"",
"job_type":"create_cluster",
"user_name":"kunlun_test",
"timestamp":"202205131532",
"paras":{
"nick_name":"rbrcluster002",
"ha_mode":"rbr",
"shards":"2",
"nodes":"3",
"comps":"1",
"max_storage_size":"20",
"max_connections":"6",
"cpu_cores":"8",
"innodb_size":"1",
"dbcfg":"1",
"machinelist": [ {"hostaddr":"192.168.0.129"} ]
}
}

此时nodemgr没有日志。

@jd-zhang
Copy link
Contributor Author

2022-05-19 10:10:56: @chaojie1979 commented


应该是 写数据库的 连接断了, 后面增加重试机制

@jd-zhang
Copy link
Contributor Author

2022-05-20 10:16:59: @chaojie1979 commented


zettalib里面增加重试机制,之前接口通过statement_retries配置重试次数

@jd-zhang
Copy link
Contributor Author

2022-05-20 10:16:59: @chaojie1979 changed owner from chaojie to snow

@jd-zhang
Copy link
Contributor Author

2022-05-20 10:19:16: [email protected] commented


第三次重现,clustermgr输出这样的打印:Fri May 20 09:46:35 2022 tid:0xf1f61 [ERROR] [/home/kunlun/program_binaries/test_rbr/cluster_mgr_0513/src/http_server/http_server.cc:341 GenerateRequestUniqueId]: execute query failed [this lead to connection closed]: , error number: 2006, sql: begin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant