Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nacos 2.3.2 集群中跨节点不一致的下线行为,部署三节点只有单节点可正常下线服务,其他节点下线服务报错400。(Inconsistent Offline Behavior Across Nodes in Nacos 2.3.2 Cluster) #12063

Open
Dreamer-SK opened this issue May 7, 2024 · 3 comments

Comments

@Dreamer-SK
Copy link

Describe the bug
使用 Nacos 2.3.2 版本内置数据源集群方式部署,服务上下线仅单节点正常,其余节点下线出现 400 响应,下线失败。
In Nacos 2.3.2 cluster deployment using an embedded data source, service registration and deregistration only function correctly on one node. Other nodes respond with a 400 error and fail to deregister.

Expected behavior
期望通过 Web 页面完成服务上下线维护,期望修复该问题或将接口调用升级为 v2 版本。
Service registration and deregistration should be manageable via the web UI, resolving the issue or upgrading the API calls to v2.

Actually behavior
我部署了 24000、24002、24004 端口以内置数据源集群方式启动,部署后访问 Web 页面仅 24004 节点可以正常上下线服务,其余节点下线出现 400 响应,下线失败。
The cluster is deployed on ports 24000, 24002, and 24004. Only the 24004 node can successfully manage service registration and deregistration via the web UI. The 24000 and 24002 nodes respond with a 400 error when trying to deregister services.

How to Reproduce

  1. 下载 Nacos 后更新其 cluster.conf 文件保持三者一致为正确配置,命令行启用三服务 startup.cmd -p embedded。
    Download Nacos and update the cluster.conf file to ensure the same configuration for all nodes. Start all three nodes using the command startup.cmd -p embedded.
    image

  2. 应用服务内配置启动变量完成服务注册:-Dspring.cloud.nacos.server-addr=172.16.20.214:24000,172.16.20.214:24004,172.16.20.214:24002 -Xms256M -Xmx256M -Dspring.profiles.active=dev
    Configure the application service startup parameters to complete service registration: -Dspring.cloud.nacos.server-addr=172.16.20.214:24000,172.16.20.214:24004,172.16.20.214:24002 -Xms256M -Xmx256M -Dspring.profiles.active=dev
    image

  3. 启动成功后分别访问三台 Nacos 的 Web 页面,经检查其集群状态正常,访问服务管理-服务列表,其服务注册正常,如图于 24000/24002 节点内服务下线失败。
    After successful startup, visit the web UI of all three Nacos nodes. The cluster status appears normal, and services register successfully, but deregistration on the 24000 and 24002 nodes fails.
    24000错误1
    24000错误2
    响应信息(Response Message:):
    “<!doctype html><html lang="en"><head><title>HTTP Status 400–Bad Request</title><style type="text/css ">body {font-family:Tahoma,Arial,sans-serif;} h1, h2, h3, b {color:white;background-color:#525D76;} h1 {font-size:22px;} h2 {font-size:16px;} h3 {font-size:14px;} p {font-size:12px;} a {color:black;} .line {height:1px;background-color:#525D76;border:none;}</style></head><body><h1>HTTP Status 400 – Bad Request</h1></body></html>”

  4. 通过 24004 节点下线成功,且 24004 节点操作服务下线会同步下线 24000 与 24002,结果符合预期。
    Deregistration via the 24004 node is successful and also triggers deregistration for the 24000 and 24002 nodes, as expected.
    24004正确1
    24004正确2
    响应信息(Response Message:):
    ok

Desktop (please complete the following information):

  • OS: 尝试使用 Centos7.6、Windows 8 均可复现(Both CentOS 7.6 and Windows 8 can reproduce the issue)
  • Version: nacos-server 2.3.2、nacos-client 2.3.2
  • Module beta
  • SDK spring-cloud-alibaba-naco

Additional context
#10345 同问题。经测试,v2 版本实例修改接口正常。

通过 Apifox 与 curl 进行接口测试,访问 24000 与 24002 的 v1 修改实例接口时会出现400报错,访问 24004 的 v1 修改实例接口时正常,表现与页面访问一致。访问 24000、24002、24004 的 v2 修改实例接口三节点均正常,Web 页面内使用 v1 接口,v1 接口可能随更新迭代出现本问题。
This issue matches #10345. The v2 instance modification API works fine. Testing via Apifox and curl confirms that the v1 instance modification API for the 24000 and 24002 nodes returns errors, while the 24004 node's v1 API functions correctly, consistent with the web UI behavior. All three nodes work normally with the v2 API. The web UI uses the v1 API, which may have been affected by recent updates.
v1错误400
v2正常

@KomachiSion
Copy link
Collaborator

看描述应该是集群之间的转发请求出错了,只有在leader节点上调用的时候因为不用转发而可以通过,不过我自己的环境没有复现出来,我怀疑是部署的环境导致转发请求的时候报错了, 可以查看下服务端日志,有没有什么有用的信息。
或者通过抓包请求或者arthas分析一下转发链路。

@Dreamer-SK
Copy link
Author

好的,您可以尝试使用WINDOEWS机器内单机器直接以内置集群方式启动三节点,操作会出现同样的问题。同时我近期会据你所说排查下可能存在的问题再行反馈。

@KomachiSion
Copy link
Collaborator

400 BadRequest 看起来就是被tomcat直接拦截了,一般出现这个问题可能原因:

  1. 集群搭建的有问题,多个节点之间的配置不同,导致转发请求到责任节点上的时候请求到了错误的节点上;
  2. 集群的配置有问题,例如把tomcat的max-header-size改到过小,或者是请求内容过大,导致转发时内容被截断导致请求无法解析等。
    建议还是通过抓包请求或者arthas分析一下

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants