Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc channel stub cannot end request when timeout exceed,eg:timeout=30ms,request ended after 200ms #36712

Closed
johnsondeng opened this issue May 24, 2024 · 7 comments

Comments

@johnsondeng
Copy link

johnsondeng commented May 24, 2024

What version of gRPC and what language are you using?

python
grpcio 1.54.0
grpcio-tools 1.54.0

What operating system (Linux, Windows,...) and version?

linux

What runtime / compiler are you using (e.g. python version or version of gcc)

Python 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] on linux

What did you do?

Just write a grpc client multit-thread demo, set the timeout 30ms for per request, and grpc server "10.155.1.127:1234" will always sleep for 60ms. So what to be expected should be that all requests end with timeout exceed in 30ms. But Actually, after lots of request time sended and grep elapsed time to stastics, many request cost much more than 30ms.

grpc_server/demo_client.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-


from api import test_pb2, test_pb2_grpc
import os
import sys
import argparse
import logging
import time
import toml
from threading import Thread
from multiprocessing import Process

import grpc


grpc_channel_options = [
    ('grpc.max_send_message_length', -1),
    ('grpc.max_receive_message_length', -1),
    # ('grpc.client_idle_timeout_ms', 10 * 1000)
]
channel = grpc.insecure_channel("10.155.1.127:1234",
                                grpc_channel_options)


def start():

    request_cnt = 0
    while request_cnt < total:
        node = None
        try:
            # node, channel = client.get_channel()
            start_time = time.perf_counter()

            stub = test_pb2_grpc.TestStub(channel)
            req = test_pb2.Req(msg="hello")

            if args.api == "echo":
                rsp = stub.Echo(req, timeout=0.03)
            if args.api == "proxyecho":
                rsp = stub.ProxyEcho(req, timeout=1)
            elif args.api == "redisecho":
                rsp = stub.RedisEcho(req, timeout=1)

            if request_cnt % 1000 == 0:
                logging.info("request cnt %d, rsp %s", request_cnt, rsp.msg)
        except Exception as e:
            elapsed = (time.perf_counter() - start_time) * 1000
            # logging.error("exception, %s", e)
            print("exception %s, elapsed %f" % (e, elapsed))

        request_cnt += 1
        # time.sleep(0.002)


parser = argparse.ArgumentParser()
parser.add_argument("--api",
                    type=str,
                    default="echo",
                    help="echo/redisecho/proxyecho")

if __name__ == "__main__":
    args = parser.parse_args()
    # time.sleep(10000)

    total = 10000000
    worker_num = 32
    threads = []
    [threads.append(Thread(target=start)) for i in range(worker_num)]

    for th in threads:
        th.start()
    for th in threads:
        th.join()

Run the program nohup python grpc_server/demo_client.py for nearly 5 min, and print the time cost of top 10 longest request

image

What did you expect to see?

All request ended in 30ms

What did you see instead?

Some requests ended in 100ms+.

Make sure you include information that can help us debug (full error message, exception listing, stack trace, logs).

See TROUBLESHOOTING.md for how to diagnose problems better.

Anything else we should know about your project / environment?

@johnsondeng
Copy link
Author

Any help? @gnossen @XuanWang-Amos @hassox @beccasaurus

@sourabhsinghs
Copy link

@johnsondeng I wasn't able to repro the issue setting up a simple server against your client.

Please provide the server side script also to be able to repro this issue. If possible a docker image with your server/client setup.

@johnsondeng
Copy link
Author

johnsondeng commented Jul 12, 2024

@johnsondeng I wasn't able to repro the issue setting up a simple server against your client.

Please provide the server side script also to be able to repro this issue. If possible a docker image with your server/client setup.

issue.zip

@sourabhsinghs The code and dockerfile is in zip attach files. I build a simple image for this issue, and can still repro the issue after run the client for 5min. Please give help, the problem show impact on out production server for long time.

dockerfile

FROM python:3.10

WORKDIR /workspace

RUN pip install grpcio==1.65.0 grpcio-tools==1.65.0

RUN mkdir /opt/protoc && cd /opt/protoc \
    && wget https://github.com/protocolbuffers/protobuf/releases/download/v27.2/protoc-27.2-linux-x86_64.zip \
    && unzip protoc-27.2-linux-x86_64.zip \
    && export PATH=$PATH:/opt/protoc/bin/ \
    && protoc --version

@johnsondeng
Copy link
Author

Can anyone help? @gnossen @sourabhsinghs @XuanWang-Amos

@sourabhsinghs
Copy link

@johnsondeng , i have tried again to repro this issue with suggested setup. these are my findings.

Elapsed Time Range Number of Calls
(10, 20] 0
(20, 30] 0
(30, 40] 0
(40, 50] 0
(50, 60] 3200119
(60, 70] 64
(70, 80] 2
(80, 90] 0
(90, 100] 0

Do you have any suggestion on why we are unable to see similar delays ?

Did another test to see constant overhead for timeout delay . sharing graph for that also.

response_delay_vs_timeout

@johnsondeng
Copy link
Author

johnsondeng commented Jul 30, 2024

@johnsondeng , i have tried again to repro this issue with suggested setup. these are my findings.

Elapsed Time Range Number of Calls
(10, 20] 0
(20, 30] 0
(30, 40] 0
(40, 50] 0
(50, 60] 3200119
(60, 70] 64
(70, 80] 2
(80, 90] 0
(90, 100] 0
Do you have any suggestion on why we are unable to see similar delays ?

Did another test to see constant overhead for timeout delay . sharing graph for that also.

response_delay_vs_timeout

@sourabhsinghs I produce a histogram too, it seems like that requests which time cost > 60ms is more than yours. I think maybe hardware diff? My container is running on Mac M1.

But my question is, why the request > 60ms cannot end requests in 50ms(timeout params setting in rpc call)。 Your histogram above also shows 64 request in (60, 70]。The timeout is very inaccurate。

0.00 - 10.00: 0
10.00 - 20.00: 0
20.00 - 30.00: 0
30.00 - 40.00: 0
40.00 - 50.00: 0
50.00 - 60.00: 125234
60.00 - 70.00: 197
70.00 - 80.00: 33
80.00 - 90.00: 19
90.00 - 100.00: 8
100.00 - 500.00: 1

@sourabhsinghs
Copy link

@johnsondeng In data you've shared looks like 99.794% of RPCs (125234/125492) ends in 50.00 - 60.00 ms. Latency in rest of the RPC can be attributed to Hardware / GIL / Network etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants