Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JetStreamError: heartbeats missed #649

Open
erikschul opened this issue Feb 16, 2025 · 2 comments
Open

JetStreamError: heartbeats missed #649

erikschul opened this issue Feb 16, 2025 · 2 comments
Labels
defect Suspected defect such as a bug or regression

Comments

@erikschul
Copy link

erikschul commented Feb 16, 2025

Observed behavior

I'm using nats.node with Bun, and it seems that my application sometimes crashes because of NATS.
It will be in the middle of some database operation and crash, so it doesn't seem related to fetch.
I've also tried adding some await Bun.sleep(10) to allow task switching for NATS, so I don't think that's the problem.
I'm calling msg.working() every 5s with a setInterval function.

I'm not sure what the heartbeat is for. Can it be disabled? I'm only interested in the working() and ack() calls.
The consumer is configured with ack_wait: 5 * 60 * 10 ** 9, // 5min.
Is it possible to use NATS on an unreliable network connection? Or does it require consistent and very frequent heartbeats?

nats js debug console output:

< SUB _INBOX.B8M1G50VZP7G51FZHIKBMV.1 2␍␊
< PUB $JS.API.CONSUMER.MSG.NEXT.mystream.pull_consumer _INBOX.B8M1G50VZP7G51FZHIKBMV.1 74␍␊{"batch":1,"max_bytes":0,"idle_heartbeat":1000000000,"expires":5000000000}␍␊
> MSG mysubject 2 $JS.ACK.mystream.pull_consumer.12.1.52.1739656687327122032.0 39␍␊{"type":"mymsg","name":"abc"}␍␊
> PING␍␊
< PONG␍␊
< UNSUB 2␍␊
     working() called.
     working() finished.
< PUB $JS.ACK.mystream.pull_consumer.12.1.52.1739656687327122032.0 4␍␊+WPI␍␊
     working() called.
     working() finished.

...

< PUB $JS.ACK.mystream.pull_consumer.12.1.52.1739656687327122032.0 4␍␊+WPI␍␊
< PUB $JS.ACK.mystream.pull_consumer.12.1.52.1739656687327122032.0 4␍␊+ACK␍␊UNSUB 2␍␊
24 |     }
25 | }
26 | exports.JetStreamNotEnabled = JetStreamNotEnabled;
27 | class JetStreamError extends Error {
28 |     constructor(message, opts) {
29 |         super(message, opts);
             ^
JetStreamError: heartbeats missed
      at JetStreamError (./node_modules/@nats-io/jetstream/lib/jserrors.js:29:9)

Using the debugging code in the README:

void new Promise(async () => {
			for await (const s of iter.status()) {
				console.log("Status: ", s)
				switch (s.type) {
					case "heartbeats_missed": {
						console.log(`${s.count} heartbeats missed`)
					}
				}
			}
		})

It seems that when there are no messages, it works fine:

Status:  {
  type: "discard",
  messagesLeft: 1,
  bytesLeft: 0,
}
Status:  {
  type: "next",
  options: {
    batch: 1,
    max_bytes: 0,
    idle_heartbeat: 2500000000,
    expires: 5000000000,
  },
}
Status:  {
  type: "heartbeat",
  lastConsumerSequence: 55,
  lastStreamSequence: 5,
}

but when the work is started, which involves a lot of fetching with long timeouts (e.g. 10-30 seconds), it may be causing Nats to fail heartbeats, and the internal nats code seems to stop on failures, if I'm reading this right...

// if we are not a consume, give up - this was masked by an
// external timer on fetch - the hb is a more reliable timeout
// since it requires 2 to be missed - this creates an edge case
// that would timeout the client longer than they would expect: we
// could be waiting for one more message, nothing happens, and
// now we have to wait for 2 missed hbs, which would be 1m (max), so
// there wouldn't be a fail fast.
this.stop(new jserrors_1.JetStreamError("heartbeats missed"));

Expected behavior

No error.

Server and client version

server: 2.10.22

"@nats-io/jetstream": "^3.0.0-37",
"@nats-io/nats-core": "^3.0.0-50",
"@nats-io/transport-node": "^3.0.0-35",

Host environment

No response

Steps to reproduce

No response

@erikschul erikschul added the defect Suspected defect such as a bug or regression label Feb 16, 2025
@erikschul
Copy link
Author

I managed to work around it by using next() instead.

I still think this is an issue though.

@aricart
Copy link
Member

aricart commented Feb 18, 2025

@erikschul is your API blocking? The heartbeats missed, is directly in response to your client likely being partitioned? is this possible? where you have a connection but part of it goes silent? For your pattern the more resilient is doing next() as you have done. consume() is intended for stable processes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

No branches or pull requests

2 participants