Skip to content

fix: replication stuck when syncing large batches #4474

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 1, 2025

Conversation

fraidev
Copy link
Contributor

@fraidev fraidev commented Mar 31, 2025

Should fix replica sync problem: #4472

I also notice more perfomance with replicas now:

before:

 fluvio benchmark producer -p 3 -r 3 -k --num-records 50000                                                                                                                                                                                                                                                              $ fix_replication_large_2 
Benchmark started

latencies: 19.7µs min, 7.4ms avg, 33.1ms max, 7.2ms p0.50, 15.1ms p0.95, 17.3ms p0.99
50000 total records sent, 71942 records/sec: (369.1 MB/sec), total time: 695.9ms

**Per Record E2E Latency**

|Variable| p0.00 |p0.50|p0.95 |p0.99 |p1.00 |
|--------|-------|-----|------|------|------|
|Latency |19.7µs |7.2ms|15.1ms|17.3ms|33.1ms|

**Throughput (Total Produced Bytes / Time)**

|     Variable      |   Speed    |
|-------------------|------------|
|Produced Throughput|369.1 MB/sec|

Benchmark completed

now:

▷ flvdr benchmark producer -p 3 -r 3 -k --num-records 50000                                                                                                                                                                                                                                                               $ fix_replication_large_2 
Benchmark started

latencies: 69.2µs min, 5.9ms avg, 23.2ms max, 5.6ms p0.50, 12.6ms p0.95, 14.6ms p0.99
50000 total records sent, 99800 records/sec: (512.0 MB/sec), total time: 501.5ms

**Per Record E2E Latency**

|Variable| p0.00 |p0.50|p0.95 |p0.99 |p1.00 |
|--------|-------|-----|------|------|------|
|Latency |69.2µs |5.6ms|12.6ms|14.6ms|23.2ms|

**Throughput (Total Produced Bytes / Time)**

|     Variable      |   Speed    |
|-------------------|------------|
|Produced Throughput|512.0 MB/sec|

Benchmark completed

@fraidev fraidev force-pushed the fix_replication_large_2 branch from 69621e3 to c45c5e6 Compare March 31, 2025 22:06
@fraidev fraidev force-pushed the fix_replication_large_2 branch from c45c5e6 to 995fa2b Compare March 31, 2025 22:16
@fraidev fraidev force-pushed the fix_replication_large_2 branch from c7a8be7 to c021365 Compare March 31, 2025 22:29
@sehz
Copy link
Contributor

sehz commented Mar 31, 2025

Trace shows no record were sent, even thought slice was 1MB. So there should be at least 1 record arriving. Can you add more debugging to see how many bytes were received and why decoding failed?

2025-03-31T22:32:13.626134Z TRACE respond{socket=Socket(12) _connection=peer("127.0.0.1:64972")}:LeaderConnection{self=5002 sink=fd(12)}:update_from_leaders{sink=fd(12)}: fluvio_spu_schema::file: KfFileRecordSet encoding file slice len: 1000000
2025-03-31T22:32:13.649397Z DEBUG respond{socket=Socket(12) _connection=peer("127.0.0.1:64972")}:LeaderConnection{self=5002 sink=fd(12)}: fluvio_spu::replication::leader::connection: waiting for next event
2025-03-31T22:32:13.649391Z TRACE FollowerGroupController{leader=5001}: fluvio_spu::replication::follower::peer_api: decoding with header: RequestHeader {
    api_key: 0,
    api_version: 7,
    correlation_id: 1,
    client_id: "leader: 5001",
}
2025-03-31T22:32:13.649500Z DEBUG FollowerGroupController{leader=5001}:sync_from_leader{sink=fd(11)}: fluvio_spu::replication::follower::controller::inner: update from leader replica=test-0 leader_hw=0 leader_leo=5 records=0 base_offset=-1
2025-03-31T22:32:13.649528Z DEBUG FollowerGroupController{leader=5001}:sync_from_leader{sink=fd(11)}: fluvio_spu::replication::follower::state: no records
2025-03-31T22:32:13.649540Z DEBUG FollowerGroupController{leader=5001}:sync_from_leader{sink=fd(11)}: fluvio_spu::replication::follower::controller::inner: no changes
2025-03-31T22:32:13.649560Z DEBUG FollowerGroupController{leader=5001}: fluvio_spu::replication::follower::controller::inner: waiting request from leader counter=1
test replication::test::test_sync_larger_records_sync ... FAILED

@fraidev
Copy link
Contributor Author

fraidev commented Mar 31, 2025

Trace shows no record were sent, even thought slice was 1MB. So there should be at least 1 record arriving. Can you add more debugging to see how many bytes were received and why decoding failed?

2025-03-31T22:32:13.626134Z TRACE respond{socket=Socket(12) _connection=peer("127.0.0.1:64972")}:LeaderConnection{self=5002 sink=fd(12)}:update_from_leaders{sink=fd(12)}: fluvio_spu_schema::file: KfFileRecordSet encoding file slice len: 1000000
2025-03-31T22:32:13.649397Z DEBUG respond{socket=Socket(12) _connection=peer("127.0.0.1:64972")}:LeaderConnection{self=5002 sink=fd(12)}: fluvio_spu::replication::leader::connection: waiting for next event
2025-03-31T22:32:13.649391Z TRACE FollowerGroupController{leader=5001}: fluvio_spu::replication::follower::peer_api: decoding with header: RequestHeader {
    api_key: 0,
    api_version: 7,
    correlation_id: 1,
    client_id: "leader: 5001",
}
2025-03-31T22:32:13.649500Z DEBUG FollowerGroupController{leader=5001}:sync_from_leader{sink=fd(11)}: fluvio_spu::replication::follower::controller::inner: update from leader replica=test-0 leader_hw=0 leader_leo=5 records=0 base_offset=-1
2025-03-31T22:32:13.649528Z DEBUG FollowerGroupController{leader=5001}:sync_from_leader{sink=fd(11)}: fluvio_spu::replication::follower::state: no records
2025-03-31T22:32:13.649540Z DEBUG FollowerGroupController{leader=5001}:sync_from_leader{sink=fd(11)}: fluvio_spu::replication::follower::controller::inner: no changes
2025-03-31T22:32:13.649560Z DEBUG FollowerGroupController{leader=5001}: fluvio_spu::replication::follower::controller::inner: waiting request from leader counter=1
test replication::test::test_sync_larger_records_sync ... FAILED

If I understood well, these logs are saying that the records were not received by follower. But it's not saying about leader sending it.

@sehz
Copy link
Contributor

sehz commented Mar 31, 2025

Here is leader sending to follower:

2025-03-31T22:32:13.626134Z TRACE respond{socket=Socket(12) _connection=peer("127.0.0.1:64972")}:LeaderConnection{self=5002 sink=fd(12)}:update_from_leaders{sink=fd(12)}: fluvio_spu_schema::file: KfFileRecordSet encoding file slice len: 1000000

Follower receiving:

2025-03-31T22:32:13.649391Z TRACE FollowerGroupController{leader=5001}: fluvio_spu::replication::follower::peer_api: decoding with header: RequestHeader {
    api_key: 0,
    api_version: 7,
    correlation_id: 1,
    client_id: "leader: 5001",
}
2025-03-31T22:32:13.649500Z DEBUG FollowerGroupController{leader=5001}:sync_from_leader{sink=fd(11)}: fluvio_spu::replication::follower::controller::inner: update from leader replica=test-0 leader_hw=0 leader_leo=5 records=0 base_offset=-1
2025-03-31T22:32:13.649528Z DEBUG FollowerGroupController{leader=5001}:sync_from_leader{sink=fd(11)}: fluvio_spu::replication::follower::state: no records

@fraidev fraidev marked this pull request as ready for review April 1, 2025 00:49
@fraidev fraidev requested a review from sehz April 1, 2025 00:52
@fraidev fraidev force-pushed the fix_replication_large_2 branch from d8f51d7 to a344916 Compare April 1, 2025 01:00
@fraidev fraidev force-pushed the fix_replication_large_2 branch from a344916 to 22f453e Compare April 1, 2025 01:00
),
));
}

self.records.decode(&mut buf, version)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's restore len validation

@fraidev fraidev force-pushed the fix_replication_large_2 branch 2 times, most recently from 643d20b to 2981229 Compare April 1, 2025 02:13
@fraidev fraidev force-pushed the fix_replication_large_2 branch from 2981229 to 2e35405 Compare April 1, 2025 02:35
@fraidev fraidev requested a review from sehz April 1, 2025 05:05
@fraidev fraidev force-pushed the fix_replication_large_2 branch from 74684e3 to a9c4554 Compare April 1, 2025 05:14
Copy link
Contributor

@sehz sehz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sehz
Copy link
Contributor

sehz commented Apr 1, 2025

Let's change in here warn to debug: https://github.com/infinyon/fluvio/blob/master/crates/fluvio-protocol/src/record/data.rs#L320 for both cases since it is expected that RecordSet typical will not able to process all the batches and warning is misleading

@fraidev fraidev changed the title test: add test to sync large batch records fix: replication stuck when syncing large batches Apr 1, 2025
@fraidev fraidev force-pushed the fix_replication_large_2 branch from a9c4554 to ea52ffa Compare April 1, 2025 18:30
@fraidev
Copy link
Contributor Author

fraidev commented Apr 1, 2025

Let's change in here warn to debug: https://github.com/infinyon/fluvio/blob/master/crates/fluvio-protocol/src/record/data.rs#L320 for both cases since it is expected that RecordSet typical will not able to process all the batches and warning is misleading

sure, done.

@fraidev fraidev enabled auto-merge April 1, 2025 18:32
@fraidev fraidev added this pull request to the merge queue Apr 1, 2025
Merged via the queue into infinyon:master with commit 679991d Apr 1, 2025
102 checks passed
@fraidev fraidev deleted the fix_replication_large_2 branch April 1, 2025 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants