Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUGFIX] add Health Check for Range over gRPC Connection Loop #2828

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kpango
Copy link
Collaborator

@kpango kpango commented Feb 3, 2025

Description

[WIP]

Related Issue

Versions

  • Vald Version: v1.7.16
  • Go Version: v1.23.5
  • Rust Version: v1.83.0
  • Docker Version: v27.4.0
  • Kubernetes Version: v1.32.0
  • Helm Version: v3.16.3
  • NGT Version: v2.3.5
  • Faiss Version: v1.9.0

Checklist

Special notes for your reviewer

Summary by CodeRabbit

  • New Features
    • Introduced enhanced metrics tracking and improved context management for more responsive operations.
  • Documentation
    • Updated API documentation across endpoints with comprehensive overviews, structured status code tables, and detailed troubleshooting guidance.
    • Corrected typographical errors in documentation for the Insert RPC method.
  • Refactor
    • Streamlined logging and connection management to improve overall robustness.
  • Chores
    • Upgraded multiple dependency and component versions (e.g., K3S, Docker, FAISS, Helm, GolangCI-Lint) for enhanced stability and performance.

Copy link
Contributor

coderabbitai bot commented Feb 3, 2025

📝 Walkthrough

Walkthrough

This pull request introduces extensive changes across the project. Updates include modifications in Makefile configurations, enhancements to gRPC client and connection pool functions by adding context parameters and refining error handling, and corrections in API documentation with expanded endpoint summaries and structured status code tables. Kubernetes deployment annotations have been updated, various dependency and version numbers have been adjusted, and new documentation and Rust source files have been added.

Changes

File(s) Change Summary
Makefile.d/k3d.mk, Makefile.d/functions.mk Added K3D_NETWORK variable; updated k3d install commands, permission adjustments, and revised GitHub Actions version retrieval.
internal/backoff/backoff.go, internal/net/grpc/client.go, internal/net/grpc/client_test.go, internal/net/grpc/pool/pool.go, internal/net/grpc/pool/pool_test.go, internal/observability/metrics/grpc/grpc.go, internal/net/net.go, internal/test/mock/grpc/grpc_client_mock.go, internal/test/mock/grpc_testify_mock.go Enhanced context management by updating method signatures; replaced explicit loops with maps.Clone; refined connection health checks, error logging, and metric reporting.
pkg/gateway/mirror/service/mirror.go, pkg/gateway/lb/handler/grpc/aggregation.go Updated gRPC client calls to pass context and added additional logging for error handling in aggregation searches.
apis/swagger/v1/** (e.g. vald, mirror, filter, flush, index, insert, remove, search, update, upsert swagger files) Expanded API endpoint summaries with detailed overviews, status code tables, troubleshooting sections, and corrected documentation typos.
k8s/**/deployment.yaml (e.g. discoverer, gateway/lb, gateway/mirror, index/operator, manager/index) Updated ConfigMap checksum annotations to reflect configuration changes.
versions/*, go.mod, example/client/go.mod Updated or downgraded various dependency and tool version numbers.
.gitfiles, apis/docs/v1/insert.md, rust/bin/agent/src/handler/object.rs, rust/libs/algorithms/qbg/* Added new API documentation files, a Rust unit test workflow, and new Rust source files for object handling and algorithm support.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant gRPCClient
    participant ConnectionPool
    Client->>gRPCClient: Call ConnectedAddrs(ctx)
    gRPCClient->>ConnectionPool: rangeConns(ctx, force)
    ConnectionPool-->>gRPCClient: Return connection list
    gRPCClient-->>Client: Return addresses
Loading

Suggested labels

size/S, actions/backport/release/v1.7

Suggested reviewers

  • kmrmt
  • vankichi
  • datelier

Warning

There were issues while running some tools. Please review the errors and either fix the tool’s configuration or disable the tool if it’s a critical failure.

🔧 golangci-lint (1.62.2)

level=warning msg="[config_reader] The configuration option run.skip-dirs is deprecated, please use issues.exclude-dirs."
level=warning msg="[config_reader] The configuration option output.format is deprecated, please use output.formats"
level=warning msg="[config_reader] The configuration option linters.govet.check-shadowing is deprecated. Please enable shadow instead, if you are not using enable-all."
Error: can't load config: the Go language version (go1.23) used to build golangci-lint is lower than the targeted Go version (1.24.0)
Failed executing command with error: can't load config: the Go language version (go1.23) used to build golangci-lint is lower than the targeted Go version (1.24.0)

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@vdaas-ci
Copy link
Collaborator

vdaas-ci commented Feb 3, 2025

[CHATOPS:HELP] ChatOps commands.

  • 🙆‍♀️ /approve - approve
  • 🍱 /format - format codes and add licenses
  • /gen-test - generate test codes
  • 🏷️ /label - add labels
  • 🔚 2️⃣ 🔚 /label actions/e2e-deploy - run E2E deploy & integration test

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
internal/net/grpc/pool/pool.go (2)

143-160: Fallback initial connection dial.
The fallback logic for re-dialing on the same or a new port is well-structured. Consider extracting this repeated dial fallback into a helper function if it grows more complex.


437-464: New Reconnect method logic.
The flow checks if the pool is closing or already healthy, then conditionally triggers fresh DNS resolution. This seems correct; however, ensure that concurrent calls to Reconnect won’t lead to inconsistent states (e.g., changing reconnectHash while another goroutine is also resolving DNS). Overall, a good expansion of functionality.

pkg/gateway/lb/handler/grpc/aggregation.go (1)

108-108: Consider using structured logging with error context.

The added debug logging is good for visibility, but consider using structured logging to include additional context like target address, request ID, etc. This will make debugging easier by providing more context around the errors.

Example:

-log.Debug(err)
+log.Debug("aggregation search error",
+    "error", err,
+    "target", target,
+    "request_id", bcfg.GetRequestId())

Also applies to: 120-120, 177-177, 189-189

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aee34f9 and 3b75a31.

📒 Files selected for processing (12)
  • Makefile.d/k3d.mk (3 hunks)
  • internal/backoff/backoff.go (2 hunks)
  • internal/net/grpc/client.go (12 hunks)
  • internal/net/grpc/client_test.go (2 hunks)
  • internal/net/grpc/pool/pool.go (27 hunks)
  • internal/net/grpc/pool/pool_test.go (1 hunks)
  • internal/observability/metrics/grpc/grpc.go (3 hunks)
  • internal/test/mock/grpc/grpc_client_mock.go (1 hunks)
  • internal/test/mock/grpc_testify_mock.go (1 hunks)
  • pkg/gateway/lb/handler/grpc/aggregation.go (4 hunks)
  • pkg/gateway/mirror/service/mirror.go (1 hunks)
  • versions/K3S_VERSION (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • versions/K3S_VERSION
  • pkg/gateway/mirror/service/mirror.go
🔇 Additional comments (36)
internal/net/grpc/pool/pool.go (17)

23-40: Imports look consistent.
These newly added imports (maps, codes, status, singleflight) seem fitting for the added functionalities (maps.Clone usage, gRPC error codes, singleflight to handle concurrency). No issues spotted.


50-54: Context-aware method signatures.
Introducing context parameters into Disconnect and Get helps ensure clean cancellations and timeouts. This update aligns well with modern best practices.


79-79: Singleflight for connection concurrency control.
Defining group singleflight.Group[Conn] is a neat approach for preventing duplicate connections and consolidating in-flight requests.


93-94: Shared metrics map under locking.
Using mu to guard the metrics map ensures thread safety. Confirm consistent locking around every access to avoid data races.


107-107: Instantiating singleflight.
Initializing p.group = singleflight.New[Conn]() on creation is helpful for grouping connection calls. Good addition.


193-193: Debug log usage.
The debug log here is straightforward. No issues noted.


241-241: Lock usage within store method.
Acquiring a write lock for the pool store operation is correct. Ensures correct pointer updates.


355-356: Additional debug message.
Just a debug statement; no functional impact.


368-370: Check if pool is closing before singleTargetConnect.
Short-circuiting when p == nil || p.closing.Load() helps avoid redundant dial attempts. Overall logic is sound.

Also applies to: 376-376


466-467: Context-aware disconnection.
Providing a ctx argument to Disconnect is a nice extension for controlled shutdowns.


494-494: Dial debug logging.
This addition is harmless and helpful for tracing dial attempts.


555-559: Calculation of unhealthy connections.
Assigning unhealthy = pl - cnt implies “remaining connections are unhealthy.” If cnt ever surpasses pl, this might yield a negative result. Verify that such a scenario cannot occur, or add a safeguard if needed.


604-633: Enhanced Do method error recovery flow.
• Incorporating status checks and re-closing the connection if it’s canceled or fails is a solid approach to preserve resilience.
• The subsequent refreshConn call (line 626) properly attempts a retry.
Looks reasonable overall.


640-644: Get method with a context parameter.
Using getHealthyConn matches the “do once, fail fast” logic. Good step for context awareness.


795-843: poolConn.Close with delayed shutdown logic.
The approach thoroughly checks connection states, waiting for a stable state or the context to expire. Good for graceful closings.


845-872: isHealthy extended logging.
The function properly triggers conn.Connect() for IDLE states and recurses upon state changes. This handles transient states robustly.


874-883: Metrics function.
Exposing a safe clone of the metrics map is a good pattern to avoid concurrency issues with external callers.

internal/net/grpc/client.go (7)

93-93: Method now accepts context.
Switching ConnectedAddrs() to be context-aware aligns with gRPC best practices, enabling cancellation/timeouts during address checking.


253-253: Using rangeConns with force=true for internal loops.
These lines trigger a forced iteration over connections for rebalancing/health checks. Ensure concurrency is well-managed when multiple rebalances happen simultaneously.

Also applies to: 289-289


418-418: Non-forced rangeConns calls.
Calling rangeConns(ctx, false, ...) indicates partial reconnection logic only if connections are unhealthy. This approach is consistent with your incremental healing design.

Also applies to: 481-481


568-568: Checking IsHealthy before usage.
The approach to skip or reconnect invalid pools is correct. Logging warns about unreachable addresses, which is helpful for debugging.

Also applies to: 637-637


703-703: RoundRobin approach calls rangeConns.
Applying rangeConns(ctx, false, ...) is a valid pattern for enumerating connections until one is healthy. No concerns here.


1088-1101: ConnectedAddrs now uses rangeConns.
Implementation filters out only healthy connections, returning an updated slice of addresses. Helps external callers handle connection states.


1121-1149: New internal helper rangeConns.
force param usage for forced iteration vs. selective iteration is a nice design.
• The check that re-connects unhealthy pool objects ensures self-healing.
Looks cohesive.

Makefile.d/k3d.mk (3)

20-20: New K3D_NETWORK variable.
Declaring K3D_NETWORK = bridge is straightforward; no issues.


33-34: Customized install directory.
Using K3D_INSTALL_DIR=$(BINDIR) and ensuring chmod a+x is performed afterwards looks good. This keeps binaries organized.


46-47: Eviction policy arguments for K3S.
These additional kubelet arguments help manage disk usage thresholds. The usage is correct for the “agent” node contexts. No issues spotted.

internal/test/mock/grpc/grpc_client_mock.go (1)

54-54: LGTM! Context parameter properly added.

The method signature update is consistent with the interface changes across the codebase.

internal/observability/metrics/grpc/grpc.go (3)

33-34: LGTM! New metrics for pool connection health.

Clear and descriptive metric name and description for tracking pool connection health.


69-77: LGTM! Proper metric view configuration.

The view is correctly configured with sum aggregation, which is appropriate for connection count metrics.


83-102: LGTM! Well-implemented metrics registration.

The implementation properly:

  • Creates an observable gauge
  • Handles registration errors
  • Sets up callback for metrics collection
internal/backoff/backoff.go (1)

244-244: LGTM! Improved metrics cloning.

Good use of maps.Clone for a more concise and efficient implementation.

internal/net/grpc/client_test.go (1)

3110-3110: LGTM! Context parameter additions improve consistency.

The addition of context parameters to ConnectedAddrs and rangeConns methods aligns with the context-aware design pattern used throughout the codebase. This change enables better control over cancellation and timeouts.

Also applies to: 3306-3306

internal/net/grpc/pool/pool_test.go (3)

2347-2347: LGTM! Context parameter added correctly.

The change to add context.Background() to the Disconnect method call aligns with the updated method signature.


1-14: LGTM! License header is properly maintained.

The Apache 2.0 license header is correctly formatted and includes the appropriate copyright information.


18-108: Implement missing test cases.

The file contains extensive test case structures but most are commented out with TODO markers. Consider implementing these test cases to ensure proper coverage of the pool functionality, especially for the updated Disconnect method.

Let's check the test coverage:

Also applies to: 110-265, 267-422, 424-584, 586-744, 746-909, 911-1061, 1063-1213, 1215-1360, 1362-1531, 1533-1696, 1698-1866, 1868-2035, 2037-2201, 2203-2352, 2354-2522, 2524-2684, 2686-2849, 2851-3015, 3017-3187, 3189-3339, 3341-3491, 3493-3657, 3659-3823, 3825-3975, 3977-4127, 4129-4232, 4234-4321

@kpango kpango force-pushed the bugfix/internal-grpc/add-health-check-for-range-over-grpc-loop branch from 3b75a31 to ac007bc Compare February 3, 2025 08:38
Copy link

cloudflare-workers-and-pages bot commented Feb 3, 2025

Deploying vald with  Cloudflare Pages  Cloudflare Pages

Latest commit: a8c444f
Status: ✅  Deploy successful!
Preview URL: https://ecadd6d8.vald.pages.dev
Branch Preview URL: https://bugfix-internal-grpc-add-hea.vald.pages.dev

View logs

@kpango kpango force-pushed the bugfix/internal-grpc/add-health-check-for-range-over-grpc-loop branch from ac007bc to e317f7a Compare February 3, 2025 09:02
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
internal/observability/metrics/grpc/grpc.go (1)

81-103: Consider error handling improvements in the callback function.

The callback function for observing pool connection metrics could benefit from more robust error handling.

Consider this improvement:

 func (gm *grpcServerMetrics) Register(m metrics.Meter) error {
     healthyConn, err := m.Int64ObservableGauge(
         poolConnMetricsName,
         metrics.WithDescription(poolConnMetricsDescription),
         metrics.WithUnit(metrics.Dimensionless),
     )
     if err != nil {
         return err
     }
     _, err = m.RegisterCallback(
         func(ctx context.Context, o api.Observer) error {
             ms := pool.Metrics(ctx)
             if len(ms) == 0 {
                 return nil
             }
+            var errs error
             for name, cnt := range ms {
-                o.ObserveInt64(healthyConn, cnt, api.WithAttributes(attribute.String(gm.poolTargetAddrKey, name)))
+                if err := o.ObserveInt64(healthyConn, cnt, api.WithAttributes(attribute.String(gm.poolTargetAddrKey, name))); err != nil {
+                    errs = errors.Join(errs, fmt.Errorf("failed to observe metrics for %s: %w", name, err))
+                }
             }
-            return nil
+            return errs
         }, healthyConn,
     )
     return err
 }
internal/net/grpc/pool/pool.go (2)

92-95: Consider using sync.Map for metrics.

The current implementation uses a global map with mutex, which could be replaced with sync.Map for better concurrent access patterns.

Consider this improvement:

-var (
-    mu      sync.RWMutex
-    metrics map[string]int64 = make(map[string]int64)
-)
+var metrics sync.Map[string, int64]

466-492: Consider adding metrics for disconnection events.

The Disconnect method could benefit from tracking disconnection events in metrics.

Consider adding a new metric for tracking disconnection events:

 func (p *pool) Disconnect(ctx context.Context) (err error) {
     log.Debug("Disconnecting...")
     p.closing.Store(true)
     defer p.closing.Store(false)
     emap := make(map[string]error, p.len())
+    disconnectedConns := 0
     err = p.loop(ctx, func(ctx context.Context, _ int, pc *poolConn) bool {
         if pc != nil && pc.conn != nil {
             ierr := pc.conn.Close()
             if ierr != nil {
                 if !errors.Is(ierr, context.DeadlineExceeded) &&
                     !errors.Is(ierr, context.Canceled) {
                     log.Debugf("failed to close connection pool addr = %s\terror = %v", pc.addr, ierr)
                     emap[ierr.Error()] = err
                 } else {
                     log.Debugf("Disconnect loop operation canceled while closing pool member connection to %s,\terror: %v", pc.addr, ierr)
                     return false
                 }
             }
+            disconnectedConns++
         }
         return true
     })
     p.flush()
+    mu.Lock()
+    metrics[p.addr] = 0 // Reset metrics for this pool
+    mu.Unlock()
     for _, e := range emap {
         err = errors.Join(err, e)
     }
     return err
 }
internal/net/grpc/client.go (1)

1128-1140: Consider adding retry mechanism for connection recovery.

The connection recovery logic could benefit from a retry mechanism with backoff.

Consider this improvement:

 if p == nil || !p.IsHealthy(ctx) {
+    var retries int
+    maxRetries := 3
+    backoffDuration := time.Second
+    for retries < maxRetries {
         pc, err := p.Connect(ctx)
         if pc == nil || err != nil || !pc.IsHealthy(ctx) {
             if pc != nil {
                 if derr := pc.Disconnect(ctx); derr != nil {
                     log.Debugf("Failed to disconnect unhealthy connection for %s: %v", addr, derr)
                 }
             }
+            retries++
+            if retries < maxRetries {
+                select {
+                case <-ctx.Done():
+                    return true
+                case <-time.After(backoffDuration * time.Duration(retries)):
+                    continue
+                }
+            }
             log.Debugf("Unhealthy connection detected for %s during gRPC Connection Range over Loop:\t%s", addr, p.String())
             return true
         }
         p = pc
+        break
+    }
 }
internal/net/grpc/pool/pool_test.go (1)

2347-2347: Update test to include context parameter.

The test is calling p.Disconnect(context.Background()) with a default background context. Consider adding test cases that verify context cancellation behavior.

Add test cases to verify:

  • Context cancellation during disconnect
  • Context timeout handling
  • Context propagation to underlying connections
-err := p.Disconnect(context.Background())
+ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+defer cancel()
+err := p.Disconnect(ctx)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3b75a31 and e317f7a.

📒 Files selected for processing (12)
  • Makefile.d/k3d.mk (3 hunks)
  • internal/backoff/backoff.go (2 hunks)
  • internal/net/grpc/client.go (12 hunks)
  • internal/net/grpc/client_test.go (2 hunks)
  • internal/net/grpc/pool/pool.go (27 hunks)
  • internal/net/grpc/pool/pool_test.go (1 hunks)
  • internal/observability/metrics/grpc/grpc.go (3 hunks)
  • internal/test/mock/grpc/grpc_client_mock.go (1 hunks)
  • internal/test/mock/grpc_testify_mock.go (1 hunks)
  • pkg/gateway/lb/handler/grpc/aggregation.go (4 hunks)
  • pkg/gateway/mirror/service/mirror.go (1 hunks)
  • versions/K3S_VERSION (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • versions/K3S_VERSION
  • internal/backoff/backoff.go
  • internal/test/mock/grpc_testify_mock.go
  • pkg/gateway/mirror/service/mirror.go
  • internal/test/mock/grpc/grpc_client_mock.go
  • Makefile.d/k3d.mk
  • pkg/gateway/lb/handler/grpc/aggregation.go
  • internal/net/grpc/client_test.go
⏰ Context from checks skipped due to timeout of 90000ms (93)
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / dump-contexts-to-log
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / dump-contexts-to-log
  • GitHub Check: build / dump-contexts-to-log
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / dump-contexts-to-log
  • GitHub Check: build / dump-contexts-to-log
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / dump-contexts-to-log
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / dump-contexts-to-log
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
🔇 Additional comments (9)
internal/observability/metrics/grpc/grpc.go (3)

33-34: LGTM! Clear and descriptive metric names.

The new metric names and descriptions are clear and follow the existing naming pattern.


37-39: LGTM! Well-structured struct update.

The addition of poolTargetAddrKey field with a clear purpose enhances the struct's functionality for tracking pool connection metrics.


69-77: LGTM! Consistent view registration.

The new view for pool connection metrics follows the same pattern as existing views and uses appropriate aggregation type (Sum) for counting connections.

internal/net/grpc/pool/pool.go (2)

586-589: LGTM! Thread-safe metrics update.

The metrics update is properly protected by mutex lock.


873-881: LGTM! Clean metrics retrieval implementation.

The Metrics function safely clones and returns the metrics map.

internal/net/grpc/client.go (2)

1121-1148: LGTM! Improved connection health checking.

The rangeConns method now properly handles connection health checks and includes force parameter for different use cases.


1088-1100: LGTM! Context-aware ConnectedAddrs implementation.

The ConnectedAddrs method now properly uses context for health checks.

internal/net/grpc/pool/pool_test.go (2)

16-108: Implement missing test cases for New function.

The test cases for the New function are commented out with TODO placeholders. Since this PR adds health check functionality, it's important to have test coverage for pool initialization.

Run this script to check if there are any existing tests for the New function:

Would you like me to help implement test cases for the New function that verify:

  • Pool initialization with health check configuration
  • Error cases for invalid health check parameters
  • Default health check behavior

4234-4320: Add test cases for health check functionality.

The isHealthy test function is crucial for this PR's health check feature but lacks test implementations.

Run this script to check if there are any existing health check tests:

Would you like me to help implement test cases that verify:

  • Health check success/failure scenarios
  • Health check timeout handling
  • Health check error conditions
  • Health check metrics collection

coderabbitai[bot]
coderabbitai bot previously approved these changes Feb 3, 2025
@kpango kpango force-pushed the bugfix/internal-grpc/add-health-check-for-range-over-grpc-loop branch 2 times, most recently from 716f8e0 to a453411 Compare February 3, 2025 09:16
Copy link

codecov bot commented Feb 3, 2025

Codecov Report

Attention: Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.

Project coverage is 16.91%. Comparing base (29272c4) to head (a8c444f).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
internal/test/mock/grpc_testify_mock.go 0.00% 2 Missing ⚠️
internal/test/mock/grpc/grpc_client_mock.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2828   +/-   ##
=======================================
  Coverage   16.91%   16.91%           
=======================================
  Files         136      136           
  Lines       15057    15057           
=======================================
  Hits         2547     2547           
  Misses      12213    12213           
  Partials      297      297           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kpango kpango force-pushed the bugfix/internal-grpc/add-health-check-for-range-over-grpc-loop branch 2 times, most recently from 8759e30 to 9ecb684 Compare February 3, 2025 10:00
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (1)
internal/net/grpc/client_test.go (1)

1-3490: Test Implementation Incompleteness – Add Test Cases

It appears that the test file contains extensive scaffolding with 38 TODO markers and 38 placeholder "test_case_X" sections. Although the placeholders are present, no actual test logic is implemented. To ensure thorough coverage, we need to fill these scaffolds with meaningful tests addressing:

• Validating gRPC client connectivity, including context parameter changes
• Testing both happy paths and error scenarios
• Covering concurrent and ordered operations as defined in methods such as Range, RangeConcurrent, ConnectedAddrs, and rangeConns

Would you like assistance in generating concrete test implementations for these sections?

🔗 Analysis chain

Add test cases for the TODO sections.

The test file has extensive test case structures defined but most are commented out and marked as TODO. This indicates incomplete test coverage which should be addressed.

Let's verify the test coverage:

Would you like me to help generate test cases for the TODO sections? I can provide implementations that cover:

  • Happy path scenarios
  • Error cases
  • Edge cases
  • Concurrent execution scenarios
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Count total number of test functions vs implemented test cases
echo "Test functions with TODO markers:"
rg -c "TODO test cases" internal/net/grpc/client_test.go

echo "Actual implemented test cases:"
rg -c "test_case_[0-9]+" internal/net/grpc/client_test.go

Length of output: 272

🧹 Nitpick comments (6)
internal/observability/metrics/grpc/grpc.go (1)

81-103: Consider adding error handling for empty metrics.

The implementation looks good but could benefit from additional error handling.

Consider adding error handling when ms is nil:

 func (gm *grpcServerMetrics) Register(m metrics.Meter) error {
     healthyConn, err := m.Int64ObservableGauge(
         poolConnMetricsName,
         metrics.WithDescription(poolConnMetricsDescription),
         metrics.WithUnit(metrics.Dimensionless),
     )
     if err != nil {
         return err
     }
     _, err = m.RegisterCallback(
         func(ctx context.Context, o api.Observer) error {
             ms := pool.Metrics(ctx)
-            if len(ms) == 0 {
+            if ms == nil || len(ms) == 0 {
                 return nil
             }
             for name, cnt := range ms {
                 o.ObserveInt64(healthyConn, cnt, api.WithAttributes(attribute.String(gm.poolTargetAddrKey, name)))
             }
             return nil
         }, healthyConn,
     )
     return err
 }
internal/net/grpc/pool/pool.go (1)

800-848: Consider adding timeout validation.

The connection closing logic could benefit from timeout validation.

Consider validating the delay parameter:

 func (pc *poolConn) Close(ctx context.Context, delay time.Duration) error {
+    if delay <= 0 {
+        return errors.New("delay must be positive")
+    }
     tdelay := delay / 10
     if tdelay < time.Millisecond*5 {
         tdelay = time.Millisecond * 5
internal/net/grpc/client.go (1)

1121-1148: Consider adding connection retry limit.

The connection retry logic in rangeConns could benefit from a retry limit to prevent infinite loops.

Consider adding a retry limit:

 func (g *gRPCClient) rangeConns(ctx context.Context, force bool, fn func(addr string, p pool.Conn) bool) error {
     var cnt int
+    const maxRetries = 3
+    retries := 0
     g.conns.Range(func(addr string, p pool.Conn) bool {
         if force {
             cnt++
             return fn(addr, p)
         }
         if p == nil || !p.IsHealthy(ctx) {
+            if retries >= maxRetries {
+                log.Debugf("Max retries reached for %s", addr)
+                return true
+            }
+            retries++
             pc, err := p.Connect(ctx)
internal/net/grpc/client_test.go (2)

87-87: Consider disabling parallel test execution.

The tests use t.Parallel() but given the shared nature of gRPC connections and the potential for race conditions, parallel execution may lead to flaky tests.

Consider removing t.Parallel() or implementing proper test isolation through:

  • Unique connection pools per test
  • Isolated test servers
  • Proper cleanup between tests

Also applies to: 441-441, 632-632, 824-824, 1019-1019, 1212-1212, 1408-1408, 1611-1611, 1786-1786, 2138-2138, 2322-2322, 2518-2518, 2707-2707, 3071-3071


87-103: Add goleak cleanup in TestMain.

While individual tests use goleak.VerifyNone(), it would be better to centralize this in a TestMain function.

Add a TestMain function:

func TestMain(m *testing.M) {
    goleak.VerifyTestMain(m,
        goleak.IgnoreCurrent(),
        // Add other specific goroutine ignore rules
    )
}
internal/net/grpc/pool/pool_test.go (1)

45-46: Implement test cases for the pool operations.

The test file has a well-structured test table setup but lacks actual test cases. I can help implement test cases for various pool operations like connection management, health checks, and reconnection logic.

Would you like me to help generate test cases for any specific pool operations? For example, I can start with:

  1. Connection initialization tests
  2. Health check tests
  3. Reconnection tests
  4. Error handling tests

Also applies to: 147-148, 304-305, 466-467, 624-625, 789-790, 949-950, 1101-1102, 1248-1249, 1407-1408, 1579-1580, 1746-1747, 1915-1916, 2083-2084, 2241-2242, 2402-2403, 2566-2567, 2729-2730, 2897-2898, 3065-3066, 3227-3228, 3379-3380, 3539-3540, 3704-3705, 3863-3864, 4015-4016, 4157-4158, 4257-4258

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e317f7a and c45bf8a.

📒 Files selected for processing (13)
  • Makefile.d/k3d.mk (3 hunks)
  • internal/backoff/backoff.go (2 hunks)
  • internal/net/grpc/client.go (12 hunks)
  • internal/net/grpc/client_test.go (2 hunks)
  • internal/net/grpc/pool/pool.go (26 hunks)
  • internal/net/grpc/pool/pool_test.go (1 hunks)
  • internal/net/net.go (1 hunks)
  • internal/observability/metrics/grpc/grpc.go (3 hunks)
  • internal/test/mock/grpc/grpc_client_mock.go (1 hunks)
  • internal/test/mock/grpc_testify_mock.go (1 hunks)
  • pkg/gateway/lb/handler/grpc/aggregation.go (4 hunks)
  • pkg/gateway/mirror/service/mirror.go (1 hunks)
  • versions/K3S_VERSION (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • pkg/gateway/mirror/service/mirror.go
  • internal/test/mock/grpc_testify_mock.go
  • internal/test/mock/grpc/grpc_client_mock.go
  • Makefile.d/k3d.mk
  • versions/K3S_VERSION
  • internal/backoff/backoff.go
  • pkg/gateway/lb/handler/grpc/aggregation.go
🔇 Additional comments (9)
internal/observability/metrics/grpc/grpc.go (3)

33-34: LGTM! Clear and descriptive metric naming.

The new metric for tracking pool connections has a clear name and description that accurately reflects its purpose.


37-39: LGTM! Good field naming and initialization.

The poolTargetAddrKey field is appropriately named and initialized with a descriptive value in the New function.


69-77: LGTM! Proper metric view configuration.

The view for pool connections is correctly configured with AggregationSum which is appropriate for counting healthy connections.

internal/net/net.go (1)

199-199: LGTM! Improved IP address logging.

Using StringExpanded() provides a more detailed representation of IP addresses in logs, which is helpful for debugging.

internal/net/grpc/pool/pool.go (2)

544-607: LGTM! Robust health check implementation.

The health check implementation is thorough and handles various edge cases appropriately:

  • Checks connection state
  • Handles reconnection attempts
  • Updates metrics
  • Proper error handling

92-95: 🛠️ Refactor suggestion

Consider using a more thread-safe metrics initialization.

The global metrics map initialization could be improved for thread safety.

Consider using sync.Once for thread-safe initialization:

 var (
     mu      sync.RWMutex
-    metrics map[string]int64 = make(map[string]int64)
+    metrics map[string]int64
+    metricsOnce sync.Once
 )

+func init() {
+    metricsOnce.Do(func() {
+        metrics = make(map[string]int64)
+    })
+}

Likely invalid or redundant comment.

internal/net/grpc/client.go (1)

1088-1100: LGTM! Proper context handling in ConnectedAddrs.

The implementation correctly uses context for connection health checks and properly handles errors.

internal/net/grpc/pool/pool_test.go (2)

2347-2347: LGTM! The context parameter addition is consistent.

The change to add context.Background() parameter to p.Disconnect() call aligns with the goal of improving context handling in gRPC connections.


16-108: Implement the TODO test cases to ensure comprehensive test coverage.

The file contains numerous test function stubs with TODO placeholders. Given that this is a gRPC connection pool implementation, having comprehensive test coverage is crucial for ensuring reliability and proper connection handling.

Run the following script to check test coverage:

Also applies to: 109-265, 266-422, 423-584, 585-744, 745-909, 910-1061, 1062-1213, 1214-1360, 1361-1531, 1532-1697, 1698-1866, 1867-2035, 2036-2201, 2202-2353, 2354-2522, 2523-2684, 2685-2849, 2850-3015, 3016-3187, 3188-3339, 3340-3491, 3492-3657, 3658-3823, 3824-3975, 3976-4127, 4128-4232, 4233-4321

@kpango kpango force-pushed the bugfix/internal-grpc/add-health-check-for-range-over-grpc-loop branch 4 times, most recently from 278a532 to 4247d97 Compare February 4, 2025 07:14
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
internal/observability/metrics/grpc/grpc.go (1)

81-104: Consider adding error handling for callback registration.

The error from RegisterCallback is correctly propagated, but consider adding specific error handling for the callback function itself to prevent potential panics.

 func (gm *grpcServerMetrics) Register(m metrics.Meter) error {
 	// The metrics are dynamically registered at the grpc server interceptor package,
 	healthyConn, err := m.Int64ObservableGauge(
 		poolConnMetricsName,
 		metrics.WithDescription(poolConnMetricsDescription),
 		metrics.WithUnit(metrics.Dimensionless),
 	)
 	if err != nil {
 		return err
 	}
 	_, err = m.RegisterCallback(
 		func(ctx context.Context, o api.Observer) error {
 			ms := pool.Metrics(ctx)
 			if len(ms) == 0 {
 				return nil
 			}
+			var errs error
 			for name, cnt := range ms {
-				o.ObserveInt64(healthyConn, cnt, api.WithAttributes(attribute.String(gm.poolTargetAddrKey, name)))
+				if err := o.ObserveInt64(healthyConn, cnt, api.WithAttributes(attribute.String(gm.poolTargetAddrKey, name))); err != nil {
+					errs = errors.Join(errs, fmt.Errorf("failed to observe metrics for %s: %w", name, err))
+				}
 			}
-			return nil
+			return errs
 		}, healthyConn,
 	)
 	return err
 }
internal/net/grpc/pool/pool.go (2)

92-95: Consider using sync.Map for metrics.

The global metrics map with mutex could potentially become a bottleneck under high concurrency. Consider using sync.Map for better performance.

-var (
-	mu      sync.RWMutex
-	metrics map[string]int64 = make(map[string]int64)
-)
+var metrics sync.Map[string, int64]

592-594: Consider batching metrics updates.

The metrics update in IsHealthy could be optimized by batching updates or using atomic operations.

-mu.Lock()
-metrics[p.addr] = int64(pl - unhealthy)
-mu.Unlock()
+atomic.StoreInt64(&metrics[p.addr], int64(pl - unhealthy))
internal/net/grpc/client.go (1)

1129-1156: Consider adding timeout for health checks.

The rangeConns method should consider adding a timeout for health checks to prevent long-running operations.

 func (g *gRPCClient) rangeConns(ctx context.Context, force bool, fn func(addr string, p pool.Conn) bool) error {
+	// Add timeout for health checks
+	ctx, cancel := context.WithTimeout(ctx, 5*time.Second)
+	defer cancel()
+
 	var cnt int
 	g.conns.Range(func(addr string, p pool.Conn) bool {
 		if force {
 			cnt++
 			return fn(addr, p)
 		}
 		if p == nil || !p.IsHealthy(ctx) {
 			pc, err := p.Connect(ctx)
 			if pc == nil || err != nil || !pc.IsHealthy(ctx) {
 				if pc != nil {
 					if derr := pc.Disconnect(ctx); derr != nil {
 						log.Debugf("Failed to disconnect unhealthy connection for %s: %v", addr, derr)
 					}
 				}
 				log.Debugf("Unhealthy connection detected for %s during gRPC Connection Range over Loop:\t%s", addr, p.String())
 				return true
 			}
 			p = pc
 		}
 		cnt++
 		return fn(addr, p)
 	})
 	if cnt == 0 {
 		return errors.ErrGRPCClientConnNotFound("*")
 	}
 	return nil
 }
internal/net/grpc/client_test.go (1)

3110-3110: Add timeout to context in test.

When testing network operations, it's recommended to use context with timeout to prevent tests from hanging indefinitely.

-gotAddrs := g.ConnectedAddrs(context.Background())
+ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+defer cancel()
+gotAddrs := g.ConnectedAddrs(ctx)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c45bf8a and 9f83fdd.

📒 Files selected for processing (15)
  • Makefile (2 hunks)
  • Makefile.d/functions.mk (2 hunks)
  • Makefile.d/k3d.mk (3 hunks)
  • internal/backoff/backoff.go (2 hunks)
  • internal/net/grpc/client.go (16 hunks)
  • internal/net/grpc/client_test.go (2 hunks)
  • internal/net/grpc/pool/pool.go (26 hunks)
  • internal/net/grpc/pool/pool_test.go (1 hunks)
  • internal/net/net.go (1 hunks)
  • internal/observability/metrics/grpc/grpc.go (3 hunks)
  • internal/test/mock/grpc/grpc_client_mock.go (1 hunks)
  • internal/test/mock/grpc_testify_mock.go (1 hunks)
  • pkg/gateway/lb/handler/grpc/aggregation.go (4 hunks)
  • pkg/gateway/mirror/service/mirror.go (1 hunks)
  • versions/K3S_VERSION (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
  • versions/K3S_VERSION
  • pkg/gateway/mirror/service/mirror.go
  • internal/backoff/backoff.go
  • internal/net/net.go
  • Makefile.d/k3d.mk
  • internal/test/mock/grpc/grpc_client_mock.go
  • pkg/gateway/lb/handler/grpc/aggregation.go
  • internal/test/mock/grpc_testify_mock.go
  • internal/net/grpc/pool/pool_test.go
🔇 Additional comments (13)
Makefile.d/functions.mk (1)

62-67: LGTM: CGO flags are properly defined and ordered.

The CGO flags are correctly defined with proper ordering, ensuring that all necessary flags (CFLAGS, CXXFLAGS, FFLAGS) are set before they're used in CGO-specific variables.

Makefile (4)

161-164: Excellent optimization and security flags configuration.

The new COMMON_FLAGS variable provides a comprehensive set of optimization and security flags:

  • Performance: -Ofast, -march=native, -mtune=native, -flto=auto
  • Security: -fvisibility=hidden
  • Size optimization: -ffunction-sections, -fdata-sections

165-167: LGTM: Language standards are properly set.

The language standards are correctly set for each compiler:

  • C: gnu23
  • C++: gnu++23
  • Fortran: f2018

181-182: LGTM: Clean flag handling for arm64.

The flags for arm64 architecture are cleanly handled without unnecessary modifications.


176-178: Verify AVX-512 disabling flags for NGT on amd64.

The NGT_FLAGS disable AVX-512 instructions which might affect performance on newer CPUs that support these instructions.

Run this script to check if the target system supports AVX-512:

internal/observability/metrics/grpc/grpc.go (3)

33-34: LGTM! Clear and descriptive metric names.

The new metric names and descriptions are clear and accurately describe their purpose.


37-39: LGTM! Well-structured struct update.

The addition of poolTargetAddrKey field to grpcServerMetrics struct is appropriate for storing the target address key.


69-77: LGTM! Appropriate metric view configuration.

The new view for pool connection metrics is correctly configured with sum aggregation, which is appropriate for counting healthy connections.

internal/net/grpc/pool/pool.go (1)

879-887: LGTM! Clean metrics retrieval implementation.

The Metrics function correctly uses maps.Clone to return a copy of the metrics map, preventing external modifications.

internal/net/grpc/client.go (2)

1096-1108: LGTM! Context-aware connection listing.

The ConnectedAddrs method now correctly uses context for health checks and connection listing.


1115-1124: LGTM! Proper context handling in Close method.

The Close method now correctly handles context cancellation during disconnection.

internal/net/grpc/client_test.go (2)

3306-3308: LGTM! Context parameter correctly added.

The test args struct has been updated to include the context parameter as previously requested.


20-105: Implement test cases for better coverage.

The test file contains extensive boilerplate but most test cases are commented out with TODO placeholders. Consider implementing these test cases to ensure proper test coverage of the gRPC client functionality.

Let's check the current test coverage:

Would you like me to help generate test cases for the key functionality?

Also applies to: 159-246, 346-435, 535-627, 728-819, 921-1014, 1118-1207, 1312-1403, 1510-1605, 1702-1781, 1876-1957, 2052-2133, 2228-2317, 2422-2513, 2612-2702, 2802-2891, 2986-3067, 3166-3254, 3352-3440

coderabbitai[bot]
coderabbitai bot previously approved these changes Feb 4, 2025
@kpango kpango force-pushed the bugfix/internal-grpc/add-health-check-for-range-over-grpc-loop branch from 6c78959 to 7ba32c0 Compare February 4, 2025 09:54
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (9)
internal/observability/metrics/grpc/grpc.go (1)

81-103: Improve error handling in callback registration.

While the implementation is solid, consider handling errors from the callback function more explicitly. Currently, if the callback returns an error, it's only returned from the Register method without any logging or cleanup.

Consider this improvement:

 func (gm *grpcServerMetrics) Register(m metrics.Meter) error {
     healthyConn, err := m.Int64ObservableGauge(
         poolConnMetricsName,
         metrics.WithDescription(poolConnMetricsDescription),
         metrics.WithUnit(metrics.Dimensionless),
     )
     if err != nil {
         return err
     }
     _, err = m.RegisterCallback(
         func(ctx context.Context, o api.Observer) error {
             ms := pool.Metrics(ctx)
             if len(ms) == 0 {
                 return nil
             }
             for name, cnt := range ms {
                 o.ObserveInt64(healthyConn, cnt, api.WithAttributes(attribute.String(gm.poolTargetAddrKey, name)))
             }
             return nil
         }, healthyConn,
     )
-    return err
+    if err != nil {
+        log.Errorf("Failed to register metrics callback: %v", err)
+        return fmt.Errorf("failed to register metrics callback: %w", err)
+    }
+    return nil
 }
internal/net/grpc/pool/pool.go (3)

92-95: Consider encapsulating metrics storage.

While the implementation is thread-safe, consider encapsulating the metrics storage in a dedicated type to avoid global variables and provide better control over the metrics lifecycle.

Consider this alternative approach:

-var (
-    mu      sync.RWMutex
-    metrics map[string]int64 = make(map[string]int64)
-)

+type metricsStorage struct {
+    mu      sync.RWMutex
+    metrics map[string]int64
+}
+
+func newMetricsStorage() *metricsStorage {
+    return &metricsStorage{
+        metrics: make(map[string]int64),
+    }
+}
+
+var defaultMetricsStorage = newMetricsStorage()

544-607: Enhance logging consistency in health check.

While the health check implementation is robust, the logging could be more structured and consistent. Consider using structured logging with consistent fields across all log messages.

Consider this improvement:

 func (p *pool) IsHealthy(ctx context.Context) (healthy bool) {
     if p == nil || p.closing.Load() {
         return false
     }
-    log.Debug("Checking health...")
+    log.WithFields(log.Fields{
+        "addr": p.addr,
+        "pool_len": p.len(),
+    }).Debug("Starting health check")
     var cnt, unhealthy int
     pl := p.len()
     err := p.loop(ctx, func(ctx context.Context, idx int, pc *poolConn) bool {
         if pc == nil || !isHealthy(ctx, pc.conn) {
             if p.isIP {
                 if pc != nil && pc.addr != "" {
                     err := p.refreshConn(ctx, idx, pc, pc.addr)
                     if err != nil {
-                        unhealthy = pl - cnt
+                        unhealthy = pl - cnt
+                        log.WithFields(log.Fields{
+                            "addr": pc.addr,
+                            "error": err,
+                        }).Debug("Failed to refresh IP connection")
                         return false
                     }
                     return true
                 }
                 unhealthy = pl - cnt
                 return false
             }

879-887: Consider optimizing metrics map cloning.

While the implementation is thread-safe, consider optimizing the map cloning for better performance with large maps. Pre-allocating the map with the exact size would avoid map growth and rehashing.

Consider this improvement:

 func Metrics(context.Context) map[string]int64 {
     mu.RLock()
-    defer mu.RUnlock()
-
     if len(metrics) == 0 {
+        mu.RUnlock()
         return nil
     }
-    return maps.Clone(metrics)
+    result := make(map[string]int64, len(metrics))
+    for k, v := range metrics {
+        result[k] = v
+    }
+    mu.RUnlock()
+    return result
 }
internal/net/grpc/client.go (3)

1129-1158: Enhance error handling in rangeConns.

While the implementation is solid, consider improving error handling by aggregating errors from connection operations and providing more context in error messages.

Consider this improvement:

 func (g *gRPCClient) rangeConns(
     ctx context.Context, force bool, fn func(addr string, p pool.Conn) bool,
 ) error {
     var cnt int
+    var errs []error
     g.conns.Range(func(addr string, p pool.Conn) bool {
         if force {
             cnt++
             return fn(addr, p)
         }
         if p == nil || !p.IsHealthy(ctx) {
             pc, err := p.Connect(ctx)
             if pc == nil || err != nil || !pc.IsHealthy(ctx) {
                 if pc != nil {
                     if derr := pc.Disconnect(ctx); derr != nil {
-                        log.Debugf("Failed to disconnect unhealthy connection for %s: %v", addr, derr)
+                        errs = append(errs, fmt.Errorf("disconnect %s: %w", addr, derr))
                     }
                 }
-                log.Debugf("Unhealthy connection detected for %s during gRPC Connection Range over Loop:\t%s", addr, p.String())
+                log.WithFields(log.Fields{
+                    "addr": addr,
+                    "error": err,
+                    "pool": p.String(),
+                }).Debug("Unhealthy connection detected")
                 return true
             }
             p = pc
         }
         cnt++
         return fn(addr, p)
     })
     if cnt == 0 {
-        return errors.ErrGRPCClientConnNotFound("*")
+        if len(errs) > 0 {
+            return fmt.Errorf("no healthy connections found: %w", errors.Join(errs...))
+        }
+        return errors.ErrGRPCClientConnNotFound("*")
     }
     return nil
 }

1096-1108: Optimize ConnectedAddrs implementation.

While the implementation is correct, consider pre-allocating the slice with the exact capacity to avoid reallocations during append operations.

Consider this improvement:

 func (g *gRPCClient) ConnectedAddrs(ctx context.Context) (addrs []string) {
-    addrs = make([]string, 0, g.conns.Len())
+    // Pre-count healthy connections to allocate exact capacity
+    healthyCount := 0
+    g.rangeConns(ctx, false, func(addr string, p pool.Conn) bool {
+        if p != nil && p.IsHealthy(ctx) {
+            healthyCount++
+        }
+        return true
+    })
+    addrs = make([]string, 0, healthyCount)
     err := g.rangeConns(ctx, false, func(addr string, p pool.Conn) bool {
         if p != nil && p.IsHealthy(ctx) {
             addrs = append(addrs, addr)
         }
         return true
     })
     if err != nil {
         return nil
     }
     return addrs
 }

1115-1124: Consider parallel disconnection in Close.

While the implementation is correct, consider implementing parallel disconnection for faster cleanup of multiple connections.

Consider this improvement:

-select {
-case <-ctx.Done():
-    return false
-default:
-    derr := g.Disconnect(ctx, addr)
-    if derr != nil && !errors.Is(derr, errors.ErrGRPCClientConnNotFound(addr)) {
-        err = errors.Join(err, derr)
-    }
-    return true
-}
+eg, egctx := errgroup.New(ctx)
+eg.SetLimit(runtime.GOMAXPROCS(0)) // Or another suitable concurrency limit
+g.conns.Range(func(addr string, p pool.Conn) bool {
+    addr := addr // Capture for goroutine
+    eg.Go(safety.RecoverFunc(func() error {
+        select {
+        case <-egctx.Done():
+            return egctx.Err()
+        default:
+            if derr := g.Disconnect(egctx, addr); derr != nil &&
+                !errors.Is(derr, errors.ErrGRPCClientConnNotFound(addr)) {
+                return derr
+            }
+            return nil
+        }
+    }))
+    return true
+})
+if err := eg.Wait(); err != nil {
+    return err
+}
apis/swagger/v1/vald/insert.swagger.json (1)

17-18: Fix typo in API documentation.

There's a typo in the summary: "Inset RPC" should be "Insert RPC"

Apply this diff to fix the typo:

-"summary": "Overview\nInset RPC is the method to add a new single vector.\n
+"summary": "Overview\nInsert RPC is the method to add a new single vector.\n
apis/swagger/v1/vald/update.swagger.json (1)

50-52: MultiUpdate Endpoint Documentation Added:
The changes for the /update/multiple endpoint clearly articulate the multi-update functionality and include a helpful notice regarding gRPC message size limitations. For consistency, consider adding more detailed troubleshooting information in future revisions.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7ba32c0 and 9c82017.

⛔ Files ignored due to path filters (31)
  • apis/grpc/v1/agent/core/agent.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/agent/sidecar/sidecar.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/discoverer/discoverer.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/filter/egress/egress_filter.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/filter/ingress/ingress_filter.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/meta/meta.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/mirror/mirror.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/mirror/mirror_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/payload/payload.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/rpc/errdetails/error_details.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/filter.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/filter_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/flush.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/flush_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/index.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/index_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/insert.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/insert_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/object.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/object_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/remove.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/remove_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/search.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/search_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/update.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/update_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/upsert.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/upsert_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • example/client/go.sum is excluded by !**/*.sum
  • go.sum is excluded by !**/*.sum
  • rust/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (68)
  • .gitfiles (4 hunks)
  • .github/ISSUE_TEMPLATE/bug_report.md (1 hunks)
  • .github/ISSUE_TEMPLATE/security_issue_report.md (1 hunks)
  • .github/PULL_REQUEST_TEMPLATE.md (1 hunks)
  • Makefile (2 hunks)
  • Makefile.d/functions.mk (2 hunks)
  • Makefile.d/k3d.mk (3 hunks)
  • apis/swagger/v1/mirror/mirror.swagger.json (1 hunks)
  • apis/swagger/v1/vald/filter.swagger.json (8 hunks)
  • apis/swagger/v1/vald/flush.swagger.json (1 hunks)
  • apis/swagger/v1/vald/index.swagger.json (5 hunks)
  • apis/swagger/v1/vald/insert.swagger.json (2 hunks)
  • apis/swagger/v1/vald/object.swagger.json (4 hunks)
  • apis/swagger/v1/vald/remove.swagger.json (3 hunks)
  • apis/swagger/v1/vald/search.swagger.json (8 hunks)
  • apis/swagger/v1/vald/update.swagger.json (3 hunks)
  • apis/swagger/v1/vald/upsert.swagger.json (2 hunks)
  • example/client/go.mod (2 hunks)
  • go.mod (12 hunks)
  • internal/backoff/backoff.go (2 hunks)
  • internal/net/grpc/client.go (16 hunks)
  • internal/net/grpc/client_test.go (2 hunks)
  • internal/net/grpc/pool/pool.go (26 hunks)
  • internal/net/grpc/pool/pool_test.go (1 hunks)
  • internal/net/net.go (1 hunks)
  • internal/observability/metrics/grpc/grpc.go (3 hunks)
  • internal/test/mock/grpc/grpc_client_mock.go (1 hunks)
  • internal/test/mock/grpc_testify_mock.go (1 hunks)
  • k8s/discoverer/deployment.yaml (1 hunks)
  • k8s/gateway/gateway/lb/deployment.yaml (1 hunks)
  • k8s/gateway/gateway/mirror/deployment.yaml (1 hunks)
  • k8s/index/operator/deployment.yaml (1 hunks)
  • k8s/manager/index/deployment.yaml (1 hunks)
  • pkg/gateway/lb/handler/grpc/aggregation.go (4 hunks)
  • pkg/gateway/mirror/service/mirror.go (1 hunks)
  • rust/rust-toolchain (1 hunks)
  • versions/BUF_VERSION (1 hunks)
  • versions/CMAKE_VERSION (1 hunks)
  • versions/DOCKER_VERSION (1 hunks)
  • versions/FAISS_VERSION (1 hunks)
  • versions/GOLANGCILINT_VERSION (1 hunks)
  • versions/HELM_VERSION (1 hunks)
  • versions/K3S_VERSION (1 hunks)
  • versions/KIND_VERSION (1 hunks)
  • versions/KUBECTL_VERSION (1 hunks)
  • versions/NGT_VERSION (1 hunks)
  • versions/OPERATOR_SDK_VERSION (1 hunks)
  • versions/PROMETHEUS_STACK_VERSION (1 hunks)
  • versions/PROTOBUF_VERSION (1 hunks)
  • versions/RUST_VERSION (1 hunks)
  • versions/TELEPRESENCE_VERSION (1 hunks)
  • versions/USEARCH_VERSION (1 hunks)
  • versions/YQ_VERSION (1 hunks)
  • versions/actions/ACTIONS_SETUP_GO (1 hunks)
  • versions/actions/ACTIONS_SETUP_NODE (1 hunks)
  • versions/actions/ACTIONS_UPLOAD_ARTIFACT (1 hunks)
  • versions/actions/CODECOV_CODECOV_ACTION (1 hunks)
  • versions/actions/DOCKER_SETUP_BUILDX_ACTION (1 hunks)
  • versions/actions/DOCKER_SETUP_QEMU_ACTION (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_ANALYZE (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_AUTOBUILD (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_INIT (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_UPLOAD_SARIF (1 hunks)
  • versions/actions/GITHUB_ISSUE_METRICS (1 hunks)
  • versions/actions/PETER_EVANS_CREATE_PULL_REQUEST (1 hunks)
  • versions/actions/REVIEWDOG_ACTION_HADOLINT (1 hunks)
  • versions/actions/REVIEWDOG_ACTION_LANGUAGETOOL (1 hunks)
  • versions/actions/SOFTPROPS_ACTION_GH_RELEASE (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (55)
  • versions/HELM_VERSION
  • versions/actions/ACTIONS_SETUP_GO
  • versions/actions/PETER_EVANS_CREATE_PULL_REQUEST
  • versions/KIND_VERSION
  • versions/actions/ACTIONS_SETUP_NODE
  • versions/OPERATOR_SDK_VERSION
  • versions/actions/REVIEWDOG_ACTION_LANGUAGETOOL
  • versions/KUBECTL_VERSION
  • versions/NGT_VERSION
  • versions/actions/GITHUB_CODEQL_ACTION_ANALYZE
  • versions/DOCKER_VERSION
  • versions/YQ_VERSION
  • versions/actions/SOFTPROPS_ACTION_GH_RELEASE
  • versions/actions/DOCKER_SETUP_BUILDX_ACTION
  • versions/actions/DOCKER_SETUP_QEMU_ACTION
  • versions/PROTOBUF_VERSION
  • versions/K3S_VERSION
  • versions/actions/CODECOV_CODECOV_ACTION
  • versions/BUF_VERSION
  • versions/TELEPRESENCE_VERSION
  • versions/actions/GITHUB_CODEQL_ACTION_INIT
  • versions/actions/REVIEWDOG_ACTION_HADOLINT
  • versions/RUST_VERSION
  • versions/USEARCH_VERSION
  • apis/swagger/v1/vald/flush.swagger.json
  • versions/actions/GITHUB_CODEQL_ACTION_AUTOBUILD
  • versions/actions/GITHUB_ISSUE_METRICS
  • versions/PROMETHEUS_STACK_VERSION
  • versions/GOLANGCILINT_VERSION
  • rust/rust-toolchain
  • .github/ISSUE_TEMPLATE/bug_report.md
  • versions/actions/ACTIONS_UPLOAD_ARTIFACT
  • versions/FAISS_VERSION
  • versions/actions/GITHUB_CODEQL_ACTION_UPLOAD_SARIF
  • pkg/gateway/mirror/service/mirror.go
  • internal/test/mock/grpc_testify_mock.go
  • k8s/index/operator/deployment.yaml
  • Makefile.d/functions.mk
  • internal/test/mock/grpc/grpc_client_mock.go
  • apis/swagger/v1/vald/index.swagger.json
  • internal/backoff/backoff.go
  • k8s/discoverer/deployment.yaml
  • k8s/gateway/gateway/lb/deployment.yaml
  • Makefile.d/k3d.mk
  • versions/CMAKE_VERSION
  • .github/ISSUE_TEMPLATE/security_issue_report.md
  • internal/net/grpc/pool/pool_test.go
  • internal/net/net.go
  • k8s/gateway/gateway/mirror/deployment.yaml
  • k8s/manager/index/deployment.yaml
  • Makefile
  • .github/PULL_REQUEST_TEMPLATE.md
  • example/client/go.mod
  • pkg/gateway/lb/handler/grpc/aggregation.go
  • go.mod
🧰 Additional context used
📓 Learnings (1)
.gitfiles (1)
Learnt from: datelier
PR: vdaas/vald#2807
File: rust/bin/agent/src/handler/object.rs:36-38
Timestamp: 2025-01-22T06:26:59.130Z
Learning: In Rust code for Vald handlers, avoid using unwrap() on Option types. Instead, handle None values explicitly by returning appropriate error statuses to maintain robustness.
🔇 Additional comments (34)
apis/swagger/v1/mirror/mirror.swagger.json (1)

15-18: Enhanced Documentation through Detailed Summary:

The updated summary on line 17 now clearly outlines the endpoint's purpose and includes a structured table of status codes. This approach improves clarity and provides valuable context regarding the expected responses from the endpoint. However, please verify that the multiline formatting and table syntax (using newline characters and pipe symbols) is rendered as expected in your Swagger UI/editor. In some cases, certain tools might not fully support this markdown-like formatting within JSON strings.

.gitfiles (3)

126-126: LGTM! Addition of Rust unit test workflow.

The addition of a dedicated Rust unit test workflow improves the testing coverage for Rust components.


161-163: LGTM! New buf.gen configuration files.

The addition of buf.gen configuration files for documentation generation enhances the project's documentation tooling.


164-177: LGTM! Comprehensive API documentation.

The addition of multiple API documentation files improves the project's documentation coverage and maintainability.

internal/observability/metrics/grpc/grpc.go (3)

16-24: LGTM!

The imports and struct changes are well-organized and necessary for implementing the pool connection metrics functionality.

Also applies to: 37-39


33-34: LGTM!

The metric constants are well-named and have clear descriptions.


69-77: LGTM!

The pool connection metrics view is properly configured with sum aggregation, which is appropriate for tracking the count of healthy connections.

internal/net/grpc/client_test.go (2)

3110-3110: Verify context handling in ConnectedAddrs test.

The test should verify context cancellation and timeout scenarios.

Apply this diff to improve context handling:

-gotAddrs := g.ConnectedAddrs(context.Background())
+ctx, cancel := context.WithTimeout(context.Background(), time.Second)
+defer cancel()
+gotAddrs := g.ConnectedAddrs(ctx)

3306-3308: 🛠️ Refactor suggestion

Update rangeConns test to include context parameter.

The test function signature has been updated but test cases need to be updated accordingly.

Apply this diff to update the test:

type args struct {
+    ctx context.Context
     fn func(addr string, p pool.Conn) bool
}
✅ Verification successful

Test Case Update Required: Update the args struct to include the context parameter.

The test case in internal/net/grpc/client_test.go still contains commented-out code where the ctx context.Context field is present. This confirms that after the signature change, the test cases haven’t been updated to include the context parameter as expected. Please apply the provided diff to uncomment (or add) the ctx context.Context field within the args struct so that it aligns with the updated function signature.

  • Location: internal/net/grpc/client_test.go (lines around 3306–3308)
🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts or web queries found


Script:

#!/bin/bash
# List occurrences of the 'args' struct definition in the test file to inspect context usage.
echo "Occurrences of 'type args' in internal/net/grpc/client_test.go:"
rg -n "type args *{" internal/net/grpc/client_test.go

echo "--------------------------------------------"
# List occurrences of 'ctx context.Context' to see if the context parameter is currently present.
echo "Occurrences of 'ctx context.Context' in internal/net/grpc/client_test.go:"
rg -n "ctx context.Context" internal/net/grpc/client_test.go

echo "--------------------------------------------"
# List the lines around where the commented-out code exists to verify removal or changes.
echo "Context around commented code for 'fn func(addr string, p pool.Conn) bool':"
rg -n -C 5 "fn func\(addr string, p pool.Conn\)" internal/net/grpc/client_test.go

Length of output: 2154

apis/swagger/v1/vald/object.swagger.json (2)

17-18: LGTM! Well-structured API documentation.

The documentation for /exists/{id} endpoint is comprehensive and follows best practices:

  • Clear overview section
  • Structured status code table
  • Detailed troubleshooting guide with common errors and resolutions

106-107: LGTM! Well-structured API documentation.

The documentation for /object/{id.id} endpoint is comprehensive and follows best practices:

  • Clear overview section
  • Structured status code table
  • Detailed troubleshooting guide with common errors and resolutions
apis/swagger/v1/vald/insert.swagger.json (1)

50-51: LGTM! Well-structured API documentation.

The documentation for /insert/multiple endpoint is comprehensive and follows best practices:

  • Clear overview section
  • Important notice about gRPC message size limitation
  • Structured status code table
  • Detailed troubleshooting guide with common errors and resolutions
apis/swagger/v1/vald/upsert.swagger.json (2)

17-18: Enhanced Upsert Endpoint Documentation:
The updated summary and description now clearly explain that the Upsert RPC handles both updating an existing vector and inserting a new one. The detailed status code table and troubleshooting section greatly improve clarity. Consider a slight rewording (for example, “update an existing vector or insert a new one”) to make it even clearer.


50-52: Enhanced MultiUpsert Endpoint Documentation:
The revised summary for the /upsert/multiple endpoint now explains the dual functionality of updating existing vectors and adding new ones in a single request. The added notice on gRPC message size limitations and the structured table of status codes/troubleshooting details provide excellent guidance.

apis/swagger/v1/vald/update.swagger.json (1)

17-18: Enhanced Update Endpoint Documentation:
The updated summary and description for the /update endpoint now include a comprehensive status code table with troubleshooting details. This clear presentation helps users understand the purpose and common error scenarios of this RPC.

apis/swagger/v1/vald/remove.swagger.json (3)

17-19: Improved Single Remove Documentation:
The revised summary and description for the /remove endpoint now include a detailed table of possible responses and troubleshooting tips. This enhancement clarifies what a client should expect in error conditions, improving the overall API usability.


50-52: Enhanced MultiRemove Endpoint Documentation:
The updated documentation for the /remove/multiple endpoint now mirrors the style of the single remove endpoint—with a detailed status code table and troubleshooting suggestions. The clarity and consistency across endpoints are commendable.


83-85: Detailed RemoveByTimestamp Documentation:
The summary and description for the /remove/timestamp endpoint now provide a comprehensive explanation—including how multiple timestamps are combined using an AND condition—with a clear status code table and troubleshooting section.

apis/swagger/v1/vald/filter.swagger.json (8)

17-18: InsertObject Endpoint Documentation Enhancement:
The updated summary now specifies that the InsertObject RPC is used for inserting objects via the Vald Filter Gateway. The added status code table clearly lays out possible responses, which enhances user understanding.


49-50: MultiInsertObject Endpoint Documentation Enhancement:
The revised summary for the /insert/object/multiple endpoint makes it clear that multiple new objects can be inserted in a single request. The detailed presentation of status codes improves clarity.


81-82: SearchObject Endpoint Documentation Updated:
The updated summary for the /search/object endpoint clearly articulates that the SearchObject RPC is used for object-based searches. The error handling table and troubleshooting information add valuable context.


112-113: StreamSearchObject Endpoint Documentation Updated:
The enhanced summary for the /search/object/multiple endpoint now includes a notice on bidirectional streaming, which is very useful. The structured status code table and troubleshooting details further improve the documentation.


145-146: UpdateObject Endpoint Documentation Improved:
For the /update/object endpoint, the revised summary now effectively communicates the update operation for a single object, complete with a clear status code table.


177-178: MultiUpdateObject Endpoint Documentation Updated:
The updated summary and description emphasize the multi-update functionality and note the gRPC message size limitation. Clear troubleshooting steps are provided, enhancing the endpoint’s usability.


210-211: UpsertObject Endpoint Documentation Enhanced:
The summary for the /upsert/object endpoint now clearly states that it is used for both updating an existing object and inserting a new one. The concise explanation is very helpful.


242-244: MultiUpsertObject Endpoint Documentation Enhanced:
The changes for the /upsert/object/multiple endpoint now offer a detailed summary with a clear notice on gRPC message size limitations and troubleshooting information. This consistency and level of detail across endpoints is excellent.

apis/swagger/v1/vald/search.swagger.json (8)

16-19: LinearSearch Endpoint Documentation Enhanced:
The updated summary for the /linearsearch endpoint clearly describes the LinearSearch RPC with a detailed status code table and troubleshooting guide. This helps set a clear expectation for users.


49-52: LinearSearchByID Endpoint Documentation Updated:
The revised documentation for the /linearsearch/id endpoint now explains that the search is performed by a user-defined vector ID – including a note that the vector must be indexed prior to querying. This adds useful context regarding prerequisites.


81-85: MultiLinearSearchByID Endpoint Documentation Enhanced:
The enhanced summary for the /linearsearch/id/multiple endpoint now includes a clear notice about gRPC message size limitations, a thorough status code table, and detailed troubleshooting steps, which together offer a comprehensive guide.


115-118: MultiLinearSearch Endpoint Documentation Improved:
The updated summary for /linearsearch/multiple explains the functionality clearly and provides users with detailed troubleshooting information. This ensures that clients are aware of both the capabilities and limitations of the RPC.


148-151: Search Endpoint Documentation Enhanced:
The /search endpoint now features an expanded summary and detailed troubleshooting information. This helps in setting clear expectations regarding both the success and error conditions of the search operation.


180-184: SearchByID Endpoint Documentation Updated:
The summary for the /search/id endpoint now clearly indicates that the RPC searches based on a user-defined vector ID, including important validation notes and detailed error guidance.


213-217: MultiSearchByID Endpoint Documentation Enhanced:
The updated documentation for /search/id/multiple now includes a detailed status code table and clear troubleshooting advice. This helps users diagnose potential issues when searching with multiple IDs.


246-250: MultiSearch Endpoint Documentation Improved:
The enhanced summary and comprehensive troubleshooting information for the /search/multiple endpoint provide a clear and useful guide for users. The documentation now effectively communicates both functional details and considerations (such as gRPC limitations).

@github-actions github-actions bot removed the size/L label Feb 4, 2025
@kpango kpango force-pushed the bugfix/internal-grpc/add-health-check-for-range-over-grpc-loop branch from 9c82017 to 0e02ce2 Compare February 4, 2025 10:42
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments.

Actionable comments posted: 4

🧹 Nitpick comments (7)
internal/observability/metrics/grpc/grpc.go (1)

81-104: Consider adding error handling for metrics registration.

While the implementation is good, consider adding cleanup or fallback handling if registration fails.

 func (gm *grpcServerMetrics) Register(m metrics.Meter) error {
+    var registeredMetrics []api.Observable
     healthyConn, err := m.Int64ObservableGauge(
         poolConnMetricsName,
         metrics.WithDescription(poolConnMetricsDescription),
         metrics.WithUnit(metrics.Dimensionless),
     )
     if err != nil {
+        // Clean up any previously registered metrics
+        for _, metric := range registeredMetrics {
+            m.UnregisterCallback(metric)
+        }
         return err
     }
+    registeredMetrics = append(registeredMetrics, healthyConn)
     _, err = m.RegisterCallback(
         func(ctx context.Context, o api.Observer) error {
             ms := pool.Metrics(ctx)
             if len(ms) == 0 {
                 return nil
             }
             for name, cnt := range ms {
                 o.ObserveInt64(healthyConn, cnt, api.WithAttributes(attribute.String(gm.poolTargetAddrKey, name)))
             }
             return nil
         }, healthyConn,
     )
     return err
 }
internal/net/grpc/pool/pool.go (1)

592-594: Consider using deferred mutex unlock for metrics update.

The current implementation could lead to mutex contention. Consider using a deferred unlock pattern.

-    mu.Lock()
-    metrics[p.addr] = int64(pl - unhealthy)
-    mu.Unlock()
+    func() {
+        mu.Lock()
+        defer mu.Unlock()
+        metrics[p.addr] = int64(pl - unhealthy)
+    }()
apis/swagger/v1/vald/insert.swagger.json (1)

17-18: Typo in Endpoint Summary – “Inset RPC” vs “Insert RPC”.
The updated summary for the /insert endpoint now reads “Inset RPC is the method to add a new single vector.” This is likely a typographical mistake. Please update “Inset” to “Insert” for clarity and consistency with the rest of the API documentation.

apis/swagger/v1/vald/filter.swagger.json (1)

177-179: MultiUpdateObject Endpoint – Notice on Message Size Limitation.
The /update/object/multiple endpoint now has a noticed message about gRPC message size limitations, which is important for users creating bulk requests. Ensure the “exceed” wording is grammatically adjusted (consider “exceeds” instead of “exceed”).

Makefile (3)

161-167: Enhance Compiler and Linker Flags Consolidation
The new definitions for LDFLAGS, COMMON_FLAGS, and the language-specific flag variables are well organized. Using a central COMMON_FLAGS variable helps maintain consistency across C, C++ and Fortran builds.

Suggestion: Consider adding an inline comment above these definitions explaining the rationale behind using aggressive optimizations (e.g. -Ofast) and the potential trade-offs regarding floating‐point precision or portability.


181-182: Arm64 Flag Handling
For the arm64 architecture, the flags remain unchanged as indicated by the reassignment. Although functionally correct, it might be helpful to add a brief comment noting that no extra optimizations or disabling flags are required for arm64 targets.


185-186: Default Flag Preservation for Other Architectures
The explicit reassignment of CFLAGS and CXXFLAGS in the fallback path (for architectures other than amd64 or arm64) confirms that the defaults are preserved. This is clear and acceptable as long as no additional architecture-specific flags are needed.

Suggestion: A comment might be added to explain this branch if future maintainers expect architecture-specific optimizations.

🛑 Comments failed to post (4)
.gitfiles (1)

1951-1951: 💡 Codebase verification

Robust Error Handling: Multiple unwrap() Calls Detected

The QBG implementation and object handler contain numerous calls to unwrap(), which do not align with the robust error handling practices outlined in PR #2807. Specifically:

  • rust/libs/algorithms/qbg/src/lib.rs has multiple unwrap() usages (e.g., lines 379, 397, 448, etc.) that may cause runtime panics if an error occurs.
  • rust/bin/agent/src/handler/object.rs (line 43) calls unwrap() on the conversion of hostname, which can also lead to crashes if the hostname isn’t valid.

It’s recommended to refactor these instances to use proper error handling methods, such as ? for propagating errors or matching on the result to handle potential errors gracefully.

🔗 Analysis chain

LGTM: New Rust implementation files.

The addition of Rust source files for object handling and the QBG algorithm implementation aligns with the project's goals. Based on the retrieved learning from PR #2807, ensure that the Rust implementation follows robust error handling practices.

Run the following script to verify the Rust implementation follows the project's error handling guidelines:

Also applies to: 1970-1974

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for unwrap() usage in Rust files
# Note: unwrap() should be avoided as per PR #2807 learning

rg -n 'unwrap\(\)' 'rust/bin/agent/src/handler/object.rs' 'rust/libs/algorithms/qbg/src/lib.rs'

Length of output: 2839

apis/swagger/v1/vald/object.swagger.json (2)

77-77: 🛠️ Refactor suggestion

Add missing status codes and troubleshooting information.

The documentation for /object/meta/{id.id} endpoint has TODO placeholders for status codes and troubleshooting.

-        "summary": "Overview\nRepresent the RPC to get the vector metadata. This RPC is mainly used for index correction process\n---\nStatus Code\nTODO\n---\nTroubleshooting\nTODO",
+        "summary": "Overview\nRepresent the RPC to get the vector metadata. This RPC is mainly used for index correction process\n---\nStatus Code\n|  0   | OK                |\n|  1   | CANCELLED         |\n|  3   | INVALID_ARGUMENT  |\n|  4   | DEADLINE_EXCEEDED |\n|  5   | NOT_FOUND         |\n|  13  | INTERNAL          |\n---\nTroubleshooting\nThe request process may not be completed when the response code is NOT `0 (OK)`.",
+        "description": "Here are some common reasons and how to resolve each error.\n\n| name              | common reason                                                                                   | how to resolve                                                                           |\n| :---------------- | :---------------------------------------------------------------------------------------------- | :--------------------------------------------------------------------------------------- |\n| CANCELLED         | Executed cancel() of rpc from client/server-side or network problems between client and server. | Check the code, especially around timeout and connection management, and fix if needed.  |\n| INVALID_ARGUMENT  | The Requested vector's ID is empty, or some request payload is invalid.                         | Check request payload and fix request payload.                                           |\n| DEADLINE_EXCEEDED | The RPC timeout setting is too short on the client/server side.                                 | Check the gRPC timeout setting on both the client and server sides and fix it if needed. |\n| NOT_FOUND         | Requested ID is NOT inserted.                                                                   | Send a request with an ID that is already inserted.                                      |\n| INTERNAL          | Target Vald cluster or network route has some critical error.                                   | Check target Vald cluster first and check network route including ingress as second.     |",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        "summary": "Overview\nRepresent the RPC to get the vector metadata. This RPC is mainly used for index correction process\n---\nStatus Code\n|  0   | OK                |\n|  1   | CANCELLED         |\n|  3   | INVALID_ARGUMENT  |\n|  4   | DEADLINE_EXCEEDED |\n|  5   | NOT_FOUND         |\n|  13  | INTERNAL          |\n---\nTroubleshooting\nThe request process may not be completed when the response code is NOT `0 (OK)`.",
        "description": "Here are some common reasons and how to resolve each error.\n\n| name              | common reason                                                                                   | how to resolve                                                                           |\n| :---------------- | :---------------------------------------------------------------------------------------------- | :--------------------------------------------------------------------------------------- |\n| CANCELLED         | Executed cancel() of rpc from client/server-side or network problems between client and server. | Check the code, especially around timeout and connection management, and fix if needed.  |\n| INVALID_ARGUMENT  | The Requested vector's ID is empty, or some request payload is invalid.                         | Check request payload and fix request payload.                                           |\n| DEADLINE_EXCEEDED | The RPC timeout setting is too short on the client/server side.                                 | Check the gRPC timeout setting on both the client and server sides and fix it if needed. |\n| NOT_FOUND         | Requested ID is NOT inserted.                                                                   | Send a request with an ID that is already inserted.                                      |\n| INTERNAL          | Target Vald cluster or network route has some critical error.                                   | Check target Vald cluster first and check network route including ingress as second.     |",

47-47: 🛠️ Refactor suggestion

Add missing status codes and troubleshooting information.

The documentation for /object/list endpoint has TODO placeholders for status codes and troubleshooting.

-        "summary": "Overview\nA method to get all the vectors with server streaming\n---\nStatus Code\nTODO\n---\nTroubleshooting\nTODO",
+        "summary": "Overview\nA method to get all the vectors with server streaming\n---\nStatus Code\n|  0   | OK                |\n|  1   | CANCELLED         |\n|  3   | INVALID_ARGUMENT  |\n|  4   | DEADLINE_EXCEEDED |\n|  13  | INTERNAL          |\n---\nTroubleshooting\nThe request process may not be completed when the response code is NOT `0 (OK)`.",
+        "description": "Here are some common reasons and how to resolve each error.\n\n| name              | common reason                                                                                   | how to resolve                                                                           |\n| :---------------- | :---------------------------------------------------------------------------------------------- | :--------------------------------------------------------------------------------------- |\n| CANCELLED         | Executed cancel() of rpc from client/server-side or network problems between client and server. | Check the code, especially around timeout and connection management, and fix if needed.  |\n| INVALID_ARGUMENT  | Some request payload is invalid.                                                                 | Check request payload and fix request payload.                                           |\n| DEADLINE_EXCEEDED | The RPC timeout setting is too short on the client/server side.                                 | Check the gRPC timeout setting on both the client and server sides and fix it if needed. |\n| INTERNAL          | Target Vald cluster or network route has some critical error.                                   | Check target Vald cluster first and check network route including ingress as second.     |",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        "summary": "Overview\nA method to get all the vectors with server streaming\n---\nStatus Code\n|  0   | OK                |\n|  1   | CANCELLED         |\n|  3   | INVALID_ARGUMENT  |\n|  4   | DEADLINE_EXCEEDED |\n|  13  | INTERNAL          |\n---\nTroubleshooting\nThe request process may not be completed when the response code is NOT `0 (OK)`.",
        "description": "Here are some common reasons and how to resolve each error.\n\n| name              | common reason                                                                                   | how to resolve                                                                           |\n| :---------------- | :---------------------------------------------------------------------------------------------- | :--------------------------------------------------------------------------------------- |\n| CANCELLED         | Executed cancel() of rpc from client/server-side or network problems between client and server. | Check the code, especially around timeout and connection management, and fix if needed.  |\n| INVALID_ARGUMENT  | Some request payload is invalid.                                                                 | Check request payload and fix request payload.                                           |\n| DEADLINE_EXCEEDED | The RPC timeout setting is too short on the client/server side.                                 | Check the gRPC timeout setting on both the client and server sides and fix it if needed. |\n| INTERNAL          | Target Vald cluster or network route has some critical error.                                   | Check target Vald cluster first and check network route including ingress as second.     |",
apis/swagger/v1/vald/update.swagger.json (1)

83-84: ⚠️ Potential issue

Incomplete Placeholder for UpdateTimestamp Endpoint.
The summary for /update/timestamp still contains “TODO” placeholders for both status codes and troubleshooting. This should be completed before merging—the documentation must reflect actual status values and common error resolutions for the update timestamp operation.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (6)
apis/swagger/v1/vald/upsert.swagger.json (2)

50-51: Clarify MultiUpsert Endpoint Summary and Description.
The updated summary and description for the /upsert/multiple endpoint clearly outline the method’s purpose and call out the gRPC message size limitation. One minor nitpick: consider whether using "1 request" is optimal for clarity—using "a single request" might be more immediately clear for all readers.


252-253: Fix Typographical Error in UpsertMulti Request Definition.
There is a typo in the description field: "Represent mthe ultiple upsert request." should be corrected to "Represent the multiple upsert request." This change will improve clarity and maintain the professionalism of the API documentation.

Below is a suggested diff for the correction:

-        "description": "Represent mthe ultiple upsert request."
+        "description": "Represent the multiple upsert request."
internal/net/grpc/pool/pool.go (1)

1129-1158: Enhance error handling in rangeConns.

The rangeConns function has been updated to handle connection health checks, but there are a few improvements that could be made:

  1. Consider adding error propagation from Connect and Disconnect calls
  2. Consider adding retry logic for transient failures

Apply this diff to improve error handling:

 func (g *gRPCClient) rangeConns(
     ctx context.Context, force bool, fn func(addr string, p pool.Conn) bool,
 ) error {
     var cnt int
+    var errs []error
     g.conns.Range(func(addr string, p pool.Conn) bool {
         if force {
             cnt++
             return fn(addr, p)
         }
         if p == nil || !p.IsHealthy(ctx) {
             pc, err := p.Connect(ctx)
             if pc == nil || err != nil || !pc.IsHealthy(ctx) {
                 if pc != nil {
                     if derr := pc.Disconnect(ctx); derr != nil {
                         log.Debugf("Failed to disconnect unhealthy connection for %s: %v", addr, derr)
+                        errs = append(errs, derr)
                     }
                 }
                 log.Debugf("Unhealthy connection detected for %s during gRPC Connection Range over Loop:\t%s", addr, p.String())
                 return true
             }
             p = pc
         }
         cnt++
         return fn(addr, p)
     })
     if cnt == 0 {
         return errors.ErrGRPCClientConnNotFound("*")
     }
+    if len(errs) > 0 {
+        return errors.Join(errs...)
+    }
     return nil
 }
internal/net/grpc/client.go (1)

257-291: Consider adding retry logic for connection failures.

The connection monitoring loops could benefit from retry logic for transient failures.

Consider adding exponential backoff for connection retries:

 err = g.rangeConns(ctx, true, func(addr string, p pool.Conn) bool {
+    var retries int
+    var backoff = time.Millisecond * 100
     if addr == "" || p == nil {
         disconnectTargets = append(disconnectTargets, addr)
         return true
     }
     var err error
-    p, err = p.Reconnect(ctx, false)
+    for retries < 3 {
+        p, err = p.Reconnect(ctx, false)
+        if err == nil {
+            break
+        }
+        time.Sleep(backoff)
+        backoff *= 2
+        retries++
+    }
     if err != nil {
         // ... error handling ...
     }
     return true
 })

Also applies to: 294-333

apis/swagger/v1/vald/filter.swagger.json (2)

175-178: Revised Summary & Description for /update/object/multiple Endpoint
The updated text now incorporates a notice regarding potential gRPC message size limitations and provides a comprehensive status code table. One suggestion is to verify that the HTML snippet (i.e. <div class="notice">) renders correctly in your Swagger UI.


240-243: Enhanced Summary & Description for /upsert/object/multiple Endpoint
The updated summary and description now indicate that the endpoint performs multi-upsert operations, complete with a detailed status code table and an HTML-formatted notice about gRPC message size limitations. It is recommended to check that the embedded HTML is supported by all Swagger UI renderers in your environment.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9c82017 and 0e02ce2.

⛔ Files ignored due to path filters (31)
  • apis/grpc/v1/agent/core/agent.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/agent/sidecar/sidecar.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/discoverer/discoverer.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/filter/egress/egress_filter.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/filter/ingress/ingress_filter.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/meta/meta.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/mirror/mirror.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/mirror/mirror_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/payload/payload.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/rpc/errdetails/error_details.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/filter.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/filter_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/flush.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/flush_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/index.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/index_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/insert.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/insert_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/object.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/object_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/remove.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/remove_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/search.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/search_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/update.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/update_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/upsert.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/upsert_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • example/client/go.sum is excluded by !**/*.sum
  • go.sum is excluded by !**/*.sum
  • rust/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (68)
  • .gitfiles (4 hunks)
  • .github/ISSUE_TEMPLATE/bug_report.md (1 hunks)
  • .github/ISSUE_TEMPLATE/security_issue_report.md (1 hunks)
  • .github/PULL_REQUEST_TEMPLATE.md (1 hunks)
  • Makefile (6 hunks)
  • Makefile.d/functions.mk (2 hunks)
  • Makefile.d/k3d.mk (3 hunks)
  • apis/swagger/v1/mirror/mirror.swagger.json (1 hunks)
  • apis/swagger/v1/vald/filter.swagger.json (8 hunks)
  • apis/swagger/v1/vald/flush.swagger.json (1 hunks)
  • apis/swagger/v1/vald/index.swagger.json (5 hunks)
  • apis/swagger/v1/vald/insert.swagger.json (2 hunks)
  • apis/swagger/v1/vald/object.swagger.json (4 hunks)
  • apis/swagger/v1/vald/remove.swagger.json (3 hunks)
  • apis/swagger/v1/vald/search.swagger.json (8 hunks)
  • apis/swagger/v1/vald/update.swagger.json (3 hunks)
  • apis/swagger/v1/vald/upsert.swagger.json (2 hunks)
  • example/client/go.mod (2 hunks)
  • go.mod (12 hunks)
  • internal/backoff/backoff.go (2 hunks)
  • internal/net/grpc/client.go (16 hunks)
  • internal/net/grpc/client_test.go (2 hunks)
  • internal/net/grpc/pool/pool.go (26 hunks)
  • internal/net/grpc/pool/pool_test.go (1 hunks)
  • internal/net/net.go (1 hunks)
  • internal/observability/metrics/grpc/grpc.go (3 hunks)
  • internal/test/mock/grpc/grpc_client_mock.go (1 hunks)
  • internal/test/mock/grpc_testify_mock.go (1 hunks)
  • k8s/discoverer/deployment.yaml (1 hunks)
  • k8s/gateway/gateway/lb/deployment.yaml (1 hunks)
  • k8s/gateway/gateway/mirror/deployment.yaml (1 hunks)
  • k8s/index/operator/deployment.yaml (1 hunks)
  • k8s/manager/index/deployment.yaml (1 hunks)
  • pkg/gateway/lb/handler/grpc/aggregation.go (4 hunks)
  • pkg/gateway/mirror/service/mirror.go (1 hunks)
  • rust/rust-toolchain (1 hunks)
  • versions/BUF_VERSION (1 hunks)
  • versions/CMAKE_VERSION (1 hunks)
  • versions/DOCKER_VERSION (1 hunks)
  • versions/FAISS_VERSION (1 hunks)
  • versions/GOLANGCILINT_VERSION (1 hunks)
  • versions/HELM_VERSION (1 hunks)
  • versions/K3S_VERSION (1 hunks)
  • versions/KIND_VERSION (1 hunks)
  • versions/KUBECTL_VERSION (1 hunks)
  • versions/NGT_VERSION (1 hunks)
  • versions/OPERATOR_SDK_VERSION (1 hunks)
  • versions/PROMETHEUS_STACK_VERSION (1 hunks)
  • versions/PROTOBUF_VERSION (1 hunks)
  • versions/RUST_VERSION (1 hunks)
  • versions/TELEPRESENCE_VERSION (1 hunks)
  • versions/USEARCH_VERSION (1 hunks)
  • versions/YQ_VERSION (1 hunks)
  • versions/actions/ACTIONS_SETUP_GO (1 hunks)
  • versions/actions/ACTIONS_SETUP_NODE (1 hunks)
  • versions/actions/ACTIONS_UPLOAD_ARTIFACT (1 hunks)
  • versions/actions/CODECOV_CODECOV_ACTION (1 hunks)
  • versions/actions/DOCKER_SETUP_BUILDX_ACTION (1 hunks)
  • versions/actions/DOCKER_SETUP_QEMU_ACTION (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_ANALYZE (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_AUTOBUILD (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_INIT (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_UPLOAD_SARIF (1 hunks)
  • versions/actions/GITHUB_ISSUE_METRICS (1 hunks)
  • versions/actions/PETER_EVANS_CREATE_PULL_REQUEST (1 hunks)
  • versions/actions/REVIEWDOG_ACTION_HADOLINT (1 hunks)
  • versions/actions/REVIEWDOG_ACTION_LANGUAGETOOL (1 hunks)
  • versions/actions/SOFTPROPS_ACTION_GH_RELEASE (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (58)
  • versions/actions/PETER_EVANS_CREATE_PULL_REQUEST
  • versions/KUBECTL_VERSION
  • versions/OPERATOR_SDK_VERSION
  • versions/CMAKE_VERSION
  • versions/actions/ACTIONS_SETUP_GO
  • versions/DOCKER_VERSION
  • versions/actions/ACTIONS_SETUP_NODE
  • versions/HELM_VERSION
  • versions/actions/DOCKER_SETUP_QEMU_ACTION
  • versions/actions/GITHUB_CODEQL_ACTION_UPLOAD_SARIF
  • versions/PROTOBUF_VERSION
  • versions/actions/ACTIONS_UPLOAD_ARTIFACT
  • versions/KIND_VERSION
  • versions/actions/DOCKER_SETUP_BUILDX_ACTION
  • versions/actions/CODECOV_CODECOV_ACTION
  • versions/RUST_VERSION
  • versions/NGT_VERSION
  • versions/PROMETHEUS_STACK_VERSION
  • versions/FAISS_VERSION
  • versions/actions/GITHUB_CODEQL_ACTION_ANALYZE
  • versions/BUF_VERSION
  • versions/actions/SOFTPROPS_ACTION_GH_RELEASE
  • versions/USEARCH_VERSION
  • versions/K3S_VERSION
  • internal/net/net.go
  • versions/actions/REVIEWDOG_ACTION_HADOLINT
  • versions/actions/GITHUB_ISSUE_METRICS
  • versions/actions/GITHUB_CODEQL_ACTION_INIT
  • versions/GOLANGCILINT_VERSION
  • versions/actions/REVIEWDOG_ACTION_LANGUAGETOOL
  • versions/YQ_VERSION
  • versions/actions/GITHUB_CODEQL_ACTION_AUTOBUILD
  • internal/test/mock/grpc/grpc_client_mock.go
  • Makefile.d/functions.mk
  • internal/test/mock/grpc_testify_mock.go
  • k8s/gateway/gateway/lb/deployment.yaml
  • pkg/gateway/lb/handler/grpc/aggregation.go
  • internal/backoff/backoff.go
  • versions/TELEPRESENCE_VERSION
  • .github/PULL_REQUEST_TEMPLATE.md
  • rust/rust-toolchain
  • apis/swagger/v1/vald/flush.swagger.json
  • apis/swagger/v1/mirror/mirror.swagger.json
  • k8s/index/operator/deployment.yaml
  • pkg/gateway/mirror/service/mirror.go
  • apis/swagger/v1/vald/index.swagger.json
  • k8s/discoverer/deployment.yaml
  • .github/ISSUE_TEMPLATE/bug_report.md
  • k8s/gateway/gateway/mirror/deployment.yaml
  • k8s/manager/index/deployment.yaml
  • .github/ISSUE_TEMPLATE/security_issue_report.md
  • Makefile.d/k3d.mk
  • internal/net/grpc/pool/pool_test.go
  • apis/swagger/v1/vald/remove.swagger.json
  • Makefile
  • example/client/go.mod
  • apis/swagger/v1/vald/update.swagger.json
  • go.mod
🧰 Additional context used
📓 Learnings (1)
.gitfiles (1)
Learnt from: datelier
PR: vdaas/vald#2807
File: rust/bin/agent/src/handler/object.rs:36-38
Timestamp: 2025-01-22T06:26:59.130Z
Learning: In Rust code for Vald handlers, avoid using unwrap() on Option types. Instead, handle None values explicitly by returning appropriate error statuses to maintain robustness.
🔇 Additional comments (34)
apis/swagger/v1/vald/upsert.swagger.json (1)

17-18: Enhance Upsert Endpoint Documentation.
The updated summary and description now provide a comprehensive overview of the Upsert RPC, including detailed status codes and troubleshooting guidelines. This enriched documentation should be very helpful to API consumers. Please verify that the multiline text with embedded markdown-style tables renders as expected in your Swagger UI.

apis/swagger/v1/vald/object.swagger.json (4)

17-18: Enhanced Documentation for the /exists/{id} Endpoint.
The updated summary and added description now clearly outline the purpose of the Exists RPC along with detailed error codes and troubleshooting steps. Please ensure that these textual details remain in sync with the backend’s error-handling logic and overall documentation standards.


47-47: Incomplete Documentation in /object/list Endpoint.
The summary field still has placeholders ("Status Code TODO" and "Troubleshooting TODO"). Please fill in these sections with the appropriate status codes and troubleshooting information so that API consumers receive complete documentation.


77-77: Incomplete Documentation in /object/meta/{id.id} Endpoint.
Similar to the /object/list endpoint, the summary contains placeholders for "Status Code" and "Troubleshooting". Completing these sections will improve clarity and consistency with other endpoints.


106-107: Enhanced Documentation for the /object/{id.id} Endpoint.
The improved summary and description now offer a detailed explanation of the GetObject RPC, including a clear table of status codes and error scenarios. Verify that these details accurately reflect the actual behavior and error responses from the service, and ensure consistency with similar endpoints such as /exists/{id}.

.gitfiles (2)

1951-1951: Verify error handling in the new Rust handler.

Based on previous feedback, ensure that the new object handler follows proper error handling practices by avoiding unwrap() on Option types.

Run the following script to check for unwrap() usage:

#!/bin/bash
# Description: Check for unwrap() usage in the new Rust handler
rg -A 2 'unwrap\(\)' rust/bin/agent/src/handler/object.rs

1970-1974: Integration Testing and Documentation for QBG Algorithm Are Missing.

The new QBG algorithm implementation requires proper integration testing and documentation.

Run the following script to check for tests and documentation:

#!/bin/bash
# Description: Check for test files and documentation for QBG algorithm
echo "Checking for tests..."
fd -t f "test" rust/libs/algorithms/qbg/

echo "Checking for documentation..."
fd -t f "README|\.md" rust/libs/algorithms/qbg/
internal/observability/metrics/grpc/grpc.go (4)

33-34: LGTM! New metrics for pool connection health.

The new metrics server_pool_conn with description "Count of healthy pool connections by target address" will help monitor the health of gRPC connection pools.


37-39: LGTM! Added field for target address key.

The poolTargetAddrKey field in grpcServerMetrics struct will be used as a label key for the pool connection metrics.


69-77: LGTM! Added view for pool connection metrics.

The new view for poolConnMetricsName with AggregationSum is correctly configured to track the count of healthy pool connections.


83-103: LGTM! Added registration for pool connection metrics.

The registration logic correctly:

  1. Creates an observable gauge for pool connections
  2. Registers a callback to observe connection health
  3. Uses the target address as an attribute
internal/net/grpc/pool/pool.go (3)

93-95: LGTM! Added global metrics storage.

The global metrics map with mutex synchronization will store connection health metrics per target address.


592-594: LGTM! Added metrics update in health check.

The health check function now updates the metrics map with the count of healthy connections per target address.


879-887: LGTM! Added metrics retrieval function.

The Metrics function safely returns a clone of the metrics map using maps.Clone.

internal/net/grpc/client.go (2)

1096-1108: LGTM! Added context to ConnectedAddrs.

The ConnectedAddrs function now accepts a context parameter and uses it for health checks, improving context propagation.


1115-1124: LGTM! Added context handling in Close.

The Close function now properly handles context cancellation during cleanup.

internal/net/grpc/client_test.go (2)

3110-3110: Context parameter added but test cases need update.

The ConnectedAddrs method now accepts a context parameter but the test cases haven't been fully updated to reflect this change.

The test should verify context cancellation and timeout scenarios. Apply this diff:

-gotAddrs := g.ConnectedAddrs(context.Background())
+ctx, cancel := context.WithTimeout(context.Background(), time.Second)
+defer cancel()
+gotAddrs := g.ConnectedAddrs(ctx)

3306-3308: Context parameter added to rangeConns but test cases need update.

The rangeConns method signature has been updated to include a context parameter but the test cases need to be updated accordingly.

The test should verify context handling. Apply this diff:

type args struct {
-    ctx context.Context
+    ctx context.Context
     fn func(addr string, p pool.Conn) bool
}
apis/swagger/v1/vald/insert.swagger.json (2)

17-18: LGTM! Comprehensive API documentation improvements.

The documentation improvements for the /insert endpoint are excellent, providing:

  • Clear status code table
  • Detailed troubleshooting guide
  • Common error scenarios and resolutions

50-51: LGTM! Well-structured documentation for multi-insert endpoint.

The documentation for /insert/multiple endpoint is well-structured with:

  • Clear warning about gRPC message size limitations
  • Comprehensive status code table
  • Detailed troubleshooting section
apis/swagger/v1/vald/filter.swagger.json (6)

15-17: Enhanced Summary for /insert/object Endpoint
The updated summary now clearly specifies that the endpoint uses the InsertObject RPC method and includes a detailed status code table. This increased clarity should help API consumers understand the possible responses.


47-50: Improved Summary for /insert/object/multiple Endpoint
The revised summary explicitly states that the endpoint corresponds to the MultiInsertObject RPC and incorporates a status code table with a notice on gRPC message size limitations. This additional information is useful for users concerned with message size constraints.


79-82: Updated Summary for /search/object Endpoint
The summary now clearly indicates that this endpoint uses the SearchObject RPC and provides a detailed status code table. This update enhances overall clarity and consistency.


111-114: Clarified Summary for /search/object/multiple Endpoint
The updated summary emphasizes that the endpoint employs the bidirectional streaming mechanism (StreamSearchObject RPC) while including a status code table. This change makes it easier for consumers to understand the streaming behavior.


143-146: Refined Summary for /update/object Endpoint
The summary update explicitly denotes the UpdateObject RPC along with a detailed list of status codes, which promotes better understanding of the operation’s expected outcomes.


208-211: Improved Summary for /upsert/object Endpoint
The revised summary clearly describes that this endpoint facilitates both updating and inserting a single object, and it details the corresponding status codes. This enhances the clarity of the API documentation.

apis/swagger/v1/vald/search.swagger.json (8)

15-19: Enhanced Summary and Troubleshooting for /linearsearch Endpoint
The updated summary now explains that the endpoint uses the LinearSearch RPC method and includes a detailed status code table with troubleshooting guidance. This comprehensive information should help users diagnose potential issues.


48-52: Updated Summary for /linearsearch/id Endpoint
The enhanced summary now identifies the endpoint as using the LinearSearchByID RPC, noting that the vector must be indexed into vald-agent beforehand. Detailed status codes and troubleshooting instructions are provided, which improve clarity.


81-85: Improved Summary for /linearsearch/id/multiple Endpoint
The revised summary now clarifies that this endpoint supports multi-linear search by IDs in a single call. The inclusion of an HTML notice regarding gRPC message size limitations and detailed troubleshooting steps significantly enhances the documentation.


114-118: Revised Summary for /linearsearch/multiple Endpoint
The updated summary and description clearly communicate that the endpoint enables multi-linear search with multiple vectors in one request. The detailed status codes and troubleshooting information offer valuable guidance for resolving issues.


147-151: Enhanced Summary for /search Endpoint
The updated summary now includes a detailed status code table and troubleshooting instructions for the Search RPC endpoint, thereby improving usability and user guidance.


180-184: Updated Summary for /search/id Endpoint
The summary has been refined to clearly indicate that the endpoint uses the SearchByID RPC. It also provides important notes about pre-indexing requirements along with a comprehensive list of status codes and troubleshooting advice.


213-217: Enhanced Summary for /search/id/multiple Endpoint
The revised summary now reflects that this endpoint processes multi-search requests by IDs and provides extensive troubleshooting details, including an HTML notice on gRPC message size limitations.


246-250: Updated Summary for /search/multiple Endpoint
The enhanced summary clearly outlines the multi-search functionality using multiple vectors in a single request. The inclusion of a detailed status code table, troubleshooting guidance, and a notice regarding gRPC message size limitations makes the endpoint documentation much clearer.

coderabbitai[bot]
coderabbitai bot previously approved these changes Feb 4, 2025
coderabbitai[bot]
coderabbitai bot previously approved these changes Feb 4, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
Makefile.d/build.mk (1)

45-45: LGTM! Switching to static linking improves binary portability.

The change from $(LDFLAGS) to $(STATIC_LDFLAGS) for CGO-enabled targets is a good improvement that:

  • Reduces runtime dependencies by statically linking libraries
  • Makes binaries more portable across different environments
  • Ensures consistent behavior by bundling all dependencies

Note that this may increase binary sizes and build times.

Also applies to: 50-50, 102-102, 110-110, 114-114

Makefile (4)

161-162: Improved linker flags organization.

The separation of LDFLAGS and STATIC_LDFLAGS provides better control over linking behavior. The new flags focus on security (-z relro -z now), optimization (-flto=auto, -Wl,-O2), and size reduction (-Wl,--gc-sections).


170-175: Well-structured compiler optimization flags.

The introduction of COMMON_FLAGS with carefully chosen optimization flags is excellent:

  • Performance optimizations: -O3, -ffast-math, -march=native
  • Size optimizations: -ffunction-sections, -fdata-sections
  • Security improvements: -fvisibility=hidden

177-190: Good architecture-specific optimizations.

The architecture-specific handling is well-implemented:

  • amd64: Disables newer AVX instructions for better compatibility
  • arm64: Uses appropriate march setting
  • Fallback for other architectures

688-706: Consistent CMake configurations across different installations.

The CMake configurations for NGT, Faiss, and usearch are consistently structured with appropriate flags for:

  • Build type (Release)
  • Static linking
  • Optimization flags
  • Installation paths

Also applies to: 711-734, 740-766

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ef74e7e and fd9a346.

📒 Files selected for processing (3)
  • Makefile (5 hunks)
  • Makefile.d/build.mk (2 hunks)
  • Makefile.d/tools.mk (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • Makefile.d/tools.mk
⏰ Context from checks skipped due to timeout of 90000ms (98)
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / dump-contexts-to-log
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / dump-contexts-to-log
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / dump-contexts-to-log
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / dump-contexts-to-log
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: build / dump-contexts-to-log
  • GitHub Check: build / build

coderabbitai[bot]
coderabbitai bot previously approved these changes Feb 4, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
internal/net/grpc/pool/pool.go (4)

303-347: Consider optimizing connection refresh logic.

The refreshConn function could be optimized by:

  1. Reducing nested if conditions
  2. Moving the connection closure logic to a separate helper function
  3. Adding retry mechanism for failed connections
 func (p *pool) refreshConn(ctx context.Context, idx int, pc *poolConn, addr string) (err error) {
+    closeConn := func(pc *poolConn) {
+        if pc != nil && pc.conn != nil {
+            p.eg.Go(safety.RecoverFunc(func() error {
+                log.Debugf("waiting for invalid connection to %s to be closed...", pc.addr)
+                err := pc.Close(ctx, p.roccd)
+                if err != nil {
+                    log.Debugf("failed to close connection pool addr = %s\terror = %v", pc.addr, err)
+                }
+                return nil
+            }))
+        }
+    }
+
     if pc != nil && pc.addr == addr && isHealthy(ctx, pc.conn) {
         return nil
     }
-    if pc != nil {
-        log.Debugf("connection for %s pool %d/%d len %d is unhealthy trying to establish new pool member connection to %s", pc.addr, idx+1, p.Size(), p.Len(), addr)
-    } else {
-        log.Debugf("connection pool %d/%d len %d is empty, establish new pool member connection to %s", idx+1, p.Size(), p.Len(), addr)
-    }
+    log.Debugf("connection pool %d/%d len %d status: %s, establish new pool member connection to %s",
+        idx+1, p.Size(), p.Len(),
+        map[bool]string{true: "unhealthy", false: "empty"}[pc != nil],
+        addr)
+
     conn, err := p.dial(ctx, addr)
     if err != nil {
         if pc != nil && isHealthy(ctx, pc.conn) {
             log.Debugf("dialing new connection to %s failed,\terror: %v,\tbut existing connection to %s is healthy will keep existing connection", addr, err, pc.addr)
             return nil
         }
-        if pc != nil {
-            if pc.conn != nil {
-                p.eg.Go(safety.RecoverFunc(func() error {
-                    log.Debugf("waiting for invalid connection to %s to be closed...", pc.addr)
-                    err = pc.Close(ctx, p.roccd)
-                    if err != nil {
-                        log.Debugf("failed to close connection pool addr = %s\terror = %v", pc.addr, err)
-                    }
-                    return nil
-                }))
-            }
-        }
+        closeConn(pc)
         return errors.Join(err, errors.ErrInvalidGRPCClientConn(addr))
     }
+
     p.store(idx, &poolConn{
         conn: conn,
         addr: addr,
     })
-    if pc != nil {
-        p.eg.Go(safety.RecoverFunc(func() error {
-            log.Debugf("waiting for old connection to %s to be closed...", pc.addr)
-            err = pc.Close(ctx, p.roccd)
-            if err != nil {
-                log.Debugf("failed to close connection pool addr = %s\terror = %v", pc.addr, err)
-            }
-            return nil
-        }))
-    }
+    closeConn(pc)
     return nil
 }

617-651: Enhance error handling in Do method.

The error handling in the Do method could be improved by:

  1. Adding retries with backoff for transient failures
  2. Better handling of context cancellation
 func (p *pool) Do(ctx context.Context, f func(conn *ClientConn) error) (err error) {
     if p == nil {
         return errors.ErrGRPCClientConnNotFound("*")
     }
+
+    var attempts int
+    maxAttempts := 3
+
+    for attempts < maxAttempts {
+        if ctx.Err() != nil {
+            return ctx.Err()
+        }
+
         idx, pc, ok := p.getHealthyConn(ctx, 0, p.Len())
         if !ok || pc == nil || pc.conn == nil {
+            attempts++
+            if attempts == maxAttempts {
                 return errors.ErrGRPCClientConnNotFound(p.addr)
+            }
+            time.Sleep(time.Duration(attempts) * 100 * time.Millisecond)
+            continue
         }
 
         conn := pc.conn
         err = f(conn)
         if err != nil {
             st, ok := status.FromError(err)
-            if ok && st != nil && st.Code() != codes.Canceled {
+            if ok && st != nil {
+                if st.Code() == codes.Canceled {
+                    return err
+                }
                 if conn != nil {
                     cerr := conn.Close()
                     if cerr != nil {
                         st, ok := status.FromError(cerr)
                         if ok && st != nil && st.Code() != codes.Canceled {
                             log.Warnf("Failed to close connection: %v", cerr)
                         }
                     }
                 }
                 rerr := p.refreshConn(ctx, idx, pc, p.addr)
                 if rerr == nil {
                     if newErr := f(p.load(idx).conn); newErr != nil {
                         return errors.Join(err, newErr)
                     }
                     return nil
                 }
                 err = errors.Join(err, rerr)
+                attempts++
+                if attempts < maxAttempts {
+                    time.Sleep(time.Duration(attempts) * 100 * time.Millisecond)
+                    continue
+                }
             }
         }
+        return err
+    }
     return err
 }

547-615: Optimize health check implementation.

The IsHealthy method could be improved by:

  1. Using atomic operations for counters
  2. Adding metrics for individual connection states
  3. Implementing circuit breaker pattern
+type healthMetrics struct {
+    total     int64
+    healthy   int64
+    unhealthy int64
+    states    map[connectivity.State]int64
+}
+
 func (p *pool) IsHealthy(ctx context.Context) (healthy bool) {
     if p == nil || p.closing.Load() {
         return false
     }
     log.Debug("Checking health...")
-    var cnt, unhealthy int
+    metrics := &healthMetrics{
+        states: make(map[connectivity.State]int64),
+    }
     pl := p.len()
+    atomic.StoreInt64(&metrics.total, int64(pl))
+
     err := p.loop(ctx, func(ctx context.Context, idx int, pc *poolConn) bool {
+        if pc != nil && pc.conn != nil {
+            state := pc.conn.GetState()
+            atomic.AddInt64(&metrics.states[state], 1)
+        }
+
         if pc == nil || !isHealthy(ctx, pc.conn) {
+            atomic.AddInt64(&metrics.unhealthy, 1)
             if p.isIP {
                 // ... rest of the IP handling logic
             }
             // ... rest of the DNS handling logic
         }
-        cnt++
+        atomic.AddInt64(&metrics.healthy, 1)
         return true
     })
     
     mu.Lock()
-    metrics[p.addr] = int64(pl - unhealthy)
+    metrics[p.addr] = atomic.LoadInt64(&metrics.healthy)
     mu.Unlock()
     
-    if cnt == 0 {
+    if atomic.LoadInt64(&metrics.healthy) == 0 {
         return false
     }
     if p.isIP {
-        return unhealthy == 0
+        return atomic.LoadInt64(&metrics.unhealthy) == 0
     }
-    return unhealthy < pl
+    return atomic.LoadInt64(&metrics.unhealthy) < int64(pl)
 }

893-901: Consider enhancing metrics with more detailed information.

The current metrics implementation could be improved by adding:

  1. Connection state counts
  2. Connection age
  3. Error counts
  4. Latency metrics
+type ConnectionMetrics struct {
+    HealthyConnections   int64
+    StateMetrics        map[connectivity.State]int64
+    LastErrorTimestamp  int64
+    ErrorCount          int64
+    AverageLatency      int64
+}
+
+var (
+    metricsData map[string]*ConnectionMetrics
+)
+
-func Metrics(context.Context) map[string]int64 {
+func Metrics(context.Context) map[string]*ConnectionMetrics {
     mu.RLock()
     defer mu.RUnlock()
 
-    if len(metrics) == 0 {
+    if len(metricsData) == 0 {
         return nil
     }
-    return maps.Clone(metrics)
+    return maps.Clone(metricsData)
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fd9a346 and 551c681.

📒 Files selected for processing (2)
  • internal/net/grpc/pool/pool.go (27 hunks)
  • internal/net/net.go (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/net/net.go
⏰ Context from checks skipped due to timeout of 90000ms (109)
  • GitHub Check: coverage
  • GitHub Check: check-format-diff
  • GitHub Check: Run tests for pkg packages
  • GitHub Check: Run tests for internal packages
  • GitHub Check: Run tests for cmd packages
  • GitHub Check: grpc-stream
  • GitHub Check: grpc-sequential
  • GitHub Check: CodeQL
  • GitHub Check: runner / go build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: coverage
  • GitHub Check: check-format-diff
  • GitHub Check: Run tests for pkg packages
  • GitHub Check: Run tests for internal packages
  • GitHub Check: Run tests for cmd packages
  • GitHub Check: grpc-stream
  • GitHub Check: grpc-sequential
  • GitHub Check: CodeQL
  • GitHub Check: runner / go build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: coverage
  • GitHub Check: check-format-diff
  • GitHub Check: Run tests for pkg packages
  • GitHub Check: Run tests for internal packages
  • GitHub Check: Run tests for cmd packages
  • GitHub Check: grpc-stream
  • GitHub Check: grpc-sequential
  • GitHub Check: CodeQL
  • GitHub Check: runner / go build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: coverage
  • GitHub Check: check-format-diff
  • GitHub Check: Run tests for pkg packages
  • GitHub Check: Run tests for internal packages
  • GitHub Check: Run tests for cmd packages
  • GitHub Check: grpc-stream
  • GitHub Check: grpc-sequential
  • GitHub Check: CodeQL
  • GitHub Check: runner / go build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: coverage
  • GitHub Check: check-format-diff
  • GitHub Check: Run tests for pkg packages
  • GitHub Check: Run tests for internal packages
  • GitHub Check: Run tests for cmd packages
  • GitHub Check: grpc-stream
  • GitHub Check: grpc-sequential
  • GitHub Check: CodeQL
  • GitHub Check: runner / go build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: coverage
  • GitHub Check: check-format-diff
  • GitHub Check: Run tests for pkg packages
  • GitHub Check: Run tests for internal packages
  • GitHub Check: Run tests for cmd packages
  • GitHub Check: grpc-stream
  • GitHub Check: grpc-sequential
  • GitHub Check: CodeQL
  • GitHub Check: runner / go build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: coverage
  • GitHub Check: check-format-diff
  • GitHub Check: Run tests for pkg packages
  • GitHub Check: Run tests for internal packages
  • GitHub Check: Run tests for cmd packages
  • GitHub Check: grpc-stream
  • GitHub Check: grpc-sequential
  • GitHub Check: CodeQL
  • GitHub Check: runner / go build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: coverage
  • GitHub Check: check-format-diff
  • GitHub Check: Run tests for pkg packages
  • GitHub Check: Run tests for internal packages
  • GitHub Check: Run tests for cmd packages
  • GitHub Check: grpc-stream
  • GitHub Check: grpc-sequential
  • GitHub Check: CodeQL
  • GitHub Check: runner / go build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: coverage
  • GitHub Check: check-format-diff
  • GitHub Check: Run tests for pkg packages
  • GitHub Check: Run tests for internal packages
  • GitHub Check: Run tests for cmd packages
  • GitHub Check: grpc-stream
  • GitHub Check: grpc-sequential
  • GitHub Check: CodeQL
  • GitHub Check: runner / go build
  • GitHub Check: build / build
  • GitHub Check: build / build
  • GitHub Check: coverage
  • GitHub Check: check-format-diff
  • GitHub Check: Run tests for pkg packages
  • GitHub Check: Run tests for internal packages
  • GitHub Check: Run tests for cmd packages
  • GitHub Check: grpc-stream
  • GitHub Check: grpc-sequential
  • GitHub Check: CodeQL
  • GitHub Check: runner / go build
  • GitHub Check: build / build
🔇 Additional comments (1)
internal/net/grpc/pool/pool.go (1)

92-95: LGTM! Global metrics map is properly synchronized.

The implementation uses a mutex to protect concurrent access to the metrics map, which is a good practice for thread safety.

coderabbitai[bot]
coderabbitai bot previously approved these changes Feb 5, 2025
@kpango kpango force-pushed the bugfix/internal-grpc/add-health-check-for-range-over-grpc-loop branch from 9726362 to 341a593 Compare February 5, 2025 02:38
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
internal/observability/metrics/grpc/grpc.go (1)

83-103: Consider adding error metrics for connection failures.

While the current implementation tracks healthy connections, it would be valuable to also track connection failures to help diagnose issues.

Consider adding a counter for failed connections:

+const (
+    poolConnFailedMetricsName        = "server_pool_conn_failed"
+    poolConnFailedMetricsDescription = "Count of failed pool connections by target address"
+)

 func (gm *grpcServerMetrics) Register(m metrics.Meter) error {
     healthyConn, err := m.Int64ObservableGauge(
         poolConnMetricsName,
         metrics.WithDescription(poolConnMetricsDescription),
         metrics.WithUnit(metrics.Dimensionless),
     )
     if err != nil {
         return err
     }
+    failedConn, err := m.Int64Counter(
+        poolConnFailedMetricsName,
+        metrics.WithDescription(poolConnFailedMetricsDescription),
+        metrics.WithUnit(metrics.Dimensionless),
+    )
+    if err != nil {
+        return err
+    }
internal/net/grpc/pool/pool.go (1)

92-95: Consider using sync.Map for metrics.

The current implementation uses a regular map with mutex, which could be replaced with sync.Map for better performance in concurrent scenarios.

-var (
-    mu      sync.RWMutex
-    metrics map[string]int64 = make(map[string]int64)
-)
+var metrics sync.Map
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 551c681 and 56a7e41.

⛔ Files ignored due to path filters (31)
  • apis/grpc/v1/agent/core/agent.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/agent/sidecar/sidecar.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/discoverer/discoverer.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/filter/egress/egress_filter.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/filter/ingress/ingress_filter.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/meta/meta.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/mirror/mirror.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/mirror/mirror_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/payload/payload.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/rpc/errdetails/error_details.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/filter.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/filter_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/flush.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/flush_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/index.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/index_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/insert.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/insert_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/object.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/object_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/remove.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/remove_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/search.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/search_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/update.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/update_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/upsert.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/upsert_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • example/client/go.sum is excluded by !**/*.sum
  • go.sum is excluded by !**/*.sum
  • rust/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (70)
  • .gitfiles (4 hunks)
  • .github/ISSUE_TEMPLATE/bug_report.md (1 hunks)
  • .github/ISSUE_TEMPLATE/security_issue_report.md (1 hunks)
  • .github/PULL_REQUEST_TEMPLATE.md (1 hunks)
  • Makefile (5 hunks)
  • Makefile.d/build.mk (2 hunks)
  • Makefile.d/functions.mk (3 hunks)
  • Makefile.d/k3d.mk (3 hunks)
  • Makefile.d/tools.mk (2 hunks)
  • apis/swagger/v1/mirror/mirror.swagger.json (1 hunks)
  • apis/swagger/v1/vald/filter.swagger.json (8 hunks)
  • apis/swagger/v1/vald/flush.swagger.json (1 hunks)
  • apis/swagger/v1/vald/index.swagger.json (5 hunks)
  • apis/swagger/v1/vald/insert.swagger.json (2 hunks)
  • apis/swagger/v1/vald/object.swagger.json (4 hunks)
  • apis/swagger/v1/vald/remove.swagger.json (3 hunks)
  • apis/swagger/v1/vald/search.swagger.json (8 hunks)
  • apis/swagger/v1/vald/update.swagger.json (3 hunks)
  • apis/swagger/v1/vald/upsert.swagger.json (2 hunks)
  • example/client/go.mod (2 hunks)
  • go.mod (12 hunks)
  • internal/backoff/backoff.go (2 hunks)
  • internal/net/grpc/client.go (18 hunks)
  • internal/net/grpc/client_test.go (2 hunks)
  • internal/net/grpc/pool/pool.go (27 hunks)
  • internal/net/grpc/pool/pool_test.go (1 hunks)
  • internal/net/net.go (3 hunks)
  • internal/observability/metrics/grpc/grpc.go (3 hunks)
  • internal/test/mock/grpc/grpc_client_mock.go (1 hunks)
  • internal/test/mock/grpc_testify_mock.go (1 hunks)
  • k8s/discoverer/deployment.yaml (1 hunks)
  • k8s/gateway/gateway/lb/deployment.yaml (1 hunks)
  • k8s/gateway/gateway/mirror/deployment.yaml (1 hunks)
  • k8s/index/operator/deployment.yaml (1 hunks)
  • k8s/manager/index/deployment.yaml (1 hunks)
  • pkg/gateway/lb/handler/grpc/aggregation.go (4 hunks)
  • pkg/gateway/mirror/service/mirror.go (1 hunks)
  • rust/rust-toolchain (1 hunks)
  • versions/BUF_VERSION (1 hunks)
  • versions/CMAKE_VERSION (1 hunks)
  • versions/DOCKER_VERSION (1 hunks)
  • versions/FAISS_VERSION (1 hunks)
  • versions/GOLANGCILINT_VERSION (1 hunks)
  • versions/HELM_VERSION (1 hunks)
  • versions/K3S_VERSION (1 hunks)
  • versions/KIND_VERSION (1 hunks)
  • versions/KUBECTL_VERSION (1 hunks)
  • versions/NGT_VERSION (1 hunks)
  • versions/OPERATOR_SDK_VERSION (1 hunks)
  • versions/PROMETHEUS_STACK_VERSION (1 hunks)
  • versions/PROTOBUF_VERSION (1 hunks)
  • versions/RUST_VERSION (1 hunks)
  • versions/TELEPRESENCE_VERSION (1 hunks)
  • versions/USEARCH_VERSION (1 hunks)
  • versions/YQ_VERSION (1 hunks)
  • versions/actions/ACTIONS_SETUP_GO (1 hunks)
  • versions/actions/ACTIONS_SETUP_NODE (1 hunks)
  • versions/actions/ACTIONS_UPLOAD_ARTIFACT (1 hunks)
  • versions/actions/CODECOV_CODECOV_ACTION (1 hunks)
  • versions/actions/DOCKER_SETUP_BUILDX_ACTION (1 hunks)
  • versions/actions/DOCKER_SETUP_QEMU_ACTION (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_ANALYZE (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_AUTOBUILD (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_INIT (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_UPLOAD_SARIF (1 hunks)
  • versions/actions/GITHUB_ISSUE_METRICS (1 hunks)
  • versions/actions/PETER_EVANS_CREATE_PULL_REQUEST (1 hunks)
  • versions/actions/REVIEWDOG_ACTION_HADOLINT (1 hunks)
  • versions/actions/REVIEWDOG_ACTION_LANGUAGETOOL (1 hunks)
  • versions/actions/SOFTPROPS_ACTION_GH_RELEASE (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (59)
  • versions/actions/PETER_EVANS_CREATE_PULL_REQUEST
  • versions/actions/DOCKER_SETUP_QEMU_ACTION
  • versions/HELM_VERSION
  • versions/KUBECTL_VERSION
  • versions/KIND_VERSION
  • versions/actions/ACTIONS_SETUP_GO
  • versions/BUF_VERSION
  • versions/USEARCH_VERSION
  • versions/actions/GITHUB_CODEQL_ACTION_UPLOAD_SARIF
  • versions/OPERATOR_SDK_VERSION
  • versions/TELEPRESENCE_VERSION
  • versions/CMAKE_VERSION
  • versions/actions/CODECOV_CODECOV_ACTION
  • versions/NGT_VERSION
  • versions/FAISS_VERSION
  • versions/actions/ACTIONS_SETUP_NODE
  • versions/actions/ACTIONS_UPLOAD_ARTIFACT
  • versions/actions/SOFTPROPS_ACTION_GH_RELEASE
  • versions/GOLANGCILINT_VERSION
  • versions/PROTOBUF_VERSION
  • versions/actions/GITHUB_CODEQL_ACTION_INIT
  • versions/actions/GITHUB_CODEQL_ACTION_ANALYZE
  • versions/actions/GITHUB_CODEQL_ACTION_AUTOBUILD
  • versions/K3S_VERSION
  • versions/actions/DOCKER_SETUP_BUILDX_ACTION
  • versions/RUST_VERSION
  • versions/PROMETHEUS_STACK_VERSION
  • versions/actions/REVIEWDOG_ACTION_LANGUAGETOOL
  • versions/actions/GITHUB_ISSUE_METRICS
  • apis/swagger/v1/vald/flush.swagger.json
  • k8s/index/operator/deployment.yaml
  • k8s/gateway/gateway/lb/deployment.yaml
  • apis/swagger/v1/vald/index.swagger.json
  • internal/test/mock/grpc/grpc_client_mock.go
  • .github/ISSUE_TEMPLATE/bug_report.md
  • .github/PULL_REQUEST_TEMPLATE.md
  • internal/test/mock/grpc_testify_mock.go
  • Makefile.d/build.mk
  • Makefile.d/k3d.mk
  • apis/swagger/v1/mirror/mirror.swagger.json
  • internal/backoff/backoff.go
  • k8s/manager/index/deployment.yaml
  • k8s/discoverer/deployment.yaml
  • pkg/gateway/lb/handler/grpc/aggregation.go
  • versions/YQ_VERSION
  • internal/net/grpc/pool/pool_test.go
  • versions/DOCKER_VERSION
  • rust/rust-toolchain
  • versions/actions/REVIEWDOG_ACTION_HADOLINT
  • .github/ISSUE_TEMPLATE/security_issue_report.md
  • internal/net/net.go
  • pkg/gateway/mirror/service/mirror.go
  • Makefile.d/tools.mk
  • k8s/gateway/gateway/mirror/deployment.yaml
  • apis/swagger/v1/vald/remove.swagger.json
  • Makefile.d/functions.mk
  • apis/swagger/v1/vald/update.swagger.json
  • example/client/go.mod
  • go.mod
🧰 Additional context used
📓 Learnings (1)
.gitfiles (1)
Learnt from: datelier
PR: vdaas/vald#2807
File: rust/bin/agent/src/handler/object.rs:36-38
Timestamp: 2025-01-22T06:26:59.130Z
Learning: In Rust code for Vald handlers, avoid using unwrap() on Option types. Instead, handle None values explicitly by returning appropriate error statuses to maintain robustness.
🔇 Additional comments (39)
apis/swagger/v1/vald/upsert.swagger.json (2)

17-18: Enhanced /upsert Documentation: Detailed Summary and Description

The updated summary now offers a comprehensive overview of the Upsert RPC, including a clear description of its update/insert behavior and a detailed status code table. The added description provides valuable troubleshooting tips for common error scenarios. Please verify that the embedded markdown (tables, line breaks, and code formatting with backticks) renders correctly in your Swagger UI output.


50-51: Enhanced MultiUpsert Endpoint Documentation

The changes for the /upsert/multiple endpoint now include a more descriptive summary emphasizing the ability to update or add multiple vectors in a single request, with bold formatting for clarity. The description starts with an HTML notice warning about gRPC message size limitations and then outlines detailed status codes and troubleshooting steps. Ensure that the HTML snippet and markdown content display as intended across all consuming tools.

apis/swagger/v1/vald/insert.swagger.json (3)

18-18: Enhanced Error Handling and Troubleshooting (Description Update)
The updated description for the /insert endpoint now provides a detailed set of common error reasons and resolutions. This added clarity is very beneficial to API users.


50-50: Clear MultiInsert Summary
The summary for the /insert/multiple endpoint is concise and clearly states that multiple vectors can be added in a single request. The use of markdown (e.g., 1) appropriately emphasizes the key point.


51-51: Detailed MultiInsert Description with Warning Note
The new description for the /insert/multiple endpoint adds a useful warning about gRPC message size limitations and provides comprehensive status code and troubleshooting information. Please verify that the HTML encoded elements (e.g., \u003cdiv class="notice"\u003e) render correctly in your Swagger UI environment.

internal/observability/metrics/grpc/grpc.go (3)

33-34: LGTM! Clear and descriptive metric names.

The metric names and descriptions are well-defined and follow observability best practices.


37-39: LGTM! Good struct field naming.

The poolTargetAddrKey field is appropriately named and its purpose is clear from the context.


69-77: LGTM! Well-structured metric view.

The view for pool connection metrics is properly configured with:

  • Clear name and description
  • Appropriate aggregation type (Sum)
internal/net/grpc/pool/pool.go (2)

600-602: LGTM! Thread-safe metrics update.

The metrics update is properly protected by a mutex lock.


894-902: Consider adding new Metrics function.

The new Metrics function effectively uses maps.Clone for safe access to metrics data.

internal/net/grpc/client.go (2)

1138-1140: LGTM! Good method signature update.

The rangeConns method now properly accepts context and force parameters, improving control over the connection loop.


1124-1133: LGTM! Proper context cancellation handling.

The code correctly handles context cancellation during connection closure.

.gitfiles (2)

161-177: LGTM! Comprehensive API documentation coverage.

The addition of detailed API documentation files for core Vald operations (filter, flush, index, insert, mirror, object, remove, search, update, upsert) improves the project's usability and maintainability.


1970-1974: Integration Testing and Documentation for QBG Algorithm Are Missing

The new QBG algorithm implementation lacks:

  • Test files in the QBG directory
  • README or documentation explaining the algorithm's purpose and usage

Run the following script to verify the missing test files and documentation:

#!/bin/bash
# Description: Check for test files and documentation for QBG algorithm
echo "Checking for tests..."
fd -t f "test" rust/libs/algorithms/qbg/

echo "Checking for documentation..."
fd -t f "README|\.md" rust/libs/algorithms/qbg/
internal/net/grpc/client_test.go (2)

3306-3308: Context parameter added to rangeConns but test cases need update.

The rangeConns method signature has been updated to include a context parameter but the test cases need to be updated accordingly.

The test should verify context handling. Apply this diff:

type args struct {
-    ctx context.Context
+    ctx context.Context
     fn func(addr string, p pool.Conn) bool
}

3110-3110: 🛠️ Refactor suggestion

Context parameter added but test cases need update.

The ConnectedAddrs method now accepts a context parameter but the test cases haven't been fully updated to reflect this change.

The test should verify context cancellation and timeout scenarios. Apply this diff:

-gotAddrs := g.ConnectedAddrs(context.Background())
+ctx, cancel := context.WithTimeout(context.Background(), time.Second)
+defer cancel()
+gotAddrs := g.ConnectedAddrs(ctx)

Likely invalid or redundant comment.

apis/swagger/v1/vald/object.swagger.json (4)

17-18: LGTM! Documentation improvements for /exists/{id} endpoint.

The documentation has been significantly enhanced with:

  • Clear overview section
  • Comprehensive status code table
  • Detailed troubleshooting guide with common errors and resolutions

47-47: Complete the TODO sections in documentation.

The /object/list endpoint documentation has incomplete sections:

  • Status Code: TODO
  • Troubleshooting: TODO

Would you like me to help generate the missing documentation sections?


77-77: Complete the TODO sections in documentation.

The /object/meta/{id.id} endpoint documentation has incomplete sections:

  • Status Code: TODO
  • Troubleshooting: TODO

Would you like me to help generate the missing documentation sections?


106-107: LGTM! Documentation improvements for /object/{id.id} endpoint.

The documentation has been significantly enhanced with:

  • Clear overview section
  • Comprehensive status code table
  • Detailed troubleshooting guide with common errors and resolutions
apis/swagger/v1/vald/filter.swagger.json (8)

17-17: Enhanced InsertObject Documentation
The updated summary now clearly explains that the endpoint uses the InsertObject RPC with a detailed table of status codes. This added context will help API consumers understand potential responses.


49-49: Improved MultiInsertObject Summary
The description for the /insert/object/multiple endpoint has been expanded to explain that multiple objects can be inserted in one request and now includes clear status code details.


81-81: Clarified SearchObject Summary
The updated summary for /search/object now explains that the SearchObject RPC is used and provides users with a table of status codes. This is a welcome improvement in clarity.


113-113: Enhanced StreamSearchObject Documentation
The revision for /search/object/multiple now emphasizes the bidirectional streaming nature of the RPC along with status codes and troubleshooting information. This extra detail should aid both developers and integrators.


145-145: Clearer UpdateObject Summary
The new summary for /update/object succinctly explains that this endpoint updates a single vector. Including the status code table ensures that consumers know what to expect on error or success.


177-178: MultiUpdateObject Enhanced with Warning Notice
The /update/object/multiple endpoint now includes a summary and a descriptive block notifying users of gRPC message size limitations. This additional troubleshooting section—with potential reasons and guidance—is especially useful.


210-210: UpsertObject Operation Clarified
The summary for /upsert/object has been updated to indicate that it both updates an existing object and inserts a new one, which improves the understanding of its dual role.


241-243: Comprehensive MultiUpsertObject Documentation
The revision in /upsert/object/multiple now details that multiple upsert operations can be handled in one request. The added warning about message size limitations and the structured status code table are both valuable improvements.

Makefile (3)

161-162: Refactored Linker Flags
The adjustment of LDFLAGS to include options like -Wl,--gc-sections and -Wl,-O2 together with the introduction of STATIC_LDFLAGS (which prepends -static) helps optimize the linking process and manage binary size/security more effectively. Please ensure that these removals (e.g. of -static from the dynamic flag set) are fully intended.


170-176: Introduction of COMMON_FLAGS and Compiler Settings
The new COMMON_FLAGS variable consolidates many optimization and visibility options, and its use in CFLAGS, CXXFLAGS, and FFLAGS improves maintainability. The choice of standards (gnu17, gnu++23, f2018) is clearly documented by the variable names. Consider adding a brief comment in the Makefile about the rationale behind these specific flags for future maintainers.


177-190: Architecture-Specific Adjustments
The conditional block for GOARCH correctly applies additional flags for amd64 (with NGT_FLAGS) while leaving arm64 untouched. This ensures that architecture optimizations are only applied where relevant.

apis/swagger/v1/vald/search.swagger.json (8)

17-18: Detailed LinearSearch RPC Overview
The updated summary for /linearsearch now includes a comprehensive status code table and a troubleshooting section. This makes it much clearer what each response code means and which remedies to consider if an error occurs.


50-52: Improved LinearSearchByID Documentation
The /linearsearch/id endpoint now clearly states that the vector must already be indexed and provides precise troubleshooting details. This clarification should help prevent misuse and reduce confusion for API consumers.


83-85: Enhanced MultiLinearSearchByID Summary
The revision for /linearsearch/id/multiple includes both an updated summary and a troubleshooting table, along with a notice about gRPC message size limitations. These additions are very helpful for guiding users through common pitfalls.


116-118: Clear MultiLinearSearch RPC Description
The updated summary and description for /linearsearch/multiple now call out the potential issue with gRPC message sizes along with detailed status codes and troubleshooting tips. This is a very user‑friendly enhancement.


149-151: Refined Search RPC Overview
The /search endpoint summary is now more descriptive and includes the necessary troubleshooting information and status code table. It maintains consistency with the other search endpoints.


182-183: Enhanced SearchByID Documentation
By detailing that the vector must be pre-indexed and offering clear troubleshooting steps, the updated summary for /search/id significantly reduces ambiguity.


214-218: Comprehensive MultiSearchByID Information
The /search/id/multiple update provides an extensive troubleshooting table and clearly states the limitations and remedies. This added level of detail is very helpful for debugging issues when searching by multiple IDs.


248-250: Improved MultiSearch RPC Clarity
The changes in /search/multiple standardize the wording for multi‑vector searches. The combination of summary, status code table, and troubleshooting instructions delivers a consistent and clear documentation experience.

@kpango kpango force-pushed the bugfix/internal-grpc/add-health-check-for-range-over-grpc-loop branch from 9a713bb to 5f28239 Compare March 15, 2025 21:54
@vdaas-ci
Copy link
Collaborator

[WARNING:CONFLICT] You may require to fix the conflict. Please check.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
internal/net/grpc/pool/pool.go (1)

773-773: Fixed typo in log message.

Corrected "successuly" to "successfully" in log message.

-				log.Debugf("successuly found available gRPC port on %d", port)
+				log.Debugf("successfully found available gRPC port on %d", port)
apis/swagger/v1/vald/upsert.swagger.json (1)

69-71: Typographical Correction:
There is a small typo in the parameter description—"Represent mthe ultiple upsert request." should be corrected to "Represent the multiple upsert request."

apis/swagger/v1/vald/search.swagger.json (1)

83-84: Enhanced /linearsearch/id/multiple documentation with important gRPC size limitation notice.
The summary concisely states the endpoint’s purpose. The description includes an HTML notice about the gRPC message size limitation followed by the status code table and troubleshooting details. Consider reviewing the inline comment (e.g. the // --- marker) to see if it should be removed for clarity.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9a713bb and 5f28239.

⛔ Files ignored due to path filters (31)
  • apis/grpc/v1/agent/core/agent.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/agent/sidecar/sidecar.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/discoverer/discoverer.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/filter/egress/egress_filter.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/filter/ingress/ingress_filter.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/meta/meta.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/mirror/mirror.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/mirror/mirror_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/payload/payload.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/rpc/errdetails/error_details.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/filter.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/filter_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/flush.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/flush_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/index.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/index_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/insert.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/insert_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/object.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/object_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/remove.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/remove_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/search.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/search_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/update.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/update_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/upsert.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/upsert_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • example/client/go.sum is excluded by !**/*.sum
  • go.sum is excluded by !**/*.sum
  • rust/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (68)
  • .gitfiles (4 hunks)
  • .github/ISSUE_TEMPLATE/bug_report.md (1 hunks)
  • .github/ISSUE_TEMPLATE/security_issue_report.md (1 hunks)
  • .github/PULL_REQUEST_TEMPLATE.md (1 hunks)
  • Makefile.d/functions.mk (1 hunks)
  • Makefile.d/k3d.mk (3 hunks)
  • apis/docs/v1/insert.md (1 hunks)
  • apis/proto/v1/vald/insert.proto (1 hunks)
  • apis/swagger/v1/mirror/mirror.swagger.json (1 hunks)
  • apis/swagger/v1/vald/filter.swagger.json (8 hunks)
  • apis/swagger/v1/vald/flush.swagger.json (1 hunks)
  • apis/swagger/v1/vald/index.swagger.json (5 hunks)
  • apis/swagger/v1/vald/insert.swagger.json (2 hunks)
  • apis/swagger/v1/vald/object.swagger.json (4 hunks)
  • apis/swagger/v1/vald/remove.swagger.json (3 hunks)
  • apis/swagger/v1/vald/search.swagger.json (8 hunks)
  • apis/swagger/v1/vald/update.swagger.json (3 hunks)
  • apis/swagger/v1/vald/upsert.swagger.json (2 hunks)
  • docs/api/insert.md (1 hunks)
  • example/client/go.mod (3 hunks)
  • go.mod (12 hunks)
  • internal/backoff/backoff.go (2 hunks)
  • internal/net/grpc/client.go (18 hunks)
  • internal/net/grpc/client_test.go (2 hunks)
  • internal/net/grpc/pool/pool.go (27 hunks)
  • internal/net/grpc/pool/pool_test.go (1 hunks)
  • internal/net/net.go (3 hunks)
  • internal/observability/metrics/grpc/grpc.go (3 hunks)
  • internal/test/mock/grpc/grpc_client_mock.go (1 hunks)
  • internal/test/mock/grpc_testify_mock.go (1 hunks)
  • k8s/discoverer/deployment.yaml (1 hunks)
  • k8s/gateway/gateway/lb/deployment.yaml (1 hunks)
  • k8s/gateway/gateway/mirror/deployment.yaml (1 hunks)
  • k8s/index/operator/deployment.yaml (1 hunks)
  • k8s/manager/index/deployment.yaml (1 hunks)
  • pkg/gateway/lb/handler/grpc/aggregation.go (4 hunks)
  • pkg/gateway/mirror/service/mirror.go (1 hunks)
  • versions/BUF_VERSION (1 hunks)
  • versions/CMAKE_VERSION (1 hunks)
  • versions/DOCKER_VERSION (1 hunks)
  • versions/FAISS_VERSION (1 hunks)
  • versions/GOLANGCILINT_VERSION (1 hunks)
  • versions/HELM_VERSION (1 hunks)
  • versions/K3S_VERSION (1 hunks)
  • versions/KIND_VERSION (1 hunks)
  • versions/KUBECTL_VERSION (1 hunks)
  • versions/NGT_VERSION (1 hunks)
  • versions/OPERATOR_SDK_VERSION (1 hunks)
  • versions/PROMETHEUS_STACK_VERSION (1 hunks)
  • versions/PROTOBUF_VERSION (1 hunks)
  • versions/TELEPRESENCE_VERSION (1 hunks)
  • versions/USEARCH_VERSION (1 hunks)
  • versions/YQ_VERSION (1 hunks)
  • versions/actions/ACTIONS_SETUP_GO (1 hunks)
  • versions/actions/ACTIONS_SETUP_NODE (1 hunks)
  • versions/actions/ACTIONS_UPLOAD_ARTIFACT (1 hunks)
  • versions/actions/CODECOV_CODECOV_ACTION (1 hunks)
  • versions/actions/DOCKER_SETUP_BUILDX_ACTION (1 hunks)
  • versions/actions/DOCKER_SETUP_QEMU_ACTION (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_ANALYZE (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_AUTOBUILD (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_INIT (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_UPLOAD_SARIF (1 hunks)
  • versions/actions/GITHUB_ISSUE_METRICS (1 hunks)
  • versions/actions/PETER_EVANS_CREATE_PULL_REQUEST (1 hunks)
  • versions/actions/REVIEWDOG_ACTION_HADOLINT (1 hunks)
  • versions/actions/REVIEWDOG_ACTION_LANGUAGETOOL (1 hunks)
  • versions/actions/SOFTPROPS_ACTION_GH_RELEASE (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (59)
  • versions/actions/SOFTPROPS_ACTION_GH_RELEASE
  • versions/actions/DOCKER_SETUP_QEMU_ACTION
  • versions/CMAKE_VERSION
  • versions/PROTOBUF_VERSION
  • versions/USEARCH_VERSION
  • versions/KIND_VERSION
  • versions/HELM_VERSION
  • versions/actions/ACTIONS_SETUP_NODE
  • versions/DOCKER_VERSION
  • versions/OPERATOR_SDK_VERSION
  • versions/BUF_VERSION
  • versions/actions/REVIEWDOG_ACTION_HADOLINT
  • versions/YQ_VERSION
  • versions/TELEPRESENCE_VERSION
  • versions/FAISS_VERSION
  • versions/actions/GITHUB_CODEQL_ACTION_ANALYZE
  • versions/actions/PETER_EVANS_CREATE_PULL_REQUEST
  • versions/actions/DOCKER_SETUP_BUILDX_ACTION
  • versions/NGT_VERSION
  • versions/actions/ACTIONS_SETUP_GO
  • versions/actions/ACTIONS_UPLOAD_ARTIFACT
  • versions/KUBECTL_VERSION
  • versions/K3S_VERSION
  • versions/actions/GITHUB_CODEQL_ACTION_INIT
  • versions/GOLANGCILINT_VERSION
  • versions/PROMETHEUS_STACK_VERSION
  • versions/actions/GITHUB_CODEQL_ACTION_AUTOBUILD
  • apis/proto/v1/vald/insert.proto
  • versions/actions/GITHUB_CODEQL_ACTION_UPLOAD_SARIF
  • apis/docs/v1/insert.md
  • apis/swagger/v1/vald/flush.swagger.json
  • versions/actions/CODECOV_CODECOV_ACTION
  • docs/api/insert.md
  • internal/test/mock/grpc/grpc_client_mock.go
  • k8s/index/operator/deployment.yaml
  • versions/actions/GITHUB_ISSUE_METRICS
  • pkg/gateway/lb/handler/grpc/aggregation.go
  • .github/PULL_REQUEST_TEMPLATE.md
  • internal/test/mock/grpc_testify_mock.go
  • Makefile.d/k3d.mk
  • Makefile.d/functions.mk
  • apis/swagger/v1/vald/index.swagger.json
  • versions/actions/REVIEWDOG_ACTION_LANGUAGETOOL
  • internal/net/net.go
  • pkg/gateway/mirror/service/mirror.go
  • k8s/gateway/gateway/lb/deployment.yaml
  • .github/ISSUE_TEMPLATE/bug_report.md
  • internal/net/grpc/pool/pool_test.go
  • apis/swagger/v1/mirror/mirror.swagger.json
  • internal/backoff/backoff.go
  • k8s/gateway/gateway/mirror/deployment.yaml
  • k8s/discoverer/deployment.yaml
  • internal/observability/metrics/grpc/grpc.go
  • k8s/manager/index/deployment.yaml
  • internal/net/grpc/client_test.go
  • .github/ISSUE_TEMPLATE/security_issue_report.md
  • internal/net/grpc/client.go
  • go.mod
  • example/client/go.mod
🧰 Additional context used
🧠 Learnings (1)
.gitfiles (1)
Learnt from: datelier
PR: vdaas/vald#2807
File: rust/bin/agent/src/handler/object.rs:36-38
Timestamp: 2025-03-15T12:24:12.647Z
Learning: In Rust code for Vald handlers, avoid using unwrap() on Option types. Instead, handle None values explicitly by returning appropriate error statuses to maintain robustness.
🔇 Additional comments (75)
.gitfiles (4)

126-126: Addition of Rust unit tests is a positive step

Adding a dedicated workflow for Rust unit tests will help ensure code quality and stability for the Rust components in the codebase.


161-177: API documentation expansion is valuable

The addition of comprehensive API documentation files for various endpoints (filter, flush, index, etc.) improves the project's usability and developer experience.


1951-1951: Error Handling Concern: Verify unwrap() usage in Rust handler

Based on previous feedback and project guidelines, the Rust handler code should avoid using unwrap() on Option types and instead handle None values explicitly by returning appropriate error statuses.

#!/bin/bash
# Description: Check for unwrap() usage in the new Rust handler
rg -A 2 'unwrap\(\)' rust/bin/agent/src/handler/object.rs

1970-1974: Integration Testing and Documentation for QBG Algorithm

The QBG algorithm files have been added, but ensure that proper integration testing and documentation are provided for this new algorithm.

#!/bin/bash
# Description: Check for test files and documentation for QBG algorithm
echo "Checking for tests..."
fd -t f "test" rust/libs/algorithms/qbg/

echo "Checking for documentation..."
fd -t f "README|\.md" rust/libs/algorithms/qbg/
internal/net/grpc/pool/pool.go (42)

23-23: Import of "maps" package added for metrics functionality.

The "maps" package is used for the new metrics functionality that tracks connection health status.


34-35: Added imports for gRPC error handling.

The imports for "codes" and "status" packages enable better error handling for gRPC connections, which is essential for the connection health checking.


40-40: Updated import to use singleflight package.

The singleflight package helps consolidate multiple concurrent requests for the same resource into a single operation, which is beneficial for connection management.


52-52: Interface method signature updated to accept context.

The Disconnect method now accepts a context.Context parameter, which is a good practice for propagating cancelation signals and timeouts.


54-54: Interface method signature clarification.

The Get method signature now explicitly shows it accepts a context.Context parameter.


79-79: Updated singleflight implementation to use generics.

The singleflight.Group is now using the generic type parameter Conn which improves type safety.


92-95: Added metrics tracking for connection health.

A new global metrics map with mutex protection has been added to track the health status of connections across the application.

This metrics implementation will be crucial for monitoring the health of the gRPC connections.


107-107: Updated singleflight initialization with generic type.

The singleflight.Group is now initialized with the generic New function.


142-143: Enhanced debug logging for connection initialization.

Added detailed logging for connection attempts to improve diagnostics and visibility into the connection process.

These logs will help in troubleshooting connection issues.

Also applies to: 159-160


193-193: Added logging for connection pool initialization.

This log message provides visibility into the connection pool size and capacity during initialization.


239-244: Cleaned up pool initialization and improved mutex handling.

The commented-out init call and proper mutex locking/unlocking improve the thread safety of the implementation.


253-253: Removed unnecessary pool initialization.

The init call is commented out, as it's likely unnecessary at this position.


307-311: Improved logging for connection refresh.

Enhanced debug logging to provide better visibility into the connection refresh process.


353-355: Enhanced logging for Connect operation.

The debug logging now includes more connection details that will be useful for troubleshooting.


358-358: Refactored Connect method to use singleTargetConnect.

The Connect method now properly delegates to singleTargetConnect when handling IP connections or when DNS resolution fails.

Also applies to: 362-365


371-373: Added validation in connect method.

Added validation to check if the pool is nil or closing before proceeding with the connection.


379-380: Added logging for IP connections.

Enhanced logging for connecting to multiple IPs.


406-438: Implemented singleTargetConnect method to replace Reconnect.

The singleTargetConnect method provides a more focused approach to connecting to a single endpoint, improving error handling and logging.

This method is crucial for the health check functionality as it handles reconnection to unhealthy endpoints.


440-472: Reimplemented Reconnect method with improved health checking.

The Reconnect method now checks connection health before attempting to reconnect, and uses DNS resolution for more efficient IP balancing when possible.

This is a key implementation for the health check functionality mentioned in the PR title.


474-475: Updated Disconnect method to accept context parameter.

The Disconnect method now accepts a context parameter, allowing for better cancellation and timeout control.


479-482: Enhanced logging for connection closure.

Added logging for each connection being closed during the disconnection process.


507-507: Improved logging for dial operations.

Added timeout information to the dial logging message.


551-551: Added health check logging.

Added a debug log statement to indicate when health checks are being performed.


558-559: Enhanced error reporting for unhealthy connections.

Improved logging for unhealthy connections and updates to the way unhealthy connections are handled.

Also applies to: 562-563


568-570: Enhanced logging for unhealthy IP connections.

Added warning logs and updated how unhealthy connection count is calculated.


573-577: Improved handling of DNS connection health.

Enhanced the logic for detecting and handling unhealthy DNS connections.


588-589: Added error logging for DNS reconnection failures.

Added logging when DNS reconnection fails during health checks.


600-602: Added metrics tracking for connection health.

Updated the metrics map with the count of healthy connections, which is crucial for monitoring.


621-648: Enhanced error handling in Do method.

Added status code checking to differentiate between connection errors and canceled operations, with improved logic for reconnecting.

This change is significant for making the connection pool more resilient.


653-659: Simplified Get method implementation.

Streamlined the Get method to use getHealthyConn and return a poolConn directly.


662-663: Updated getHealthyConn method signature.

The method now returns a poolConn reference in addition to index and boolean status.


676-676: Updated Disconnect call to use context.

Updated the Disconnect call to pass context parameter, matching the new method signature.


684-690: Improved connection health check and refresh logic.

Enhanced the logic for checking connection health and refreshing connections, returning the new connection details.


697-699: Simplified healthy connection check.

Streamlined the logic for checking if a connection is healthy and returning it.


761-761: Added logging for port scanning.

Enhanced visibility into the port scanning process with additional logging.


798-811: Updated String method formatting.

Updated the string formatting to include additional connection details.


816-817: Adjusted minimum delay for connection closing.

Set a minimum delay of 5ms for connection closing to prevent too rapid closing.


825-825: Enhanced logging for connection closure.

Added debug logging with delay information for connection closing.


830-839: Improved error handling during connection closure.

Added status code checking to better handle errors during connection closure.


851-856: Enhanced error handling for connection closure during state changes.

Added status code checking to better handle errors during connection closure when state changes.


873-873: Enhanced connection health check logging.

Improved visibility into connection health status with more detailed logging.

Also applies to: 879-885


894-902: Implemented Metrics function for health monitoring.

Added a new function to expose connection health metrics, which is essential for the health check functionality mentioned in the PR title.

This function provides a way to access the metrics map with proper synchronization, using maps.Clone for thread safety.

apis/swagger/v1/vald/update.swagger.json (3)

17-18: Comprehensive Update Endpoint Documentation:
The updated summary and description for the /update endpoint now provide detailed status codes and a clear troubleshooting table. This structured formatting will help developers quickly understand the expected responses.


50-52: Enhanced MultiUpdate Documentation:
The /update/multiple endpoint now includes an explicit overview and structured status code information. The consistent formatting across endpoints is excellent.


83-84: Incomplete Timestamp Update Documentation:
The /update/timestamp endpoint still has "TODO" placeholders for the status code and troubleshooting sections. Please complete these details before merging to ensure users receive full guidance.

apis/swagger/v1/vald/filter.swagger.json (8)

17-18: Enhanced Insert Object Documentation:
The updated summary for the /insert/object endpoint clearly explains that it leverages the InsertObject RPC via the Vald Filter Gateway and includes a well-structured status code table. Excellent improvement for clarity.


49-50: Clear MultiInsert Documentation:
The new summary for the /insert/object/multiple endpoint now explicitly states that it is used to add multiple objects in one request. This enhancement makes the endpoint’s behavior easier to understand.


81-82: Improved Search Object Summary:
The updated summary for the /search/object endpoint is concise and adds clarity by listing potential status codes. This will help users diagnose issues quickly.


113-114: Detailed Streaming Search Documentation:
The /search/object/multiple endpoint now incorporates a description of bidirectional streaming and details that each search request/response is independent, which is very useful for grasping the RPC behavior.


145-147: Updated Update Object RPC Summary:
The summary for the /update/object endpoint is now more informative and includes a structured list of status codes. This consistency will aid in reducing ambiguity during troubleshooting.


176-179: Enhanced MultiUpdate Object Documentation:
The summary and description for the /update/object/multiple endpoint have been expanded to include detailed troubleshooting steps along with status codes. This is consistent with other endpoints and adds valuable detail.


211-213: Consistent Upsert Object Documentation:
The updated summary for the /upsert/object endpoint clearly communicates its dual role in updating or adding a single object, making it easier for users to interpret its behavior.


242-244: Thorough MultiUpsert Documentation:
The /upsert/object/multiple endpoint now includes a notice regarding gRPC message size limitations and provides a detailed status code table. This comprehensive format is a welcome improvement.

apis/swagger/v1/vald/insert.swagger.json (2)

17-18: Improved Insert Endpoint Documentation:
The expanded summary and troubleshooting details for the /insert endpoint now offer clear guidance on both the expected behavior and error-handling strategies.


50-52: Enhanced MultiInsert Documentation:
The summary for the /insert/multiple endpoint now includes a concise notice about gRPC message size limitations and provides a structured table of status codes with corresponding troubleshooting advice, which greatly improves its clarity.

apis/swagger/v1/vald/upsert.swagger.json (2)

17-18: Clear Upsert Endpoint Documentation:
The revised summary for the /upsert endpoint now clearly explains its functionality—both updating an existing vector and adding a new one if the vector isn’t already present. The inclusion of a structured troubleshooting section is very beneficial.


50-52: Comprehensive MultiUpsert Documentation:
The updated documentation for the /upsert/multiple endpoint provides an in-depth explanation, including a notice on gRPC message size limitations and a detailed table of status codes, which ensures a complete understanding of potential issues.

apis/swagger/v1/vald/remove.swagger.json (3)

17-19: Enhanced Remove Endpoint Documentation:
The updated summary for the /remove endpoint now provides a comprehensive overview with a well-organized status code table and troubleshooting steps. This extra detail is very useful.


50-52: Clear MultiRemove Documentation:
The multi-remove endpoint now includes a detailed description and a notice concerning gRPC message size limitations, along with comprehensive troubleshooting guidance. This consistency is highly beneficial for end users.


83-85: Detailed RemoveByTimestamp Documentation:
The updates to the /remove/timestamp endpoint include a clear summary and an informative description that explains the use of repeated timestamps (resulting in an AND condition) along with troubleshooting advice. This added clarity will help prevent misinterpretation.

apis/swagger/v1/vald/object.swagger.json (4)

17-18: Enhanced /exists/{id} documentation is well detailed.
The updated summary now clearly outlines the error status codes and troubleshooting guidelines, and the added description table provides actionable resolution steps. Please ensure that the formatting remains consistent across all endpoints.


47-47: Reminder: Complete TODO placeholders for /object/list.
The summary currently contains "TODO" for the status code and troubleshooting sections. As noted in previous reviews, please update these placeholders with the appropriate values to provide complete documentation.


77-77: Reminder: Complete TODO placeholders for /object/meta/{id.id}.
The summary for this endpoint still shows "TODO" for the status code and troubleshooting sections. Please update these sections to give clear guidance on expected responses and error resolutions.


106-107: Detailed /object/{id.id} documentation update looks solid.
The new summary and description provide a comprehensive overview and clear troubleshooting steps. Just verify that the error codes remain up to date as additional changes occur in the service.

apis/swagger/v1/vald/search.swagger.json (7)

17-18: Well-structured /linearsearch endpoint documentation.
The expanded summary and detailed error troubleshooting table offer clarity on potential failure scenarios. The consistent format makes it easier for API users to understand the behavior.


50-52: Clear and comprehensive /linearsearch/id documentation update.
The new summary now explains the precondition (that the vector must be indexed) and lists the status codes and troubleshooting guidance. The accompanying description table is informative and consistent with other endpoints.


116-117: Improved /linearsearch/multiple endpoint documentation.
The summary clearly explains that multiple vectors can be searched in a single request. The description’s HTML notice regarding gRPC message limits, status code table, and troubleshooting guidance are well integrated.


149-150: Solid update to /search endpoint documentation.
Both the summary and description now provide a clear overview, detailed expected status codes, and troubleshooting steps. This consistency will aid API consumers in diagnosing issues.


182-183: Accurate documentation update for /search/id.
The summary now specifies that the vector must be indexed prior to search, and the detailed error table in the description is consistent with other similar endpoints.


215-216: Comprehensive update for /search/id/multiple endpoint.
The enhanced summary and description now provide clear guidance on how to handle multiple ID searches, along with detailed error resolution instructions and common pitfalls.


248-250: Detailed update for /search/multiple endpoint is well done.
The summary clearly states the purpose, and the description—complete with the HTML notice about gRPC message size limitations, error codes, and troubleshooting guidance—ensures users understand potential issues.

coderabbitai[bot]
coderabbitai bot previously approved these changes Mar 15, 2025
@kpango kpango force-pushed the bugfix/internal-grpc/add-health-check-for-range-over-grpc-loop branch from 5f28239 to a8c444f Compare March 15, 2025 22:12
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
apis/swagger/v1/vald/update.swagger.json (1)

83-84: ⚠️ Potential issue

Incomplete documentation for the /update/timestamp endpoint.

The documentation contains placeholder "TODO" sections for both status code table and troubleshooting information.

Complete the documentation by adding:

  1. Status code table (following the same format as other endpoints)
  2. Troubleshooting section with common errors and their resolutions

This was previously identified in a review and still needs to be addressed before merging.

apis/swagger/v1/vald/object.swagger.json (2)

47-47: ⚠️ Potential issue

Incomplete documentation for the /object/list endpoint.

The documentation contains placeholder "TODO" sections for both status code table and troubleshooting information.

Complete the documentation by adding:

  1. Status code table (following the same format as other endpoints)
  2. Troubleshooting section with common errors and their resolutions

This was previously identified in a review and still needs to be addressed before merging.


77-77: ⚠️ Potential issue

Incomplete documentation for the /object/meta/{id.id} endpoint.

The documentation contains placeholder "TODO" sections for both status code table and troubleshooting information.

Complete the documentation by adding:

  1. Status code table (following the same format as other endpoints)
  2. Troubleshooting section with common errors and their resolutions

This was previously identified in a review and still needs to be addressed before merging.

🧹 Nitpick comments (11)
internal/net/grpc/pool/pool.go (2)

92-95: New global metrics tracking mechanism.

The introduction of a global metrics map with mutex synchronization enables tracking connection health across the application.

While this implementation works, consider a more structured approach:

-var (
-	mu      sync.RWMutex
-	metrics map[string]int64 = make(map[string]int64)
-)

+type poolMetrics struct {
+	mu      sync.RWMutex
+	metrics map[string]int64
+}
+
+var globalMetrics = &poolMetrics{
+	metrics: make(map[string]int64),
+}
+
+func (pm *poolMetrics) update(addr string, value int64) {
+	pm.mu.Lock()
+	defer pm.mu.Unlock()
+	pm.metrics[addr] = value
+}
+
+func (pm *poolMetrics) get() map[string]int64 {
+	pm.mu.RLock()
+	defer pm.mu.RUnlock()
+	return maps.Clone(pm.metrics)
+}

This encapsulates the metrics logic and provides clearer methods for updating and retrieving metrics.


239-245: Commented initialization code should be removed.

The commented out initialization call (// p.init(false)) shouldn't be left in the codebase.

-	// p.init(false)
	p.pmu.Lock()

Remove the commented code since it's no longer needed. Similar commented code appears at lines 253 and 353. Removing all instances will improve code clarity.

apis/swagger/v1/vald/filter.swagger.json (1)

178-178: Grammatical error in warning message

There's a grammatical error in the message size limitation warning.

- "Please be careful that the size of the request exceed the limit."
+ "Please be careful that the size of the request does not exceed the limit."

or alternatively:

- "Please be careful that the size of the request exceed the limit."
+ "Please be careful if the size of the request exceeds the limit."
internal/net/grpc/client.go (8)

257-258: Consider verifying necessity of forced rebalance.
Calling rangeConns with force=true can be costly in large deployments. If overhead becomes an issue, consider selectively triggering a forced rebalance only when actively needed.


264-264: Beware of logging detailed connection info.
p.String() might reveal internal details. If there's any concern about sensitive data exposure, consider sanitizing or limiting such logs.


296-297: Fix minor grammar in debug log.
The message “starting to checking health...” is slightly off. Changing it to “starting to check health...” or “starting health check...” is clearer.


303-303: Monitor for sensitive data in logs.
Similar to previous logs, ensure that p.String() doesn't leak confidential information at debug level.


348-348: Potential large output.
log.Debugf("starting to bulk reconnection...") can become verbose if g.crl is large. Consider rate-limiting or grouping logs if spam becomes an issue.


383-383: Potentially large or sensitive outputs.
Be mindful that listing all disconnect targets may flood logs or expose details. Log only essential data if volume or sensitivity is a concern.


391-391: Redact sensitive address details if needed.
This debug statement enumerates addresses and hosts. If there's any risk of leaking confidential info, consider partial redactions.


1147-1162: Consider modularizing reconnection logic.
The nested code for reconnecting, disconnecting, and checking IP connections is quite verbose. Moving it into a helper function could reduce complexity and centralize error handling.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5f28239 and a8c444f.

⛔ Files ignored due to path filters (30)
  • apis/grpc/v1/agent/core/agent.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/agent/sidecar/sidecar.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/discoverer/discoverer.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/filter/egress/egress_filter.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/filter/ingress/ingress_filter.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/meta/meta.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/mirror/mirror.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/mirror/mirror_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/payload/payload.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/rpc/errdetails/error_details.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/filter.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/filter_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/flush.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/flush_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/index.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/index_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/insert.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/insert_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/object.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/object_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/remove.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/remove_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/search.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/search_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/update.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/update_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/upsert.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • apis/grpc/v1/vald/upsert_vtproto.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • example/client/go.sum is excluded by !**/*.sum
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (68)
  • .gitfiles (4 hunks)
  • .github/ISSUE_TEMPLATE/bug_report.md (1 hunks)
  • .github/ISSUE_TEMPLATE/security_issue_report.md (1 hunks)
  • .github/PULL_REQUEST_TEMPLATE.md (1 hunks)
  • Makefile.d/functions.mk (1 hunks)
  • Makefile.d/k3d.mk (3 hunks)
  • apis/docs/v1/insert.md (1 hunks)
  • apis/proto/v1/vald/insert.proto (1 hunks)
  • apis/swagger/v1/mirror/mirror.swagger.json (1 hunks)
  • apis/swagger/v1/vald/filter.swagger.json (8 hunks)
  • apis/swagger/v1/vald/flush.swagger.json (1 hunks)
  • apis/swagger/v1/vald/index.swagger.json (5 hunks)
  • apis/swagger/v1/vald/insert.swagger.json (2 hunks)
  • apis/swagger/v1/vald/object.swagger.json (4 hunks)
  • apis/swagger/v1/vald/remove.swagger.json (3 hunks)
  • apis/swagger/v1/vald/search.swagger.json (8 hunks)
  • apis/swagger/v1/vald/update.swagger.json (3 hunks)
  • apis/swagger/v1/vald/upsert.swagger.json (2 hunks)
  • docs/api/insert.md (1 hunks)
  • example/client/go.mod (2 hunks)
  • go.mod (17 hunks)
  • internal/backoff/backoff.go (2 hunks)
  • internal/net/grpc/client.go (18 hunks)
  • internal/net/grpc/client_test.go (2 hunks)
  • internal/net/grpc/pool/pool.go (27 hunks)
  • internal/net/grpc/pool/pool_test.go (1 hunks)
  • internal/net/net.go (3 hunks)
  • internal/observability/metrics/grpc/grpc.go (3 hunks)
  • internal/test/mock/grpc/grpc_client_mock.go (1 hunks)
  • internal/test/mock/grpc_testify_mock.go (1 hunks)
  • k8s/discoverer/deployment.yaml (1 hunks)
  • k8s/gateway/gateway/lb/deployment.yaml (1 hunks)
  • k8s/gateway/gateway/mirror/deployment.yaml (1 hunks)
  • k8s/index/operator/deployment.yaml (1 hunks)
  • k8s/manager/index/deployment.yaml (1 hunks)
  • pkg/gateway/lb/handler/grpc/aggregation.go (4 hunks)
  • pkg/gateway/mirror/service/mirror.go (1 hunks)
  • versions/BUF_VERSION (1 hunks)
  • versions/CMAKE_VERSION (1 hunks)
  • versions/DOCKER_VERSION (1 hunks)
  • versions/FAISS_VERSION (1 hunks)
  • versions/GOLANGCILINT_VERSION (1 hunks)
  • versions/HELM_VERSION (1 hunks)
  • versions/K3S_VERSION (1 hunks)
  • versions/KIND_VERSION (1 hunks)
  • versions/KUBECTL_VERSION (1 hunks)
  • versions/NGT_VERSION (1 hunks)
  • versions/OPERATOR_SDK_VERSION (1 hunks)
  • versions/PROMETHEUS_STACK_VERSION (1 hunks)
  • versions/PROTOBUF_VERSION (1 hunks)
  • versions/TELEPRESENCE_VERSION (1 hunks)
  • versions/USEARCH_VERSION (1 hunks)
  • versions/YQ_VERSION (1 hunks)
  • versions/actions/ACTIONS_SETUP_GO (1 hunks)
  • versions/actions/ACTIONS_SETUP_NODE (1 hunks)
  • versions/actions/ACTIONS_UPLOAD_ARTIFACT (1 hunks)
  • versions/actions/CODECOV_CODECOV_ACTION (1 hunks)
  • versions/actions/DOCKER_SETUP_BUILDX_ACTION (1 hunks)
  • versions/actions/DOCKER_SETUP_QEMU_ACTION (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_ANALYZE (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_AUTOBUILD (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_INIT (1 hunks)
  • versions/actions/GITHUB_CODEQL_ACTION_UPLOAD_SARIF (1 hunks)
  • versions/actions/GITHUB_ISSUE_METRICS (1 hunks)
  • versions/actions/PETER_EVANS_CREATE_PULL_REQUEST (1 hunks)
  • versions/actions/REVIEWDOG_ACTION_HADOLINT (1 hunks)
  • versions/actions/REVIEWDOG_ACTION_LANGUAGETOOL (1 hunks)
  • versions/actions/SOFTPROPS_ACTION_GH_RELEASE (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (57)
  • versions/DOCKER_VERSION
  • versions/actions/ACTIONS_SETUP_GO
  • versions/CMAKE_VERSION
  • versions/actions/GITHUB_CODEQL_ACTION_UPLOAD_SARIF
  • versions/actions/GITHUB_CODEQL_ACTION_ANALYZE
  • versions/actions/DOCKER_SETUP_QEMU_ACTION
  • versions/KUBECTL_VERSION
  • versions/actions/ACTIONS_SETUP_NODE
  • versions/actions/SOFTPROPS_ACTION_GH_RELEASE
  • versions/KIND_VERSION
  • versions/PROTOBUF_VERSION
  • versions/BUF_VERSION
  • versions/HELM_VERSION
  • docs/api/insert.md
  • versions/actions/PETER_EVANS_CREATE_PULL_REQUEST
  • versions/actions/REVIEWDOG_ACTION_HADOLINT
  • versions/actions/GITHUB_CODEQL_ACTION_INIT
  • versions/actions/REVIEWDOG_ACTION_LANGUAGETOOL
  • apis/swagger/v1/vald/flush.swagger.json
  • versions/USEARCH_VERSION
  • versions/OPERATOR_SDK_VERSION
  • versions/K3S_VERSION
  • versions/actions/GITHUB_CODEQL_ACTION_AUTOBUILD
  • versions/PROMETHEUS_STACK_VERSION
  • apis/docs/v1/insert.md
  • apis/proto/v1/vald/insert.proto
  • versions/FAISS_VERSION
  • versions/GOLANGCILINT_VERSION
  • versions/actions/GITHUB_ISSUE_METRICS
  • versions/NGT_VERSION
  • versions/TELEPRESENCE_VERSION
  • versions/actions/CODECOV_CODECOV_ACTION
  • k8s/index/operator/deployment.yaml
  • internal/backoff/backoff.go
  • Makefile.d/functions.mk
  • versions/actions/DOCKER_SETUP_BUILDX_ACTION
  • .github/PULL_REQUEST_TEMPLATE.md
  • k8s/discoverer/deployment.yaml
  • versions/YQ_VERSION
  • internal/test/mock/grpc_testify_mock.go
  • internal/net/net.go
  • versions/actions/ACTIONS_UPLOAD_ARTIFACT
  • .github/ISSUE_TEMPLATE/bug_report.md
  • k8s/gateway/gateway/lb/deployment.yaml
  • Makefile.d/k3d.mk
  • apis/swagger/v1/mirror/mirror.swagger.json
  • pkg/gateway/lb/handler/grpc/aggregation.go
  • k8s/manager/index/deployment.yaml
  • pkg/gateway/mirror/service/mirror.go
  • .github/ISSUE_TEMPLATE/security_issue_report.md
  • apis/swagger/v1/vald/index.swagger.json
  • internal/test/mock/grpc/grpc_client_mock.go
  • internal/net/grpc/pool/pool_test.go
  • k8s/gateway/gateway/mirror/deployment.yaml
  • internal/net/grpc/client_test.go
  • example/client/go.mod
  • go.mod
🧰 Additional context used
🧠 Learnings (1)
.gitfiles (1)
Learnt from: datelier
PR: vdaas/vald#2807
File: rust/bin/agent/src/handler/object.rs:36-38
Timestamp: 2025-03-15T12:24:12.647Z
Learning: In Rust code for Vald handlers, avoid using unwrap() on Option types. Instead, handle None values explicitly by returning appropriate error statuses to maintain robustness.
⏰ Context from checks skipped due to timeout of 90000ms (111)
  • GitHub Check: Run tests for internal packages
  • GitHub Check: Run tests for pkg packages
  • GitHub Check: Run tests for cmd packages
  • GitHub Check: runner / kubeval-conflint
  • GitHub Check: runner / kubelinter
  • GitHub Check: grpc-sequential
  • GitHub Check: grpc-stream
  • GitHub Check: coverage
  • GitHub Check: CodeQL
  • GitHub Check: check-format-diff
  • GitHub Check: runner / go build
  • GitHub Check: Run tests for Rust
  • GitHub Check: Run tests for internal packages
  • GitHub Check: Run tests for pkg packages
  • GitHub Check: Run tests for cmd packages
  • GitHub Check: runner / kubelinter
  • GitHub Check: grpc-sequential
  • GitHub Check: grpc-stream
  • GitHub Check: coverage
  • GitHub Check: CodeQL
  • GitHub Check: check-format-diff
  • GitHub Check: runner / go build
  • GitHub Check: Run tests for Rust
  • GitHub Check: Run tests for internal packages
  • GitHub Check: Run tests for pkg packages
  • GitHub Check: Run tests for cmd packages
  • GitHub Check: runner / kubelinter
  • GitHub Check: grpc-sequential
  • GitHub Check: grpc-stream
  • GitHub Check: coverage
  • GitHub Check: CodeQL
  • GitHub Check: check-format-diff
  • GitHub Check: runner / go build
  • GitHub Check: Run tests for Rust
  • GitHub Check: Run tests for internal packages
  • GitHub Check: Run tests for pkg packages
  • GitHub Check: Run tests for cmd packages
  • GitHub Check: runner / kubelinter
  • GitHub Check: grpc-sequential
  • GitHub Check: grpc-stream
  • GitHub Check: coverage
  • GitHub Check: CodeQL
  • GitHub Check: check-format-diff
  • GitHub Check: runner / go build
  • GitHub Check: Run tests for Rust
  • GitHub Check: Run tests for internal packages
  • GitHub Check: Run tests for pkg packages
  • GitHub Check: Run tests for cmd packages
  • GitHub Check: runner / kubelinter
  • GitHub Check: grpc-sequential
  • GitHub Check: grpc-stream
  • GitHub Check: coverage
  • GitHub Check: CodeQL
  • GitHub Check: check-format-diff
  • GitHub Check: runner / go build
  • GitHub Check: Run tests for Rust
  • GitHub Check: Run tests for internal packages
  • GitHub Check: Run tests for pkg packages
  • GitHub Check: Run tests for cmd packages
  • GitHub Check: runner / kubelinter
  • GitHub Check: grpc-sequential
  • GitHub Check: grpc-stream
  • GitHub Check: coverage
  • GitHub Check: CodeQL
  • GitHub Check: check-format-diff
  • GitHub Check: runner / go build
  • GitHub Check: Run tests for Rust
  • GitHub Check: Run tests for internal packages
  • GitHub Check: Run tests for pkg packages
  • GitHub Check: Run tests for cmd packages
  • GitHub Check: runner / kubelinter
  • GitHub Check: grpc-sequential
  • GitHub Check: grpc-stream
  • GitHub Check: coverage
  • GitHub Check: CodeQL
  • GitHub Check: check-format-diff
  • GitHub Check: runner / go build
  • GitHub Check: Run tests for Rust
  • GitHub Check: Run tests for internal packages
  • GitHub Check: Run tests for pkg packages
  • GitHub Check: Run tests for cmd packages
  • GitHub Check: runner / kubelinter
  • GitHub Check: grpc-sequential
  • GitHub Check: grpc-stream
  • GitHub Check: coverage
  • GitHub Check: CodeQL
  • GitHub Check: check-format-diff
  • GitHub Check: runner / go build
  • GitHub Check: Run tests for Rust
  • GitHub Check: Run tests for internal packages
  • GitHub Check: Run tests for pkg packages
  • GitHub Check: Run tests for cmd packages
  • GitHub Check: runner / kubelinter
  • GitHub Check: grpc-sequential
  • GitHub Check: grpc-stream
  • GitHub Check: coverage
  • GitHub Check: CodeQL
  • GitHub Check: check-format-diff
  • GitHub Check: runner / go build
  • GitHub Check: Run tests for Rust
  • GitHub Check: Run tests for internal packages
  • GitHub Check: Run tests for pkg packages
  • GitHub Check: Run tests for cmd packages
  • GitHub Check: runner / kubelinter
  • GitHub Check: grpc-sequential
  • GitHub Check: grpc-stream
  • GitHub Check: coverage
  • GitHub Check: CodeQL
  • GitHub Check: check-format-diff
  • GitHub Check: runner / go build
  • GitHub Check: Run tests for Rust
🔇 Additional comments (48)
.gitfiles (4)

126-126: New unit testing workflow for Rust components added

This adds a dedicated workflow for Rust unit tests, which is a good step toward ensuring code quality for the Rust implementations in the project. Ensuring proper test coverage is critical for maintaining reliability, especially when adding new functionality like the QBG algorithm and object handler.


161-177: API documentation system has been improved

The addition of these documentation-related files significantly enhances the project's API documentation structure. This includes configuration files for documentation generation (buf.gen.doc.yaml, buf.gen.payload.yaml, etc.) and individual endpoint documentation files like filter.md, flush.md, etc. These changes align with good software development practices by improving developer experience through comprehensive documentation.


970-974: ⚠️ Potential issue

New QBG algorithm implementation added without proper testing and documentation

The pull request introduces a new QBG algorithm implementation in Rust, but there appears to be no test files or documentation for this component. This is concerning as it could lead to integration issues and maintenance challenges.

#!/bin/bash
# Description: Check for test files and documentation for QBG algorithm
echo "Checking for tests..."
fd -t f "test" rust/libs/algorithms/qbg/

echo "Checking for documentation..."
fd -t f "README|\.md" rust/libs/algorithms/qbg/

951-951: ⚠️ Potential issue

Error handling concern in the Rust handler implementation

The new object.rs handler implementation might contain instances of unwrap() on Option types, which violates the project's error handling guidelines. This could lead to panics in production if null values are encountered.

#!/bin/bash
# Description: Check for unwrap() usage in the new Rust handler
rg -A 2 'unwrap\(\)' rust/bin/agent/src/handler/object.rs
apis/swagger/v1/vald/upsert.swagger.json (2)

17-18: Excellent API documentation enhancement for the Upsert endpoint

The summary and description for the /upsert endpoint have been significantly enhanced with:

  1. A clear overview of the endpoint's functionality
  2. A structured table of status codes and their meanings
  3. A troubleshooting section with common error scenarios
  4. Detailed resolution strategies for each potential error

This comprehensive documentation greatly improves developer experience by providing clear guidance on how to use the API and troubleshoot issues.


50-51: Well-structured API documentation for the MultiUpsert endpoint

The /upsert/multiple endpoint documentation has been enhanced with:

  1. A clear explanation of its purpose for updating or adding multiple vectors
  2. An important notice about gRPC message size limitations
  3. A structured table of status codes
  4. Comprehensive troubleshooting information and error resolution strategies

These improvements provide developers with critical information about potential limitations and error handling, which will help ensure correct implementation.

apis/swagger/v1/vald/remove.swagger.json (3)

17-18: Excellent documentation improvements for the /remove endpoint.

The expanded summary and detailed description provide clear guidance for users by:

  • Adding a structured overview of the endpoint functionality
  • Including a comprehensive status code table
  • Adding a detailed troubleshooting section with common error scenarios and resolution steps

These improvements will significantly help developers understand how to properly use the API and handle errors.


50-51: Well-structured documentation for the /remove/multiple endpoint.

The additions here are valuable:

  • Clear warning about gRPC message size limitations
  • Consistent status code table format
  • Comprehensive troubleshooting guidance

This consistent documentation approach across endpoints will improve the developer experience.


83-84: Helpful clarification of /remove/timestamp endpoint behavior.

The documentation now clearly explains:

  • The purpose of the timestamp-based vector removal
  • How the AND condition works when multiple timestamps are provided
  • Specific error handling guidance for timestamp-related operations

This is particularly helpful for users working with timestamp-based operations.

apis/swagger/v1/vald/update.swagger.json (2)

17-18: Thorough documentation for the /update endpoint.

The expanded summary provides valuable information:

  • Clear overview of the update operation
  • Complete status code table with all possible response codes
  • Detailed troubleshooting information for various error scenarios

This will help developers implement robust error handling when using this endpoint.


50-51: Comprehensive documentation for the /update/multiple endpoint.

The additions here mirror the single update endpoint's improvements with appropriate adaptations:

  • Important warning about gRPC message size limitations
  • Consistent formatting with other endpoint documentation
  • Detailed error resolution guidance

This consistency across endpoints is excellent for the API documentation.

internal/net/grpc/pool/pool.go (7)

23-41: Updated imports to support health check functionality.

The new imports enable important functionality:

  • maps package for metrics handling (Go 1.21+ feature)
  • Direct imports for grpc/codes and grpc/status for better error handling
  • singleflight for deduplicating concurrent connection attempts

These imports support the health check mechanism being implemented.


355-368: Improved connection logic with enhanced logging.

The connection logic has been enhanced with additional logging to provide better visibility into the connection process.

The improved logging will make it easier to diagnose connection issues, which is particularly valuable for distributed systems like Vald.


406-438: Renamed and enhanced the Reconnect method.

The method has been renamed to singleTargetConnect and now accepts an address parameter, providing more flexibility.

This change allows more targeted reconnection attempts, which should improve reliability when some targets are unavailable.


474-501: Updated Disconnect method with proper context handling.

The Disconnect method now accepts a context parameter and includes enhanced logging.

This change ensures that connection cleanup can be properly managed with timeouts and cancellation, which is important for graceful shutdowns.


628-648: Improved error handling with gRPC status codes.

The error handling now properly checks gRPC status codes and attempts to refresh connections when appropriate.

This enhancement is crucial for maintaining connection pool health when transient errors occur, and properly distinguishes between different types of errors (e.g., canceled operations vs. actual connection failures).


894-902: New Metrics function provides visibility into connection health.

The added Metrics function returns a copy of the connection health metrics.

This is a valuable addition that enables monitoring and diagnostics of the connection pool's health. The use of maps.Clone() ensures that the returned map is a copy, preventing external code from modifying the internal metrics.


52-54:

✅ Verification successful

Interface method signature changes require attention.

The Disconnect and Get method signatures now require a context.Context parameter, which is a breaking change to the interface.

This change improves context propagation but requires updates to all implementations of this interface. Please verify that all code calling these methods has been updated accordingly.


🏁 Script executed:

#!/bin/bash
# Find all places where the Disconnect or Get methods are called
echo "Searching for Disconnect method calls:"
rg -A 1 -B 1 "\bDisconnect\s*\(\s*\)" --type go

echo "Searching for Get method calls without context:"
rg -A 1 -B 1 "\bGet\s*\(\s*\)" --type go

Length of output: 61454


Interface Update Verified: Context propagation confirmed

The updated method signatures in internal/net/grpc/pool/pool.go now require a context.Context parameter for both Disconnect and Get, and our search didn’t reveal any call sites invoking these methods without a context. Although the broad grep pattern for Get returned many unrelated references (like errgroup.Get and info.Get), no evidence was found of missing context arguments on the pool interface calls.

Please ensure that all implementations and test cases for this interface consistently use the new signatures.

apis/swagger/v1/vald/object.swagger.json (2)

17-18: Well-detailed documentation for the /exists/{id} endpoint.

The expanded summary and detailed description provide:

  • Clear explanation of the endpoint's purpose
  • Comprehensive status code table
  • Detailed troubleshooting guidance

This follows the consistent documentation pattern established in other endpoints.


106-107: Excellent documentation improvements for the /object/{id.id} endpoint.

The expanded summary and detailed description provide:

  • Clear explanation of the GetObject RPC functionality
  • Comprehensive status code table with possible response codes
  • Detailed troubleshooting guidance for various error scenarios

This will significantly help developers understand how to properly use the API.

apis/swagger/v1/vald/search.swagger.json (4)

17-18: Documentation quality enhancement for LinearSearch RPC endpoint

The detailed improvements to the /linearsearch endpoint summary provide excellent context with a well-structured overview, status code table, and troubleshooting guidance. This significantly improves API usability.


50-51: Clear documentation with important pre-condition for LinearSearchByID

The enhanced documentation for /linearsearch/id properly highlights the critical prerequisite that vectors must be indexed before searching, which helps prevent user errors and improves the developer experience.


83-84: Important warning about gRPC limitations for MultiLinearSearchByID

The addition of the gRPC message size limitation warning for the /linearsearch/id/multiple endpoint is valuable, as it helps users avoid potential runtime errors when working with large datasets.


149-150: Consistent documentation improvements across search endpoints

The uniform approach to documentation improvements across all search endpoints (regular, ID-based, and multiple variants) creates a cohesive API reference. The status code tables and troubleshooting sections are particularly helpful for error handling.

Also applies to: 182-183, 215-216, 248-249

apis/swagger/v1/vald/filter.swagger.json (2)

17-17: Enhanced API documentation for filter service endpoints

The improvements to the summaries for filter endpoints provide consistent and clear documentation, following the same pattern as the search API. This consistency across the API surface helps users understand the system behavior.

Also applies to: 49-49, 81-81, 113-113, 145-145, 210-210


243-243: Consistent wording in warning messages

It's good to see the correct grammar used in this similar warning message ("exceeds" instead of "exceed"), which confirms the issue in line 178 was likely a simple oversight.

apis/swagger/v1/vald/insert.swagger.json (2)

17-18: Fixed typo and enhanced documentation for Insert RPC

The summary for the /insert endpoint has been correctly labeled as "Insert RPC" (fixing the previous "Inset RPC" typo noted in past reviews) and includes a comprehensive description with status codes and troubleshooting information.


50-51: Clear documentation for MultiInsert with important size limitations

The documentation for the /insert/multiple endpoint properly highlights the gRPC message size limitation, which is crucial information for users working with batch operations.

internal/observability/metrics/grpc/grpc.go (4)

17-20: Added imports and constants for pool connection health monitoring

The new imports and constants establish the foundation for monitoring gRPC pool connection health, which is essential for observability in a distributed system.

Also applies to: 33-34


37-39: Enhanced grpcServerMetrics struct with pool target address tracking

The addition of the poolTargetAddrKey field to the grpcServerMetrics struct enables proper labeling of metrics with target addresses, allowing for more detailed monitoring of connection health by endpoint.

Also applies to: 41-45


69-77: Added metric view for pool connection health monitoring

The new metric view with AggregationSum aggregation type is appropriate for tracking connection counts and integrates well with the existing metrics framework.


81-104: Implemented connection health metrics registration and collection

The updated Register method now properly collects pool connection health metrics by target address. The callback function effectively retrieves metrics from the pool and reports them with the appropriate attributes.

These changes align with the PR objectives of adding health checks for the gRPC connection loop, providing valuable operational visibility.

internal/net/grpc/client.go (16)

93-93: Context-based method signature is a good addition.
Adding a context.Context parameter to ConnectedAddrs aligns with the rest of the client interface, allowing better control over cancellations and timeouts.


168-168: Informative debug logging.
Logging the start of the connection monitor helps diagnose initialization issues.


196-196: Helpful success log.
This log makes it clear how many clients are successfully connected, assisting debugging efforts.


394-394: Clear debug information.
The log straightforwardly indicates disconnection is in progress, which is beneficial for traceability.


417-417: Concise completion log.
Logging the successful start of the connection monitor confirms that monitoring is active without spamming output.


435-435: Correct usage of rangeConns with non-forced mode.
Invoking rangeConns(ctx, false, ...) ensures connections are only processed if healthy or reconnect attempts succeed.


498-498: Consistent usage of rangeConns.
Reusing rangeConns in concurrent flow maintains code consistency and reduces duplication.


585-585: Good defensive check.
Ensuring connection health before proceeding is essential to prevent runtime errors.


654-654: Consistent approach to checking health.
Mirroring the check introduced earlier keeps logic uniform across ordered range operations.


721-721: Extending rangeConns usage into RoundRobin is sound.
This approach helps maintain a single entry point for connection handling logic.


899-899: Verify retry behavior for unhealthy connections.
Returning p.IsHealthy(ctx) as the retry indicator means any non-healthy result triggers retriable logic. Ensure this aligns with how many reconnection attempts you want.


906-906: Confirm backoff logic with health checks.
Same as above: re-check your desired behavior for how many times an unhealthy connection is retried.


1086-1086: Straightforward disconnect call.
Returning (nil, p.Disconnect(ctx)) properly matches the function signature and ensures an error is returned if disconnection fails.


1105-1107: Context parameter for ConnectedAddrs is consistent.
The new signature improves cancellation handling in enumerating connected addresses.


1138-1140: Flexible approach with context & force parameter.
Allowing a forced traversal or a normal check extends the method’s applicability.


1143-1146: Handle nil connections in forced checks.
When force is true, there’s no additional nil check for p. This was previously flagged and still risks panics if p is nil.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment