feat(infra): enhance server infrastructure with Redis connection pool and observability stack #1259

abretonc7s · 2025-04-09T15:26:03Z

Description

Overview

This PR introduces significant infrastructure improvements to our socket server architecture, focusing on enhanced reliability, scalability, and observability. The key changes include:

Redis Connection Management Overhaul: Replaced singleton Redis client with a properly managed connection pool, improving resource utilization and resilience.
Observability Stack: Integrated Prometheus, Grafana, Loki, and Promtail for comprehensive monitoring and logging.
Architecture Split: Separated analytics functionality into a dedicated server to improve modularity and reduce coupling.
Documentation Improvements: Enhanced setup documentation for both development and scalable deployment scenarios.

Key Implementation Details

Redis Improvements

Implemented Redis connection pooling with configurable minimum (default: 15) and maximum (default: 50) connections
Added automatic key migration for backward compatibility with legacy key formats
Enhanced connection retry logic with better error handling
Implemented health checks with monitoring dashboard integration

Monitoring Stack

Added Prometheus for metrics collection from all services on local dev
Configured Grafana with multiple dashboards for system metrics, Redis health, and application monitoring
Integrated Loki and Promtail for centralized log collection and visualization
Added custom metrics for tracking Redis pool usage, key migrations, and handler performance

Architecture Changes

Extracted analytics API to a dedicated service (packages/analytics-server)
Added middleware for redirecting analytics traffic to the new dedicated service
Refactored socket server to support clustered Redis deployments
Implemented graceful shutdown procedures that properly clean up resources

Docker Environment Enhancements

Created two deployment modes:
1. Development Mode: Single instance with auto-reload for development
2. Scalable Mode: Multiple app instances with load balancing for production-like testing
Added Redis cluster configuration with proper support for horizontal scaling

Breaking Changes

Analytics endpoints (/evt and /debug) are now redirected to a separate service
Upgraded Redis and Socket.IO dependencies with more robust connection handling
Changed default analytics server URL to point to the new separate service

Migration Steps

This PR is designed to maintain backward compatibility, so explicit migration steps shouldn't be necessary. The server will automatically:

Detect and migrate legacy Redis keys to the new format when encountered
Redirect analytics requests to the new service
Handle both old and new client connection patterns

Testing Instructions

Development Mode

# Start background services
docker compose up -d cache prometheus grafana loki promtail

# Run the socket server with auto-reload
docker compose up appdev

Scalable Mode Simulation

# Initialize Redis Cluster
docker compose up redis-cluster-init

# Start the full stack
docker compose up -d redis-master1 redis-master2 redis-master3 app1 app2 app3 nginx prometheus grafana loki promtail

Access Grafana at http://localhost:3444 (login: gadmin/admin)
Access Prometheus at http://localhost:9090
Socket server is available at http://localhost:4000 (dev) or http://localhost:8080 (scalable)

Benefits & Impact

Improved reliability through connection pooling and automatic recovery
Enhanced visibility with comprehensive dashboards for metrics and logs
Better scalability with proper Redis cluster support
Cleaner architecture with separation of analytics from core socket functionality
Improved developer experience with better documentation and development environment

codecov · 2025-04-09T16:21:33Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.73%. Comparing base (205959d) to head (768e716).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1259      +/-   ##
==========================================
- Coverage   74.05%   73.73%   -0.33%     
==========================================
  Files         182      182              
  Lines        4352     4355       +3     
  Branches     1066     1067       +1     
==========================================
- Hits         3223     3211      -12     
- Misses       1129     1144      +15

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

packages/sdk-socket-server-next/src/server.ts

packages/sdk-socket-server-next/src/metrics.ts

chakra-guy

please use prom for metrics

chakra-guy

A more general note on this PR:

What is the goal of this PR? How do we know that this PR is successful? How can we measure it?

I think we should define these and add metrics for it.

abretonc7s · 2025-04-10T00:35:33Z

A more general note on this PR:

What is the goal of this PR? How do we know that this PR is successful? How can we measure it?

I think we should define these and add metrics for it.

Obvisouly fix existing reddit issue or at least give us more insights on the actual root cause... There is a part of monitoring that comes with it. We first fix the problem.

- Introduced a connection pool for Redis to improve resource management and reduce connection churn. - Added backward compatibility for Redis key formats, migrating legacy keys to new formats during access. - Enhanced metrics for tracking key migration progress and connection pool health. - Updated Redis client interactions across the application to utilize the new pooling mechanism. - Implemented health checks and logging for Redis connections to improve operational visibility.

chakra-guy · 2025-04-10T07:59:11Z

A more general note on this PR:
What is the goal of this PR? How do we know that this PR is successful? How can we measure it?
I think we should define these and add metrics for it.

Obvisouly fix existing reddit issue or at least give us more insights on the actual root cause... There is a part of monitoring that comes with it. We first fix the problem.

Totally, I just want to know if the goal is to fix that specific redis problem, how can we measure that it worked?

The metrics that come to my mind that we could track is: "redis command error" or "redis key slot error"

- Introduced new packages for analytics server and client. - Updated package.json and yarn.lock to include dependencies for the new packages. - Implemented basic server functionality with event tracking and logging. - Added ESLint and Prettier configurations for code quality. - Created README files for both packages.

- Added a new document outlining the need to split the `sdk-communication-layer` package to resolve dependency conflicts and optimize bundle size for mobile and dApp implementations. - Implemented a new middleware for analytics redirection in the socket server, enhancing request handling for analytics events. - Updated the socket server's entry point to include the new analytics middleware. - Added a clean command for Redis in the socket server's package.json to facilitate easier development and testing.

ecp4224 · 2025-04-14T13:42:32Z

A more general note on this PR:
What is the goal of this PR? How do we know that this PR is successful? How can we measure it?
I think we should define these and add metrics for it.

Obvisouly fix existing reddit issue or at least give us more insights on the actual root cause... There is a part of monitoring that comes with it. We first fix the problem.

Totally, I just want to know if the goal is to fix that specific redis problem, how can we measure that it worked?

The metrics that come to my mind that we could track is: "redis command error" or "redis key slot error"

We can track error rate using these logs. This is what I used to verify my previous redis fixes (went from 10k-30k in the past 15 minutes to ~1k in the past 15 minutes)

…tics server URL - Changed the parameter name from `socketServerUrl` to `analyticsServerUrl` in the `SendAnalytics` function for clarity. - Added a new constant `DEFAULT_ANALYTICS_SERVER_URL` for the default analytics server URL. - Updated various components to use `analyticsServerUrl` instead of `communicationServerUrl` for sending analytics events, ensuring consistency across the communication layer.

…figuration - Updated the `initializeConnector` function to include `analyticsServerUrl` for improved clarity. - Changed the default analytics server URL to match the primary server URL for consistency. - Removed unused analytics-related code from the socket server, simplifying the analytics API and enhancing maintainability. - Adjusted the cleanup process to eliminate unnecessary analytics flushing during server shutdown.

- Introduced a multi-stage Dockerfile for building and running the analytics server. - Configured the build stage to install dependencies and build the project. - Set up the runtime stage to install only production dependencies and expose the server on port 2002. - Updated the default analytics server URL in the configuration to point to localhost for local development.

- Replaced singleton Redis client with a managed connection pool, improving connection efficiency. - Integrated Socket.IO Redis adapter for better cluster support and resolved compatibility issues. - Added monitoring metrics for Redis pool usage and introduced a health check API endpoint. - Enhanced stability with graceful shutdown and automatic connection recovery. - Refactored code to streamline Redis operations and improve maintainability.

- Added Prometheus service for metrics scraping with a pinned version for stability. - Configured Grafana service for metrics visualization, including default datasource setup. - Created a new Prometheus configuration file to define scrape targets for app1, app2, and app3. - Updated README to include setup instructions for the new monitoring stack.

- Added Redis cluster configuration with multiple master nodes for improved scalability. - Updated application services to depend on the Redis cluster and configured environment variables for Redis connections. - Modified Grafana service to expose its UI on a new port and included provisioning for dashboards and data sources. - Created a default dashboard for Grafana to visualize socket server metrics. - Revised README to reflect changes in setup instructions and features for both development and scalable environments.

- Added Loki and Promtail services to the Docker Compose setup for centralized log management. - Configured Promtail to scrape logs from Docker containers based on labels for better observability. - Created a new Grafana dashboard for visualizing logs and metrics from the socket server and Redis. - Updated README with instructions for starting the new logging stack alongside existing services. - Removed the default Grafana dashboard in favor of a more comprehensive logging dashboard.

… and functionality - Enhanced the README with clearer instructions for setting up the SDK socket server in development and scalable environments. - Reformatted bullet points for better readability. - Updated Grafana dashboard configurations to ensure proper integration with Prometheus and Loki, including adjustments to data source references and layout. - Improved the handling of log dashboards to ensure consistent data visualization. - Refactored socket server configuration to utilize a more efficient Redis client setup with expiration handling for channel occupancy.

- Revised the setup instructions in the README to specify the inclusion of Loki alongside Redis, Prometheus, and Grafana in the Docker Compose command for starting background services.

- Deleted the migration status endpoint and its associated logic from the socket server codebase, streamlining the server functionality.

- Initialized Redis health status to false to ensure the first success is logged. - Enhanced logging to provide clearer messages when transitioning between healthy and unhealthy states. - Consolidated the import of ClientType from a new socket-types module for better organization. - Updated multiple protocol files to reflect the new import path for ClientType.

ecp4224

some comments and questions I had while going through the PR

packages/analytics-client/package.json

packages/analytics-client/README.md

packages/analytics-server/README.md

packages/analytics-server/package.json

packages/analytics-server/src/index.ts

packages/sdk-socket-server-next/src/redis.ts

packages/sdk-socket-server-next/src/protocol/retrieveMessages.ts

packages/sdk-socket-server-next/src/redis.ts

packages/sdk-socket-server-next/src/socket-types.ts

packages/sdk-socket-server-next/src/utils.ts

chakra-guy

While I believe these are good high level changes to make and will make the service better, this PR presents significant challenges for review:

Primary concerns:

~6K lines of changes across multiple disconnected functional areas
No tests included, making it difficult to verify this works correctly
The PR description doesn't clearly explain the rationale behind key architectural decisions or sometimes overstates things

Recommendation: Split into Focused PRs

Breaking this into multiple, focused PRs would significantly improve our ability to review and verify each change:

1. Local Development Environment PR

Docker environment enhancements for single and multi-instance mode
Local Prometheus/Grafana/Loki setup
Scope: Configuration changes only, no business logic modifications
Purpose: Improve local development experience

2. Analytics Service Extraction PR

Extract analytics API to a dedicated service (packages/analytics-server)
Add middleware for redirecting analytics traffic
Scope: Clean separation of the analytics functionality
Purpose: Improve modularity and separation of concerns

3. Redis Integration Removal from Analytics Service PR

Remove Redis dependencies from the extracted analytics service
Scope: Changes limited to the analytics service implementation
Purpose: Further decoupling of components

4. Redis Connection Pooling PR

Replace singleton Redis client with properly managed connection pool
Add connection retry logic and error handling
Scope: Redis connection management only
Purpose: Improve resilience and resource utilization

5. Redis Key Migration PR

Implement automatic key migration for backward compatibility
Add proper Prometheus metrics for monitoring the migration
Scope: Key format changes and migration logic only
Purpose: Support legacy key formats while moving to new architecture

Specific Implementation Concerns

Monitoring Claims: The PR description suggests implementation of Prometheus, Grafana, and Loki as new additions, but these monitoring tools were already set up in production/dev environments months ago. The actual contribution appears limited to configuring a local development setup.
Redis Key Migration Metrics: The implementation uses a global variable counter exposed via an API endpoint rather than Prometheus metrics, which is inconsistent with our existing observability stack. As already noted in earlier comments, we should be using Prometheus for all metrics.
Redis Key Migration: Could you please also explain in the PR what is the problem with the current key and why would this new format fix the issue?

Thanks!

abretonc7s · 2025-04-16T09:59:27Z

While I believe these are good high level changes to make and will make the service better, this PR presents significant challenges for review:

Primary concerns:

~6K lines of changes across multiple disconnected functional areas

No tests included, making it difficult to verify this works correctly

The PR description doesn't clearly explain the rationale behind key architectural decisions or sometimes overstates things

Recommendation: Split into Focused PRs

Breaking this into multiple, focused PRs would significantly improve our ability to review and verify each change:

1. Local Development Environment PR

Docker environment enhancements for single and multi-instance mode

Local Prometheus/Grafana/Loki setup

Scope: Configuration changes only, no business logic modifications

Purpose: Improve local development experience

2. Analytics Service Extraction PR

Extract analytics API to a dedicated service (packages/analytics-server)

Add middleware for redirecting analytics traffic

Scope: Clean separation of the analytics functionality

Purpose: Improve modularity and separation of concerns

3. Redis Integration Removal from Analytics Service PR

Remove Redis dependencies from the extracted analytics service

Scope: Changes limited to the analytics service implementation

Purpose: Further decoupling of components

4. Redis Connection Pooling PR

Replace singleton Redis client with properly managed connection pool

Add connection retry logic and error handling

Scope: Redis connection management only

Purpose: Improve resilience and resource utilization

5. Redis Key Migration PR

Implement automatic key migration for backward compatibility

Add proper Prometheus metrics for monitoring the migration

Scope: Key format changes and migration logic only

Purpose: Support legacy key formats while moving to new architecture

Specific Implementation Concerns

Monitoring Claims: The PR description suggests implementation of Prometheus, Grafana, and Loki as new additions, but these monitoring tools were already set up in production/dev environments months ago. The actual contribution appears limited to configuring a local development setup.

Redis Key Migration Metrics: The implementation uses a global variable counter exposed via an API endpoint rather than Prometheus metrics, which is inconsistent with our existing observability stack. As already noted in earlier comments, we should be using Prometheus for all metrics.

Redis Key Migration: Could you please also explain in the PR what is the problem with the current key and why would this new format fix the issue?

Thanks!

the redis key migration is to force each key to stay in the same host, seems to be a redis specific format. These changes were actually started by Eddie in a previous PR without migration and not having all the keys.

sonarqubecloud · 2025-04-19T13:33:24Z

Quality Gate failed

Failed conditions
3 Security Hotspots

See analysis details on SonarQube Cloud

Copilot

Pull Request Overview

This PR enhances the server infrastructure by introducing a Redis connection pool and an observability stack while refactoring analytics event routing.

Replaces communication server URL with a dedicated analytics server URL for analytics events.
Disables the handleClientsConnectedEvent test pending analytics externalization.
Adds and configures a new analytics server with comprehensive logging, monitoring, and security improvements.

Reviewed Changes

Copilot reviewed 52 out of 61 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
packages/sdk-communication-layer/src/services/RemoteCommunication/EventListeners/handleClientsConnectedEvent.test.ts	Temporarily disables analytics test for externalized analytics server
packages/sdk-communication-layer/src/services/RemoteCommunication/ConnectionManager/rejectChannel.ts	Updates analytics event reporting URL to analyticsServerUrl
packages/sdk-communication-layer/src/services/RemoteCommunication/ConnectionManager/disconnect.ts	Consistently updates disconnect analytics endpoint to analyticsServerUrl
packages/sdk-communication-layer/src/config.ts	Adds a default analytics server URL configuration
packages/sdk-communication-layer/src/RemoteCommunication.ts	Introduces analyticsServerUrl to the state and constructor parameters
packages/sdk-communication-layer/src/Analytics.ts	Updates analytics endpoint parameter and targetUrl assignment
packages/analytics-server/*	Introduces a new analytics server with proper logging and monitoring configuration
packages/analytics-server/README.md and .eslintrc.js	Provides documentation and linting configuration for the analytics server

Files not reviewed (9)

package.json: Language not supported
packages/analytics-client/package.json: Language not supported
packages/analytics-server/.eslintignore: Language not supported
packages/analytics-server/.gitignore: Language not supported
packages/analytics-server/.prettierignore: Language not supported
packages/analytics-server/.prettierrc: Language not supported
packages/analytics-server/Dockerfile: Language not supported
packages/analytics-server/package.json: Language not supported
packages/analytics-server/tsconfig.json: Language not supported

Comments suppressed due to low confidence (2)

packages/sdk-communication-layer/src/services/RemoteCommunication/EventListeners/handleClientsConnectedEvent.test.ts:13

The test for handleClientsConnectedEvent is disabled; please consider adding a TODO or note indicating when and under what conditions the test should be re-enabled.

// Disabled while checking externalizing analytics server.

packages/sdk-communication-layer/src/services/RemoteCommunication/ConnectionManager/rejectChannel.ts:59

Ensure that replacing state.communicationServerUrl with state.analyticsServerUrl in analytics event dispatch is intentional and consistently supported across all modules handling analytics events.

state.analyticsServerUrl,

abretonc7s marked this pull request as ready for review April 9, 2025 15:27

abretonc7s requested a review from a team as a code owner April 9, 2025 15:27

chakra-guy reviewed Apr 9, 2025

View reviewed changes

packages/sdk-socket-server-next/src/server.ts Outdated Show resolved Hide resolved

chakra-guy reviewed Apr 9, 2025

View reviewed changes

packages/sdk-socket-server-next/src/metrics.ts Outdated Show resolved Hide resolved

chakra-guy requested changes Apr 9, 2025

View reviewed changes

chakra-guy reviewed Apr 9, 2025

View reviewed changes

abretonc7s marked this pull request as draft April 10, 2025 03:02

abretonc7s force-pushed the redis-consolidation branch from 1cc7b64 to 779c38d Compare April 10, 2025 03:02

abretonc7s added 2 commits April 14, 2025 12:34

abretonc7s added 11 commits April 15, 2025 09:14

Merge remote-tracking branch 'origin/main' into redis-consolidation

4291b68

feat: wip

9eb0887

fix(docs): update README to include Loki in background services setup

3349f7f

- Revised the setup instructions in the README to specify the inclusion of Loki alongside Redis, Prometheus, and Grafana in the Docker Compose command for starting background services.

abretonc7s changed the title ~~feat(redis): enhance key migration and health monitoring for improved reliability~~ feat(infra): enhance server infrastructure with Redis connection pool and observability stack Apr 15, 2025

abretonc7s marked this pull request as ready for review April 15, 2025 12:42

abretonc7s added 2 commits April 15, 2025 20:45

chore: remove migration status endpoint from socket server

fa956fc

- Deleted the migration status endpoint and its associated logic from the socket server codebase, streamlining the server functionality.

abretonc7s added 2 commits April 15, 2025 21:03

feat: cleanup

8611177

feat: cleanup

01c69b9

abretonc7s mentioned this pull request Apr 15, 2025

feat(analytics): introduce @metamask/analytics-client package and consolidate tracking types #1263

Open

5 tasks

ecp4224 requested changes Apr 15, 2025

View reviewed changes

abretonc7s added 6 commits April 15, 2025 23:21

fix: pr comments

ab872e0

fix: pr comments

38486eb

fix: pipeline

7137851

feat: pr comments

16477fd

feat: lockfile

b0ace18

fix: pipeline

8c9e3e8

abretonc7s requested review from ecp4224 and chakra-guy April 16, 2025 03:15

chakra-guy reviewed Apr 16, 2025

View reviewed changes

ecp4224 previously approved these changes Apr 16, 2025

View reviewed changes

fix: pipeline

768e716

abretonc7s dismissed ecp4224’s stale review via 768e716 April 19, 2025 13:05

michaelmccallam requested a review from Copilot April 25, 2025 07:15

Copilot AI reviewed Apr 25, 2025

View reviewed changes

Uh oh!

feat(infra): enhance server infrastructure with Redis connection pool and observability stack #1259

Are you sure you want to change the base?

feat(infra): enhance server infrastructure with Redis connection pool and observability stack #1259

Conversation

abretonc7s commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Overview

Key Implementation Details

Redis Improvements

Monitoring Stack

Architecture Changes

Docker Environment Enhancements

Breaking Changes

Migration Steps

Testing Instructions

Development Mode

Scalable Mode Simulation

Benefits & Impact

Uh oh!

codecov bot commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

chakra-guy left a comment

Choose a reason for hiding this comment

Uh oh!

chakra-guy left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abretonc7s commented Apr 10, 2025

Uh oh!

chakra-guy commented Apr 10, 2025

Uh oh!

ecp4224 commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ecp4224 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chakra-guy left a comment

Choose a reason for hiding this comment

Recommendation: Split into Focused PRs

1. Local Development Environment PR

2. Analytics Service Extraction PR

3. Redis Integration Removal from Analytics Service PR

4. Redis Connection Pooling PR

5. Redis Key Migration PR

Specific Implementation Concerns

Uh oh!

abretonc7s commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Recommendation: Split into Focused PRs

1. Local Development Environment PR

2. Analytics Service Extraction PR

3. Redis Integration Removal from Analytics Service PR

4. Redis Connection Pooling PR

5. Redis Key Migration PR

Specific Implementation Concerns

Uh oh!

sonarqubecloud bot commented Apr 19, 2025

Quality Gate failed

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

abretonc7s commented Apr 9, 2025 •

edited

Loading

codecov bot commented Apr 9, 2025 •

edited

Loading

chakra-guy left a comment •

edited

Loading

ecp4224 commented Apr 14, 2025 •

edited

Loading

abretonc7s commented Apr 16, 2025 •

edited

Loading