Skip to content

feat(infra): enhance server infrastructure with Redis connection pool and observability stack #1259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

abretonc7s
Copy link
Collaborator

@abretonc7s abretonc7s commented Apr 9, 2025

Description

Overview

This PR introduces significant infrastructure improvements to our socket server architecture, focusing on enhanced reliability, scalability, and observability. The key changes include:

  1. Redis Connection Management Overhaul: Replaced singleton Redis client with a properly managed connection pool, improving resource utilization and resilience.
  2. Observability Stack: Integrated Prometheus, Grafana, Loki, and Promtail for comprehensive monitoring and logging.
  3. Architecture Split: Separated analytics functionality into a dedicated server to improve modularity and reduce coupling.
  4. Documentation Improvements: Enhanced setup documentation for both development and scalable deployment scenarios.

Key Implementation Details

Redis Improvements

  • Implemented Redis connection pooling with configurable minimum (default: 15) and maximum (default: 50) connections
  • Added automatic key migration for backward compatibility with legacy key formats
  • Enhanced connection retry logic with better error handling
  • Implemented health checks with monitoring dashboard integration

Monitoring Stack

  • Added Prometheus for metrics collection from all services on local dev
  • Configured Grafana with multiple dashboards for system metrics, Redis health, and application monitoring
  • Integrated Loki and Promtail for centralized log collection and visualization
  • Added custom metrics for tracking Redis pool usage, key migrations, and handler performance

Architecture Changes

  • Extracted analytics API to a dedicated service (packages/analytics-server)
  • Added middleware for redirecting analytics traffic to the new dedicated service
  • Refactored socket server to support clustered Redis deployments
  • Implemented graceful shutdown procedures that properly clean up resources

Docker Environment Enhancements

  • Created two deployment modes:
    1. Development Mode: Single instance with auto-reload for development
    2. Scalable Mode: Multiple app instances with load balancing for production-like testing
  • Added Redis cluster configuration with proper support for horizontal scaling

Breaking Changes

  • Analytics endpoints (/evt and /debug) are now redirected to a separate service
  • Upgraded Redis and Socket.IO dependencies with more robust connection handling
  • Changed default analytics server URL to point to the new separate service

Migration Steps

This PR is designed to maintain backward compatibility, so explicit migration steps shouldn't be necessary. The server will automatically:

  1. Detect and migrate legacy Redis keys to the new format when encountered
  2. Redirect analytics requests to the new service
  3. Handle both old and new client connection patterns

Testing Instructions

Development Mode

# Start background services
docker compose up -d cache prometheus grafana loki promtail

# Run the socket server with auto-reload
docker compose up appdev

Scalable Mode Simulation

# Initialize Redis Cluster
docker compose up redis-cluster-init

# Start the full stack
docker compose up -d redis-master1 redis-master2 redis-master3 app1 app2 app3 nginx prometheus grafana loki promtail

Benefits & Impact

  • Improved reliability through connection pooling and automatic recovery
  • Enhanced visibility with comprehensive dashboards for metrics and logs
  • Better scalability with proper Redis cluster support
  • Cleaner architecture with separation of analytics from core socket functionality
  • Improved developer experience with better documentation and development environment

@abretonc7s abretonc7s marked this pull request as ready for review April 9, 2025 15:27
@abretonc7s abretonc7s requested a review from a team as a code owner April 9, 2025 15:27
Copy link

codecov bot commented Apr 9, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.73%. Comparing base (205959d) to head (768e716).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1259      +/-   ##
==========================================
- Coverage   74.05%   73.73%   -0.33%     
==========================================
  Files         182      182              
  Lines        4352     4355       +3     
  Branches     1066     1067       +1     
==========================================
- Hits         3223     3211      -12     
- Misses       1129     1144      +15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Collaborator

@chakra-guy chakra-guy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use prom for metrics

Copy link
Collaborator

@chakra-guy chakra-guy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more general note on this PR:

What is the goal of this PR? How do we know that this PR is successful? How can we measure it?

I think we should define these and add metrics for it.

@abretonc7s
Copy link
Collaborator Author

A more general note on this PR:

What is the goal of this PR? How do we know that this PR is successful? How can we measure it?

I think we should define these and add metrics for it.

Obvisouly fix existing reddit issue or at least give us more insights on the actual root cause... There is a part of monitoring that comes with it. We first fix the problem.

@abretonc7s abretonc7s marked this pull request as draft April 10, 2025 03:02
- Introduced a connection pool for Redis to improve resource management and reduce connection churn.
- Added backward compatibility for Redis key formats, migrating legacy keys to new formats during access.
- Enhanced metrics for tracking key migration progress and connection pool health.
- Updated Redis client interactions across the application to utilize the new pooling mechanism.
- Implemented health checks and logging for Redis connections to improve operational visibility.
@abretonc7s abretonc7s force-pushed the redis-consolidation branch from 1cc7b64 to 779c38d Compare April 10, 2025 03:02
@chakra-guy
Copy link
Collaborator

A more general note on this PR:
What is the goal of this PR? How do we know that this PR is successful? How can we measure it?
I think we should define these and add metrics for it.

Obvisouly fix existing reddit issue or at least give us more insights on the actual root cause... There is a part of monitoring that comes with it. We first fix the problem.

Totally, I just want to know if the goal is to fix that specific redis problem, how can we measure that it worked?

The metrics that come to my mind that we could track is: "redis command error" or "redis key slot error"

- Introduced new packages for analytics server and client.
- Updated package.json and yarn.lock to include dependencies for the new packages.
- Implemented basic server functionality with event tracking and logging.
- Added ESLint and Prettier configurations for code quality.
- Created README files for both packages.
- Added a new document outlining the need to split the `sdk-communication-layer` package to resolve dependency conflicts and optimize bundle size for mobile and dApp implementations.
- Implemented a new middleware for analytics redirection in the socket server, enhancing request handling for analytics events.
- Updated the socket server's entry point to include the new analytics middleware.
- Added a clean command for Redis in the socket server's package.json to facilitate easier development and testing.
@ecp4224
Copy link
Collaborator

ecp4224 commented Apr 14, 2025

A more general note on this PR:
What is the goal of this PR? How do we know that this PR is successful? How can we measure it?
I think we should define these and add metrics for it.

Obvisouly fix existing reddit issue or at least give us more insights on the actual root cause... There is a part of monitoring that comes with it. We first fix the problem.

Totally, I just want to know if the goal is to fix that specific redis problem, how can we measure that it worked?

The metrics that come to my mind that we could track is: "redis command error" or "redis key slot error"

We can track error rate using these logs. This is what I used to verify my previous redis fixes (went from 10k-30k in the past 15 minutes to ~1k in the past 15 minutes)

…tics server URL

- Changed the parameter name from `socketServerUrl` to `analyticsServerUrl` in the `SendAnalytics` function for clarity.
- Added a new constant `DEFAULT_ANALYTICS_SERVER_URL` for the default analytics server URL.
- Updated various components to use `analyticsServerUrl` instead of `communicationServerUrl` for sending analytics events, ensuring consistency across the communication layer.
…figuration

- Updated the `initializeConnector` function to include `analyticsServerUrl` for improved clarity.
- Changed the default analytics server URL to match the primary server URL for consistency.
- Removed unused analytics-related code from the socket server, simplifying the analytics API and enhancing maintainability.
- Adjusted the cleanup process to eliminate unnecessary analytics flushing during server shutdown.
- Introduced a multi-stage Dockerfile for building and running the analytics server.
- Configured the build stage to install dependencies and build the project.
- Set up the runtime stage to install only production dependencies and expose the server on port 2002.
- Updated the default analytics server URL in the configuration to point to localhost for local development.
- Replaced singleton Redis client with a managed connection pool, improving connection efficiency.
- Integrated Socket.IO Redis adapter for better cluster support and resolved compatibility issues.
- Added monitoring metrics for Redis pool usage and introduced a health check API endpoint.
- Enhanced stability with graceful shutdown and automatic connection recovery.
- Refactored code to streamline Redis operations and improve maintainability.
- Added Prometheus service for metrics scraping with a pinned version for stability.
- Configured Grafana service for metrics visualization, including default datasource setup.
- Created a new Prometheus configuration file to define scrape targets for app1, app2, and app3.
- Updated README to include setup instructions for the new monitoring stack.
- Added Redis cluster configuration with multiple master nodes for improved scalability.
- Updated application services to depend on the Redis cluster and configured environment variables for Redis connections.
- Modified Grafana service to expose its UI on a new port and included provisioning for dashboards and data sources.
- Created a default dashboard for Grafana to visualize socket server metrics.
- Revised README to reflect changes in setup instructions and features for both development and scalable environments.
- Added Loki and Promtail services to the Docker Compose setup for centralized log management.
- Configured Promtail to scrape logs from Docker containers based on labels for better observability.
- Created a new Grafana dashboard for visualizing logs and metrics from the socket server and Redis.
- Updated README with instructions for starting the new logging stack alongside existing services.
- Removed the default Grafana dashboard in favor of a more comprehensive logging dashboard.
… and functionality

- Enhanced the README with clearer instructions for setting up the SDK socket server in development and scalable environments.
- Reformatted bullet points for better readability.
- Updated Grafana dashboard configurations to ensure proper integration with Prometheus and Loki, including adjustments to data source references and layout.
- Improved the handling of log dashboards to ensure consistent data visualization.
- Refactored socket server configuration to utilize a more efficient Redis client setup with expiration handling for channel occupancy.
- Revised the setup instructions in the README to specify the inclusion of Loki alongside Redis, Prometheus, and Grafana in the Docker Compose command for starting background services.
@abretonc7s abretonc7s changed the title feat(redis): enhance key migration and health monitoring for improved reliability feat(infra): enhance server infrastructure with Redis connection pool and observability stack Apr 15, 2025
@abretonc7s abretonc7s marked this pull request as ready for review April 15, 2025 12:42
- Deleted the migration status endpoint and its associated logic from the socket server codebase, streamlining the server functionality.
- Initialized Redis health status to false to ensure the first success is logged.
- Enhanced logging to provide clearer messages when transitioning between healthy and unhealthy states.
- Consolidated the import of ClientType from a new socket-types module for better organization.
- Updated multiple protocol files to reflect the new import path for ClientType.
Copy link
Collaborator

@ecp4224 ecp4224 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments and questions I had while going through the PR

Copy link
Collaborator

@chakra-guy chakra-guy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I believe these are good high level changes to make and will make the service better, this PR presents significant challenges for review:

Primary concerns:

  • ~6K lines of changes across multiple disconnected functional areas
  • No tests included, making it difficult to verify this works correctly
  • The PR description doesn't clearly explain the rationale behind key architectural decisions or sometimes overstates things

Recommendation: Split into Focused PRs

Breaking this into multiple, focused PRs would significantly improve our ability to review and verify each change:

1. Local Development Environment PR

  • Docker environment enhancements for single and multi-instance mode
  • Local Prometheus/Grafana/Loki setup
  • Scope: Configuration changes only, no business logic modifications
  • Purpose: Improve local development experience

2. Analytics Service Extraction PR

  • Extract analytics API to a dedicated service (packages/analytics-server)
  • Add middleware for redirecting analytics traffic
  • Scope: Clean separation of the analytics functionality
  • Purpose: Improve modularity and separation of concerns

3. Redis Integration Removal from Analytics Service PR

  • Remove Redis dependencies from the extracted analytics service
  • Scope: Changes limited to the analytics service implementation
  • Purpose: Further decoupling of components

4. Redis Connection Pooling PR

  • Replace singleton Redis client with properly managed connection pool
  • Add connection retry logic and error handling
  • Scope: Redis connection management only
  • Purpose: Improve resilience and resource utilization

5. Redis Key Migration PR

  • Implement automatic key migration for backward compatibility
  • Add proper Prometheus metrics for monitoring the migration
  • Scope: Key format changes and migration logic only
  • Purpose: Support legacy key formats while moving to new architecture

Specific Implementation Concerns

  1. Monitoring Claims: The PR description suggests implementation of Prometheus, Grafana, and Loki as new additions, but these monitoring tools were already set up in production/dev environments months ago. The actual contribution appears limited to configuring a local development setup.

  2. Redis Key Migration Metrics: The implementation uses a global variable counter exposed via an API endpoint rather than Prometheus metrics, which is inconsistent with our existing observability stack. As already noted in earlier comments, we should be using Prometheus for all metrics.

  3. Redis Key Migration: Could you please also explain in the PR what is the problem with the current key and why would this new format fix the issue?

Thanks!

@abretonc7s
Copy link
Collaborator Author

abretonc7s commented Apr 16, 2025

While I believe these are good high level changes to make and will make the service better, this PR presents significant challenges for review:

Primary concerns:

  • ~6K lines of changes across multiple disconnected functional areas
  • No tests included, making it difficult to verify this works correctly
  • The PR description doesn't clearly explain the rationale behind key architectural decisions or sometimes overstates things

Recommendation: Split into Focused PRs

Breaking this into multiple, focused PRs would significantly improve our ability to review and verify each change:

1. Local Development Environment PR

  • Docker environment enhancements for single and multi-instance mode
  • Local Prometheus/Grafana/Loki setup
  • Scope: Configuration changes only, no business logic modifications
  • Purpose: Improve local development experience

2. Analytics Service Extraction PR

  • Extract analytics API to a dedicated service (packages/analytics-server)
  • Add middleware for redirecting analytics traffic
  • Scope: Clean separation of the analytics functionality
  • Purpose: Improve modularity and separation of concerns

3. Redis Integration Removal from Analytics Service PR

  • Remove Redis dependencies from the extracted analytics service
  • Scope: Changes limited to the analytics service implementation
  • Purpose: Further decoupling of components

4. Redis Connection Pooling PR

  • Replace singleton Redis client with properly managed connection pool
  • Add connection retry logic and error handling
  • Scope: Redis connection management only
  • Purpose: Improve resilience and resource utilization

5. Redis Key Migration PR

  • Implement automatic key migration for backward compatibility
  • Add proper Prometheus metrics for monitoring the migration
  • Scope: Key format changes and migration logic only
  • Purpose: Support legacy key formats while moving to new architecture

Specific Implementation Concerns

  1. Monitoring Claims: The PR description suggests implementation of Prometheus, Grafana, and Loki as new additions, but these monitoring tools were already set up in production/dev environments months ago. The actual contribution appears limited to configuring a local development setup.
  2. Redis Key Migration Metrics: The implementation uses a global variable counter exposed via an API endpoint rather than Prometheus metrics, which is inconsistent with our existing observability stack. As already noted in earlier comments, we should be using Prometheus for all metrics.
  3. Redis Key Migration: Could you please also explain in the PR what is the problem with the current key and why would this new format fix the issue?

Thanks!

the redis key migration is to force each key to stay in the same host, seems to be a redis specific format. These changes were actually started by Eddie in a previous PR without migration and not having all the keys.

ecp4224
ecp4224 previously approved these changes Apr 16, 2025
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
3 Security Hotspots

See analysis details on SonarQube Cloud

@michaelmccallam michaelmccallam requested a review from Copilot April 25, 2025 07:15
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the server infrastructure by introducing a Redis connection pool and an observability stack while refactoring analytics event routing.

  • Replaces communication server URL with a dedicated analytics server URL for analytics events.
  • Disables the handleClientsConnectedEvent test pending analytics externalization.
  • Adds and configures a new analytics server with comprehensive logging, monitoring, and security improvements.

Reviewed Changes

Copilot reviewed 52 out of 61 changed files in this pull request and generated no comments.

Show a summary per file
File Description
packages/sdk-communication-layer/src/services/RemoteCommunication/EventListeners/handleClientsConnectedEvent.test.ts Temporarily disables analytics test for externalized analytics server
packages/sdk-communication-layer/src/services/RemoteCommunication/ConnectionManager/rejectChannel.ts Updates analytics event reporting URL to analyticsServerUrl
packages/sdk-communication-layer/src/services/RemoteCommunication/ConnectionManager/disconnect.ts Consistently updates disconnect analytics endpoint to analyticsServerUrl
packages/sdk-communication-layer/src/config.ts Adds a default analytics server URL configuration
packages/sdk-communication-layer/src/RemoteCommunication.ts Introduces analyticsServerUrl to the state and constructor parameters
packages/sdk-communication-layer/src/Analytics.ts Updates analytics endpoint parameter and targetUrl assignment
packages/analytics-server/* Introduces a new analytics server with proper logging and monitoring configuration
packages/analytics-server/README.md and .eslintrc.js Provides documentation and linting configuration for the analytics server
Files not reviewed (9)
  • package.json: Language not supported
  • packages/analytics-client/package.json: Language not supported
  • packages/analytics-server/.eslintignore: Language not supported
  • packages/analytics-server/.gitignore: Language not supported
  • packages/analytics-server/.prettierignore: Language not supported
  • packages/analytics-server/.prettierrc: Language not supported
  • packages/analytics-server/Dockerfile: Language not supported
  • packages/analytics-server/package.json: Language not supported
  • packages/analytics-server/tsconfig.json: Language not supported
Comments suppressed due to low confidence (2)

packages/sdk-communication-layer/src/services/RemoteCommunication/EventListeners/handleClientsConnectedEvent.test.ts:13

  • The test for handleClientsConnectedEvent is disabled; please consider adding a TODO or note indicating when and under what conditions the test should be re-enabled.
// Disabled while checking externalizing analytics server.

packages/sdk-communication-layer/src/services/RemoteCommunication/ConnectionManager/rejectChannel.ts:59

  • Ensure that replacing state.communicationServerUrl with state.analyticsServerUrl in analytics event dispatch is intentional and consistently supported across all modules handling analytics events.
state.analyticsServerUrl,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants