-
-
Notifications
You must be signed in to change notification settings - Fork 176
feat(infra): enhance server infrastructure with Redis connection pool and observability stack #1259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1259 +/- ##
==========================================
- Coverage 74.05% 73.73% -0.33%
==========================================
Files 182 182
Lines 4352 4355 +3
Branches 1066 1067 +1
==========================================
- Hits 3223 3211 -12
- Misses 1129 1144 +15 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use prom for metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more general note on this PR:
What is the goal of this PR? How do we know that this PR is successful? How can we measure it?
I think we should define these and add metrics for it.
Obvisouly fix existing reddit issue or at least give us more insights on the actual root cause... There is a part of monitoring that comes with it. We first fix the problem. |
- Introduced a connection pool for Redis to improve resource management and reduce connection churn. - Added backward compatibility for Redis key formats, migrating legacy keys to new formats during access. - Enhanced metrics for tracking key migration progress and connection pool health. - Updated Redis client interactions across the application to utilize the new pooling mechanism. - Implemented health checks and logging for Redis connections to improve operational visibility.
1cc7b64
to
779c38d
Compare
Totally, I just want to know if the goal is to fix that specific redis problem, how can we measure that it worked? The metrics that come to my mind that we could track is: "redis command error" or "redis key slot error" |
- Introduced new packages for analytics server and client. - Updated package.json and yarn.lock to include dependencies for the new packages. - Implemented basic server functionality with event tracking and logging. - Added ESLint and Prettier configurations for code quality. - Created README files for both packages.
- Added a new document outlining the need to split the `sdk-communication-layer` package to resolve dependency conflicts and optimize bundle size for mobile and dApp implementations. - Implemented a new middleware for analytics redirection in the socket server, enhancing request handling for analytics events. - Updated the socket server's entry point to include the new analytics middleware. - Added a clean command for Redis in the socket server's package.json to facilitate easier development and testing.
We can track error rate using these logs. This is what I used to verify my previous redis fixes (went from 10k-30k in the past 15 minutes to ~1k in the past 15 minutes) |
…tics server URL - Changed the parameter name from `socketServerUrl` to `analyticsServerUrl` in the `SendAnalytics` function for clarity. - Added a new constant `DEFAULT_ANALYTICS_SERVER_URL` for the default analytics server URL. - Updated various components to use `analyticsServerUrl` instead of `communicationServerUrl` for sending analytics events, ensuring consistency across the communication layer.
…figuration - Updated the `initializeConnector` function to include `analyticsServerUrl` for improved clarity. - Changed the default analytics server URL to match the primary server URL for consistency. - Removed unused analytics-related code from the socket server, simplifying the analytics API and enhancing maintainability. - Adjusted the cleanup process to eliminate unnecessary analytics flushing during server shutdown.
- Introduced a multi-stage Dockerfile for building and running the analytics server. - Configured the build stage to install dependencies and build the project. - Set up the runtime stage to install only production dependencies and expose the server on port 2002. - Updated the default analytics server URL in the configuration to point to localhost for local development.
- Replaced singleton Redis client with a managed connection pool, improving connection efficiency. - Integrated Socket.IO Redis adapter for better cluster support and resolved compatibility issues. - Added monitoring metrics for Redis pool usage and introduced a health check API endpoint. - Enhanced stability with graceful shutdown and automatic connection recovery. - Refactored code to streamline Redis operations and improve maintainability.
- Added Prometheus service for metrics scraping with a pinned version for stability. - Configured Grafana service for metrics visualization, including default datasource setup. - Created a new Prometheus configuration file to define scrape targets for app1, app2, and app3. - Updated README to include setup instructions for the new monitoring stack.
- Added Redis cluster configuration with multiple master nodes for improved scalability. - Updated application services to depend on the Redis cluster and configured environment variables for Redis connections. - Modified Grafana service to expose its UI on a new port and included provisioning for dashboards and data sources. - Created a default dashboard for Grafana to visualize socket server metrics. - Revised README to reflect changes in setup instructions and features for both development and scalable environments.
- Added Loki and Promtail services to the Docker Compose setup for centralized log management. - Configured Promtail to scrape logs from Docker containers based on labels for better observability. - Created a new Grafana dashboard for visualizing logs and metrics from the socket server and Redis. - Updated README with instructions for starting the new logging stack alongside existing services. - Removed the default Grafana dashboard in favor of a more comprehensive logging dashboard.
… and functionality - Enhanced the README with clearer instructions for setting up the SDK socket server in development and scalable environments. - Reformatted bullet points for better readability. - Updated Grafana dashboard configurations to ensure proper integration with Prometheus and Loki, including adjustments to data source references and layout. - Improved the handling of log dashboards to ensure consistent data visualization. - Refactored socket server configuration to utilize a more efficient Redis client setup with expiration handling for channel occupancy.
- Revised the setup instructions in the README to specify the inclusion of Loki alongside Redis, Prometheus, and Grafana in the Docker Compose command for starting background services.
- Deleted the migration status endpoint and its associated logic from the socket server codebase, streamlining the server functionality.
- Initialized Redis health status to false to ensure the first success is logged. - Enhanced logging to provide clearer messages when transitioning between healthy and unhealthy states. - Consolidated the import of ClientType from a new socket-types module for better organization. - Updated multiple protocol files to reflect the new import path for ClientType.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments and questions I had while going through the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I believe these are good high level changes to make and will make the service better, this PR presents significant challenges for review:
Primary concerns:
- ~6K lines of changes across multiple disconnected functional areas
- No tests included, making it difficult to verify this works correctly
- The PR description doesn't clearly explain the rationale behind key architectural decisions or sometimes overstates things
Recommendation: Split into Focused PRs
Breaking this into multiple, focused PRs would significantly improve our ability to review and verify each change:
1. Local Development Environment PR
- Docker environment enhancements for single and multi-instance mode
- Local Prometheus/Grafana/Loki setup
- Scope: Configuration changes only, no business logic modifications
- Purpose: Improve local development experience
2. Analytics Service Extraction PR
- Extract analytics API to a dedicated service (packages/analytics-server)
- Add middleware for redirecting analytics traffic
- Scope: Clean separation of the analytics functionality
- Purpose: Improve modularity and separation of concerns
3. Redis Integration Removal from Analytics Service PR
- Remove Redis dependencies from the extracted analytics service
- Scope: Changes limited to the analytics service implementation
- Purpose: Further decoupling of components
4. Redis Connection Pooling PR
- Replace singleton Redis client with properly managed connection pool
- Add connection retry logic and error handling
- Scope: Redis connection management only
- Purpose: Improve resilience and resource utilization
5. Redis Key Migration PR
- Implement automatic key migration for backward compatibility
- Add proper Prometheus metrics for monitoring the migration
- Scope: Key format changes and migration logic only
- Purpose: Support legacy key formats while moving to new architecture
Specific Implementation Concerns
-
Monitoring Claims: The PR description suggests implementation of Prometheus, Grafana, and Loki as new additions, but these monitoring tools were already set up in production/dev environments months ago. The actual contribution appears limited to configuring a local development setup.
-
Redis Key Migration Metrics: The implementation uses a global variable counter exposed via an API endpoint rather than Prometheus metrics, which is inconsistent with our existing observability stack. As already noted in earlier comments, we should be using Prometheus for all metrics.
-
Redis Key Migration: Could you please also explain in the PR what is the problem with the current key and why would this new format fix the issue?
Thanks!
the redis key migration is to force each key to stay in the same host, seems to be a redis specific format. These changes were actually started by Eddie in a previous PR without migration and not having all the keys. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the server infrastructure by introducing a Redis connection pool and an observability stack while refactoring analytics event routing.
- Replaces communication server URL with a dedicated analytics server URL for analytics events.
- Disables the handleClientsConnectedEvent test pending analytics externalization.
- Adds and configures a new analytics server with comprehensive logging, monitoring, and security improvements.
Reviewed Changes
Copilot reviewed 52 out of 61 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
packages/sdk-communication-layer/src/services/RemoteCommunication/EventListeners/handleClientsConnectedEvent.test.ts | Temporarily disables analytics test for externalized analytics server |
packages/sdk-communication-layer/src/services/RemoteCommunication/ConnectionManager/rejectChannel.ts | Updates analytics event reporting URL to analyticsServerUrl |
packages/sdk-communication-layer/src/services/RemoteCommunication/ConnectionManager/disconnect.ts | Consistently updates disconnect analytics endpoint to analyticsServerUrl |
packages/sdk-communication-layer/src/config.ts | Adds a default analytics server URL configuration |
packages/sdk-communication-layer/src/RemoteCommunication.ts | Introduces analyticsServerUrl to the state and constructor parameters |
packages/sdk-communication-layer/src/Analytics.ts | Updates analytics endpoint parameter and targetUrl assignment |
packages/analytics-server/* | Introduces a new analytics server with proper logging and monitoring configuration |
packages/analytics-server/README.md and .eslintrc.js | Provides documentation and linting configuration for the analytics server |
Files not reviewed (9)
- package.json: Language not supported
- packages/analytics-client/package.json: Language not supported
- packages/analytics-server/.eslintignore: Language not supported
- packages/analytics-server/.gitignore: Language not supported
- packages/analytics-server/.prettierignore: Language not supported
- packages/analytics-server/.prettierrc: Language not supported
- packages/analytics-server/Dockerfile: Language not supported
- packages/analytics-server/package.json: Language not supported
- packages/analytics-server/tsconfig.json: Language not supported
Comments suppressed due to low confidence (2)
packages/sdk-communication-layer/src/services/RemoteCommunication/EventListeners/handleClientsConnectedEvent.test.ts:13
- The test for handleClientsConnectedEvent is disabled; please consider adding a TODO or note indicating when and under what conditions the test should be re-enabled.
// Disabled while checking externalizing analytics server.
packages/sdk-communication-layer/src/services/RemoteCommunication/ConnectionManager/rejectChannel.ts:59
- Ensure that replacing state.communicationServerUrl with state.analyticsServerUrl in analytics event dispatch is intentional and consistently supported across all modules handling analytics events.
state.analyticsServerUrl,
Description
Overview
This PR introduces significant infrastructure improvements to our socket server architecture, focusing on enhanced reliability, scalability, and observability. The key changes include:
Key Implementation Details
Redis Improvements
Monitoring Stack
Architecture Changes
packages/analytics-server
)Docker Environment Enhancements
Breaking Changes
/evt
and/debug
) are now redirected to a separate serviceMigration Steps
This PR is designed to maintain backward compatibility, so explicit migration steps shouldn't be necessary. The server will automatically:
Testing Instructions
Development Mode
Scalable Mode Simulation
Benefits & Impact