You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If usage reporting is enabled, Backend will send a sizable number of metrics in its usage report (almost 100). Some of the database queries behind those metrics touch every row in the largest tables (e.g., the audits, submissions, and client_audits tables), so those queries could take longer. Currently, we query all metrics in parallel. Especially as we continue to add metrics, it seems possible that the number of concurrent queries will lead to performance problems. The usage report is run at 3 a.m. UTC, but in some time zones, user activity is expected at that time.
In practice, I don't see much evidence that there have been performance problems. Looking at analytics from the past 31 days, the vast majority of usage reports were received within a minute of 3 a.m. A small number of usage reports were received after 1 minute past 3 a.m., including some after 5 minutes and one after 40 minutes (!). There are multiple variables that could affect the time of submission receipt, so those numbers don't necessarily speak to the performance of the analytics queries, though that's a possibility.
Side note: I'm actually confused about how a usage report could be successful after taking more than 1 minute, much less 5 minutes. I'm pretty sure that the number of analytics queries exceeds the size of the database connection pool. Doesn't Slonik time out fairly quickly if it can't acquire a database connection? Why isn't it timing out these queries? We've seen that transactions use a single database connection even if they make multiple queries, but I don't think that the analytic queries are run in a transaction. Maybe most of the queries are fast enough that it's not a problem.
If query timeout were an issue, I'd be concerned that we're not receiving analytics from large servers where the queries could take a long time. I see large servers in our analytics, so I don't have reason to believe that's happening. But like I said, I'm confused about why query timeout doesn't seem to be an issue here.
Even assuming query timeout isn't an issue, I think it makes sense to reduce the number of concurrent queries in order to reduce the risk of performance issues.
There's also an issue in testing that I think is related to the number of concurrent queries. When I run tests, I see a lot of MaxListenersExceededWarning. Many of those seem related to analytics. For example:
task: analytics
✔ should not compute analytics if not enabled
✔ should not compute analytics if explicitly disabled
✔ should not compute analytics if analytics sent recently
(node:80089) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 notice listeners added to [Client]. MaxListeners is 10. Use emitter.setMaxListeners() to increase limit
✔ should send analytics if enabled and time to send (111ms)
(node:80089) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 notice listeners added to [Client]. MaxListeners is 10. Use emitter.setMaxListeners() to increase limit
✔ should resend analytics if last attempt failed (92ms)
(node:80089) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 notice listeners added to [Client]. MaxListeners is 10. Use emitter.setMaxListeners() to increase limit
✔ should log event and full report if analytics sent successfully (87ms)
(node:80089) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 notice listeners added to [Client]. MaxListeners is 10. Use emitter.setMaxListeners() to increase limit
✔ should log request errors (85ms)
We've encountered this warning before and connected it to a Slonik issue. That issue was resolved in Slonik 23.6.3, but we're on 23.6.2 (so close!).
In at least one place in testing, we set defaultMaxListeners to more than 10. Instead of setting it in more tests, I think it'd probably be better to address the underlying cause behind the warning. However, I wanted to mention it as one workaround we've tried.
If usage reporting is enabled, Backend will send a sizable number of metrics in its usage report (almost 100). Some of the database queries behind those metrics touch every row in the largest tables (e.g., the
audits
,submissions
, andclient_audits
tables), so those queries could take longer. Currently, we query all metrics in parallel. Especially as we continue to add metrics, it seems possible that the number of concurrent queries will lead to performance problems. The usage report is run at 3 a.m. UTC, but in some time zones, user activity is expected at that time.In practice, I don't see much evidence that there have been performance problems. Looking at analytics from the past 31 days, the vast majority of usage reports were received within a minute of 3 a.m. A small number of usage reports were received after 1 minute past 3 a.m., including some after 5 minutes and one after 40 minutes (!). There are multiple variables that could affect the time of submission receipt, so those numbers don't necessarily speak to the performance of the analytics queries, though that's a possibility.
Side note: I'm actually confused about how a usage report could be successful after taking more than 1 minute, much less 5 minutes. I'm pretty sure that the number of analytics queries exceeds the size of the database connection pool. Doesn't Slonik time out fairly quickly if it can't acquire a database connection? Why isn't it timing out these queries? We've seen that transactions use a single database connection even if they make multiple queries, but I don't think that the analytic queries are run in a transaction. Maybe most of the queries are fast enough that it's not a problem.
If query timeout were an issue, I'd be concerned that we're not receiving analytics from large servers where the queries could take a long time. I see large servers in our analytics, so I don't have reason to believe that's happening. But like I said, I'm confused about why query timeout doesn't seem to be an issue here.
Even assuming query timeout isn't an issue, I think it makes sense to reduce the number of concurrent queries in order to reduce the risk of performance issues.
There's also an issue in testing that I think is related to the number of concurrent queries. When I run tests, I see a lot of
MaxListenersExceededWarning
. Many of those seem related to analytics. For example:We've encountered this warning before and connected it to a Slonik issue. That issue was resolved in Slonik 23.6.3, but we're on 23.6.2 (so close!).
In at least one place in testing, we set
defaultMaxListeners
to more than 10. Instead of setting it in more tests, I think it'd probably be better to address the underlying cause behind the warning. However, I wanted to mention it as one workaround we've tried.central-backend/test/integration/other/analytics-queries.js
Line 2148 in 63ca788
That said, even if we bumped Slonik and addressed the warning, I think we'd still want to think about reducing the number of concurrent queries.
The text was updated successfully, but these errors were encountered: