feat: improved cross platform metric collection #2834

NathanSavageKaimai · 2025-02-09T02:06:17Z

This PR improves CPU and RAM metric collection across multiple environments. The CPU metrics are now fully cGroup aware report properly in containerised environments with cpu quota limits. The memory profile method on windows has been updated from using legacy WMIC to powershell. Finally two new utility methods have been added to @crawlee/utils, general.ts to determine if the scraper is containerised (instead of just running in docker) as well as if cGroup is enabled.

Adds

@crawlee/utils

general.ts

isContainerised() an extention of isDocker() that also checks for the presence of a KUBERNETES_SERVICE_HOST environment variable for k8 and a CRAWLEE_CONTAINERISED environment variable for manual control.
getCgroupsVersion() a method to determine the cGroup version in a cGroup controlled environment. It does this by checking for a file at /sys/fs/cgroup/memory/. If it is present, the cGroup verison is 1, else version 2.

cpu-info.ts

Collects cpu infomation in a similar manner to memory-info.ts

getCurrentCpuTicks() The existing solution. Used in AWS lambda, containerised environments without a cGroup cpu limit and on bare metal.
getCpuQuota() Gets the cpu quota in cGroup controlled environments.
getCpuPeriod() Gets the cpu quota period in cGroup controlled environments.
getContainerCpuUsage() Gets the containers cpu usage.
getSystemCpuUsage() Gets the systems cpu usage.
getCpuInfo() The main method for collecting cpu load metrics. Determines the enviroment and calls the other functions accordingly.

Removes

@apify/ps-tree

A legacy package that checked memory usage using WMIC.exe (depreciated) on windows or ps on *nix. Replaced by @crawlee\packages\utils\src\internals\psTree.ts which calculates the memory usage in a similar manner but using powershell and Get-CimInstance Win32_Process. Also adds type safety.

Fixes

Fixes: #2771

NathanSavageKaimai · 2025-02-11T11:59:20Z

hey all, thanks for agreeing to take a look. I havent done much OSS before so im looking forward to hearing your thoughts :). Assuming it all looks good to you, when might it be incorperated into a crawlee release? At work we have a project that hinged on using crawlee in k8 so the autoscaling issues in containers is causing a fairly significant issue for us.

When you come to review it, id be happy to hop on a discord call and discuss it. :)

Thanks for everything you do!

janbuchar · 2025-02-11T13:28:04Z

Hi @NathanSavageKaimai and thanks for your willingness to contribute! In the issue that this aims to close, you mentioned the possibility of using the ps-list package. If we decided that adding another dependency is fine, could the change be smaller? How much? Are there any other tradeoffs or possible disadvantages to using that library?

NathanSavageKaimai · 2025-02-11T13:37:11Z

hi @janbuchar, ps-list would serve the same purpose as the new packages/utils/src/internals/psTree.ts file so it would be a 170 line reduction. Another module that might be useful that i have found since is systeminfomation. This module could likley replace, cpu-info.ts, memory-info.ts and psTree.ts leading to a ~540 line reduction. :)

Let me know if you would like me to explore these options.

NathanSavageKaimai · 2025-02-11T13:39:53Z

with ps-list, you are relying on a third party binary which doesnt provide its source code as far as i can tell. It could be a potential supply chain risk.

vladfrangu · 2025-02-11T13:41:21Z

I will +1 that worry, I'm not a fan of using a dependency that embeds a binary whose source code isn't directly open source / one we could build ourselves and embed

janbuchar · 2025-02-11T14:00:40Z

Um, as far as I can tell, ps-list uses fastlist, which seems open enough to me - am I missing anything?

NathanSavageKaimai · 2025-02-11T14:13:37Z

@janbuchar Ah yep, i hadnt found the cpp repo. still, being externally tracked, theres no automatic method to verify the authenticity of the binary beyond downloading from both sources and checking the hashes.

vladfrangu · 2025-02-11T14:14:18Z

Um, as far as I can tell, ps-list uses fastlist, which seems open enough to me - am I missing anything?

The fact the binary is just embedded in instead of precompiled (like impit) or built at install time is a worry imo

janbuchar · 2025-02-11T14:20:43Z

You both make a good point. I'd still consider exploring systeminformation - if we can avoid maintaining code for reading low level system details, it might be worth the increased install size.

NathanSavageKaimai · 2025-02-11T14:22:02Z

You both make a good point. I'd still consider exploring systeminformation - if we can avoid maintaining code for reading low level system details, it might be worth the increased install size.

cool will do. :)

NathanSavageKaimai · 2025-02-11T18:02:48Z

@janbuchar ive had a play around with systeminfomation and unfortunately it isnt as useful as i hoped it would be. It seems that the "docker" functions arent generating the metrics themselves but sending requests to the docker socket for the data. This unfortunately means that short of mounting the socket within the container, it can only work on the host. I have done a good search and as far as i can tell, there is no universal, cross platform library to collect cpu and ram metrics on "bare metal", cgroup 1 and 2. There might even be scope here for an entirely new package for apify but for now, my approach seems to be the best one going forward. :)

janbuchar · 2025-02-12T09:20:51Z

Sounds reasonable, thanks! I will try to review the code in depth this week.

NathanSavageKaimai · 2025-02-14T11:12:35Z

hey @janbuchar have you been able to have a look yet? No worries ethier way. if you like, we can sit down for a call later. Just shoot me a message on Discord - crafty5064. Im free until 2pm UTC or all day Saturday. :)

NathanSavageKaimai · 2025-02-17T16:03:06Z

Hi all,

I hope you had a good weekend! Have you had a chance to review these changes? I was hoping they might be included in a release soon as these issues are blocking my company from deploying our product fully to k8. If you would like a chat, im free untill 1pm utc tomorrow. :)

janbuchar · 2025-02-17T16:11:26Z

Hi, I'll look into it tomorrow. Sorry for the delay!

janbuchar

Good job on this one! I have a bunch of readability/code structure comments. More importantly though, the tests here are very narrowly scoped and use mocking heavily. Do you think you could add an E2E test that would verify that the system information detection works as expected? Feel free to suggest any other way to test this as a whole.

test/core/autoscaling/snapshotter.test.ts

test/browser-pool/browser-plugins/plugins.test.ts

test/utils/general.test.ts

packages/core/src/events/local_event_manager.ts

janbuchar · 2025-02-18T21:04:49Z

packages/utils/src/internals/general.ts

@@ -43,6 +43,35 @@ export async function isDocker(forceReset?: boolean): Promise<boolean> {
    return isDockerPromiseCache;
 }

+/**
+ * Returns a `Promise` that resolves to true if the code is running in a containerised environment.
+ * Returns true if the CRAWLEE_CONTAINERISED environment variable is set.


Who is supposed to set the CRAWLEE_CONTAINERISED environment variable?

that is meant to be a manual way to run the containerised resource checks in case the other heuristics dont catch it.

Got it. Could you add it to the documentation?

would you mind showing me where? im a little bit lost on that side of it, ta.

Im free for a call right now if you would like a chat. :)

I understand that 😁 https://github.com/apify/crawlee/blob/master/docs/guides/configuration.mdx this is the place

janbuchar · 2025-02-18T21:37:00Z

packages/utils/src/internals/psTree.ts

+ * @param includeRoot - Optional flag. When true, include the process with the given PID if found.
+ *                      Defaults to false.
+ */
+export async function psTree(pid: number | string, includeRoot: boolean = false): Promise<ProcessInfo[]> {


This is a very long and complex function. Possibly the main reason why it's hard to read is that it combines the implementation for UNIX and Windows in a single function. Could it be broken down into multiple smaller functions?

it was pretty much a copy paste from apify/pstree with the WMIC changes. I can reformat it. :)

Please do that, apify/pstree is super dated and I'm sure that if we don't refactor it now, we won't get back to it, ever.

cool will do :)

packages/utils/src/internals/memory-info.ts

packages/utils/src/internals/cpu-info.ts

NathanSavageKaimai · 2025-02-18T21:56:22Z

hi @janbuchar thanks for your insight! Most of the points were just me trying to follow conventions set out in the prexisting memory-info.ts file but i will definitely work on clarifying it. :)

janbuchar · 2025-02-19T13:50:50Z

@NathanSavageKaimai please look into refactoring of psTree so that it's more readable.

Also, any thoughts regarding this?

More importantly though, the tests here are very narrowly scoped and use mocking heavily. Do you think you could add an E2E test that would verify that the system information detection works as expected? Feel free to suggest any other way to test this as a whole.

NathanSavageKaimai · 2025-02-19T14:03:29Z

@NathanSavageKaimai please look into refactoring of psTree so that it's more readable.

Also, any thoughts regarding this?

More importantly though, the tests here are very narrowly scoped and use mocking heavily. Do you think you could add an E2E test that would verify that the system information detection works as expected? Feel free to suggest any other way to test this as a whole.

its a difficult one given that its so close to the metal, an e2e test would be dependant on the current state of the test runner unless i mocked the exec call and readline interface but at that point it may as well be a unit test. Also, personally I am only set up to run tests on windows or linux through wsl so i cant verify Macos compatability beyond "its a copy paste from a solution that persumably worked". What i will do is reimplement the unit tests in apify/pstree

janbuchar · 2025-02-19T15:22:14Z

What i will do is reimplement the unit tests in apify/pstree

Cool, that will at least give us some certainty that ps-tree works.

I guess we could make a script that would show the current CPU and memory usage ratio and compare it with the old implementation. Then we could at least test-drive this on several machines with different OS and see if it behaves reasonably. What do you think?

janbuchar · 2025-02-21T15:16:18Z

Hi @NathanSavageKaimai, we discussed this PR with @B4nan. Since this change impacts critical parts of Crawlee's functionality and testing it deeply enough requires time that we don't currently have (taking into account that you need this released soon), would you be open to making the new system info implementation opt-in using some kind of a feature flag?

If yes, we could probably release it and test it later, before we decide to make the new implementation the default.

NathanSavageKaimai · 2025-02-21T16:05:31Z

Hi @janbuchar, sounds good, do you want me to make it an experiment? Also, any preference on what i call the flag?

Ta

janbuchar · 2025-02-21T16:19:35Z

Yup, CrawlerExperiments sounds like the way to go. If you can't think of anything more descriptive, systemInfoNG or systemInfoV2 work just fine for me 🙂

… 'systemInfoV2' experiment feature flag

NathanSavageKaimai · 2025-02-21T22:23:23Z

The original metric collection has been restored and the new solution made toggleable with the systemInfoV2 experimental feature flag. I ended up needing to add the systemInfoV2 flag to the configuration class as localEventsManager isnt tied to a particular scraper. The only potential issue with this is that if someone was running multiple scrapers, some with the experiment and others without, it would use whichever setting was on the crawler instanciated last.

NathanSavageKaimai · 2025-02-23T03:47:25Z

hi, just been doing some more tests and i found an issue, please dont merge yet.

ta

…and 1 fixed implementation and test

NathanSavageKaimai · 2025-02-24T16:24:12Z

hi again, @janbuchar. I have fixed the scaling error. If you are all happy about it, perhaps we could merge it in soon? Thanks. :)

janbuchar · 2025-02-24T20:46:51Z

Perfect! I think there are no major issues. We agreed with @B4nan that he'll quickly scan it and, if everything is fine, merge it. In the meantime, I'll try and test some risky changes we have waiting in the release branch so that we can make a stable release.

Thanks again for your contribution and for bearing with us 🙂

NathanSavageKaimai · 2025-02-24T20:59:42Z

sounds good, just one more thing i have noticed, currently, the "TICKS_PER_SECOND" is hardcoded to 100, im going to introduce a quick check to read it from the kernel as i have found out that in some systems, rarely it can be different.

…in the linux kernel

NathanSavageKaimai · 2025-02-25T15:42:17Z

hi @B4nan, thanks for running the pull request toolkit, have you been able to have a look over the changes? No worries ethier way. :)

NathanSavageKaimai · 2025-02-26T15:21:34Z

Hi @janbuchar have you been able to talk with @B4nan? my boss is asking for a timeline.

Ta

B4nan · 2025-02-26T15:23:43Z

You will need to wait a bit, the PR is large, so it takes time. If you need to adopt this in your own project asap, I'd suggest you use something like https://github.com/ds300/patch-package instead of pushing us for merging it early.

NathanSavageKaimai · 2025-02-26T15:29:35Z

Apologies, i didnt mean to offend. My understanding of the situation was that you were happy with janbuchar's review and you were going to simply scan it.

Thanks for all you are doing.

B4nan

left a few comments, haven't seen the whole thing yet, but it's looking pretty good. i'd remove the experiments option in favor of the configuration one, having both of them feels weird, this is surely a system-wide option, so the config class is more appropriate for it

B4nan · 2025-02-26T12:40:16Z

packages/basic-crawler/src/internals/basic-crawler.ts

+    /**
+     * Enables the use of the new resource management system.
+     * It should improve autoscaling in containerized environments by respecting cGroup resource limits.
+     */
+    systemInfoV2?: boolean;


let's keep only the configuration option, it's weird to have two feature flags for one thing

B4nan · 2025-02-26T12:42:54Z

test/utils/memory-infoV2.test.ts

+    // TODO: check if this comment is still accurate
+    // this test hangs because we launch the browser, closing is apparently not enough?


i guess this is not relevant anymore, given the test doens't hang?

B4nan · 2025-02-26T12:43:06Z

test/utils/memory-infoV2.test.ts

+    // TODO: check if this comment is still accurate
+    // this test hangs because we launch the browser, closing is apparently not enough?


B4nan · 2025-02-26T12:44:28Z

packages/utils/src/internals/general.ts

@@ -15,6 +15,8 @@ export const URL_NO_COMMAS_REGEX =
 export const URL_WITH_COMMAS_REGEX =
    /https?:\/\/(www\.)?([\p{L}0-9]|[\p{L}0-9][-\p{L}0-9@:%._+~#=]{0,254}[\p{L}0-9])\.[a-z]{2,63}(:\d{1,5})?(\/[-\p{L}0-9@:%_+,.~#?&/=()]*)?/giu;

+export const FALSY_REGEX = /^false$/giu;


do we really need to export this?

not really, i just did that since the other regexes were exported

B4nan · 2025-02-26T12:50:20Z

packages/core/src/events/local_event_manager.ts

+            log.exception(err as Error, 'Cpu snapshot failed.');
+            return {};


is this a good idea? downstream will expect values, and end up working with undefined instead

unless there is a good reason, i would remove the try/catch here

will do, i put that for parity with the existing createMemoryInfo function

packages/utils/src/internals/systemInfoV2/cpu-info.ts

B4nan · 2025-02-26T12:54:51Z

packages/utils/src/internals/systemInfoV2/cpu-info.ts

+ * @returns a number between 0 and 1 for the cpu load
+ */
+export async function getCurrentCpuTicksV2(): Promise<number> {
+    if (await isContainerized()) {


check for the negation and return early so the huge nested block doesnt end up nested

B4nan · 2025-02-27T09:35:31Z

packages/core/src/events/local_event_manager.ts

+            if (this.config.get('systemInfoV2')) {
+                const usedCpuRatio = await getCurrentCpuTicksV2();
+                return {
+                    cpuCurrentUsage: usedCpuRatio * 100,
+                    isCpuOverloaded: usedCpuRatio > options.maxUsedCpuRatio,
+                };
+            }


only this part needs to be in the try/catch if we want to keep it. but i would rather not have it here, if we want to have it somewhere, why not have it inside the getCurrentCpuTicksV2 method and return some meaningful defaults instead?

B4nan · 2025-02-27T09:36:36Z

packages/core/src/events/local_event_manager.ts

    }

    private async createMemoryInfo() {
        try {
-            const memInfo = await this._getMemoryInfo();
+            let memInfo = { mainProcessBytes: -1, childProcessesBytes: -1 };


whats the point of the default value here? its always overridden

it was to properly scope the variable since memInfo is used in the same manner by v1 and v2, i will reformat

B4nan · 2025-02-27T09:40:51Z

packages/utils/src/internals/general.ts

+    if (process.env.CRAWLEE_CONTAINERIZED) {
+        isContainerizedResult = !FALSY_REGEX.test(process.env.CRAWLEE_CONTAINERIZED);
+        return isContainerizedResult;
+    }


ideally this would be handled via configuration class too, which normalizes env vars

unfortunately, we cannot access the Configuration class in utils, it would create a circular dependancy. I could refactor it into core?

NathanSavageKaimai · 2025-02-27T12:31:29Z

hi @B4nan, working on those changes now. If we only want the configuration feature flag and not the experiment one, do you still want the experiment documentation page or shoudl i move it somewhere else?

Ta

B4nan · 2025-02-27T12:51:07Z

yeah, let's keep the docs, that's fine

- removed systemInfoV2 experiment flag in favour of config flag - added containerized / CRAWLEE_CONTAINERIZED config flag - isContainerized no longer checks the 'containerized' flag, instead 'containerized' is used in place of isContainerized in LocalEventManager

B4nan requested review from janbuchar, B4nan and vladfrangu February 10, 2025 10:57

NathanSavageKaimai force-pushed the master branch from f7d7518 to 173630c Compare February 12, 2025 15:10

janbuchar requested changes Feb 18, 2025

View reviewed changes

NathanSavageKaimai requested a review from janbuchar February 19, 2025 01:52

NathanSavageKaimai added 2 commits February 21, 2025 21:33

feat(featureFlag): reimplemented improved metric collection under the…

ff2abed

… 'systemInfoV2' experiment feature flag

chore: linting

ae4a502

NathanSavageKaimai force-pushed the master branch from c849613 to ae4a502 Compare February 21, 2025 22:16

fix(cpu status): cpu V2 was reporting values between 0 and 100 not 0 …

395d450

…and 1 fixed implementation and test

NathanSavageKaimai added 2 commits February 24, 2025 21:21

fix(variable tick rate): added a check for the specific tickrate set …

135b655

…in the linux kernel

fix(linting): missing semicolon

8bef305

B4nan requested changes Feb 27, 2025

View reviewed changes

NathanSavageKaimai added 2 commits February 27, 2025 15:07

fix(linting)

8fb81a1

NathanSavageKaimai requested a review from B4nan February 27, 2025 15:08

fix(tests): removed now redundant spies

e4a2d2f

		// TODO: check if this comment is still accurate
		// this test hangs because we launch the browser, closing is apparently not enough?

		log.exception(err as Error, 'Cpu snapshot failed.');
		return {};

feat: improved cross platform metric collection #2834

Are you sure you want to change the base?

feat: improved cross platform metric collection #2834

Conversation

NathanSavageKaimai commented Feb 9, 2025

Adds

@crawlee/utils

general.ts

cpu-info.ts

Removes

@apify/ps-tree

Fixes

NathanSavageKaimai commented Feb 11, 2025

janbuchar commented Feb 11, 2025

NathanSavageKaimai commented Feb 11, 2025

NathanSavageKaimai commented Feb 11, 2025

vladfrangu commented Feb 11, 2025

janbuchar commented Feb 11, 2025

NathanSavageKaimai commented Feb 11, 2025

vladfrangu commented Feb 11, 2025

janbuchar commented Feb 11, 2025

NathanSavageKaimai commented Feb 11, 2025

NathanSavageKaimai commented Feb 11, 2025

janbuchar commented Feb 12, 2025

NathanSavageKaimai commented Feb 14, 2025

NathanSavageKaimai commented Feb 17, 2025

janbuchar commented Feb 17, 2025

janbuchar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NathanSavageKaimai commented Feb 18, 2025

janbuchar commented Feb 19, 2025

NathanSavageKaimai commented Feb 19, 2025

janbuchar commented Feb 19, 2025

janbuchar commented Feb 21, 2025

NathanSavageKaimai commented Feb 21, 2025

janbuchar commented Feb 21, 2025

NathanSavageKaimai commented Feb 21, 2025

NathanSavageKaimai commented Feb 23, 2025

NathanSavageKaimai commented Feb 24, 2025

janbuchar commented Feb 24, 2025

NathanSavageKaimai commented Feb 24, 2025

NathanSavageKaimai commented Feb 25, 2025

NathanSavageKaimai commented Feb 26, 2025

B4nan commented Feb 26, 2025

NathanSavageKaimai commented Feb 26, 2025

B4nan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NathanSavageKaimai commented Feb 27, 2025

B4nan commented Feb 27, 2025