Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nearform can no longer host machines #3615

Open
mhdawson opened this issue Jan 22, 2024 · 47 comments
Open

Nearform can no longer host machines #3615

mhdawson opened this issue Jan 22, 2024 · 47 comments
Labels

Comments

@mhdawson
Copy link
Member

Creating this to capture/track as opposed to email discussion which is harder to pull people into.

Nearform has let the build WG know through email that they can no longer host the machines they had in our datacenter. These include

  • 2 Windows on ARM machines
  • 3 OSX machines
  • 2 Large benchmarking machines.

They have proposed moving then to another hoster which would cost $3856 Euros as a move cost and then $850 Euro per month as an ongoing cost.

From informal discussion so far we believe we don't need the Windows on ARM machines as they have been replaced by machines in Azure. That may make the cost a bit lower.

The options going forward at a high level would be:

  • Move all the machines (minus windows on ARM), proposal from NearForm
    outlined above which would require foundation Funding.
  • Find a new host/company willing to sponsor hosting/or host the machines for free.
  • Find replacements (minus windows on ARM)
    • Possibly reach out to MacInCloud for OSX machines (1 ARM and two x86)
    • Find new sponsor for large benchmarking machines.

Initial discussion is that we don't believe we can/should just create the larger machines in existing hosters. As part of this process we should also confirm with the performance team what size machines are actually needed.

Given that there have been discussions with the Foundation/Linux IT team about them helping to manage machines and their stated approach of "fully owning" what they manage it would be good to see if Linux IT can take on solving this time sensitve issue for the project.

@bensternthal could to you take on getting Linux IT to give us a yes/no in terms of taking this on, ideally in a timeframe needed by Nearform.

@efrisby could you share what the required timeframe for a move is?

@efrisby
Copy link

efrisby commented Jan 22, 2024

Hi Michael,

We have a Fibre line that needs to be removed that controls the fixed ip addresses that are currently on the servers. We can keep this in place for a period of time, but we would be hoping to shut down in under 2 months.

So if we could set the 1st of April as the deadline, would that give enough time to address the above?

Thanks,
Eamonn

@mhdawson
Copy link
Member Author

@nodejs/performance FYI

@ryanaslett
Copy link
Contributor

Hi Michael, Ryan Aslett from LF IT here.

I've been doing a bit of background research on this and wanted to make sure I understood what the requirements are.

If I understand the situation correctly, these are physical machines that Nearform is hosting for nodejs in their datacenter, which they can no longer continue to support having on their network.

My understanding of the 2 Windows on ARM machines, (Surface Pro)
https://github.com/nodejs/build/blob/main/ansible/inventory.yml#L227-L228 which were donated/on loan from ARM (#2540 (comment) and #2540 (comment)) and have since been decommissioned in favor of resources at Azure. Looks like there is still some question as to what state they're in (#3286). Seems prudent that those either get returned to ARM or dealt with some other way.

The OSX machines:
It also seems like theres 4(?) OSX machines, two x86 ones, with 2 VM's running on them via VMware fusion, and 2 ARM based ones that I believe are just bare metal machines (Couldn't find any info about the state of any VM's on the ARM Mac Mini's in the issue history, other than the fact that the IP's in the inventory match what @efrisby mentioned here: #3390 (comment)).

One of the x86 mac mini's VM's were split between release and test for 10.15, and the other has 2 vm's dedicated to test of 10.15-x64

The release machine was retired because 10.15 isnt able to run xcode13 and notarize.

It looks like there were some recent experiments to get 10.15 x64 tests to run on orka: https://ci.nodejs.org/computer/test%2Dorka%2Dmacos10.15%2Dx64%2D1/builds .

for the 2 ARM based mac minis, it seems like the test one has been unused for the last 11 days:
https://ci.nodejs.org/computer/test%2Dnearform%2Dmacos11.0%2Darm64%2D1/builds, and its jobs are being run on macstadium nodes already: https://ci.nodejs.org/computer/test%2Dmacstadium%2Dmacos11.0%2Darm64%2D4/builds

I do not have access to the release jenkins, so Im not sure what the status is of the nearform ARM release machine, other than it seems like we now have two functioning release machines for macos11-x64-1 #3179 (comment).

Given that, I wonder if we already have capacity at Macstadium an Orka to handle the roles these Nearform OSX machines are performing ? (Though, perhaps we might need another additional orka testrunner for 10.15/x64)

Would the goal in pursuing MacInCloud or another provider (i.e. sponsored https://aws.amazon.com/pm/ec2-mac/ ) be mostly for redundancy and resiliency against provider outages?

** Large Benchmarking Machines **
Other than the specs of the machines themselves,
#791 (comment), Im not sure how these are used, and whether or not there is a requirement to keep those specific machines as the benchmark machines, or if moving to another resource is an option.

Would changing the benchmark infra be an option? can those be virtual/cloud based machines, or is bare metal a requirement?

In any case I look forward to helping get this figured out.

@mcollina
Copy link
Member

Would changing the benchmark infra be an option? can those be virtual/cloud based machines, or is bare metal a requirement?

Virtual/is really not an option. However any "bare metal" host would do. I personally use a Hetzner machine for similar purposes (it's significantly worse/slower, but we need the consistency of results, not the actual speed).

In terms of resources, we could do with similar specs (I don't have those handy), or even something a bit less powerful.

My 2 cents is that those machines are likely near end-of-life.

@Uzlopak
Copy link

Uzlopak commented Jan 23, 2024

Is there something speaking against using github runners?

@mcollina
Copy link
Member

Anything running on VMs have too much interference and the standard deviation between runs is too high to measure bytecode level optimizations, e.g. microbenchmarks.

@ryanaslett
Copy link
Contributor

Is there something speaking against using github runners?

Anything running on VMs have too much interference and the standard deviation between runs is too high to measure bytecode level optimizations, e.g. microbenchmarks.

Github actions supports self hosted runners, even bare metal ones, but converting the Jenkins CI infrastructure to a Github Actions infrastructure is an ambitious undertaking that would be unlikely to succeed in the timeframe of this immediate need.

#2247 seems like a good place to continue discussing whether or not that's an eventual or possible outcome.

@targos
Copy link
Member

targos commented Jan 23, 2024

2 Windows on ARM machines

I agree it seems we don't need them anymore.

The OSX machines

Currently almost unused because the current version of Node.js doesn't support macOS 10.15. These machines could be updated to macOS 12, 13, or even 14.
I'm not sure we have enough capacity to replace them at Macstadium (we already struggle with disk space), but I would be more happy if we find other providers to donate resources (for example, Scaleway have bare metal M1 and M2 Pro mac minis).

The Intel benchmarking machines

The systems we have now are based on dual-CPU Intel Xeon E5-2699 v4.
The each have a total of 88 logical cores and 64 GB of RAM.
I'm also familiar with Hetzner machines at work and maybe we should try to ask them for sponsorship. They have machines with up to 64 cores, and datacenters in different countries.

@sxa
Copy link
Member

sxa commented Jan 23, 2024

I personally use a Hetzner machine for similar purposes (it's significantly worse/slower, but we need the consistency of results, not the actual speed).

Thanks for confirming - that was one of the questions I had when discussing this with some members of build yesterday - I figured that was likely the case. Obviously the implication is that we won't be able to compare "old" runs with "new" runs without re-running them, but that shouldn't be too much of a problem (we can always re-run if required).

Does the performance team require two systems or would one be adequate for the capacity needs?

Would the goal in pursuing MacInCloud or another provider (i.e. sponsored https://aws.amazon.com/pm/ec2-mac/ ) be mostly for redundancy and resiliency against provider outages?

I believe that's the primary driver, yes. AWS should also be a viable option if they were willing to sponsor us.

@ryanaslett
Copy link
Contributor

Regarding OSX testing:

Based on everything I've been able to glean from issues and meeting notes, it seems like a good path forward would be to lean into what we're doing with MacStadium for the short term, with an eye on having a secondary provider longer term.

  • What is the current status of our relationship with MacStadium?
  • How happy are we with the performance/reliability of their service?
  • Is it possible to leverage them entirely for the short term, are there any blockers?
  • Are there things on our backlog to do with them for OSX testing? (Things like leveraging their ephemeral instances, upgrading to orka 3.0, maybe using their jenkins plugin? I see a lot of recurring osx disk space / node health issues, which might be alleviated by using ephemeral instances - provide those can spin up and be ready for testing fast enough for the build pipeline)
  • If we, LFIT, wanted to access to the MacStadium account to audit instance sizes and requirements, what would be the process for that? (so we have a gauge on what to ask from a future provider)

Regarding Benchmark testing:

  • How time sensitive are benchmark tests? are they blockers to an immediate commit? or are they more like steps in a long release process?
  • How frequently are they used/needed?
  • One possibility is something like dedicated tenancy (metal) ec2 spot instances for the jenkins nodes to run those tests. They could spin up and down on demand, we could target a large and powerful enough size, but only have to pay for/use credits, for when we're actually running the tests, and we could work with AWS to find the sweet spot of "very available" and "performant".

@mcollina
Copy link
Member

How time sensitive are benchmark tests?

In order to land any performance related PR, we run the benchmarks.
Usually benchmarks run on dev machines are not effective.
I don't have actual stats, but I'd say we run them weekly.

Some of those jobs lasts 6-8 hours, and in the most extreme cases days.

are they blockers to an immediate commit?

The lack of benchmarking machines would slow down progress on most things performance related.

or are they more like steps in a long release process?

They are not part of our release process.


How frequently are they used/needed?

I's guess a few times per week.

One possibility is something like dedicated tenancy (metal) ec2 spot instances for the jenkins nodes to run those tests. They could spin up and down on demand, we could target a large and powerful enough size, but only have to pay for/use credits, for when we're actually running the tests, and we could work with AWS to find the sweet spot of "very available" and "performant".

One of the key strategies we employ is to rely on previous runs to compare.

I'd not really trust this setup, because the actual machine would change every time.

On top, AWS spot instances cost for c5.metal (seems a good choice in terms of resources) is likely 3x (or more) compared to a provider like Hetzner.

@richardlau
Copy link
Member

richardlau commented Jan 24, 2024

One of the key strategies we employ is to rely on previous runs to compare.

Do we? I thought each Benchmark CI job that is ran runs through the requested benchmark(s) twice -- once with the base branch (i.e. what is being compared to) and once with the PR being tested.

benchmark-node-micro-benchmarks runs this shell script.

@mhdawson
Copy link
Member Author

Would the goal in pursuing MacInCloud or another provider (i.e. sponsored https://aws.amazon.com/pm/ec2-mac/ ) be mostly for redundancy and resiliency against provider outages?

Our goal across platforms has been to have at least two providers for any platform. So while we might be able to use 1 for a short period of time, the plan should be to find a second provider if at all possible.

@mhdawson
Copy link
Member Author

mhdawson commented Jan 24, 2024

What is the current status of our relationship with MacStadium?
How happy are we with the performance/reliability of their service?

I'd say the relationship is good and we are happy with the machines they have provided. I believe most of the common issues we have relate to OSX itself versus the host. Many thanks to MacStadium for their continued support.

@mcollina
Copy link
Member

Do we? I thought each Benchmark CI job that is ran runs through the requested benchmark(s) twice -- once with the base branch (i.e. what is being compared to) and once with the PR being tested.

Yes. We typically run the benchmark across different commits as a PR evolves. I'm not convinced that those result would be comparable across different HW.

@Uzlopak
Copy link

Uzlopak commented Jan 25, 2024

Are only benchmarks run on those machines?

@kgantchev
Copy link

kgantchev commented Jan 25, 2024

Expand > I'm not sure we have enough capacity to replace them at Macstadium (we already struggle with disk space), but I would be more happy if we find other providers to donate resources (for example, Scaleway have bare metal M1 and M2 Pro mac minis).

Pardon me if I'm intruding here, but if there is a need M1 or M2 runners for GitHub Actions, may I suggest giving FlyCI a try? We offer MacOS M1 and M2 runners (ARM64). For public repos, we offer 500 mins/month of free M1 usage (4 vCPUs, 7 GB RAM, 28 GB storage).

The setup is super easy:

  1. Install the FlyCI GitHub app.
  2. Give the FlyCI app permissions to this repo.
  3. Change your runs-on flag whenever you implement the ARM64 MacOS workflow:
jobs:
ci:
-    runs-on: macos-latest
+    runs-on: flyci-macos-large-latest-m1
      steps:
      - name: 👀 Checkout repo
         uses: actions/checkout@v4

Do you think this might be a good option for nodejs / build?

Web: flyci.net

Update:

I apologize, I just realized you guys are using Jenkins, not GitHub Actions. Please ignore my comments above!

@mhdawson
Copy link
Member Author

Issues with proposal from Linux IT for how to move forward on replacing NearForm OSX machines - #3638

@UlisesGascon FYI

@ryanaslett
Copy link
Contributor

Update:

@efrisby we've selected some machines at Hetzner to act as a replacement for the benchmark machines, but are waiting on an internal fiscal process to complete so we can purchase them and get them set up, based on the machine sizing from @mcollina and @mhdawson.

I've reached out to @jasnell to see if he still had a contact at ARM so we can ask what to do with their Surface Pro machines.

Im still trying to get access to Macstadium to assess what our options with them are.

One thing we have not resolved is that once the rack is decommissioned, what should happen to that hardware?

@mhdawson
Copy link
Member Author

In terms of the hardware I think we should try to see if there are any Node.js collaborators who are interested and could pick up locallly.

@mcollina
Copy link
Member

In terms of the hardware I think we should try to see if there are any Node.js collaborators who are interested and could pick up locallly.

I'm not sure we have many collaborators in Ireland, but definitely a few past collaborators and wide community members: @GlenTiki @No9 wdyt?

There might also be the option for the go to a local charity.

@GlenTiki
Copy link

If Nearform are decommissioning the hardware and no longer have use of them themselves, they may know of other good local philanthropic uses, so I’d trust @efrisby take the lead there before me.

However, my thoughts…

I'm not sure we have many collaborators in Ireland, but definitely a few past collaborators and wide community members: @GlenTiki @No9 wdyt?

I would like to see the hardware go to someone that needs it - students, etc. 💯 No tech charities immediately come to mind except for coderdojo. I don’t know how active that is here in Waterford (where both NearForm and I are based).

I am a member of the organiser team of the local monthly Waterford tech meetup, and I would love to offer something like the Mac minis as a raffle prize - lots of students, researchers, college staff and local devs attend, so I could see them put to good use that way, if no other local collaborators need them or suggest anything.

@No9
Copy link
Member

No9 commented Mar 20, 2024

@GlenTiki agree not many charities take desktop type machines now - mainly laptops as they are easier to manage for everyone in the charities and the end users.
I'm happy to see it go to a local in the Déisean.
That reminds me I owe you a talk 😄

@efrisby
Copy link

efrisby commented Mar 22, 2024

Hi all,

If the equipment needs to be packaged up and delivered to a location, that is no problem at all. Nearform work with a charity that unused computers go to. The charity is based in UK / Ireland, who then transport machines to schools in Vietnam. We have sent mac mini's in the past, so I don't think there are any issues providing. Also we have provided to Coderdojo in the past, so no reason not to reach out again.

The only ones that we may need to returned are the Intel Xeon servers. If anyone would have any suggestions, or if someone is in the position to host these, we can arrange to ship.

Thanks,
Eamonn

@richardlau
Copy link
Member

I've marked all of the Nearform hosted machines in ci.nodejs.org and ci-release.nodejs.org as offline so no new jobs will be scheduled on to them.

@efrisby We're no longer using the Nearform hosted machines in the Node.js CI. Thanks once again to Nearform for hosting these machines for us for all of these years!

@efrisby
Copy link

efrisby commented Apr 8, 2024

@richardlau @mhdawson thanks for the kind words. I will pass that on to the team here.

Regarding the machines hosted here, are we now in a situation that we can turn all these off,

2 x Intel Severs
4 x Mac Mini machines
2 x Surface Pro machines.

If you can confirm, I will power these down tomorrow and disconnect. If anyone has any suggestions what to do with these also, can you please get in contact with me to arrange also. If you wish to donate to charity we can look at those options or if you wish to send them to someone just let me know. We can wipe and run hard drive cleans on them before any donations are made also. The two intel servers however might either need to be sent back or if you wish we have a recycle company as a supplier that can recycle old computer hardware, https://vyta.com/ that we work with to collect and green recycle and reuse of equipment. Wiping is also done to a certified standard.

Thanks again to everyone that helped the Nearform team here also with the support of these devices and to the community for the effort put in to move this whole area to a new solution, especially within the time we had.

All the best, Eamonn

@mhdawson
Copy link
Member Author

mhdawson commented Apr 8, 2024

In terms of the Intel and Surface Pro machines, I don't think the Build WG was involved in the arrangement with those who provided them. Maybe @jasnell or @mcollina knows if there was any agreement on what to do with them when the project no longer needed them.

@richardlau
Copy link
Member

@efrisby Yes, those listed machines can be disconnected and powered down.

@mhdawson
Copy link
Member Author

mhdawson commented Apr 8, 2024

As far as the Mac Minis, the Foundation paid for those so I think we should be ok with whatever @GlenTiki and @No9 agree on.

@jasnell
Copy link
Member

jasnell commented Apr 8, 2024

The surface pro machines were on loan from ARM to NearForm. Y'all would need to contact nearform about those, as I have details on what is happening with.

@mhdawson
Copy link
Member Author

mhdawson commented Apr 9, 2024

@efrisby sounds like figuring out what to do with the surface pro machines is back to you basedon the comment from @jasnell

@ryanaslett
Copy link
Contributor

@efrisby - @jasnell had provided me with some contacts at ARM, and I emailed them on mar 18th, but I did not get a reply from any of them.

@jasnell
Copy link
Member

jasnell commented Apr 9, 2024

I'd maybe give it until the end of April to wait for a response from ARM, then maybe ping them again. If there's still no answer after that, then I'd suggest shipping the surface pros to either myself or @mcollina for storage (because we were both around when the agreement to lend the machines was made). I'll keep trying with Arm and if they don't respond, I'll donate the devices to a charity.

@mcollina
Copy link
Member

@jasnell I think it might be better to ship those to me because of import duties.

@efrisby
Copy link

efrisby commented Apr 10, 2024

@jasnell @mcollina thanks all, just let me know what you decide and drop me an email with your delivery address and a contact number and I can ship them to you. I will drop you both an email so you have my details. Thanks again.

@mhdawson
Copy link
Member Author

@jasnell, @mcollina sounds like there is a plan for the surface pros, but what about the benchmark machines? As mentioned I don't think the Build WG was in the loop when they went into Nearform so I don't know what the agreement was in terms of end of life.

@jasnell
Copy link
Member

jasnell commented Apr 23, 2024

It's been so long that I can't remember the details and everything about the benchmark machine was in my old NearForm email inbox that I no longer have access to. I know the machine was on loan only but that's all I remember.

@mhdawson
Copy link
Member Author

@bensternthal, @ryanaslett maybe you can help out here in terms of the Intel machines. There seems to be no retained context in terms of the loan from Intel as the people from Neaform who worked with intel to bring the machines in are no longer at Nearform and there was nobody from the build WG who was involved in setting up the loan. I don't think Intel is a Foundation member anymore so don't know who to reach out to.

Could you two handle figuring out what to do with the machines?

@mcollina
Copy link
Member

I don't think they are on loan, more of a donation. I don't have access to those emails anymore.

@bensternthal
Copy link

@mhdawson based on reading this thread I would say the intel machines can be donated.

@mhdawson
Copy link
Member Author

@mhdawson based on reading this thread I would say the intel machines can be donated.@mhdawson based on reading this thread I would say the intel machines can be donated.

@efrisby if that is good enough for you then that's good enough for me. Is it possible for you to donate or dispose of them?

@efrisby
Copy link

efrisby commented Apr 25, 2024

@mhdawson @mcollina @jasnell I will take out the intel servers and see what we can do. We have a company that recycle old equipment that I will contact as donating servers like this is harder to find a home for. If you have anyone has any suggestions of anyone local in Ireland let me know as shipping outside of Ireland will be difficult due to the size and weight. Thanks

@GlenTiki
Copy link

I'd take server hardware if it's going to just be stripped for parts - could use it in a home lab.

@efrisby
Copy link

efrisby commented May 2, 2024

@GlenTiki If you can make it to Tramore tomorrow, I should be onsite, else if you wish we can arrange another day perhaps. I have support person calling to help with taking out some equipment so we can hopefully take these out. It might be late in the day that we will be ready, perhaps around 4.30 however. Thanks.

@GlenTiki
Copy link

GlenTiki commented May 3, 2024

@efrisby I'm away for a bit and won't be around until after the 11th May - gimme a time that suits after that and I'll be out :)

@efrisby
Copy link

efrisby commented May 3, 2024

@GlenTiki do you still have my email, the nearform one. Just drop me an email when your back and we can arrange a time that suits us. Catch up with you soon :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests