Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nix: ci: fit into the new limits #6346

Open
SomeoneSerge opened this issue Mar 27, 2024 · 3 comments
Open

nix: ci: fit into the new limits #6346

SomeoneSerge opened this issue Mar 27, 2024 · 3 comments
Labels
nix Issues specific to consuming flake.nix, or generally concerned with ❄ Nix-based llama.cpp deployment stale

Comments

@SomeoneSerge
Copy link
Collaborator

SomeoneSerge commented Mar 27, 2024

Most (all) of the nix-build jobs are being cancelled in progress since the quotas have changed. Adjust the workflows to fit in the new limits.

Context: since #6243 the ci jobs are grouped by refs and cancelled together. The existing "Nix CI" job wasn't prepared for this for two reasons:

  • It builds many variants of llama.cpp in a single job.
  • It only pushes the results to cachix after all of the builds have ended (not sure if it does the push in the "destructor" step after the cancellation).
  • PRs from forks don't have access to the repo secrets so they don't push to cachix. However, it's plausible that these could make up the majority of all jobs?
  • We're running pure nix-builds, meaning we can only cache store paths (results of complete and successful builds) not e.g. intermediate object files. This provides a strong guarantee that a passing CI means the build can be reproduced locally, but this also limits how much we can reuse between the CI jobs

References:

CC @philiptaron @Green-Sky

Potential solutions

  • Make onPush builds (.#checks) less pure
    • ccacheStdenv
    • check-pointing
    • Run pure builds onSchedule instead
  • More granular jobs: generate individual github jobs for individual attributes

Questions

  • How effective is the caching right now?
    • PRs from forks aren't allowed to push to cachix
@SomeoneSerge SomeoneSerge added the nix Issues specific to consuming flake.nix, or generally concerned with ❄ Nix-based llama.cpp deployment label Mar 27, 2024
SomeoneSerge added a commit to SomeoneSerge/llama.cpp that referenced this issue Mar 27, 2024
SomeoneSerge added a commit that referenced this issue Mar 27, 2024
hodlen pushed a commit to hodlen/llama.cpp that referenced this issue Apr 1, 2024
@SomeoneSerge
Copy link
Collaborator Author

SomeoneSerge commented Apr 1, 2024

Hi @mscheong01! Please advice, could we afford more lax limits for onSchedule jobs and on the less frequent occasions when we change the infra (e.g. cmakelists or nix expressions)?

  • It would be nice to build the windows, cuda, and rocm stuff e.g. when updating the lock file (currently once a week): nix: update flake.lock #6402
  • The aarch64/qemu job is onSchedule too:
    schedule:
    # Rebuild daily rather than on every push because QEMU is expensive (e.g.
    # 1.5h instead of minutes with the cold cache).
    #
    # randint(0, 59), randint(0, 23)
    - cron: '26 12 * * *'
    # But also rebuild if we touched any of the Nix expressions:
    push:
    branches:
    - master
    paths: ['**/*.nix', 'flake.lock']
    pull_request:
    types: [opened, synchronize, reopened]
    paths: ['**/*.nix', 'flake.lock']

Even if we implement e.g. ccache, it would be nice to run the pure builds once in a while

@mscheong01
Copy link
Collaborator

could we afford more lax limits for onSchedule jobs

this could be done by checking the github.event_name context and assigning a unique group to the workflow if the value is 'schedule'.
more about contexts: https://docs.github.com/ko/actions/learn-github-actions/contexts#github-context

and on the less frequent occasions when we change the infra (e.g. cmakelists or nix expressions)?

I can't think of a straightforward solution that wouldn't unnecessarily complicate the workflow settings 🤔. One way would be to add a manual trigger that doesn't get cancelled by other runs, but I don't think it's the best option.

@slaren
Copy link
Collaborator

slaren commented Apr 2, 2024

HIP is probably is the most important build for nix, since it is not tested in any other workflow, as far as I know. Most of the other nix builds are redundant.

hodlen pushed a commit to hodlen/llama.cpp that referenced this issue Apr 3, 2024
tybalex pushed a commit to tybalex/function.cpp that referenced this issue Apr 17, 2024
@github-actions github-actions bot added the stale label May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nix Issues specific to consuming flake.nix, or generally concerned with ❄ Nix-based llama.cpp deployment stale
Projects
None yet
Development

No branches or pull requests

3 participants