Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Killing Emacs while straight.el is cloning and buildng packages can leave things in a bad state? #1134

Open
ntc2 opened this issue Nov 29, 2023 · 4 comments
Labels

Comments

@ntc2
Copy link

ntc2 commented Nov 29, 2023

What's wrong

I'm in the process of porting my whole config to use-package+straight.el, with a focus on reproducibility (I had gotten into some weird state using use-package+package.el before, and want to avoid that). My test is to reinstall all the packages in my config in a sandbox, and this takes a long time. At one point I killed Emacs (Ctrl-C in the terminal I started Emacs from) while straight.el was installing packages, and I think it left things in a bad state: after that I was getting errors about "couldn't load package " (or similar, I don't remember the exact wording), and was eventually able to solve the problem by running M-x straight-pull-package on the failing package and confirming that I wanted to clean up its repo which was in a "dirty" state.

Directions to reproduce

Unfortunately, I don't have an easy way to reproduce this, as it must depend on exactly when you press Ctrl-C. So, this is not a very good bug report, and feel free to close it. In fact, I'm not even 100% sure that me killing Emacs with Ctrl-C was the source of the problem. However, my hope is that the maintainers might know if such a possibility of ending up in a bad state exists, and if so add a fix that prevents it, since it seems useful to be able to abort and restart a long install.

Searching through the existing issues, I found an issue where a user got confused that their package didn't get updated automatically when they changed the recipe, and the maintainers explained that this was because there's a simple check to see if the package's repo is already checked out, in which case nothing is done automatically (manually running something like straight-pull-package is needed, as in my case, but my case was more confusing, since I hadn't changed any recipe): #166 (comment) .

So, I'm wondering if this same check is to blame here: the repo exists, but is not fully cloned, but straight.el is assuming it was fully cloned. The fix could be to wrap calls to git pull (or whatever) in some kind of temporary "mark repo as dirty" state. E.g. something like (pseudo code):

touch <repo>.is-dirty
# If Emacs gets killed while `git pull` is running, 
# the `rm <repo>.is-dirty` will never happen.
git pull <repo>
rm <repo>.is-dirty

and then modify the startup check for the existence of <repo> to also check for the existence of <repo>.is-dirty file, and if so act as tho <repo> didn't exist.

Version information

  • Emacs version: 28.1
  • Operating system: Ubuntu 22.04
@ntc2 ntc2 added the bug label Nov 29, 2023
@progfolio
Copy link
Contributor

progfolio commented Nov 29, 2023

I think what you ran into is the result of the way straight-vc-git--clone-internal handles shallow clones:

straight.el/straight.el

Lines 2625 to 2630 in b3760f5

;; Do a shallow clone.
(if commit
(let ((straight--default-directory nil)
(default-directory repo-dir))
(make-directory repo-dir)
(straight--process-run "git" "init")

The repository is written to disk by Emacs, then initialized by git.
Killing Emacs during that period could leave the directory on disk.

When Emacs is killed via SIGINT, its subprocesses are sent SIGHUP (which Git should handle gracefully on its own. In this case, the directory would not be present):

https://www.gnu.org/software/emacs/manual/html_node/elisp/Signals-to-Processes.html

Emacs also sends signals automatically at certain times: killing a buffer sends a SIGHUP signal to all its associated processes; killing Emacs sends a SIGHUP signal to all remaining processes. (SIGHUP is a signal that usually indicates that the user “hung up the phone”, i.e., disconnected.)

@raxod502:

The signals we're interested in can be handled during kill-emacs-hook. A possible solution would be to temporarily add a handler to the hook which removes the repository directory.
This would happen only when shallow cloning, so no startup cost is incurred.

@raxod502
Copy link
Member

raxod502 commented Dec 2, 2023

Reasonable enough. Of course, it is impossible to make any atomic guarantee about cloning since a system may halt at any time. It may be better to clone to a temporary directory (say, the repository directly plus .tmp), and then rename it, which is atomic on most filesystems.

@ntc2
Copy link
Author

ntc2 commented Dec 2, 2023

I was able to verify that at least git clone handles SIGHUP gracefully, deleting the clone if it was incomplete, but I think @raxod502 idea of doing something like cloning to a temp dir and then moving is better than relying on how Git handles SIGHUP, and as mentioned is much more robust since it handles any possible way of killing Emacs.

My simple <repo>.is-dirty file solution was also motivated by these concerns: it's both simpler and more robust than relying on how Git handles signals, and in fact it should work uniformly for any backend (Git or otherwise), as long you can identify what directory the backend created the local package in: when the <repo>.is-dirty file is present on startup, you just delete the local package dir and try running the recipe again, indep of what backend is responsible for creating the local package. And it's trivially atomic on all systems.

Also, it's easy to increase the scope of the atomicity by just moving the rm <repo>.is-dirty part later: it can come after the package is built, so that you also handle the case where Emacs is killed while the package is being built, assuming their is some kind of build process for some packages (I actually have no idea how Emacs packages work in this respect; I don't have any intentional "byte compilation" in my config, just .el files that get reinterpreted on each startup AFAIK).

(EDIT: occurs to me that the "do everything in a temp dir and then move it" solution of @raxod502 probably also handles an increased scope of atomicity, by just doing the mv later, but then you need to be sure that there is no build process that computes and saves any paths based on the temp dir location, since those paths will be wrong after the temp dir gets moved.)

@raxod502
Copy link
Member

I don't think there need be any concern about a build system hardcoding things, since we'll only need this renaming solution to handle cloning the repository. It is possible for a build command to get interrupted partway through, but it will be re-run the next time the package is loaded, because the fact of the successful build will not be written to the build cache of straight.el. The reason this does not happen for the clone step is that the way straight.el determines completion of that step on a subsequent load is by checking if the directory exists, which will break if you kill the system halfway through (but won't break in that case if we clone-then-rename).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants