Fix unsafe pickle deserialization and CI workflow unauthorized execution#5236
Open
liu-shaojun wants to merge 2 commits intomainfrom
Open
Fix unsafe pickle deserialization and CI workflow unauthorized execution#5236liu-shaojun wants to merge 2 commits intomainfrom
liu-shaojun wants to merge 2 commits intomainfrom
Conversation
Replace raw cloudpickle.load() with HMAC-verified SafePickle across all four distributed training backends (MPI, Horovod, PyTorch DDP, TensorFlow). The vulnerability allowed arbitrary code execution via crafted pickle files passed through CLI-controlled paths. Worker scripts deserialized these files without integrity verification, enabling cluster-wide RCE in distributed deployments. Changes: - Upgrade SafePickle to use per-session random HMAC-SHA256 keys instead of the hardcoded b'shared-key' with SHA1 - SafePickle.dump() now auto-generates .sig sidecar files for integrity - SafePickle.load() mandates signature verification, rejecting unsigned or tampered pickle files - Use timing-safe hmac.compare_digest() to prevent timing attacks - Propagate HMAC key to MPI workers via mpiexec -genv and to subprocesses via inherited environment - SCP .sig files alongside .pkl files for remote MPI nodes Affected files: - mpi_train.py, mpi_estimator.py, mpi_runner.py (MPI training) - horovod_worker.py, multiprocs_backend.py (Horovod) - worker.py, ddp_subprocess.py (PyTorch DDP) - subprocess_worker.py, backend.py (TensorFlow) - safepickle.py in nano, ppml, and friesian packages Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…unner The performance-regression-test workflow had two vulnerabilities that allowed any GitHub user to execute arbitrary code on Intel's self-hosted CI runner (Rohan): 1. The issue_comment trigger had no author_association check, so anyone could comment 'APRT' on any PR to trigger benchmark execution of fork PR code on the self-hosted runner. 2. The pull_request trigger caused the workflow to run the PR author's version of the workflow file on the self-hosted runner, giving full control over the execution environment. Fix: - Remove the pull_request trigger entirely — it allowed fork PRs to override the workflow definition itself - Add author_association guard requiring MEMBER, OWNER, or COLLABORATOR status to trigger via issue_comment - Simplify the job-level condition to only handle the comment-based flow Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Security fixes for two vulnerabilities in BigDL: unsafe pickle deserialization across distributed training workers, and unauthorized code execution via CI workflow on self-hosted runner.
1. Why the change?
Vulnerability 1 — Unsafe Pickle Deserialization (CVE-worthy)
All four distributed training backends (MPI, Horovod, PyTorch DDP, TensorFlow) accept CLI-controlled file paths and immediately deserialize pickle files using
cloudpickle.load()without any integrity verification. In distributed deployments, a single malicious pickle payload achieves cluster-wide arbitrary code execution.Affected worker scripts:
python/orca/src/bigdl/orca/learn/mpi/mpi_train.pypython/nano/src/bigdl/nano/deps/horovod/horovod_worker.pypython/nano/src/bigdl/nano/pytorch/strategies/worker.pypython/nano/src/bigdl/nano/utils/tf/subprocess_worker.pyVulnerability 2 — CI Workflow Unauthorized Code Execution
The
performance-regression-test.ymlworkflow allows any GitHub user to trigger code execution on the self-hosted runner (Rohan) by commentingAPRTon any PR. Noauthor_associationcheck exists. Additionally, thepull_requesttrigger lets fork PRs override the workflow definition itself.2. User API changes
No user-facing API changes.
SafePickle.dump()andSafePickle.load()maintain the same interface but now always perform HMAC integrity verification. The HMAC key is auto-generated per session and propagated to subprocesses viaBIGDL_SAFE_PICKLE_KEYenvironment variable.3. Summary of the change
Pickle deserialization fix:
cloudpickle.load()/cloudpickle.dump()calls in worker scripts and their callers withSafePickle.load()/SafePickle.dump()SafePickle(in nano, ppml, friesian packages) from hardcodedb'shared-key'+ SHA1 to per-session random 256-bit key + HMAC-SHA256dump()now auto-generates.sigsidecar files containing the HMAC digestload()mandates signature verification — rejects unsigned or tampered pickle fileshmac.compare_digest()to prevent timing attacksmpiexec -genvand to subprocesses via inherited environment.sigfiles alongside.pklfiles for remote MPI nodesCI workflow fix:
pull_requesttrigger that allowed fork PRs to override the workflow definition on the self-hosted runnerauthor_associationguard (MEMBER,OWNER, orCOLLABORATOR) to theissue_commenttrigger4. How to test?
SafePickle.load()will raise an errorifcondition now requiresauthor_associationmembership5. New dependencies
No new dependencies introduced. All changes use Python standard library modules (
hmac,hashlib,os) that were already imported.