Skip to content

Fix unsafe pickle deserialization and CI workflow unauthorized execution#5236

Open
liu-shaojun wants to merge 2 commits intomainfrom
fix/unsafe-pickle-deserialization
Open

Fix unsafe pickle deserialization and CI workflow unauthorized execution#5236
liu-shaojun wants to merge 2 commits intomainfrom
fix/unsafe-pickle-deserialization

Conversation

@liu-shaojun
Copy link
Copy Markdown
Contributor

@liu-shaojun liu-shaojun commented Apr 20, 2026

Description

Security fixes for two vulnerabilities in BigDL: unsafe pickle deserialization across distributed training workers, and unauthorized code execution via CI workflow on self-hosted runner.

1. Why the change?

Vulnerability 1 — Unsafe Pickle Deserialization (CVE-worthy)

All four distributed training backends (MPI, Horovod, PyTorch DDP, TensorFlow) accept CLI-controlled file paths and immediately deserialize pickle files using cloudpickle.load() without any integrity verification. In distributed deployments, a single malicious pickle payload achieves cluster-wide arbitrary code execution.

Affected worker scripts:

  • python/orca/src/bigdl/orca/learn/mpi/mpi_train.py
  • python/nano/src/bigdl/nano/deps/horovod/horovod_worker.py
  • python/nano/src/bigdl/nano/pytorch/strategies/worker.py
  • python/nano/src/bigdl/nano/utils/tf/subprocess_worker.py

Vulnerability 2 — CI Workflow Unauthorized Code Execution

The performance-regression-test.yml workflow allows any GitHub user to trigger code execution on the self-hosted runner (Rohan) by commenting APRT on any PR. No author_association check exists. Additionally, the pull_request trigger lets fork PRs override the workflow definition itself.

2. User API changes

No user-facing API changes. SafePickle.dump() and SafePickle.load() maintain the same interface but now always perform HMAC integrity verification. The HMAC key is auto-generated per session and propagated to subprocesses via BIGDL_SAFE_PICKLE_KEY environment variable.

3. Summary of the change

Pickle deserialization fix:

  • Replace all cloudpickle.load()/cloudpickle.dump() calls in worker scripts and their callers with SafePickle.load()/SafePickle.dump()
  • Upgrade SafePickle (in nano, ppml, friesian packages) from hardcoded b'shared-key' + SHA1 to per-session random 256-bit key + HMAC-SHA256
  • dump() now auto-generates .sig sidecar files containing the HMAC digest
  • load() mandates signature verification — rejects unsigned or tampered pickle files
  • Use timing-safe hmac.compare_digest() to prevent timing attacks
  • Propagate HMAC key to MPI workers via mpiexec -genv and to subprocesses via inherited environment
  • SCP .sig files alongside .pkl files for remote MPI nodes

CI workflow fix:

  • Remove the pull_request trigger that allowed fork PRs to override the workflow definition on the self-hosted runner
  • Add author_association guard (MEMBER, OWNER, or COLLABORATOR) to the issue_comment trigger

4. How to test?

  • The pickle deserialization fix can be verified by attempting to load an unsigned or tampered pickle file — SafePickle.load() will raise an error
  • The CI workflow fix can be verified by checking that the if condition now requires author_association membership
  • Unit test
  • Application test

5. New dependencies

No new dependencies introduced. All changes use Python standard library modules (hmac, hashlib, os) that were already imported.

liu-shaojun and others added 2 commits April 20, 2026 02:04
Replace raw cloudpickle.load() with HMAC-verified SafePickle across all
four distributed training backends (MPI, Horovod, PyTorch DDP, TensorFlow).

The vulnerability allowed arbitrary code execution via crafted pickle files
passed through CLI-controlled paths. Worker scripts deserialized these files
without integrity verification, enabling cluster-wide RCE in distributed
deployments.

Changes:
- Upgrade SafePickle to use per-session random HMAC-SHA256 keys instead of
  the hardcoded b'shared-key' with SHA1
- SafePickle.dump() now auto-generates .sig sidecar files for integrity
- SafePickle.load() mandates signature verification, rejecting unsigned or
  tampered pickle files
- Use timing-safe hmac.compare_digest() to prevent timing attacks
- Propagate HMAC key to MPI workers via mpiexec -genv and to subprocesses
  via inherited environment
- SCP .sig files alongside .pkl files for remote MPI nodes

Affected files:
- mpi_train.py, mpi_estimator.py, mpi_runner.py (MPI training)
- horovod_worker.py, multiprocs_backend.py (Horovod)
- worker.py, ddp_subprocess.py (PyTorch DDP)
- subprocess_worker.py, backend.py (TensorFlow)
- safepickle.py in nano, ppml, and friesian packages

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…unner

The performance-regression-test workflow had two vulnerabilities that
allowed any GitHub user to execute arbitrary code on Intel's self-hosted
CI runner (Rohan):

1. The issue_comment trigger had no author_association check, so anyone
   could comment 'APRT' on any PR to trigger benchmark execution of
   fork PR code on the self-hosted runner.

2. The pull_request trigger caused the workflow to run the PR author's
   version of the workflow file on the self-hosted runner, giving full
   control over the execution environment.

Fix:
- Remove the pull_request trigger entirely — it allowed fork PRs to
  override the workflow definition itself
- Add author_association guard requiring MEMBER, OWNER, or COLLABORATOR
  status to trigger via issue_comment
- Simplify the job-level condition to only handle the comment-based flow

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@liu-shaojun liu-shaojun requested a review from glorysdj as a code owner April 20, 2026 04:29
@liu-shaojun liu-shaojun changed the title Fix/unsafe pickle deserialization Fix unsafe pickle deserialization and CI workflow unauthorized execution Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant