PKGBUILD parser

A naive PKGBUILD parser library for Rust. Useful to extract package name, sources, dependency relationships, etc from them with little to no bottleneck.

Highlights

Naive

This is naive in the sense that it does not understand PKGBUILDs natively, nor does it care what the PKGBUILDs do.

Instead, it uses a Bash instance to run a dynamically generated, highly efficient script, which does only the bare-minimum handling of in-PKGBUILD data structures and dump them directly to its stdout with minimum decorating to be parsed by the library.

Being naive, this avoids a lot of hacks needed in the Rust world to try to understand a Bash script and a lot of pitfalls that come with them.

High efficiency

The parser script is highly optimized. The logic is dynamically assembled yet static during the parser lifetime. It wastes 0 time on stuffs the users do not want to parse.

The whole parser script uses only Bash native logics and does not spawn child processes other than the subshells to extract package-specific variables, and even those are avoidable.

On a test against ArchLinux's 12575 official PKGBUILDs, the example benchmark executable took ~42.9 seconds when single-threaded and ~5.9 seconds when multi-threaded on an AMD Ryzen 5600X

Strongly typed

All data parsed from PKGBUILDs are stored as strongly typed Rust native types, these include version structures that could be easily compared with the built-in vercmp feature, dependencies that include seperate package name and version fields, hashes that are stored as byte arrays, sources that have protocol type and protocol-specific fields, etc.

Piping friendly

Nevertheless, while all data structures are strongly typed, the whole PKGBUILD still derives serde, both deserialization and serialization. This means you can run the parser in an encapsuled, isolated, safe container that cannot reach sensitive data on host, and let it write serialized data to its output, so the outer process that runs on host could deserialize it again. This would be of great use if the security concern brought by the fact the PKGBUILD is always valid Bash script and they could do whatever a Bash script could do shall be avoided. See the Security concern section below, and check out the jail example for how to implement this.

Examples

There are a couple few examples under examples, to run them, do like

cargo run --example dump_all [path to pkgbuild]

You can also build and use some examples to replace part of the makepkg functionality, like:

cargo build --release --features srcinfo --example printsrcinfo 
strip target/release/examples/printsrcinfo -o ~/bin/printsrcinfo

From now on you can run ~/bin/printsrcinfo instead of makepkg --printsrcinfo, this is much much faster (0.017s vs 4.65s on kodi-nexus-mpp-git) and would help you greatly on PKGBUILD development.

Usage

There're a few structs in the library that would need to be created and used to parse PKGBUILDs.

Parser

A Parser is a combination of a ParserScript and ParserOptions that is ready to take PKGBUILDs as its input to parse. Calling parse_one() and parse_multi() on it would use the underlying ParserScript to parse the defined list of paths of PKGBUILDs. The parse_one() method has an optional arg, and would default to PKGBUILD if it's not set.

// Create a `Parser` instance
let parser = Parser::new().expect("Failed to create parser");
// Parse one
let pkgbuild = parser.parse_one(None).expect("Failed to parse PKGBUILD");
// Parse multi
let pkgbuilds = parser.parse_multi(["/tmp/PKGBUILD/ampart", "/tmp/ampart-git/PKGBUILD", "/tmp/chromium/PKGBUILD"]).expect("Failed to parse multiple PKGBUILDs");

The shortcut methods parse_one() and parse_multi() would each create a temporary Parser object and call the corresponding methods on them.

// Parse one
let pkgbuild = parse_one(None).expect("Failed to parse PKGBUILD");
// Parse multi
let pkgbuilds = parse_multi(["/tmp/PKGBUILD/ampart", "/tmp/ampart-git/PKGBUILD", "/tmp/chromium/PKGBUILD"]).expect("Failed to parse multiple PKGBUILDs");

Please note the main method is parse_multi(), and parse_one() is only a wrapper around the parse_multi() method. If you want to parse multiple PKGBUILDs, always use the parse_multi() method, as that would only spawn the script once.

ParserScript

A ParserScript is a handle to a tamporary or on-disk file that holds the content of the script. Usually you would only want the temporary variant, unless you want to check the generated script.

// A temporary file, it would be deleted after it goes out of scope
let script = ParserScript::new(None);
// A on-disk file, the file would still persist after the lifetime
let script = ParserScript::new(Some("/tmp/myscript"));

ParserOptions

A ParserOptions accompanies a ParserScript to construct a Parser, which holds some options that could determine the behaviour of the Parser that's not hardcoded into the ParserScript

// The stream style creation
let mut options = ParserOptions::new();
options.set_interpreter("bin/mybash")
    .set_work_dir(Some("work"))
    .set_single_thread(true);
// The C style creation
let options = ParserOptions {
    intepreter: "bin/mybash".into(),
    work_dir: Some("work".into()),
    single_thread: true,
};

ParserScriptBuilder

A ParserScriptBuilder could be used to construct a fine-tuned ParserScript

let mut builder = ParserScriptBuilder::new();
builder.provides = false;
builder.pkgver_func = false
let script = builder.build().expect("Failed to construct script");
// Stream style is also supported
let script = ParserScriptBuilder::new()
        .set_makepkg_library("lib/makepkg")
        .set_makepkg_config("conf/makepkg.conf")
        .build(Some("work/my_parser.bash"))
        .expect("Failed to construct script");

Optional features

format: impl Display for all our data types, useful when you want to display them in logs in pretty format.
- The Debug trait would always be derived on all our data types regardless of this feature.
serde: impl serde::Serialize and serde::Deserialize for all our data types, useful when you want to pass the Pkgbuilds between different programs, or to and from your sub-process in containers.
- Enabling this would pull in serde and serde_bytes dependencies.
nothread: limit the parser implementation to only use a single thread.
- As we would feed the list of PKGBUILDs into the parser script's stdin, for minimum IO wait, when this is not enabled (default), the library would spawn two concurrent threads to write stdin and read stderr, while the main thread reads stdout.
- In some cases you might not want any thread to be spawned. When this is enabled, the library to use a dumber, page-by-page write read behaviour in the same thread.
unsafe_str: skip some validation for max performance when creating &str and String
- Namely this allows the unsafe conversion from &[u8] to &str and String, so utf-8 check could be skipped.
- This IS unsafe, but the tradeoff of performance vs security could be made if you really prefer performance.
vercmp: support version comparison between PlainVersion
- This uses a Rust native port of the rpmvercmp() function, just like in pacman. The result should be the same as pacman's vercmp CLI utility.
tempfile: support creating parser script as tempfile::NamedTempFile, this is enabled by default.
- If disabled, this would remove a whole dependency tree introduced by tempfile, but you'll have to explicitly set paths for the parser script.
srcinfo adds srcinfo() method to Pkgbuild, which generates a Srcinfo struct and could be used to format PKGBUILD into a format similiar to the output format of makepkg --printsrcinfo
- Only when this is enabled, would Srcinfo struct be available

Security concern

A Bash instance would be created to execute the built-in script, it would read the list of PKGBUILDs from its stdin, and outputs the parsed result to its stdout, which would then be parsed by the library into native Rust data structure.

Shell injection should not be a problem in the library side as the script would not read any variable from user input. However, as PKGBUILDs themselved are just plain Bash scripts under the hood, there're a lot of dangerous things that could be done by them. You should thus make sure the part in your code which reads the PKGBUILDs should be isolated from the host environment.

This library does not come with any pre-defined security methods to lock the reader into a container. It's up to the caller's fit to choose an containerization tool to limit the potential damage that could be caused by PKGBUILDs.

As this library has an optional serde feature, you could use that to serialize Pkgbuilds you parsed in a child process you spawned in a safe container, and deserialize that into your main process. MessagePack is a highly efficient binary format that's very suitable for the job when passing these data around.

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
examples		examples
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
build.rs		build.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PKGBUILD parser

Highlights

Naive

High efficiency

Strongly typed

Piping friendly

Examples

Usage

Parser

ParserScript

ParserOptions

ParserScriptBuilder

Optional features

Security concern

About

Releases

Packages

Contributors 2

Languages

License

7Ji/pkgbuild-rs

Folders and files

Latest commit

History

Repository files navigation

PKGBUILD parser

Highlights

Naive

High efficiency

Strongly typed

Piping friendly

Examples

Usage

Parser

ParserScript

ParserOptions

ParserScriptBuilder

Optional features

Security concern

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages