A naive PKGBUILD parser library for Rust. Useful to extract package name, sources, dependency relationships, etc from them with little to no bottleneck.
This is naive in the sense that it does not understand PKGBUILD
s natively, nor does it care what the PKGBUILD
s do.
Instead, it uses a Bash instance to run a dynamically generated, highly efficient script, which does only the bare-minimum handling of in-PKGBUILD data structures and dump them directly to its stdout with minimum decorating to be parsed by the library.
Being naive, this avoids a lot of hacks needed in the Rust world to try to understand a Bash script and a lot of pitfalls that come with them.
The parser script is highly optimized. The logic is dynamically assembled yet static during the parser lifetime. It wastes 0 time on stuffs the users do not want to parse.
The whole parser script uses only Bash
native logics and does not spawn child processes other than the subshells to extract package-specific variables, and even those are avoidable.
On a test against ArchLinux's 12575 official PKGBUILD
s, the example benchmark
executable took ~42.9 seconds when single-threaded and ~5.9 seconds when multi-threaded on an AMD Ryzen 5600X
All data parsed from PKGBUILD
s are stored as strongly typed Rust native types, these include version structures that could be easily compared with the built-in vercmp
feature, dependencies that include seperate package name and version fields, hashes that are stored as byte arrays, sources that have protocol type and protocol-specific fields, etc.
Nevertheless, while all data structures are strongly typed, the whole PKGBUILD
still derives serde
, both deserialization and serialization. This means you can run the parser in an encapsuled, isolated, safe container that cannot reach sensitive data on host, and let it write serialized data to its output, so the outer process that runs on host could deserialize it again. This would be of great use if the security concern brought by the fact the PKGBUILD
is always valid Bash script and they could do whatever a Bash script could do shall be avoided. See the Security concern section below, and check out the jail
example for how to implement this.
There are a couple few examples under examples, to run them, do like
cargo run --example dump_all [path to pkgbuild]
You can also build and use some examples to replace part of the makepkg functionality, like:
cargo build --release --features srcinfo --example printsrcinfo
strip target/release/examples/printsrcinfo -o ~/bin/printsrcinfo
From now on you can run ~/bin/printsrcinfo
instead of makepkg --printsrcinfo
, this is much much faster (0.017s vs 4.65s on kodi-nexus-mpp-git
) and would help you greatly on PKGBUILD development.
There're a few structs in the library that would need to be created and used to parse PKGBUILD
s.
A Parser
is a combination of a ParserScript
and ParserOptions
that is ready to take PKGBUILDs
as its input to parse. Calling parse_one()
and parse_multi()
on it would use the underlying ParserScript
to parse the defined list of paths of PKGBUILD
s. The parse_one()
method has an optional arg, and would default to PKGBUILD
if it's not set.
// Create a `Parser` instance
let parser = Parser::new().expect("Failed to create parser");
// Parse one
let pkgbuild = parser.parse_one(None).expect("Failed to parse PKGBUILD");
// Parse multi
let pkgbuilds = parser.parse_multi(["/tmp/PKGBUILD/ampart", "/tmp/ampart-git/PKGBUILD", "/tmp/chromium/PKGBUILD"]).expect("Failed to parse multiple PKGBUILDs");
The shortcut methods parse_one()
and parse_multi()
would each create a temporary Parser
object and call the corresponding methods on them.
// Parse one
let pkgbuild = parse_one(None).expect("Failed to parse PKGBUILD");
// Parse multi
let pkgbuilds = parse_multi(["/tmp/PKGBUILD/ampart", "/tmp/ampart-git/PKGBUILD", "/tmp/chromium/PKGBUILD"]).expect("Failed to parse multiple PKGBUILDs");
Please note the main method is parse_multi()
, and parse_one()
is only a wrapper around the parse_multi()
method. If you want to parse multiple PKGBUILD
s, always use the parse_multi()
method, as that would only spawn the script once.
A ParserScript
is a handle to a tamporary or on-disk file that holds the content of the script. Usually you would only want the temporary variant, unless you want to check the generated script.
// A temporary file, it would be deleted after it goes out of scope
let script = ParserScript::new(None);
// A on-disk file, the file would still persist after the lifetime
let script = ParserScript::new(Some("/tmp/myscript"));
A ParserOptions
accompanies a ParserScript
to construct a Parser
, which holds some options that could determine the behaviour of the Parser
that's not hardcoded into the ParserScript
// The stream style creation
let mut options = ParserOptions::new();
options.set_interpreter("bin/mybash")
.set_work_dir(Some("work"))
.set_single_thread(true);
// The C style creation
let options = ParserOptions {
intepreter: "bin/mybash".into(),
work_dir: Some("work".into()),
single_thread: true,
};
A ParserScriptBuilder
could be used to construct a fine-tuned ParserScript
let mut builder = ParserScriptBuilder::new();
builder.provides = false;
builder.pkgver_func = false
let script = builder.build().expect("Failed to construct script");
// Stream style is also supported
let script = ParserScriptBuilder::new()
.set_makepkg_library("lib/makepkg")
.set_makepkg_config("conf/makepkg.conf")
.build(Some("work/my_parser.bash"))
.expect("Failed to construct script");
format
: implDisplay
for all our data types, useful when you want to display them in logs in pretty format.- The
Debug
trait would always be derived on all our data types regardless of this feature.
- The
serde
: implserde::Serialize
andserde::Deserialize
for all our data types, useful when you want to pass thePkgbuild
s between different programs, or to and from your sub-process in containers.- Enabling this would pull in
serde
andserde_bytes
dependencies.
- Enabling this would pull in
nothread
: limit the parser implementation to only use a single thread.- As we would feed the list of PKGBUILDs into the parser script's
stdin
, for minimum IO wait, when this is not enabled (default), the library would spawn two concurrent threads to writestdin
and readstderr
, while the main thread readsstdout
. - In some cases you might not want any thread to be spawned. When this is enabled, the library to use a dumber, page-by-page write read behaviour in the same thread.
- As we would feed the list of PKGBUILDs into the parser script's
unsafe_str
: skip some validation for max performance when creating&str
andString
- Namely this allows the unsafe conversion from
&[u8]
to&str
andString
, soutf-8
check could be skipped. - This IS unsafe, but the tradeoff of performance vs security could be made if you really prefer performance.
- Namely this allows the unsafe conversion from
vercmp
: support version comparison betweenPlainVersion
- This uses a Rust native port of the
rpmvercmp()
function, just like inpacman
. The result should be the same aspacman
'svercmp
CLI utility.
- This uses a Rust native port of the
tempfile
: support creating parser script astempfile::NamedTempFile
, this is enabled by default.- If disabled, this would remove a whole dependency tree introduced by
tempfile
, but you'll have to explicitly set paths for the parser script.
- If disabled, this would remove a whole dependency tree introduced by
srcinfo
addssrcinfo()
method toPkgbuild
, which generates aSrcinfo
struct and could be used to format PKGBUILD into a format similiar to the output format ofmakepkg --printsrcinfo
- Only when this is enabled, would
Srcinfo
struct be available
- Only when this is enabled, would
A Bash instance would be created to execute the built-in script, it would read the list of PKGBUILD
s from its stdin
, and outputs the parsed result to its stdout
, which would then be parsed by the library into native Rust data structure.
Shell injection should not be a problem in the library side as the script would not read any variable from user input. However, as PKGBUILD
s themselved are just plain Bash scripts under the hood, there're a lot of dangerous things that could be done by them. You should thus make sure the part in your code which reads the PKGBUILD
s should be isolated from the host environment.
This library does not come with any pre-defined security methods to lock the reader into a container. It's up to the caller's fit to choose an containerization tool to limit the potential damage that could be caused by PKGBUILD
s.
As this library has an optional serde
feature, you could use that to serialize Pkgbuild
s you parsed in a child process you spawned in a safe container, and deserialize that into your main process. MessagePack
is a highly efficient binary format that's very suitable for the job when passing these data around.