Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use installed JEP DLL rather than packaged DLL #50

Closed
wants to merge 6 commits into from

Conversation

williballenthin
Copy link
Contributor

@williballenthin williballenthin commented Jun 13, 2023

This PR addresses #49 and makes Ghidrathon substantially easier to install: it relies on the jep.dll built during pip install jep and therefore doesn't require a Gradle build step by the user. Instead, everyone can use the same pre-built Ghidrathon.zip. An example is attached. In theory it should be a drop-in replacement for whatever Ghidrathon.zip you have already.

The PR updates the interpreter configuration to search for the jep.dll in three places:

  • when PYTHONPATH is set, which a user may do to specify a specific virtualenv,
  • when VIRTUAL_ENV is set, such as when a virtualenv is "activated",
  • all the entries from sys.path used by whatever python3 uses

This works on my Windows system, Ghidra 10.3. I need to test this further, especially on my m1 Mac Mini. We should also setup testing in CI/CD to demonstrate this works on those platforms. Finally, documentation updates are needed if this PR is approved.

TODO:

  • manual testing on linux
  • manual testing on macOS (x86)
  • manual testing on macOS (arm64)
  • CI/CD testing
  • docstrings on all functions
  • handle VIRTUAL_ENV only scenario

If merged, this supercedes #4 #46 and #48 and closes #1 #3 #10 #27 #34 and #49

ghidra_10.3_PUBLIC_20230613_Ghidrathon-50.zip


// configure Java names that will be ignored when importing from Python
for (String name : ghidrathonConfig.getJavaExcludeLibs()) {
private PathMatcher getJepDllPathMatcher() throws Exception {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • docstring comments needed for all new routines

Comment on lines +171 to +178
if (arch == "amd64") {
// x86
return FileSystems.getDefault().getPathMatcher("glob:**libjep.so");
} else if (arch == "arm64") {
// arm m1
// TODO: just guessing this arch name arm64
return FileSystems.getDefault().getPathMatcher("glob:**libjep.jnilib");
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • testing needed on m1 mac
  • testing needed on x86 mac

} else if (os.indexOf("win") >= 0) {
return FileSystems.getDefault().getPathMatcher("glob:**jep.dll");
} else if (os.indexOf("nux") >= 0) {
return FileSystems.getDefault().getPathMatcher("glob:**libjep.so");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • testing needed on linux

Comment on lines +300 to +315
Path pythonPathJep = findPythonPathJep();
if (pythonPathJep != null) {
log.info("found JEP dll via PYTHONPATH: " + pythonPathJep);
}

try {
// if this is set, it takes precedence over system python
Path virtualenvJep = findVirtualEnvJep();
if (virtualenvJep != null) {
log.info("found JEP dll via VIRTUAL_ENV: " + virtualenvJep);
}

MainInterpreter.setJepLibraryPath(nativeJep.getAbsolutePath());
// fall back to whatever python3 references
Path systemJep = findSystemJep();
if (systemJep != null) {
log.info("found JEP dll via python3: " + systemJep);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we collect all the paths first so that we can log out the status fully, then we actually set the config second.

@williballenthin
Copy link
Contributor Author

when only VIRTUAL_ENV is present then we'll need to update the python search path. i'm not sure if we can set the PYTHONPATH environment variable from java or we need to shim the interpreter loading to update sys.path.

@williballenthin
Copy link
Contributor Author

if the installation is not working, it would be nice to show a dialog to the user and explain how they can setup the dependencies. adding this doesn't block the merge here, though.

@williballenthin
Copy link
Contributor Author

one rough spot i encountered during development was when JEP crashed, such as when the python interpreter fails to load (missing JEP module), and it brings down all of Ghidra immediately. this was one of the reasons using JNI was discouraged by the Ghidra devs - that bugs make the whole program unreliable versus a single component crashing.

to avoid this, we should consider invoking python (and loading JEP) as a subprocess and asserting that it doesn't crash before trying to start the embedded interpreter. if it does crash, we can warn the user and bail safely.

@williballenthin
Copy link
Contributor Author

we should also consider asserting that the jep.dll architecture matches the running java architecture. especially on mac m1 systems, it's easy to mix up formats since the intel emulation via Rosetta is so seamless.

not a blocker, but a nice to have.

Copy link
Collaborator

@mike-hunhoff mike-hunhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@williballenthin this is great! Thank you for taking the time to think this though along with the implementation.

I left a couple of comments for your review. Also, I have some additional thoughts/questions/comments:

1

How are we handling jep.jar? From Jep's documentation

If you would like to include jep as part of your application, you can place the files as necessary presuming the following conditions are met:

The jep.jar is accessible to the Java classloaders (typically through the Java classpath)

The shared library (jep.so or jep.dll) is accessible by the Java process (typically through -Djava.library.path or the environment variable LD_LIBRARY_PATH)

Are we going to package jep.jar with the Ghidrathon extension? I don't recall if I tried this in the past, but it would be interesting if this is possible. Jep makes the jep.jar available via Maven which could lead to a slick CI build process for the extension. I imagine we ship the Ghidrathon extension with jep.jar pulled via Maven and then require users to build the shared library on their system, which we then resolve at runtime as you've shown here.

Concerns with this approach include:

  • How do we handle version mismatches between jep.jar and the shared library?
  • Can we assume jep.jar works across OS and architectures for which users may build the Jep shared library?

2

Are we concerned with the overhead of locating the Jep shared library, especially with using subprocesses? Is this something we should resolve once and store for future use?


// whoops try Windows
nativeJep = Application.getOSFile(GhidrathonUtils.THIS_EXTENSION_NAME, "jep.dll");
private void setJepPaths() throws JepException, FileNotFoundException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider using Jep's LibraryLocator? LibraryLocator should cover PYTHON_HOME and VIRTUAL_ENV lending less code for us to maintain. It won't, however, cover your third solution for resolving the shared Jep library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i didn't consider it because i didn't notice it, thank you! i think i can replace most of this PR with a call to that routine 🥳

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dug into this some more and realized that MainInterpreter is coded to use the LibraryLocator if the Jep library path isn't manually set...https://github.com/ninia/jep/blob/056ce9907f5ecbf2364df1ec55755404b2e8a947/src/main/java/jep/MainInterpreter.java#L124-L135. Essentially, if we don't do anything then Jep attempts to find the Jep native library.

I tested this locally in a Linux environment and it worked great with the caveat that the LibraryLocator is naive in that it chooses the first Jep native library that it finds e.g. if you have two Python installs, both with Jep installed, then we don't have a choice (that I can find) as to which is selected. I don't think this is a major roadblock as it, so far, "just works" without additional code on our end and the multiple Python installs can be addressed by the user either by:

  • ensuring only 1 Python install also has Jep installed
  • using virtual environments

I'm going to test this on Windows next.

}

private Path findSystemJep() {
String output = execCmd("python3", "-c", "import sys; import base64; print((b' '.join(map(lambda s: base64.b64encode(s.encode('utf-8')), sys.path))).decode('ascii'))");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we assuming python3 is available on all systems?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. can you give a scenario in which we expect Ghidrathon to work but python3 is not available? i can't immediately think of one, but i dont feel reliable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably don't need this with LibraryLocator, so lets not spend a lot of time answering this question unless that other strategy fails.

@mike-hunhoff
Copy link
Collaborator

mike-hunhoff commented Jun 14, 2023

when only VIRTUAL_ENV is present then we'll need to update the python search path. i'm not sure if we can set the PYTHONPATH environment variable from java or we need to shim the interpreter loading to update sys.path.

We can use PyConfig to configure things like PYTHONHOME before the first Python interpreter is created.

@mike-hunhoff
Copy link
Collaborator

one rough spot i encountered during development was when JEP crashed, such as when the python interpreter fails to load (missing JEP module), and it brings down all of Ghidra immediately. this was one of the reasons using JNI was discouraged by the Ghidra devs - that bugs make the whole program unreliable versus a single component crashing.

to avoid this, we should consider invoking python (and loading JEP) as a subprocess and asserting that it doesn't crash before trying to start the embedded interpreter. if it does crash, we can warn the user and bail safely.

Was the crash occurring because Ghidrathon did not bail early enough after not being able to locate the Jep module?

@williballenthin
Copy link
Contributor Author

We can use PyConfig to configure things like PYTHONHOME before the first Python interpreter is created.

Initially I used this; however, I learned that PYTHONHOME is not the same thing as PYTHONPATH and is basically useless (it doesn't have a meaning on Windows and represents a fragment of a path to the library directory on Unix. its not what i thought). in fact, i don't know why JEP supports this because i can't find a use for this setting.

setting PYTHONPATH is probably needed. I think we can prep the new interpreters by running import sys; sys.path.append(...) if necessary.

@williballenthin
Copy link
Contributor Author

Was the crash occurring because Ghidrathon did not bail early enough after not being able to locate the Jep module?

I think the crash came from within JEP (java side) as it created the Python embedded interpreter, which looked for JEP (python side) that wasn't accessible (such as if I had PYTHONPATH wrong). then the interpeter raised an exception (python side) and JEP (java side) somehow crashed. im not exactly sure of the details since all of Ghidra would disappear from the screen and no logs are written. wrapping the JEP interpreter creation in try/catch Exception had no effect.

in the above scenario, jep.dll is located and invoked; however, the python configuration is incomplete. i think probably the java side tries to reach into the python side, but can't because the python side libraries don't load. ...or something.

it might be nice to triage this, and if its a JEP edge case, fix it upstream. in the meantime, i propose trying to detect a correctly install JEP (python side) via subprocess before creating the embedded interpreter.

@williballenthin
Copy link
Contributor Author

williballenthin commented Jun 15, 2023

Are we concerned with the overhead of locating the Jep shared library, especially with using subprocesses? Is this something we should resolve once and store for future use?

i expect the overhead to be a fraction of a second (walk a few file system directories, create a short lived python process), which I think is reasonable. python and java do many of the same operations as they start themselves up, too. we could do a bit better by short circuiting upon finding the first jep.dll (at the expense of less debugging details, no big deal).

im quite wary of caching the results because of the complexity involved in detecting changes to the environment. i think we should explore this option if we determine Ghidrathon startup time is too long - but not before then.

@williballenthin
Copy link
Contributor Author

williballenthin commented Jun 15, 2023

Are we going to package jep.jar with the Ghidrathon extension?

yes.

i agree there is then the potential for version mismatching. i'll need to research if JEP verifies the versions or if we need to do this. i hope and expect its the former.

in any case, we'll have to ask users to do something like: pip install jep==4.1.1 and emphasize that the version must match. in theory we could make multiple Ghidrathon builds each with a different jep.jar version, but i can't imagine that our users feel very strongly about which JEP version they're using, so its not useful. we could also potentially bundle many jep.jar versions into the same extension and pick at runtime, but the added complexity of figuring out which python version is present is non-trivial, and against doesn't solve a likely problem. so, lets just tell people to use the right version and show an error dialog with the same details when they do something wrong.

since jep.jar is available on maven, i suspect that jep.jar should be portable across OS, architectures, and java versions. i'll confirm by using the same Ghidrathon.zip across the different hardware i have available.

@mike-hunhoff
Copy link
Collaborator

Was the crash occurring because Ghidrathon did not bail early enough after not being able to locate the Jep module?

I think the crash came from within JEP (java side) as it created the Python embedded interpreter, which looked for JEP (python side) that wasn't accessible (such as if I had PYTHONPATH wrong). then the interpeter raised an exception (python side) and JEP (java side) somehow crashed. im not exactly sure of the details since all of Ghidra would disappear from the screen and no logs are written. wrapping the JEP interpreter creation in try/catch Exception had no effect.

in the above scenario, jep.dll is located and invoked; however, the python configuration is incomplete. i think probably the java side tries to reach into the python side, but can't because the python side libraries don't load. ...or something.

it might be nice to triage this, and if its a JEP edge case, fix it upstream. in the meantime, i propose trying to detect a correctly install JEP (python side) via subprocess before creating the embedded interpreter.

Gotcha' - in the past when Ghidra has crashed a file named along the lines of hs_err_pid<pid>.log is generated that may give us a better idea of the cause. I've experienced Jep crash Ghidra because of actions carried out after the interpreter has been created but not during/before its creation so this is definitely interesting.

@mike-hunhoff
Copy link
Collaborator

Are we going to package jep.jar with the Ghidrathon extension?

yes.

i agree there is then the potential for version mismatching. i'll need to research if JEP verifies the versions or if we need to do this. i hope and expect its the former.

in any case, we'll have to ask users to do something like: pip install jep==4.1.1 and emphasize that the version must match. in theory we could make multiple Ghidrathon builds each with a different jep.jar version, but i can't imagine that our users feel very strongly about which JEP version they're using, so its not useful. we could also potentially bundle many jep.jar versions into the same extension and pick at runtime, but the added complexity of figuring out which python version is present is non-trivial, and against doesn't solve a likely problem. so, lets just tell people to use the right version and show an error dialog with the same details when they do something wrong.

since jep.jar is available on maven, i suspect that jep.jar should be portable across OS, architectures, and java versions. i'll confirm by using the same Ghidrathon.zip across the different hardware i have available.

We should also consider situations where multiple Python installs exist on a system where each install includes Jep e.g. a user has Python 3.9 and Python 3.10 installed both w/ Jep version 4.1.1. How do we choose? I think ideally it would be up to the user, but this poses additional challenges. Do we prompt every time or cache the choice? What happens in Ghidra headless mode e.g. Ghidra is configured into an analysis pipeline?

@mike-hunhoff
Copy link
Collaborator

when only VIRTUAL_ENV is present then we'll need to update the python search path. i'm not sure if we can set the PYTHONPATH environment variable from java or we need to shim the interpreter loading to update sys.path.

Unfortunately virtual environments do not work on Windows due to the way Ghidra is invoked. We lose virtual environment env variables when a child process is created for Ghidra e.g. VIRTUAL_ENV does not exist once Ghidra has been started even if we invoke ghidraRun.bat from within the virtual environment.

@mike-hunhoff
Copy link
Collaborator

Thank you for your contribution - this has been superseded by 32f3969.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Build suggestions
2 participants