Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SharedCache] Process the .symbols cache file #6210

Open
wants to merge 3 commits into
base: dev
Choose a base branch
from

Conversation

WeiN76LQh
Copy link

@WeiN76LQh WeiN76LQh commented Nov 28, 2024

A significant number of symbols are not being defined because there is currently no support for parsing the .symbols cache file. This commit adds that support.

Whenever SharedCache::InitializeHeader is called, the symbols for the image that header corresponds to, are read from the .symbols cache file and define within the binary view. This is the same place that symbols found in the Mach-O load commands and export trie are applied to the binary view. Some of the code in that function was reworked to be more DRY.

m_symbolInfos has been modified to be a vector of references to Symbols which is similar to #6197. It makes more sense for the use case where m_symbolInfos is used as a symbol cache, otherwise a bunch of time is spent transforming between the old style of m_symbolInfos entries to Binary Ninja Symbols.

This commit does require a metadata version bump. I felt this was necessary to determine which symbols to load from the symbols cache. The problem is that the m_images container does not store the images in the order they are found in the DSC. The index they are at determines the location of their symbols in the symbols cache file. Rather than converting m_images to a vector and relying on its ordering being correct, it seemed more prudent to store the index of the image in the CacheImage structure. As this is serialized, the metadata version has to be bumped to accomodate the change.

Note: this PR is built on top of #6174 because I felt like it made sense for that fix to be included.

WeiN76LQh added 2 commits November 21, 2024 17:36
`SharedCache::m_symbolInfos` isn't being serialized but there is an attempt to deserialize it. This commit adds in the code to serialize it.

I slightly modified the format because I didn't really understand how it was expected to be serialized based on the deserialization code. The deserialization code looked wrong to me but its likely a misunderstanding on my part. I kept it similar to how `m_exportInfos` is serialized.
A significant number of symbols are not being defined because there is currently no support for parsing the .symbols cache file. This commit adds that support.

`m_symbolInfos` has been modified to be a vector of references to `Symbol`s which is similar to [this PR](Vector35#6197). It makes more sense for the use case where `m_symbolInfos` is used as a symbol cache, otherwise a bunch of time is spent transforming between the old style of `m_symbolInfos` entries to Binary Ninja `Symbol`s.

This commit does require a metadata version bump. I felt this was necessary to determine which symbols to load from the symbols cache. The problem is that the `m_images` container does not store the images in the order they are found in the DSC. The index they are at determines the location of their symbols in the symbols cache file. Rather than converting `m_images` to a vector and relying on its ordering being correct, it seemed more prudent to store the index of the image in the `CacheImage` structure. As this is serialized, the metadata version has to be bumped to accomodate the change.
@WeiN76LQh
Copy link
Author

Something I considered was reading out all the symbols in the symbols cache file during the initial load. Not only was this a bit slow but more importantly it results in something like 11 million symbols (at least on iOS 18+ DSCs). With the current way the DSC plugin is written, even with @bdash's major performance improvements (found here bdash#2), the cost of serializing all those symbols every call to SharedCache::SaveToDSCView means analysis becomes substantially slower (like 10x).

Ultimately I think that function probably needs to be re-worked at some point because it is a source of performance issues. Even though this PR does lazy reading of symbols, eventually m_symbolInfos will end up being populated with a lot of symbols as more images are loaded by the user. So the performance issues of SharedCache::SaveToDSCView will continue to be more problematic.

The previous commit broke compiling the macho view plugin due to duplicate definitions
@plafosse plafosse added this to the Gallifrey milestone Dec 4, 2024
@plafosse plafosse added the File Format: SharedCache Issue with the dyld_shared_cache plugin label Dec 10, 2024
@0cyn
Copy link
Member

0cyn commented Dec 10, 2024

Thanks for the PR! Something that might be worth considering when dealing w/ such a high volume of symbols is, finding a way to not require serialization of the whole list.

There may come a point when the Undo Action serialization for only local symbols becomes more performant than serializing everything and reapplying it on deser. I'll need to take a look at what that would affect.

I'm hoping to look at this sometime within the next week or two.

@WeiN76LQh
Copy link
Author

I'm not sure I fully understand your comment. Currently this PR lazily loads symbols so in theory any symbol that has been serialized is because that symbol is required due to it being in an image that has been loaded. That said as you've mentioned in another PR it might make sense to use user symbols so that you don't need to serialize them.

Something else that I'm not sure I fully understand the purpose of is the symbol search tab in the DSC triage view. To me the name would suggest that it provides a list of all symbols in the DSC but currently it is only symbols in the export tries. I guess that would make sense to some degree to just have a list of exported symbols because people maybe looking for symbols imported into some userland binary and therefore trying to find the corresponding implementation in the DSC. If the intention was for it to be a list of all the symbols in the DSC then I guess it would require serialization of the entire symbols list which then has the performance issues. I think then the approach would have to be figuring out how to reduce deserialization to once per binary view, on creation. But also find a way to improve serialization and deserialization performance, which I could see being entirely possible. Would not surprise me if there is a library out there that can update large JSON strings with only the changes that are made for the corresponding structures.

Also from an API perspective I can see value in being able to ask the DSC about all the symbols it has and their addresses. This seems to already partially exist, although its just export trie symbols. If the plan is to provide all symbols then I don't see how you can avoid not serializing them all because otherwise you'd have to regenerate the list everytime the API was asked for all symbols from the DSC and that is a slow process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
File Format: SharedCache Issue with the dyld_shared_cache plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants