-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SharedCache] Process the .symbols cache file #6210
base: dev
Are you sure you want to change the base?
Conversation
`SharedCache::m_symbolInfos` isn't being serialized but there is an attempt to deserialize it. This commit adds in the code to serialize it. I slightly modified the format because I didn't really understand how it was expected to be serialized based on the deserialization code. The deserialization code looked wrong to me but its likely a misunderstanding on my part. I kept it similar to how `m_exportInfos` is serialized.
A significant number of symbols are not being defined because there is currently no support for parsing the .symbols cache file. This commit adds that support. `m_symbolInfos` has been modified to be a vector of references to `Symbol`s which is similar to [this PR](Vector35#6197). It makes more sense for the use case where `m_symbolInfos` is used as a symbol cache, otherwise a bunch of time is spent transforming between the old style of `m_symbolInfos` entries to Binary Ninja `Symbol`s. This commit does require a metadata version bump. I felt this was necessary to determine which symbols to load from the symbols cache. The problem is that the `m_images` container does not store the images in the order they are found in the DSC. The index they are at determines the location of their symbols in the symbols cache file. Rather than converting `m_images` to a vector and relying on its ordering being correct, it seemed more prudent to store the index of the image in the `CacheImage` structure. As this is serialized, the metadata version has to be bumped to accomodate the change.
Something I considered was reading out all the symbols in the symbols cache file during the initial load. Not only was this a bit slow but more importantly it results in something like 11 million symbols (at least on iOS 18+ DSCs). With the current way the DSC plugin is written, even with @bdash's major performance improvements (found here bdash#2), the cost of serializing all those symbols every call to Ultimately I think that function probably needs to be re-worked at some point because it is a source of performance issues. Even though this PR does lazy reading of symbols, eventually |
The previous commit broke compiling the macho view plugin due to duplicate definitions
Thanks for the PR! Something that might be worth considering when dealing w/ such a high volume of symbols is, finding a way to not require serialization of the whole list. There may come a point when the Undo Action serialization for only local symbols becomes more performant than serializing everything and reapplying it on deser. I'll need to take a look at what that would affect. I'm hoping to look at this sometime within the next week or two. |
I'm not sure I fully understand your comment. Currently this PR lazily loads symbols so in theory any symbol that has been serialized is because that symbol is required due to it being in an image that has been loaded. That said as you've mentioned in another PR it might make sense to use user symbols so that you don't need to serialize them. Something else that I'm not sure I fully understand the purpose of is the symbol search tab in the DSC triage view. To me the name would suggest that it provides a list of all symbols in the DSC but currently it is only symbols in the export tries. I guess that would make sense to some degree to just have a list of exported symbols because people maybe looking for symbols imported into some userland binary and therefore trying to find the corresponding implementation in the DSC. If the intention was for it to be a list of all the symbols in the DSC then I guess it would require serialization of the entire symbols list which then has the performance issues. I think then the approach would have to be figuring out how to reduce deserialization to once per binary view, on creation. But also find a way to improve serialization and deserialization performance, which I could see being entirely possible. Would not surprise me if there is a library out there that can update large JSON strings with only the changes that are made for the corresponding structures. Also from an API perspective I can see value in being able to ask the DSC about all the symbols it has and their addresses. This seems to already partially exist, although its just export trie symbols. If the plan is to provide all symbols then I don't see how you can avoid not serializing them all because otherwise you'd have to regenerate the list everytime the API was asked for all symbols from the DSC and that is a slow process. |
A significant number of symbols are not being defined because there is currently no support for parsing the .symbols cache file. This commit adds that support.
Whenever
SharedCache::InitializeHeader
is called, the symbols for the image that header corresponds to, are read from the .symbols cache file and define within the binary view. This is the same place that symbols found in the Mach-O load commands and export trie are applied to the binary view. Some of the code in that function was reworked to be more DRY.m_symbolInfos
has been modified to be a vector of references toSymbol
s which is similar to #6197. It makes more sense for the use case wherem_symbolInfos
is used as a symbol cache, otherwise a bunch of time is spent transforming between the old style ofm_symbolInfos
entries to Binary NinjaSymbol
s.This commit does require a metadata version bump. I felt this was necessary to determine which symbols to load from the symbols cache. The problem is that the
m_images
container does not store the images in the order they are found in the DSC. The index they are at determines the location of their symbols in the symbols cache file. Rather than convertingm_images
to a vector and relying on its ordering being correct, it seemed more prudent to store the index of the image in theCacheImage
structure. As this is serialized, the metadata version has to be bumped to accomodate the change.Note: this PR is built on top of #6174 because I felt like it made sense for that fix to be included.