Skip to content

BUG: line events from non-profiled functions misattributed to profiled functions sharing bytecodes #424

@TTsangSC

Description

@TTsangSC

Short description

If a function not passed to the profiler compiles down to the same bytecode as another function that is, line events of the former may be misattributed to the latter.

Example

`dupe_test/dupe_test_module.py`
def func():
    print(0)


def main():
    func()
    print(1)
    print(2)
`dupe-test-script.py`
from dupe_test_module import func


def main():
    func()
    print(1)
    print(2)


if __name__ == '__main__':
    main()
Profiling (normal)
$ PYTHONPATH="${PYTHONPATH}:${PWD}/dupe_test" kernprof -l --outfile=out --prof-mod=dupe_test_module --prof-mod=dupe-test-script.py -vz dupe-test-script.py
0
1
2
Wrote profile results to 'out'
Timer unit: 1e-06 s

Total time: 1e-05 s
File: [...]/dupe_test_module.py
Function: func at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def func():
     2         1         10.0     10.0    100.0      print(0)

Total time: 1.4e-05 s
File: dupe-test-script.py
Function: main at line 4

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     4                                           def main():
     5         1         11.0     11.0     78.6      func()
     6         1          2.0      2.0     14.3      print(1)
     7         1          1.0      1.0      7.1      print(2)
Profiling (bugged)
$ PYTHONPATH="${PYTHONPATH}:${PWD}/dupe_test" kernprof -l --outfile=out --prof-mod=dupe_test_module -vz dupe-test-script.py
0
1
2
Wrote profile results to 'out'
Timer unit: 1e-06 s

Total time: 1.2e-05 s
File: [...]/dupe_test_module.py
Function: func at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def func():
     2         1         12.0     12.0    100.0      print(0)

Total time: 1.6e-05 s
File: [...]/dupe_test_module.py
Function: main at line 5

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     5         1         13.0     13.0     81.2  def main():
     6         1          2.0      2.0     12.5      func()
     7         1          1.0      1.0      6.2      print(1)
     8                                               print(2)

Note how for the first invocation (1) dupe_test/dupe_test_module.py::main() has never actually been called, but has lines attributed thereto due to its bytecode twin dupe-test-script.py::main(), and (2) the line-event attributions seem to be offset by one (since the two function definitions are one line apart).

Analysis and fixes

line_profiler/_line_profiler.pyx::LineProfiler.add_function() pads the .co_code of code objects sharing the same bytecode with (copies of) the no-op bytecode, in order to quickly distinguish between dupes. Unfortunately, this only happens from the second code object that the profiler has seen onwards.

Hence, if during the Python runtime multiple functions compiling down to the same bytecode exist, but only one of them is passed to the profiler, the profiled function's .co_code is left untouched. When any of them is called, line_profiler/_line_profiler.pyx::inner_trace_callback() would then hash all of them to the same block hash, which may result in stray lines in the non-profiled functions hashing to the same line hashes as other lines in the profiled function, depending on their physical line numbers in their respective source files.

(In the second kernprof invocation, both of the functions sharing the same implementation (and thus bytecode) were passed to the profiler. Hence, one of them went through bytecode de-duplication and the two are thus not conflated.)

The naïve way to fix this would be to include more metadata (e.g. .co_name, (.co_qualname), .co_filename, and .co_firstlineno) in the block hash, or just hash the entire code object, but that is obviously a bad idea since the hashing happens for every single line event the interpreter reports. A more reasonable way would be to start the padding on the first "duplicate", i.e. to pad all code objects the profiler ever sees. This would probably incur a slight (if any) bookkeeping overhead in add_function(), but not during code execution.

Note

As described in the discussions for #422; made a separate issue for documentation purposes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions