Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context Manager - Taint Propagation Issue #797

Closed
giusepperaffa opened this issue Sep 26, 2023 · 6 comments
Closed

Context Manager - Taint Propagation Issue #797

giusepperaffa opened this issue Sep 26, 2023 · 6 comments

Comments

@giusepperaffa
Copy link

Description
I have been trying to use Pysa (Ubuntu 20.04 + virtual environment + Python 3.8) to perform a data flow analysis including the following code, which is part of a function. The source is included in a different function, whereas the sink is in the function updateDynamoDBTable. The inter-procedural nature of the analysis is not causing issues though, as the portion of the code where the taint propagation breaks is the one reported below:

# Output file generation
outputFileA = os.path.join('/tmp', 'outputFileA.txt')
with open(downloadPath, 'r') as inputFileObj, open(outputFileA, 'w') as outputFileObjA:
    inputFileContentsList = inputFileObj.readlines()
    outputFileObjA.write(inputFileContentsList[0])
# Update DynamoDB table
updateDynamoDBTable(outputFileA, outputFileB)

All seems to point to how Pysa deals with taint propagation within a context manager. I am struggling to understand how this can be solved, if at all. Any support would be greatly appreciated.

Pysa models
To support my analysis, I added the following Pysa models. Model 1 allows propagating a taint through the write method for a file (as far as I know, this is not the default Pysa behaviour), whereas models 2 and 3 ensure that the taint is propagated when the context manager performs the initialization steps. Note: model 2 and 3 were written after analysing the call graph with pyre_dump_call_graph().

  1. def io.TextIOBase.write(self, __s: TaintInTaintOut[Updates[self]]): ...
  2. def open(file: TaintInTaintOut[LocalReturn]): ...
  3. def io.IOBase.__enter__(self: TaintInTaintOut[LocalReturn]): ...

Taint analysis
Following a similar approach to #795, I have instrumented the code with the function reveal_taint. These are my findings:

  • reveal_taint called immediately after the initialization outputFileA = os.path.join('/tmp', 'outputFileA.txt') shows that outputFileA has the expected backward taint and no forward taint. To my mind, this is correct, as it confirms that Pysa is rightly identifying the final sink in the updateDynamoDBTable function. The variable outputFileA is initialized with untainted literals, so it cannot have a forward taint.
  • reveal_taint called immediately before the line outputFileObjA.write(inputFileContentsList[0]) shows that outputFileObjA has no taint.
  • reveal_taint called immediately after the line outputFileObjA.write(inputFileContentsList[0]) shows that outputFileObjA has the taint propagated via inputFileContentsList. This is the result that I expected.
  • reveal_taint called immediately after the context manager execution confirms that outputFileObjA still has the taint acquired within the context manager body.

Note: The original source (not visible in the code provided) has a taint that is propagated via downloadPath, then inputFileObj and finally inputFileContentsList.

Conclusion
The variable outputFileObjA acquires the taint from the intended source within the context manager body, and Pysa detects in outputFileA the back-propagated taint from the final sink. Therefore, after adding context manager-specific models (see above), I expected Pysa to be able to identify the entire data flow, but I have a false negative instead. Thank you very much.

@arthaud
Copy link
Contributor

arthaud commented May 1, 2024

Hi @giusepperaffa, is this still an issue? Sorry for the (very) late reply.

@giusepperaffa
Copy link
Author

Hi @arthaud

Yes, this is still an issue. Your help would be greatly appreciated.

Thank you very much.

@arthaud
Copy link
Contributor

arthaud commented May 2, 2024

Hi @giusepperaffa, I need more information to be able to help you here.

I'm not quite sure which flow we are supposed to find. It is from outputFileA to the second parameter of updateDynamoDBTable? How is initialized outputFileA? Is this a parameter to the function? is it the result of calling something else? Depending on that answer, it might be a big in the forward analysis or the backward analysis. It could also be due to a missing model, or wrong call graph.

Could you try to make the function with that code as small as possible, then add a pyre_dump() call in there (could be anywhere), run pysa with -n (e.g, pyre -n analyze) and send me the full logs? This will allow me to quickly see what is wrong. Note: it might write a lot of logs, so you probably want to redirect the output to a file.

Also, if possible, use the latest version of pyre. For that, you can install pyre-check-nightly instead of pyre-check.

@giusepperaffa
Copy link
Author

Hi @arthaud, the diagram below provides a simplified call graph of my application:

graph TD;
Function_With_Source-->Function_With_Context_Manager
Function_With_Context_Manager-->updateDynamoDBTable

The following observations should further clarify my issue:

  • The expected data flow starts with a source within Function_With_Source, goes through the Function_With_Context_Manager, then goes through the first parameter of the function updateDynamoDBTable, and finally ends in another function called by updateDynamoDBTable. For simplicity, the call graph above does not show what happens within updateDynamoDBTable, because my analysis suggests that there are no taint propagation issues within the latter function.
  • outputFileA is initialized with os.path.join in the first line of the code shared when I raised this issue. The input arguments of os.path.join are two untainted string literals, and outputFileA is a string too. outputFileA is used to initialize outputFileObjA, and it is then passed as first input argument to the function updateDynamoDBTable.
  • I emphasize that outputFileA has the expected backward taint, and outputFileObjA has the expected forward taint, but the two partial data flows are not merged by Pysa. This means that the complete data flow is not detected.

As suggested, I have re-run the analysis with pyre_dump() after the context manger, but before the call to updateDynamoDBTable. The log file, which was created with the command pyre -n analyze &> ./context_manager_issue.log, is attached.
context_manager_issue.log

Please let me know whether this allows you to understand the issue. Otherwise, I will share the code in a zip archive.

I would leave the option of upgrading to the latest version of pyre as last resort. I had several installation issues, and I would prefer keeping the virtual environment as it is for consistency with what I have been doing.

@arthaud
Copy link
Contributor

arthaud commented May 14, 2024

Hi @giusepperaffa,

From what I understand, this is the flow that you are expecting:

downloadPath = source()
#              ^^^^^ tainted
outputFileA = os.path.join('/tmp', 'outputFileA.txt')
with open(downloadPath, 'r') as inputFileObj, open(outputFileA, 'w') as outputFileObjA:
    #                           ^^^^^^^^^^^^ tainted
    inputFileContentsList = inputFileObj.readlines()
    # ^^^^^^^^^^^^^^^^^^ tainted
    outputFileObjA.write(inputFileContentsList[0])
     # ^^^^^^^^^^^ tainted
# outputFileA is tainted
updateDynamoDBTable(outputFileA, outputFileB)
#                   ^^^^^^^^^^^^ sink

Unfortunately, Pysa won't be able to catch this flow. outputFileObjA is tainted, but we are not able to taint outputFileA since it is treated as a different value at this point.

This would be similar to:

x = MyClass()
y = f(x)
y.append(source())
sink(x)

After y = f(x), x and y are treated as independent variables. tainting one won't taint the other one. The code would need to be structured differently to catch this flow. For instance, if you could eliminate outputFileA and only use outputFileObjA, this could work.

@giusepperaffa
Copy link
Author

Hi @arthaud,

Thank you very much for your reply and for investigating this issue.

I have now understood the root cause of this issue, which can now be closed.

Thank you very much again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants