Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: SessionExpired issue #1057

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

AlmogBaku
Copy link

Introducing a SessionWrapper that is responsible for renewing the Neo4J driver and session in case of SessionExpired exception.

fixes #987

Introducing a SessionWrapper that is responsible for renewing the Neo4J driver and session in case of SessionExpired exeption.
@ramonpetgrave64
Copy link
Collaborator

Hey! Thanks for this. This could be a good solution, until the neo4j client libraries become more flexible with timeouts. Can you into potential performance impacts?

@achantavy
Copy link
Contributor

Hi @AlmogBaku, thank you for putting this together. Couple of questions:

  1. Can you provide an example stack trace of experiencing Session expired while scanning aws #987 with cartography? Asking because I am wondering if this only occurs in cases where session.run() is called, and I would be very interested to see a stack trace where it fails with session.write_transaction, because write_transaction's automatic retries have solved similar-looking problems for me.

  2. The Neo4j docs you linked say that max-connection-lifetime should be configured to avoid this, and cartography does expose this setting as mentioned here: Session expired while scanning aws #987 (comment). Has playing with this setting helped with Session expired while scanning aws #987 at all?

  3. Do you have a console log trace showing that this code change fixes the problem in Session expired while scanning aws #987?

@AlmogBaku
Copy link
Author

@ramonpetgrave64 - the performance impact is relatively minor since I only keep one instance and invalidate it upon failure.

@achantavy -

  1. My stack is Neo4j Aura (GCP) + k8s job for cartography(on EKS). Apparently, the AWS gateway is dropping the connection after a while of idling. Neo4j supports claims it also happens in other use-cases.
    From my research, I know it also happens with other hosted neo4j solutions.

I don't know about the session.write_transaction, but I guess it's the same because the driver is responsible for the TCP connection and it gets dropped.

  1. Changing the timeout doesn't help because the gateway might drop it. Neo4j support team also stated that putting up a "retrying mechanism" such as this is their recommendation

  2. I can't share the login trace because it also shares some sensitive data, but I tried this patch in production, and it works :) I'd be happy to provide more information via Slack DM

@achantavy
Copy link
Contributor

My stack is Neo4j Aura

Ah, I meant can you provide a log trace showing where the crash happens? This way we might be able to identify if the issue is more likely to show up in certain parts of the code.

I don't know about the session.write_transaction, but I guess it's the same because the driver is responsible for the TCP connection and it gets dropped.

session.write_transaction does automatic retries when the TCP connection gets dropped and session.run does not. cartography has a lot of older code that still calls session.run that we are trying to refactor, so if we know that only session.run is problematic, then we can solve this issue without adding additional complexity by maintaining our own driver reisntantiation logic.

I can't share the login trace because it also shares some sensitive data

Can you redact out the sensitive parts?

It would be helpful to see a failing log trace where the issue happens, and a successful log trace to show that the issue is solved.

Just for additional reference, we implemented max-connection-lifetime here: #535 and it helped solve timeouts with our NLBs. If you haven't already, I'd definitely recommend setting it to something less than 350s if you know that the connection is going over an AWS NLB (I don't know what Aura uses for hosting). I'd also be very interested in seeing the session expired crash even with this setting.

To summarize, the logic table I'd want to know is

set --neo4j-max-connection-lifetime to less than 350 seconds
Did it crash? (yes/no)
If yes, provide log trace of the crash without confidential information

Transient crashes are frustrating and I definitely appreciate your help; I'm asking all these questions to gather data before deciding to own additional complexity (imo this kind of thing really should be handled by the neo4j driver :p)

@AlmogBaku
Copy link
Author

  1. Yes, I tried to reduce the conn lifetime, same error
  2. The error log is very similar to Session expired while scanning aws #987. If still needed - I'll be able to reproduce it next week and send you the full log
  3. I totally agree - it would have been better if the driver had handled it, but it seems like Neo4j team disagree 😖

@achantavy
Copy link
Contributor

Hi @AlmogBaku, please send the log over, or if that is not possible then please reply back with what exact cartography function was running when it crashed (e.g. was it running load() in EC2 instances, or load() in IAM, or...?).

It'd also be helpful to know if it always crashes in that same spot or if there are multiple places (if you have multiple crash dumps, then great, send them all if you can!).

I'd really like to know you're seeing this crash as a result of session.run() or write_transaction().

@AlmogBaku
Copy link
Author

AlmogBaku commented Jan 1, 2023

@achantavy
Copy link
Contributor

Thanks @AlmogBaku for sharing the log, this is helpful. I'd like a bit of time to play with this a bit before merging. Thanks for your paitience. cc: @ramonpetgrave64 @mpurusottamc FYI.

@harshagw just wondering, how difficult would it be to combine the neo4j session factory that you wrote with the session wrapper shown in this PR?

@AlmogBaku
Copy link
Author

any update?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Session expired while scanning aws
3 participants