Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG - sev-2 - [iOS/Android/All] - Increase in session refresh error rates (which forces users to re-authenticate) #9615

Closed
5 tasks
TKDickson opened this issue Sep 18, 2024 · 23 comments
Assignees
Labels
blocked Ticket is blocked and can't be worked at this time bug Used to identify a bug ticket that will be worked through the bug process front-end Ticket requires front-end work global Issues for the global team QA Tickets require QA work / review scrubbed Added to newly-written bug tickets after QA has reviewed them for clarity and completeness sev-2 Mid-tier bug severity level based on QA bug severity scale - QA to assign this

Comments

@TKDickson
Copy link
Contributor

TKDickson commented Sep 18, 2024

What happened?

We received a report from folks in the Identity DSVA channel that there's been an increase in session refresh error rates, since about mid-July.
image.png
https://vagov.ddog-gov.com/metric/explorer?fromUser=true&start=1717444052000&end=1725392852000&paused=false

The user experience for this is logging into the app, and then (within 45 days, with biometrics enabled) being logged out of the app and having to use UN/PW to log in again. This has been reported by at least one production user.

Specs:

  • Device:
  • OS:
  • User Account:
  • Accessibility State:

Steps to Reproduce

unknown

Desired behavior

Should be maintaining login session tokens for the full 45 days, so folks can log in easier with biometrics

Acceptance Criteria

Bug Severity - BE SURE TO ADD THE SEVERITY LABEL

See Bug Tracking for details on severity levels

  • Impact: High
  • Frequency: Low
  • 2 - High

Linked to Story

Screen shot(s) and additional information

Full JSON response for services related to issue (expand/collapse)

Ticket Checklist

  • Steps to reproduce are defined
  • Desired behavior is added
  • Labels added (front-end, back-end, global, Health, relevant feature, accessibility, etc)
  • Estimate of 1 added (for front-end tickets)
  • Added either to the relevant feature epic (if found during new feature implementation), or the relevant team's bug epic (Global, Health & Benefits, API, QART)
@TKDickson TKDickson added bug Used to identify a bug ticket that will be worked through the bug process global Issues for the global team QA Tickets require QA work / review sev-2 Mid-tier bug severity level based on QA bug severity scale - QA to assign this labels Sep 18, 2024
@timwright12
Copy link
Contributor

Slack thread for additional context: https://dsva.slack.com/archives/C04LFGY2SJK/p1725392845790499

@TKDickson TKDickson added the scrubbed Added to newly-written bug tickets after QA has reviewed them for clarity and completeness label Sep 19, 2024
@TKDickson TKDickson added the front-end Ticket requires front-end work label Oct 24, 2024
@Sparowhawk Sparowhawk self-assigned this Oct 25, 2024
@Sparowhawk
Copy link
Contributor

There is a spike of logNonFatalErrorToFirebase(err, startBiometricsLogin: ${authNonFatalErrorString})

So looking through that non fatal I see that the spike started with 2.30 at least compared with the previous versions. ~780 vs previous versions of it being mid 200s and then exploding at 2.32 where it jumps to 23k for Android and iOS it was a slow ramp up starting in 2.24 ~300 and steadily gaining more each release and again exploding in 2.32 to 38k.
I went through all the prs for 2.30, 2.31, 2.32, and nothing changed that would account for that spike.
I did find that the logging for logNonFatalErrorToFirebase(err, startBiometricsLogin: ${authNonFatalErrorString})
occurs if refreshToken = await retrieveRefreshToken() fails
which means this method failed

const retrieveRefreshToken = async (): Promise<string | undefined> => {
console.debug('retrieveRefreshToken')
const result = await Promise.all([
AsyncStorage.getItem(REFRESH_TOKEN_ENCRYPTED_COMPONENT_KEY),
Keychain.getInternetCredentials(KEYCHAIN_STORAGE_KEY),
])
const reconstructedToken = result[0] && result[1] ? ${result[0]}.${result[1].password}.V0 : undefined

if (reconstructedToken) {
await logAnalyticsEvent(Events.vama_login_token_get(true))
} else {
await logAnalyticsEvent(Events.vama_login_token_get(false))
}

return reconstructedToken
}

@Sparowhawk
Copy link
Contributor

Looking further I found there is also a non fatal for saving the refresh token, but it’s numbers didn’t spike so that likely indicates that saving is done correctly.
Therefore I think the issue is contained to the method I posted above, which makes no sense if it was saved correctly as that code hasn’t changed.
So currently waiting on the data dog analytics for this, but the only thing I can think of is maybe the storage size of the nonce is greater than this comment. “Biometric storage has a max storage size of 384 bytes.”

@Sparowhawk
Copy link
Contributor

so I guess my next question is do we have a spike for vama_login_token_get with a value of false?
depending on if there is a spike or not means there could be a different problem
one would be with the upstream, the other would be with Keychain or Async Storage

@Sparowhawk
Copy link
Contributor

also knowing if there was a spike in vama_login_token_store being false would also be useful

@Sparowhawk
Copy link
Contributor

Taking a look at the libraries for Async Storage and Keychain nothing jumps out in their issues that would be causing this problem, but it could still be possible.

@Sparowhawk
Copy link
Contributor

I checked the version history for both Keychain and Async Storage for which version we used to see if it was possible we updated to an unstable version. Looks like Keychain and Async Storage both didn't change versions from when we had stable releases with out the spike and immediately after the spike.

This leads me to further believe that it isn't a library problem

We have updated both since then and there are currently new versions available for both that haven't been upgraded yet.

@Sparowhawk
Copy link
Contributor

Looking further at the logs I think this might lead to more for Android

Wrapped error: javax.crypto.IllegalBlockSizeException

iOS was not as helpful The user name or passphrase you entered is not correct.

@Sparowhawk
Copy link
Contributor

with that error I was able to find this thread oblador/react-native-keychain#262

and it looks like a solution is to specify

storage: Keychain.STORAGE_TYPE.AES, in the keychain options as biometrics picks one based off the settings and RSA and the other types sometimes throws this error but AES seems to be reliable

@rbontrager
Copy link
Contributor

@Sparowhawk Android is no longer behaving as expected. If you login to a staging user, turn on biometrics, minimize the app, and re-open it it is no longer asking for your biometrics before signing you in again.

@Sparowhawk
Copy link
Contributor

@theodur Waiting on Theo's upgrade ticket for the library so we can specify the biometrics AES option instead of letting the device pick

@rbontrager
Copy link
Contributor

@Sparowhawk Your fix fixed the original issue but brought on another issue. It looks like the biometric authentication is occurring during the first time login as well (when it shouldn't). Still just for android. So if you uninstall/re-install the app, login to staging, and press yes to use biometrics a biometric authentication popup is appearing in front of the onboarding screen.

@theodur
Copy link
Contributor

theodur commented Nov 14, 2024

@theodur Waiting on Theo's upgrade ticket for the library so we can specify the biometrics AES option instead of letting the device pick

@Sparowhawk The PR with the upgrade changes was merged yesterday

Edit: Ironically, v9.2.0 of react-native-keychain came out after I finished the work in that PR, so that's why it only upgraded the library to 9.1 instead of 9.2.1

@jennb33
Copy link

jennb33 commented Jan 8, 2025

1/8/2025 - The MFS team is taking this ticket on for Sprint work.
cc: @ATeal @donmccaughey

@matt-guest-wilcore
Copy link
Contributor

matt-guest-wilcore commented Jan 28, 2025

Judging from the comments on this ticket it looks like there has been work done on this issue, I'd like to see if it's still persisting. I don't have access to DataDog but if I could get confirmation that this issue is still occurring then that would be helpful.

Also, I don't have a physical device to test on but I've been simulating biometrics with the Android emulator and haven't run into any issues yet. Will keep testing.

@matt-guest-wilcore
Copy link
Contributor

Heard back from Joe and he confirmed this is still an issue, with a spike in errors occurring November and December of last year. He also mentioned how he had first-hand experience with the issue recently.

Adding chart for reference.

Image

@jennb33
Copy link

jennb33 commented Feb 3, 2025

2/3/2025 - Moving the ticket to blocked, pending Joe Niquette's return on 2/5/2025 (per @matt-guest-wilcore)

@matt-guest-wilcore
Copy link
Contributor

Based on the discussions here in this ticket, it looks like react-native-keychain could be the root cause. I noticed PR #10040 was opened to address this, but was never merged. However, per this comment, it seems to be a follow-up to PR #10061, which has already been merged.

I thinking that if #10040 is merged, it’ll likely resolve this issue.

@jennb33
Copy link

jennb33 commented Feb 6, 2025

2/6/2025 - Joe Niquette reproduced this error and wants to review with his team, then @matt-guest-wilcore will run the reproduction of the issue with the #10040

@jason-dehaan
Copy link

2/10/25 Matt to pair with Alex to test Android

@jason-dehaan
Copy link

2/11/25 Blocked until Matt and Alex can pair

@jason-dehaan
Copy link

2/12/25 Alex to take a shot at reproducing this. Matt cannot get the error to reproduce.

@matt-guest-wilcore
Copy link
Contributor

matt-guest-wilcore commented Feb 13, 2025

2/13/25 - We haven’t been able to reproduce #9615 ourselves, but based on this PR and discussions, we believe the issue is still tied to react-native-keychain. We’re planning to move this to peer review, get it merged, and then monitor analytics to see if it resolves the problem over time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked Ticket is blocked and can't be worked at this time bug Used to identify a bug ticket that will be worked through the bug process front-end Ticket requires front-end work global Issues for the global team QA Tickets require QA work / review scrubbed Added to newly-written bug tickets after QA has reviewed them for clarity and completeness sev-2 Mid-tier bug severity level based on QA bug severity scale - QA to assign this
Projects
None yet
Development

No branches or pull requests

8 participants