Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AppView stops receiving events from ozone #107

Open
Bossett opened this issue Apr 27, 2024 · 7 comments
Open

AppView stops receiving events from ozone #107

Bossett opened this issue Apr 27, 2024 · 7 comments

Comments

@Bossett
Copy link

Bossett commented Apr 27, 2024

This is very hard to describe, but the appview stops processing labels, seemingly at random.

https://bsky.app/profile/did:plc:i5blflq3dvovxl6ser4ptcnq should be labelled by https://bsky.app/profile/profile-labels.bossett.social, and you can see this is the case at https://ozone.bossett.social/xrpc/com.atproto.label.queryLabels?uriPatterns=did:plc:i5blflq3dvovxl6ser4ptcnq - however this does not appear in the appview.

https://bsky.app/profile/did:plc:kwuoz6uqipkpstarclulkei3 labelled exactly the same way a minute beforehand, works (see https://ozone.bossett.social/xrpc/com.atproto.label.queryLabels?uriPatterns=did:plc:kwuoz6uqipkpstarclulkei3) - these were right after the other (note the cursor).

https://ozone.bossett.social/xrpc/com.atproto.label.queryLabels?uriPatterns=did:plc:k3yuswqkhhr2eqppalpt6i67 is also not working (https://bsky.app/profile/did:plc:k3yuswqkhhr2eqppalpt6i67) is cursor 16 (right before the working one).

This is in a new ozone install - I deleted the service record from my account, and started from an empty ozone database to reproduce.

There doesn't seem to be any way to troubleshoot how the appview, etc. connects to ozone - there's nothing indicating anything is wrong in the network requests, logs, etc.; but everything I do have access to/insight of seems to be pretty healthy.

This is a super-vague report because everything appears totally correct, the service is running and up at https://ozone.bossett.social/ - all the endpoints are functional, certificates correct, etc. etc. and we have some working items, and some not.

image

@Bossett
Copy link
Author

Bossett commented Apr 27, 2024

These are all showing labels now - is there a delay of some kind, or some other polling setup? I had waited many minutes before reporting (like 60+ before I nuked the old install and started over)

@Bossett
Copy link
Author

Bossett commented Apr 27, 2024

ok more digging - a few things seem to be happening:

  1. When it stops, traffic outgoing from ozone drops to zero - no data going over the websocket but the connection is still alive. I got this by just watching the port on the router here so don't have HTTP diagnostics.
  2. Restarting ozone seems to reliably fix the issue unless things are emitted during the restart - this may be a back-off-and-retry, but it doesn't seem to reliably come back, resulting in nothing over the websocket.

Starting to suspect this is an issue with some behaviour in the TypeScript ws library as this mirrors issues I've had with the firehose getting slower and slower and stopping - my fix there was just to detect that my lag to events was increasing and restart, maybe need something like that here?

@ericvolp12
Copy link

ericvolp12 commented Apr 30, 2024

FWIW this is what label throughput looks like for your labeler on our end, I see a few patches where it drops to zero for a while but we have no errors around it and the only errors I've seen from your labeler are from sockets closing in what look like Ozone instance restarts.

Our label ingester retries indefinitely to connect to registered labelers with a maximum backoff of one minute but will never give up attempting to connect if you're a valid registered labeler.

The drops to zero in this graph below might be what you were experiencing but they're not on our end afaict. Labeler rate limits are 5/sec, 5k/hr, and 50k/day but you were nowhere near those afaict.

CleanShot 2024-04-29 at 21 25 22

Here are the errors we saw from your labeler in a similar window:
image

All times are in Pacific Daylight Time (PDT)

@Bossett
Copy link
Author

Bossett commented Apr 30, 2024

thanks Jaz

I'll keep watching it - the time periods match (though I had thought I was connected for a bit more in the block before the big backfill there) but it hasn't re-occurred the last ~24h.

If it doesn't break (unintentionally) again, suspect this is better closed as an issue and I can just re-open if I can reliably repro; at least there's some ticket history if people have the same issue and we can 'triangulate'

@Bossett
Copy link
Author

Bossett commented May 6, 2024

ok it's happening for me right now - bouncing ozone, recreating the record, etc. all do nothing

Could this be a cursor-catching-up issue? I would have had a lot of events when backfilling from the firehose outage, and I also check (against my PDS w/header) to see if a label has been applied already to try minimise repeats. The combo means there were a bunch of records getting put out every minute for a bunch of accounts, and I wonder if that's been detected as abuse, or if there's some limit on how fast it'll consume from the ozone firehose

(edit: and now working again - total time out about 90 minutes since first recreating the service record. I'm honing in - it seems to break when there's either a lot of events or very few, so may be some kind of proxying issue along the way that's trying to be helpful? And I am using fly.io now - but wasn't when this started. v frustrating, will update here if I discover something or if it goes away, or if it coincides with anything major like the firehose outage.)

@Bossett
Copy link
Author

Bossett commented May 24, 2024

Ok - just want to round this out - this doesn't seem to be a bsky/ozone bug, though it is the only place I've seen it.

Cloudflare isn't liking 'bursty' websocket traffic - over a certain threshold, the connection is just hanging. Using the VM directly (i.e. turning off cf proxying) does not seem to exhibit the issue (~7 days running), I could reliably predict that it would drop whenever I saw >100 labels being applied at once.

Theory is that there's some pause maybe on the keepalive (or something like that) if the other side is reading from the websocket, so the traffic shifts to being only one way for a period of time. My key evidence is that "I saw that before once" though.

I am going to mess around with ozone and see if a different ws library helps, but if it doesn't can probably close this as 'not the fault of the software'.

@bnewbold
Copy link
Collaborator

Definitely good to be aware of! Thanks for persevering and leaving a clear debugging log here for others to find.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants