Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No source of entropy in _get_nearest_mirrors_by_network_data() #562

Open
PhirePhly opened this issue Jun 24, 2022 · 7 comments
Open

No source of entropy in _get_nearest_mirrors_by_network_data() #562

PhirePhly opened this issue Jun 24, 2022 · 7 comments

Comments

@PhirePhly
Copy link
Contributor

https://github.com/AlmaLinux/mirrors/blob/mirrors_service/src/backend/api/handlers.py#L73

When a client matches the network data service cone for any mirror, the offered mirror list is built using the following logic:

  1. Append mirror matching on ASN or subnet
  2. Create sorted list of mirrors by distance, filter out mirrors matching on first criteria, and extend list to desired minimum (LENGTH_CLOUD_MIRRORS_LIST)

The issue is that this process is completely deterministic and there is no load balancing for clients that match on any mirror's network service cone. There is no shuffling happening in either the first list of suitable_mirrors or the geographically based extension of the list to meet the minimum number of mirrors to offer.

Failure modes:

  1. A particularly large install base of hosts relies on a local mirror. When this mirror goes offline and before the health checkers have disqualified this mirror, ALL clients on the local subnet are steered to the next closest public mirror.
  2. An EXTREMELY large install base of hosts with two local mirrors are not able to load balance between the two mirrors. The list of mirrors is always returned in the same order (probably based on the order they happened to be added to the mirror database)

Recommendation:

  1. Shuffle list of mirrors which match on network data before the geographically based extension of the list.
  2. Generate a longer geographic extension of the list, shuffle, and then use to extend the original list. i.e. find the 2 * LENGTH_CLOUD_MIRRORS_LIST closest mirrors, shuffle this list, then suitable_mirrors.extend(nearby_mirrors_shuffled)[:LENGTH_CLOUD_MIRRORS_LIST - len(suitable_mirrors)]
@jonathanspw
Copy link
Member

We do have https://github.com/AlmaLinux/mirrors/blob/mirrors_service/src/backend/api/handlers.py#L145 to randomize the geo-based ones a bit. Are you also wanting to have the network-based ones randomized?

I don't think failure mode 1 that you cited is really an issue due to the randomized closest public mirrors.

As for 2: I'm not aware of anyone using more than one mirror to serve a single network. Not that it couldn't happen, just seems like an odd use-case...though also an easy thing to just fix anyway.

@PhirePhly
Copy link
Contributor Author

PhirePhly commented Jun 24, 2022

We do have L145 to randomize the geo-based ones a bit

That is only used if the client doesn't match any network data for any mirrors. When a client matches a network service cone for any mirror, we don't use the _get_nearest_mirrors_by_geo_data() function and instead append additional geographic mirrors as backup options after the suitable_mirrors, and that extension to the list isn't randomized.

https://github.com/AlmaLinux/mirrors/blob/mirrors_service/src/backend/api/handlers.py#L110

soksanichenko pushed a commit that referenced this issue Jul 1, 2022
- No source of entropy in _get_nearest_mirrors_by_network_data()
- _get_nearest_mirrors_by_network_data() fails to exclude near-by private mirrors for extra options.
- Exclude the private mirrors from the mirrors list in the case of fallback behavior
soksanichenko pushed a commit that referenced this issue Jul 1, 2022
- No source of entropy in _get_nearest_mirrors_by_network_data()
- _get_nearest_mirrors_by_network_data() fails to exclude near-by private mirrors for extra options.
- Exclude the private mirrors from the mirrors list in the case of fallback behavior

The Azure mirrors have allowed list of arches
soksanichenko pushed a commit that referenced this issue Jul 1, 2022
- No source of entropy in _get_nearest_mirrors_by_network_data()
- _get_nearest_mirrors_by_network_data() fails to exclude near-by private mirrors for extra options.
- Exclude the private mirrors from the mirrors list in the case of fallback behavior

The Azure mirrors have allowed list of arches
soksanichenko pushed a commit that referenced this issue Jul 1, 2022
- No source of entropy in _get_nearest_mirrors_by_network_data()
- _get_nearest_mirrors_by_network_data() fails to exclude near-by private mirrors for extra options.
- Exclude the private mirrors from the mirrors list in the case of fallback behavior

- The Azure mirrors have allowed list of arches
- Decrease level of logging messages in some cases
- Cache subnets of Azure/AWS cloud
soksanichenko pushed a commit that referenced this issue Jul 1, 2022
- No source of entropy in _get_nearest_mirrors_by_network_data()
- _get_nearest_mirrors_by_network_data() fails to exclude near-by private mirrors for extra options.
- Exclude the private mirrors from the mirrors list in the case of fallback behavior

- The Azure mirrors have allowed list of arches
- Decrease level of logging messages in some cases
- Cache subnets of Azure/AWS cloud
soksanichenko pushed a commit that referenced this issue Jul 1, 2022
- No source of entropy in _get_nearest_mirrors_by_network_data()
- _get_nearest_mirrors_by_network_data() fails to exclude near-by private mirrors for extra options.
- Exclude the private mirrors from the mirrors list in the case of fallback behavior

- The Azure mirrors have allowed list of arches
- Decrease level of logging messages in some cases
- Cache subnets of Azure/AWS cloud
soksanichenko pushed a commit that referenced this issue Jul 1, 2022
- No source of entropy in _get_nearest_mirrors_by_network_data()
- _get_nearest_mirrors_by_network_data() fails to exclude near-by private mirrors for extra options.
- Exclude the private mirrors from the mirrors list in the case of fallback behavior

- The Azure mirrors have allowed list of arches
- Decrease level of logging messages in some cases
- Cache subnets of Azure/AWS cloud
soksanichenko pushed a commit that referenced this issue Jul 1, 2022
- No source of entropy in _get_nearest_mirrors_by_network_data()
- _get_nearest_mirrors_by_network_data() fails to exclude near-by private mirrors for extra options.
- Exclude the private mirrors from the mirrors list in the case of fallback behavior

- The Azure mirrors have allowed list of arches
- Decrease level of logging messages in some cases
- Cache subnets of Azure/AWS cloud
@soksanichenko
Copy link
Member

  1. The first try
[ec2-user@ip-172-31-90-25 ~]$ curl -H "X-Forwarded-For: 77.79.198.14" http://localhost/mirrorlist/8/appstream
http://centos.corp.cloudlinux.com/almalinux/8/AppStream/$basearch/os/
http://almalinux.slaskdatacenter.com/8/AppStream/$basearch/os/
http://mirror.alwyzon.net/almalinux/8/AppStream/$basearch/os/
http://mirrors.hostico.ro/almalinux/8/AppStream/$basearch/os/
http://mirror.vsys.host/almalihistory | grep -i forward^C
[ec2-user@ip-172-31-90-25 ~]$
  1. Redis cache is cleared
  2. The second try
[ec2-user@ip-172-31-90-25 ~]$ curl -H "X-Forwarded-For: 77.79.198.14" http://localhost/mirrorlist/8/appstream
http://centos.corp.cloudlinux.com/almalinux/8/AppStream/$basearch/os/
http://almalinux.slaskdatacenter.com/8/AppStream/$basearch/os/
http://mirror.niif.hu/almalinux/8/AppStream/$basearch/os/
http://almalinux.ip-connect.info/8/AppStream/$basearch/os/
http://almalinux.mirrors.itworxx.de/8/AppStream/$basearch/os/[ec2-user@ip-172-31-90-25 ~]$

soksanichenko pushed a commit that referenced this issue Jul 4, 2022
- No source of entropy in _get_nearest_mirrors_by_network_data()
- _get_nearest_mirrors_by_network_data() fails to exclude near-by private mirrors for extra options.
- Exclude the private mirrors from the mirrors list in the case of fallback behavior

- The Azure mirrors have allowed list of arches
- Decrease level of logging messages in some cases
- Cache subnets of Azure/AWS cloud
@soksanichenko
Copy link
Member

The patch is deployed to production

@jonathanspw
Copy link
Member

There are reports of this being broken. We might need to re-review it.

@jonathanspw jonathanspw reopened this Nov 22, 2023
@Yenya
Copy link
Contributor

Yenya commented Nov 22, 2023

That would probably be me (maintainer of ftp.linux.cz mirror). Currently I see two problems:

  1. lifetime of the mirrorlist is too long. I would expect several minutes, but I am getting days or more. Should the first mirror be broken (as it was with ftp.sh.cvut.cz recently), the clients have no easy way to get correct repomd.xml, as dnf gets it from the first mirror only.

I think the long lifetime of the mirrorlist is caused by repeated access extending the cache lifetime: when I wget the mirrorlist every two minutes or so, I am getting the same one for days. But when I stop for at least 1 hour from that IP address, the next wget gets a new different mirrorlist. So I guess large networks behind NAT could get the same mirrorlist almost indefinitely. I don't know how the caching front-end works, but it could help if the backend can generate the mirrorlist with some kind of Expires: strftime(..., now() + 600) or something, so that the front-end will have to prune the cache even though it is being repeatedly accessed.

  1. when there are multiple mirrors matching asn: and/or subnets:, I think the most-specific match should win (longest match in subnets:, and mirror matching both subnets: and asn: should win against the mirror matching only asn:). See ftp.linux.cz (my own mirror) and ftp.sh.cvut.cz. Both have the same asn:, and have different subnets: I would expect clients from 147.251.0.0/16 to be directed to ftp.linux.cz.

Should I open a new issue for each of these two topics? Thanks,

-Yenya

@jonathanspw
Copy link
Member

For point 1 you were correct: #1075

Fix should hit prod soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants