Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of Host Capacity error on node activation #29

Open
thompsonphys opened this issue Jun 10, 2019 · 2 comments
Open

Out of Host Capacity error on node activation #29

thompsonphys opened this issue Jun 10, 2019 · 2 comments

Comments

@thompsonphys
Copy link

When submitting jobs through slurm, the AMD nodes we've specified in limits.yml are not automatically activating. We then follow the instructions on the elastic scaling page to manually call up a node and receive the error:

2019-06-10 11:29:37,108 startnode  ERROR    bm-standard-e2-64-ad1-0003:  problem launching instance: {'opc-request-id': 'E3D3A2D1DEB14B9C84CBB7FD6F2CA7B3/90862EA821FF46290B355B89CAE3A926/B4D05FEBD75749F988B1C201434A2A1C', 'code': 'InternalError', 'message': 'Out of host capacity.', 'status': 500}

After trying to launch three node instances, we also get this error:

2019-06-10 11:32:07,976 startnode  ERROR    bm-standard-e2-64-ad1-0001:  problem launching instance: {'opc-request-id': '07BF5FF7021E4BD5B7580DF99C44D23F/F122163E1641F3594B94D09F1EB83A9E/5077118583A2A8872A52AF2492160373', 'code': 'TooManyRequests', 'message': 'Too many requests for the user', 'status': 429}

We've actually had one success in activating a node following this approach, but can't figure out why it worked in that particular case but not in others. Otherwise we are well below the node limit on our given AD. Any ideas?

@milliams
Copy link
Member

Hi @thompsonphys,

The 500 error you see happens when Oracle have run out of physical machines to provide you with, regardless of whether your service limit is high enough.

The second error you see should not happen if you created your cluster more recently than the 31st of May as that was when we added in backoff and retry on 429 errors [clusterinthecloud/ansible#40]. If you made your cluster before that date, could you try recreating it?

@thompsonphys
Copy link
Author

Hi @milliams,

Thanks for the quick response. After looking at our ansible logs, we believe that we're on the most up-to-date commit (we initialized the cluster last Friday, June 06):

Starting Ansible Pull at 2019-06-06 10:38:26
/usr/bin/ansible-pull --url=https://github.com/ACRC/slurm-ansible-playbook.git --checkout=3 --inventory=/root/hosts management.yml
 [WARNING]: Could not match supplied host pattern, ignoring: mgmt
 [WARNING]: Your git version is too old to fully support the depth argument.
Falling back to full checkouts.
mgmt.subnet.clustervcn.oraclevcn.com | CHANGED => {
    "after": "2b0a76bd523a37cd60e43c343b6b6e3569519210", 
    "before": null, 
    "changed": true
}

One thing to note, though; we were attempting to call the nodes "manually" using e.g.,

sudo scontrol update NodeName=bm-standard-e2-64-ad1-0001 State=POWER_UP

since they weren't activating automatically when a job was submitted, and this could be the reason for the second error. Here's our elastic.log output that led to the error:

2019-06-10 10:58:02,536 startnode  INFO     bm-standard-e2-64-ad1-0002: Starting
2019-06-10 10:58:03,783 startnode  INFO     bm-standard-e2-64-ad1-0002:  No VNIC attachment yet. Waiting...
2019-06-10 10:58:08,840 startnode  INFO     bm-standard-e2-64-ad1-0002:  No VNIC attachment yet. Waiting...
2019-06-10 10:58:14,027 startnode  INFO     bm-standard-e2-64-ad1-0002:   Private IP 10.1.0.5
2019-06-10 10:58:14,042 startnode  INFO     bm-standard-e2-64-ad1-0002:  Started
2019-06-10 11:27:23,249 startnode  INFO     bm-standard-e2-64-ad1-0001: Starting
2019-06-10 11:27:27,248 startnode  INFO     bm-standard-e2-64-ad1-0003: Starting
2019-06-10 11:27:32,298 startnode  INFO     bm-standard-e2-64-ad1-0004: Starting
2019-06-10 11:29:14,584 startnode  ERROR    bm-standard-e2-64-ad1-0001:  problem launching instance: {'opc-request-id': '8CC851AF29E440FBAAA3E1AA977DAF39/3D8570EDE9F365F4A251F66AC7E5C69D/2EA1C20544804FDC9862EBC3D8079656', 'code': 'InternalError', 'message': 'Out of host capacity.', 'status': 500}
2019-06-10 11:29:37,108 startnode  ERROR    bm-standard-e2-64-ad1-0003:  problem launching instance: {'opc-request-id': 'E3D3A2D1DEB14B9C84CBB7FD6F2CA7B3/90862EA821FF46290B355B89CAE3A926/B4D05FEBD75749F988B1C201434A2A1C', 'code': 'InternalError', 'message': 'Out of host capacity.', 'status': 500}
2019-06-10 11:29:44,097 startnode  ERROR    bm-standard-e2-64-ad1-0004:  problem launching instance: {'opc-request-id': '3C6061CCA87E48AEB2EA6CD54337095A/91C3637438037BA0656DBBB4C6059853/A4EFA370B17C6DEA35473DF758CE1F1E', 'code': 'InternalError', 'message': 'Out of host capacity.', 'status': 500}
2019-06-10 11:30:44,308 startnode  INFO     bm-standard-e2-64-ad1-0001: Starting
2019-06-10 11:32:07,976 startnode  ERROR    bm-standard-e2-64-ad1-0001:  problem launching instance: {'opc-request-id': '07BF5FF7021E4BD5B7580DF99C44D23F/F122163E1641F3594B94D09F1EB83A9E/5077118583A2A8872A52AF2492160373', 'code': 'TooManyRequests', 'message': 'Too many requests for the user', 'status': 429}

The first node started with no difficulties since it was available on your end, but subsequent calls resulted in the 500 error and eventually the 429 error. This morning we were able to load up the additional nodes using the same approach over the same timescale (calling all three back-to-back) and didn't encounter the 429 error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants