Manuals

OpenStack NOVA API for marking host down.

What the API is for

This API will give external fault monitoring system a possibility of telling OpenStack Nova fast that compute host is down. This will immediately enable calling of evacuation of any VM on host and further enabling faster HA actions.

What this API does

In OpenStack the nova-compute service state can represent the compute host state and this new API is used to force this service down. It is assumed that the one calling this API has made sure the host is also fenced or powered down. This is important, so there is no chance same VM instance will appear twice in case evacuated to new compute host. When host is recovered by any means, the external system is responsible of calling the API again to disable forced_down flag and let the host nova-compute service report again host being up. If network fenced host come up again it should not boot VMs it had if figuring out they are evacuated to other compute host. The decision of deleting or booting VMs there used to be on host should be enhanced later to be more reliable by Nova blueprint: https://blueprints.launchpad.net/nova/+spec/robustify-evacuate

REST API for forcing down:

Parameter explanations: tenant_id: Identifier of the tenant. binary: Compute service binary name. host: Compute host name. forced_down: Compute service forced down flag. token: Token received after successful authentication. service_host_ip: Serving controller node ip.

request: PUT /v2.1/{tenant_id}/os-services/force-down { “binary”: “nova-compute”, “host”: “compute1”, “forced_down”: true }

response: 200 OK { “service”: { “host”: “compute1”, “binary”: “nova-compute”, “forced_down”: true } }

Example: curl -g -i -X PUT http://{service_host_ip}:8774/v2.1/{tenant_id}/os-services /force-down -H “Content-Type: application/json” -H “Accept: application/json ” -H “X-OpenStack-Nova-API-Version: 2.11” -H “X-Auth-Token: {token}” -d ‘{“b inary”: “nova-compute”, “host”: “compute1”, “forced_down”: true}’

CLI for forcing down:

nova service-force-down <hostname> nova-compute

Example: nova service-force-down compute1 nova-compute

REST API for disabling forced down:

Parameter explanations: tenant_id: Identifier of the tenant. binary: Compute service binary name. host: Compute host name. forced_down: Compute service forced down flag. token: Token received after successful authentication. service_host_ip: Serving controller node ip.

request: PUT /v2.1/{tenant_id}/os-services/force-down { “binary”: “nova-compute”, “host”: “compute1”, “forced_down”: false }

response: 200 OK { “service”: { “host”: “compute1”, “binary”: “nova-compute”, “forced_down”: false } }

Example: curl -g -i -X PUT http://{service_host_ip}:8774/v2.1/{tenant_id}/os-services /force-down -H “Content-Type: application/json” -H “Accept: application/json ” -H “X-OpenStack-Nova-API-Version: 2.11” -H “X-Auth-Token: {token}” -d ‘{“b inary”: “nova-compute”, “host”: “compute1”, “forced_down”: false}’

CLI for disabling forced down:

nova service-force-down –unset <hostname> nova-compute

Example: nova service-force-down –unset compute1 nova-compute

Get valid server state

Problem description

Previously when the owner of a VM has queried his VMs, he has not received enough state information, states have not changed fast enough in the VIM and they have not been accurate in some scenarios. With this change this gap is now closed.

A typical case is that, in case of a fault of a host, the user of a high availability service running on top of that host, needs to make an immediate switch over from the faulty host to an active standby host. Now, if the compute host is forced down [1] as a result of that fault, the user has to be notified about this state change such that the user can react accordingly. Similarly, a change of the host state to “maintenance” should also be notified to the users.

What is changed

A new host_status parameter is added to the /servers/{server_id} and /servers/detail endpoints in microversion 2.16. By this new parameter user can get additional state information about the host.

host_status possible values where next value in list can override the previous:

  • UP if nova-compute is up.
  • UNKNOWN if nova-compute status was not reported by servicegroup driver within configured time period. Default is within 60 seconds, but can be changed with service_down_time in nova.conf.
  • DOWN if nova-compute was forced down.
  • MAINTENANCE if nova-compute was disabled. MAINTENANCE in API directly means nova-compute service is disabled. Different wording is used to avoid the impression that the whole host is down, as only scheduling of new VMs is disabled.
  • Empty string indicates there is no host for server.

host_status is returned in the response in case the policy permits. By default the policy is for admin only in Nova policy.json:

"os_compute_api:servers:show:host_status": "rule:admin_api"

For an NFV use case this has to also be enabled for the owner of the VM:

"os_compute_api:servers:show:host_status": "rule:admin_or_owner"

REST API examples:

Case where nova-compute is enabled and reporting normally:

GET /v2.1/{tenant_id}/servers/{server_id}

200 OK
{
  "server": {
    "host_status": "UP",
    ...
  }
}

Case where nova-compute is enabled, but not reporting normally:

GET /v2.1/{tenant_id}/servers/{server_id}

200 OK
{
  "server": {
    "host_status": "UNKNOWN",
    ...
  }
}

Case where nova-compute is enabled, but forced_down:

GET /v2.1/{tenant_id}/servers/{server_id}

200 OK
{
  "server": {
    "host_status": "DOWN",
    ...
  }
}

Case where nova-compute is disabled:

GET /v2.1/{tenant_id}/servers/{server_id}

200 OK
{
  "server": {
    "host_status": "MAINTENANCE",
    ...
  }
}

Host Status is also visible in python-novaclient:

+-------+------+--------+------------+-------------+----------+-------------+
| ID    | Name | Status | Task State | Power State | Networks | Host Status |
+-------+------+--------+------------+-------------+----------+-------------+
| 9a... | vm1  | ACTIVE | -          | RUNNING     | xnet=... | UP          |
+-------+------+--------+------------+-------------+----------+-------------+

Fault Correlation

In this section, you can find some NFV use cases, and understand how they can be addressed by the Doctor framework. The use cases will describe different examples of fault correlations:

  • Between a physical resource and another physical resource
  • Between a physical resource and a virtual resource
  • Between a virtual resource and another virtual resource

Physical Switch Failure

This use case demonstrates fault correlation between two physical resources: a switch and a host. It also demonstrates the effect that this failure has on the virtual resources (instances).

A failure on a physical switch results in the attached host becoming unreachable. All instances running on this host will also become unreachable, and the applications using these instances will either fail or will no longer be highly available. In the world of NFV, it is critical to identify this fault as fast as possible and take corrective actions. By the time the VNFM notices the problem in the application it might be too late.

The Doctor architecture handles this use case by providing a fast alarm propagation from the switch to the host and to the instances running on it.

  • The Monitor detects a fault in the physical switch, and sends an event to the Inspector.
  • The Inspector identifies the affected resources (in that case, the host) based on the resource topology.
  • The Inspector notifies the Controller (in that case, Nova) that the host is down.
  • The Controller sends a notification to the Notifier.
  • Optionally, the Inspector notifies the Notifier directly about all the affected resources (host and instances).
  • The Notifier notifies the Consumer/VNFM, which can take corrective actions like perform a switchover or evacuate the failed host.

The result of using the Doctor framework is that the Consumer/VNFM could prevent the end user from being affected by the fault, e.g., by performing a switchover from the active instance to the standby one in less than one second.

Physical Port Failure – HA Use Case

This use case demonstrates fault correlation between a physical resource (a port) to a virtual resource (a bridge) and to other virtual resources (instances). It also demonstrates how the Doctor framework can be used to support HA scenarios and to avoid a single point of failure, in case one of two ports fails.

It is similar to the Physical Switch Failure, but a bit more complex. Here, the network type is used to determine the relationship between an instance and the bridges it is using. A failure in a physical port will affect some of the instances, but not all of them.

A short description of the topology: a bridge has a bond of two physical ports. Several bridges may be connected to one another, and the traffic that goes through them depends on the network type of each instance (vlan, vxlan, etc.). In case of a physical port failure, the Inspector should warn that instances using this bridge are at risk of becoming unreachable. In case both ports of the bridge failed, it is a critical error.

This would be the flow:

  • The Monitor detects a fault in the physical port, and sends an event to the Inspector.
  • The Inspector identifies the affected resources (in that case, the bridge and the instances that are using it for their network traffic) based on the resource topology.
  • The Inspector notifies the Notifier that these resources are at risk of becoming unreachable.
  • The Notifier notifies the Consumer/VNFM, which can take preventive actions like perform a switchover to the standby instances.

The result of using the Doctor framework is that the Consumer/VNFM could be proactive and prevent a failure that might happen, e.g., by doing a switchover from the active instance to the standby one as a precaution.