Design Documents

This is the directory to store design documents which may include draft versions of blueprints written before proposing to upstream OSS communities such as OpenStack, in order to keep the original blueprint as reviewed in OPNFV. That means there could be out-dated blueprints as result of further refinements in the upstream OSS community. Please refer to the link in each document to find the latest version of the blueprint and status of development in the relevant OSS community.

See also https://wiki.opnfv.org/requirements_projects .

Note

This is a specification draft of a blueprint proposed for OpenStack Nova Liberty. It was written by project member(s) and agreed within the project before submitting it upstream. No further changes to its content will be made here anymore; please follow it upstream:

Original draft is as follow:

Report host fault to update server state immediately

https://blueprints.launchpad.net/nova/+spec/update-server-state-immediately

A new API is needed to report a host fault to change the state of the instances and compute node immediately. This allows usage of evacuate API without a delay. The new API provides the possibility for external monitoring system to detect any kind of host failure fast and reliably and inform OpenStack about it. Nova updates the compute node state and states of the instances. This way the states in the Nova DB will be in sync with the real state of the system.

Problem description

  • Nova state change for failed or unreachable host is slow and does not reliably state compute node is down or not. This might cause same instance to run twice if action taken to evacuate instance to another host.
  • Nova state for instances on failed compute node will not change, but remains active and running. This gives user a false information about instance state. Currently one would need to call “nova reset-state” for each instance to have them in error state.
  • OpenStack user cannot make HA actions fast and reliably by trusting instance state and compute node state.
  • As compute node state changes slowly one cannot evacuate instances.

Use Cases

Use case in general is that in case there is a host fault one should change compute node state fast and reliably when using DB servicegroup backend. On top of this here is the use cases that are not covered currently to have instance states changed correctly: * Management network connectivity lost between controller and compute node. * Host HW failed.

Generic use case flow:

  • The external monitoring system detects a host fault.
  • The external monitoring system fences the host if not down already.
  • The external system calls the new Nova API to force the failed compute node into down state as well as instances running on it.
  • Nova updates the compute node state and state of the effected instances to Nova DB.

Currently nova-compute state will be changing “down”, but it takes a long time. Server state keeps as “vm_state: active” and “power_state: running”, which is not correct. By having external tool to detect host faults fast, fence host by powering down and then report host down to OpenStack, all these states would reflect to actual situation. Also if OpenStack will not implement automatic actions for fault correlation, external tool can do that. This could be configured for example in server instance METADATA easily and be read by external tool.

Project Priority

Liberty priorities have not yet been defined.

Proposed change

There needs to be a new API for Admin to state host is down. This API is used to mark compute node and instances running on it down to reflect the real situation.

Example on compute node is:

  • When compute node is up and running: vm_state: active and power_state: running nova-compute state: up status: enabled
  • When compute node goes down and new API is called to state host is down: vm_state: stopped power_state: shutdown nova-compute state: down status: enabled

vm_state values: soft-delete, deleted, resized and error should not be touched. task_state effect needs to be worked out if needs to be touched.

Alternatives

There is no attractive alternatives to detect all different host faults than to have a external tool to detect different host faults. For this kind of tool to exist there needs to be new API in Nova to report fault. Currently there must have been some kind of workarounds implemented as cannot trust or get the states from OpenStack fast enough.

Data model impact

None

REST API impact

  • Update CLI to report host is down

    nova host-update command

    usage: nova host-update [–status <enable|disable>]

    [–maintenance <enable|disable>] [–report-host-down] <hostname>

    Update host settings.

    Positional arguments

    <hostname> Name of host.

    Optional arguments

    –status <enable|disable> Either enable or disable a host.

    –maintenance <enable|disable> Either put or resume host to/from maintenance.

    –down Report host down to update instance and compute node state in db.

  • Update Compute API to report host is down:

    /v2.1/{tenant_id}/os-hosts/{host_name}

    Normal response codes: 200 Request parameters

    Parameter Style Type Description host_name URI xsd:string The name of the host of interest to you.

    {
    “host”: {

    “status”: “enable”, “maintenance_mode”: “enable” “host_down_reported”: “true”

    }

    }

    {
    “host”: {

    “host”: “65c5d5b7e3bd44308e67fc50f362aee6”, “maintenance_mode”: “enabled”, “status”: “enabled” “host_down_reported”: “true”

    }

    }

  • New method to nova.compute.api module HostAPI class to have a to mark host related instances and compute node down: set_host_down(context, host_name)

  • class novaclient.v2.hosts.HostManager(api) method update(host, values) Needs to handle reporting host down.

  • Schema does not need changes as in db only service and server states are to be changed.

Security impact

API call needs admin privileges (in the default policy configuration).

Notifications impact

None

Other end user impact

None

Performance Impact

Only impact is that user can get information faster about instance and compute node state. This also gives possibility to evacuate faster. No impact that would slow down. Host down should be rare occurrence.

Other deployer impact

Developer can make use of any external tool to detect host fault and report it to OpenStack.

Developer impact

None

Implementation

Assignee(s)

Primary assignee: Tomi Juvonen Other contributors: Ryota Mibu

Work Items

  • Test cases.
  • API changes.
  • Documentation.

Dependencies

None

Testing

Test cases that exists for enabling or putting host to maintenance should be altered or similar new cases made test new functionality.

Documentation Impact

New API needs to be documented:

References

Notification Alarm Evaluator

Note

This is spec draft of blueprint for OpenStack Ceilomter Liberty. To see current version: https://review.openstack.org/172893 To track development activity: https://blueprints.launchpad.net/ceilometer/+spec/notification-alarm-evaluator

https://blueprints.launchpad.net/ceilometer/+spec/notification-alarm-evaluator

This blueprint proposes to add a new alarm evaluator for handling alarms on events passed from other OpenStack services, that provides event-driven alarm evaluation which makes new sequence in Ceilometer instead of the polling-based approach of the existing Alarm Evaluator, and realizes immediate alarm notification to end users.

Problem description

As an end user, I need to receive alarm notification immediately once Ceilometer captured an event which would make alarm fired, so that I can perform recovery actions promptly to shorten downtime of my service. The typical use case is that an end user set alarm on “compute.instance.update” in order to trigger recovery actions once the instance status has changed to ‘shutdown’ or ‘error’. It should be nice that an end user can receive notification within 1 second after fault observed as the same as other helth- check mechanisms can do in some cases.

The existing Alarm Evaluator is periodically querying/polling the databases in order to check all alarms independently from other processes. This is good approach for evaluating an alarm on samples stored in a certain period. However, this is not efficient to evaluate an alarm on events which are emitted by other OpenStack servers once in a while.

The periodical evaluation leads delay on sending alarm notification to users. The default period of evaluation cycle is 60 seconds. It is recommended that an operator set longer interval than configured pipeline interval for underlying metrics, and also longer enough to evaluate all defined alarms in certain period while taking into account the number of resources, users and alarms.

Proposed change

The proposal is to add a new event-driven alarm evaluator which receives messages from Notification Agent and finds related Alarms, then evaluates each alarms;

  • New alarm evaluator could receive event notification from Notification Agent by which adding a dedicated notifier as a publisher in pipeline.yaml (e.g. notifier://?topic=event_eval).
  • When new alarm evaluator received event notification, it queries alarm database by Project ID and Resource ID written in the event notification.
  • Found alarms are evaluated by referring event notification.
  • Depending on the result of evaluation, those alarms would be fired through Alarm Notifier as the same as existing Alarm Evaluator does.

This proposal also adds new alarm type “notification” and “notification_rule”. This enables users to create alarms on events. The separation from other alarm types (such as “threshold” type) is intended to show different timing of evaluation and different format of condition, since the new evaluator will check each event notification once it received whereas “threshold” alarm can evaluate average of values in certain period calculated from multiple samples.

The new alarm evaluator handles Notification type alarms, so we have to change existing alarm evaluator to exclude “notification” type alarms from evaluation targets.

Alternatives

There was similar blueprint proposal “Alarm type based on notification”, but the approach is different. The old proposal was to adding new step (alarm evaluations) in Notification Agent every time it received event from other OpenStack services, whereas this proposal intends to execute alarm evaluation in another component which can minimize impact to existing pipeline processing.

Another approach is enhancement of existing alarm evaluator by adding notification listener. However, there are two issues; 1) this approach could cause stall of periodical evaluations when it receives bulk of notifications, and 2) this could break the alarm portioning i.e. when alarm evaluator received notification, it might have to evaluate some alarms which are not assign to it.

Data model impact

Resource ID will be added to Alarm model as an optional attribute. This would help the new alarm evaluator to filter out non-related alarms while querying alarms, otherwise it have to evaluate all alarms in the project.

REST API impact

Alarm API will be extended as follows;

  • Add “notification” type into alarm type list
  • Add “resource_id” to “alarm”
  • Add “notification_rule” to “alarm”

Sample data of Notification-type alarm:

{
    "alarm_actions": [
        "http://site:8000/alarm"
    ],
    "alarm_id": null,
    "description": "An alarm",
    "enabled": true,
    "insufficient_data_actions": [
        "http://site:8000/nodata"
    ],
    "name": "InstanceStatusAlarm",
    "notification_rule": {
        "event_type": "compute.instance.update",
        "query" : [
            {
                "field" : "traits.state",
                "type" : "string",
                "value" : "error",
                "op" : "eq",
            },
        ]
    },
    "ok_actions": [],
    "project_id": "c96c887c216949acbdfbd8b494863567",
    "repeat_actions": false,
    "resource_id": "153462d0-a9b8-4b5b-8175-9e4b05e9b856",
    "severity": "moderate",
    "state": "ok",
    "state_timestamp": "2015-04-03T17:49:38.406845",
    "timestamp": "2015-04-03T17:49:38.406839",
    "type": "notification",
    "user_id": "c96c887c216949acbdfbd8b494863567"
}

“resource_id” will be refered to query alarm and will not be check permission and belonging of project.

Security impact

None

Pipeline impact

None

Other end user impact

None

Performance/Scalability Impacts

When Ceilomter received a number of events from other OpenStack services in short period, this alarm evaluator can keep working since events are queued in a messaging queue system, but it can cause delay of alarm notification to users and increase the number of read and write access to alarm database.

“resource_id” can be optional, but restricting it to mandatory could be reduce performance impact. If user create “notification” alarm without “resource_id”, those alarms will be evaluated every time event occurred in the project. That may lead new evaluator heavy.

Other deployer impact

New service process have to be run.

Developer impact

Developers should be aware that events could be notified to end users and avoid passing raw infra information to end users, while defining events and traits.

Implementation

Assignee(s)

Primary assignee:
r-mibu
Other contributors:
None
Ongoing maintainer:
None

Work Items

  • New event-driven alarm evaluator
  • Add new alarm type “notification” as well as AlarmNotificationRule
  • Add “resource_id” to Alarm model
  • Modify existing alarm evaluator to filter out “notification” alarms
  • Add new config parameter for alarm request check whether accepting alarms without specifying “resource_id” or not

Future lifecycle

This proposal is key feature to provide information of cloud resources to end users in real-time that enables efficient integration with user-side manager or Orchestrator, whereas currently those information are considered to be consumed by admin side tool or service. Based on this change, we will seek orchestrating scenarios including fault recovery and add useful event definition as well as additional traits.

Dependencies

None

Testing

New unit/scenario tests are required for this change.

Documentation Impact

  • Proposed evaluator will be described in the developer document.
  • New alarm type and how to use will be explained in user guide.

References

Neutron Port Status Update

Note

This document represents a Neutron RFE reviewed in the Doctor project before submitting upstream to Launchpad Neutron space. The document is not intended to follow a blueprint format or to be an extensive document. For more information, please visit http://docs.openstack.org/developer/neutron/policies/blueprints.html

The RFE was submitted to Neutron. You can follow the discussions in https://bugs.launchpad.net/neutron/+bug/1598081

Neutron port status field represents the current status of a port in the cloud infrastructure. The field can take one of the following values: ‘ACTIVE’, ‘DOWN’, ‘BUILD’ and ‘ERROR’.

At present, if a network event occurs in the data-plane (e.g. virtual or physical switch fails or one of its ports, cable gets pulled unintentionally, infrastructure topology changes, etc.), connectivity to logical ports may be affected and tenants’ services interrupted. When tenants/cloud administrators are looking up their resources’ status (e.g. Nova instances and services running in them, network ports, etc.), they will wrongly see everything looks fine. The problem is that Neutron will continue reporting port ‘status’ as ‘ACTIVE’.

Many SDN Controllers managing network elements have the ability to detect and report network events to upper layers. This allows SDN Controllers’ users to be notified of changes and react accordingly. Such information could be consumed by Neutron so that Neutron could update the ‘status’ field of those logical ports, and additionally generate a notification message to the message bus.

However, Neutron misses a way to be able to receive such information through e.g. ML2 driver or the REST API (‘status’ field is read-only). There are pros and cons on both of these approaches as well as other possible approaches. This RFE intends to trigger a discussion on how Neutron could be improved to receive fault/change events from SDN Controllers or even also from 3rd parties not in charge of controlling the network (e.g. monitoring systems, human admins).

Port data plane status

https://bugs.launchpad.net/neutron/+bug/1598081

Neutron does not detect data plane failures affecting its logical resources. This spec addresses that issue by means of allowing external tools to report to Neutron about faults in the data plane that are affecting the ports. A new REST API field is proposed to that end.

Problem Description

An initial description of the problem was introduced in bug #159801 [1]. This spec focuses on capturing one (main) part of the problem there described, i.e. extending Neutron’s REST API to cover the scenario of allowing external tools to report network failures to Neutron. Out of scope of this spec are works to enable port status changes to be received and managed by mechanism drivers.

This spec also tries to address bug #1575146 [2]. Specifically, and argued by the Neutron driver team in [3]:

  • Neutron should not shut down the port completly upon detection of physnet failure; connectivity between instances on the same node may still be reachable. Externals tools may or may not want to trigger a status change on the port based on their own logic and orchestration.
  • Port down is not detected when an uplink of a switch is down;
  • The physnet bridge may have multiple physical interfaces plugged; shutting down the logical port may not be needed in case network redundancy is in place.

Proposed Change

A couple of possible approaches were proposed in [1] (comment #3). This spec proposes tackling the problema via a new extension API to the port resource. The extension adds a new attribute ‘dp-down’ (data plane down) to represent the status of the data plane. The field should be read-only by tenants and read-write by admins.

Neutron should send out an event to the message bus upon toggling the data plane status value. The event is relevant for e.g. auditing.

Data Model Impact

A new attribute as extension will be added to the ‘ports’ table.

Attribute Name Type Access Default Value Validation/ Conversion Description
dp_down boolean RO, tenant RW, admin False True/False  

REST API Impact

A new API extension to the ports resource is going to be introduced.

EXTENDED_ATTRIBUTES_2_0 = {
    'ports': {
        'dp_down': {'allow_post': False, 'allow_put': True,
                    'default': False, 'convert_to': convert_to_boolean,
                    'is_visible': True},
    },
}
Examples

Updating port data plane status to down:

PUT /v2.0/ports/<port-uuid>
Accept: application/json
{
    "port": {
        "dp_down": true
    }
}

Command Line Client Impact

neutron port-update [--dp-down <True/False>] <port>
openstack port set [--dp-down <True/False>] <port>

Argument –dp-down is optional. Defaults to False.

Security Impact

None

Notifications Impact

A notification (event) upon toggling the data plane status (i.e. ‘dp-down’ attribute) value should be sent to the message bus. Such events do not happen with high frequency and thus no negative impact on the notification bus is expected.

Performance Impact

None

IPv6 Impact

None

Other Deployer Impact

None

Developer Impact

None

Implementation

Assignee(s)

  • cgoncalves

Work Items

  • New ‘dp-down’ attribute in ‘ports’ database table
  • API extension to introduce new field to port
  • Client changes to allow for data plane status (i.e. ‘dp-down’ attribute’) being set
  • Policy (tenants read-only; admins read-write)

Documentation Impact

Documentation for both administrators and end users will have to be contemplated. Administrators will need to know how to set/unset the data plane status field.

References

[1]RFE: Port status update, https://bugs.launchpad.net/neutron/+bug/1598081
[2]RFE: ovs port status should the same as physnet https://bugs.launchpad.net/neutron/+bug/1575146
[3]Neutron Drivers meeting, July 21, 2016 http://eavesdrop.openstack.org/meetings/neutron_drivers/2016/neutron_drivers.2016-07-21-22.00.html

Inspector Design Guideline

Note

This is spec draft of design guideline for inspector component. JIRA ticket to track the update and collect comments: DOCTOR-73.

This document summarize the best practise in designing a high performance inspector to meet the requirements in OPNFV Doctor project.

Problem Description

Some pitfalls has be detected during the development of sample inspector, e.g. we suffered a significant performance degrading in listing VMs in a host.

A patch set for caching the list has been committed to solve issue. When a new inspector is integrated, it would be nice to have an evaluation of existing design and give recommendations for improvements.

This document can be treated as a source of related blueprints in inspector projects.

Guidelines

Host specific VMs list

While requirement in doctor project is to have alarm about fault to consumer in one second, it is just a limit we have set in requirements. When talking about fault management in Telco, the implementation needs to be by all means optimal and the one second is far from traditional Telco requirements.

One thing to be optimized in inspector is to eliminate the need to read list of host specific VMs from Nova API, when it gets a host specific failure event. Optimal way of implementation would be to initialize this list when Inspector start by reading from Nova API and after this list would be kept up-to-date by instance.update notifications received from nova. Polling Nova API can be used as a complementary channel to make snapshot of hosts and VMs list in order to keep the data consistent with reality.

This is enhancement and not perhaps something needed to keep under one second in a small system. Anyhow this would be something needed in case of production use.

This guideline can be summarized as following:

  • cache the host VMs mapping instead of reading it on request
  • subscribe and handle update notifications to keep the list up to date
  • make snapshot periodically to ensure data consistency

Parallel execution

In doctor’s architecture, the inspector is responsible to set error state for the affected VMs in order to notify the consumers of such failure. This is done by calling the nova reset-state API. However, this action is a synchronous request with many underlying steps and cost typically hundreds of milliseconds. According to the discussion in mailing list, this time cost will grow linearly if the requests are sent one by one. It will become a critical issue in large scale system.

It is recommended to introduce parallel execution for actions like reset-state that takes a list of targets.

Shortcut notification

An alternative way to improve notification performance is to take a shortcut from inspector to notifier instead of triggering it from controller. The difference between the two workflow is shown below:

conservative notification

Conservative Notification

shortcut notification

Shortcut Notification

It worth noting that the shortcut notification has a side effect that cloud resource states could still be out-of-sync by the time consumer processes the alarm notification. This is out of scope of inspector design but need to be taken consideration in system level.

Also the call of “reset servers state to error” is not necessary in the alternative notification case where the “host forced down” is still called. “get-valid-server-state” was implemented to have valid server state while earlier one couldn’t get it unless calling “reset servers state to error”. When not having “reset servers state to error”, states are more unlikely to be out of sync while notification and force down host would be parallel.

Appendix

A study has been made to evaluate the effect of parallel execution and shortcut notification on OPNFV Beijing Summit 2017.

notification time

Notification Time

Download the full presentation slides here.

Performance Profiler

https://goo.gl/98Osig

This blueprint proposes to create a performance profiler for doctor scenarios.

Problem Description

In the verification job for notification time, we have encountered some performance issues, such as

1. In environment deployed by APEX, it meets the criteria while in the one by Fuel, the performance is much more poor. 2. Signification performance degradation was spotted when we increase the total number of VMs

It takes time to dig the log and analyse the reason. People have to collect timestamp at each checkpoints manually to find out the bottleneck. A performance profiler will make this process automatic.

Proposed Change

Current Doctor scenario covers the inspector and notifier in the whole fault management cycle:

start                                          end
  +       +         +        +       +          +
  |       |         |        |       |          |
  |monitor|inspector|notifier|manager|controller|
  +------>+         |        |       |          |
occurred  +-------->+        |       |          |
  |     detected    +------->+       |          |
  |       |     identified   +-------+          |
  |       |               notified   +--------->+
  |       |                  |    processed  resolved
  |       |                  |                  |
  |       +<-----doctor----->+                  |
  |                                             |
  |                                             |
  +<---------------fault management------------>+

The notification time can be split into several parts and visualized as a timeline:

start                                         end
  0----5---10---15---20---25---30---35---40---45--> (x 10ms)
  +    +   +   +   +    +      +   +   +   +   +
0-hostdown |   |   |    |      |   |   |   |   |
  +--->+   |   |   |    |      |   |   |   |   |
  |  1-raw failure |    |      |   |   |   |   |
  |    +-->+   |   |    |      |   |   |   |   |
  |    | 2-found affected      |   |   |   |   |
  |    |   +-->+   |    |      |   |   |   |   |
  |    |     3-marked host down|   |   |   |   |
  |    |       +-->+    |      |   |   |   |   |
  |    |         4-set VM error|   |   |   |   |
  |    |           +--->+      |   |   |   |   |
  |    |           |  5-notified VM error  |   |
  |    |           |    +----->|   |   |   |   |
  |    |           |    |    6-transformed event
  |    |           |    |      +-->+   |   |   |
  |    |           |    |      | 7-evaluated event
  |    |           |    |      |   +-->+   |   |
  |    |           |    |      |     8-fired alarm
  |    |           |    |      |       +-->+   |
  |    |           |    |      |         9-received alarm
  |    |           |    |      |           +-->+
sample | sample    |    |      |           |10-handled alarm
monitor| inspector |nova| c/m  |    aodh   |
  |                                        |
  +<-----------------doctor--------------->+

Note: c/m = ceilometer

And a table of components sorted by time cost from most to least

Component Time Cost Percentage
inspector 160ms 40%
aodh 110ms 30%
monitor 50ms 14%
...    
...    

Note: data in the table is for demonstration only, not actual measurement

Timestamps can be collected from various sources

  1. log files
  2. trace point in code

The performance profiler will be integrated into the verification job to provide detail result of the test. It can also be deployed independently to diagnose performance issue in specified environment.

Working Items

  1. PoC with limited checkpoints
  2. Integration with verification job
  3. Collect timestamp at all checkpoints
  4. Display the profiling result in console
  5. Report the profiling result to test database
  6. Independent package which can be installed to specified environment

Planned Maintenance Design Guideline

Note

This is spec draft of design guideline for planned maintenance. JIRA ticket to track the update and collect comments: DOCTOR-52.

This document describes how one can implement planned maintenance by utilizing the OPNFV Doctor project. framework and to meet the set requirements.

Problem Description

Telco application need to know when planned maintenance is going to happen in order to guarantee zero down time in its operation. It needs to be possible to make own actions to have application running on not affected resource or give guidance to admin actions like migration. More details are defined in requirement documentation: use cases, architecture and implementation. Also discussion in the OPNFV summit about planned maintenance session.

Guidelines

Cloud admin needs to make a notification about planned maintenance including all details that application needs in order to make decisions upon his affected service. This notification payload can be consumed by application by subscribing to corresponding event alarm trough alarming service like OpenStack AODH.

Before maintenance starts application needs to be able to make switch over for his ACT-STBY service affected, do operation to move service to not effected part of infra or give a hint for admin operation like migration that can be automatically issued by admin tool according to agreed policy.

Flow diagram:

admin alarming project  controller  inspector
  |   service  app manager   |           |
  |  1.   |         |        |           |
  +------------------------->+           |
  +<-------------------------+           |
  |  2.   |         |        |           |
  +------>+    3.   |        |           |
  |       +-------->+   4.   |           |
  |       |         +------->+           |
  |       |    5.   +<-------+           |
  +<----------------+        |           |
  |                 |   6.   |           |
  +------------------------->+           |
  +<-------------------------+     7.    |
  +------------------------------------->+
  |   8.  |         |        |           |
  +------>+    9.   |        |           |
  |       +-------->+        |           |
  +--------------------------------------+
  |                10.                   |
  +--------------------------------------+
  |  11.  |         |        |           |
  +------------------------->+           |
  +<-------------------------+           |
  |  12.  |         |        |           |
  +------>+-------->+        |    13.    |
  +------------------------------------->+
  +-------+---------+--------+-----------+

Concepts used below:

  • full maintenance: This means maintenance will take a longer time and resource should be emptied, meaning container or VM need to be moved or deleted. Admin might need to test resource to work after maintenance.
  • reboot: Only a reboot is needed and admin does not need separate testing after that. Container or VM can be left in place if so wanted.
  • notification: Notification to rabbitmq.

Admin makes a planned maintenance session where he sets a maintenance_session_id that is a unique ID for all the hardware resources he is going to have the maintenance at the same time. Mostly maintenance should be done node by node, meaning a single compute node at a time would be in single planned maintenance session having unique maintenance_session_id. This ID will be carried trough the whole session in all places and can be used to query maintenance in admin tool API. Project running a Telco application should set a specific role for admin tool to know it cannot do planned maintenance unless project has agreed actions to be done for its VMs or containers. This means the project has configured itself to get alarms upon planned maintenance and it is capable of agreeing needed actions. Admin is supposed to use an admin tool to automate maintenance process partially or entirely.

The flow of a successful planned maintenance session as in OpenStack example case:

  1. Admin disables nova-compute in order to do planned maintenance on a compute host and gets ACK from the API call. This action needs to be done to ensure no thing will be placed in this compute host by any user. Action is always done regardless the whole compute will be affected or not.
  2. Admin sends a project specific maintenance notification with state planned maintenance. This includes detailed information about maintenance, like when it is going to start, is it reboot or full maintenance including the information about project containers or VMs running on host or the part of it that will need maintenance. Also default action like migration will be mentioned that will be issued by admin before maintenance starts if no other action is set by project. In case project has a specific role set, planned maintenance cannot start unless project has agreed the admin action. Available admin actions are also listed in notification.
  3. Application manager of the project receives AODH alarm about the same.
  4. Application manager can do switch over to his ACT-STBY service, delete and re-instantiate his service on not affected resource if so wanted.
  5. Application manager may call admin tool API to give preferred instructions for leaving VMs and containers in place or do admin action to migrate them. In case admin does not receive this instruction before maintenance is to start it will do the pre-configured default action like migration to projects without a specific role to say project need to agree the action. VMs or Containers can be left on host if type of maintenance is just reboot.
  6. Admin does possible actions to VMs and containers and receives an ACK.
  7. In case everything went ok, Admin sends admin type of maintenance notification with state in maintenance. This notification can be consumed by Inspector and other cloud services to know there is ongoing maintenance which means things like automatic fault management actions for the hardware resources should be disabled.
  8. If maintenance type is reboot and project is still having containers or VMs running on affected hardware resource, Admin sends project specific maintenance notification with state updated to in maintenance. If project do not have anything left running on affected hardware resource, state will be maintenance over instead. If maintenance can not be performed for some reason state should be maintenance cancelled. In this case last operation remaining for admin is to re-enable nova-compute service, ensure everything is running and not to proceed any further steps.
  9. Application manager of the project receives AODH alarm about the same.
  10. Admin will do the maintenance. This is out of Doctor scope.
  11. Admin enables nova-compute service when maintenance is over and host can be put back to production. An ACK is received from API call.
  12. In case project had left containers or VMs on hardware resource over maintenance, Admin sends project specific maintenance notification with state updated to maintenance over.
  13. Admin sends admin type of maintenance notification with state updated to maintenance over. Inspector and other cloud services can consume this to know hardware resource is back in use.

POC

There was a Maintenance POC for planned maintenance in the OPNFV Beijing summit to show the basic concept of using framework defined by the project.

Inspector Design Guideline

Note

This is spec draft of design guideline for inspector component. JIRA ticket to track the update and collect comments: DOCTOR-73.

This document summarize the best practise in designing a high performance inspector to meet the requirements in OPNFV Doctor project.

Problem Description

Some pitfalls has be detected during the development of sample inspector, e.g. we suffered a significant performance degrading in listing VMs in a host.

A patch set for caching the list has been committed to solve issue. When a new inspector is integrated, it would be nice to have an evaluation of existing design and give recommendations for improvements.

This document can be treated as a source of related blueprints in inspector projects.

Guidelines

Host specific VMs list

While requirement in doctor project is to have alarm about fault to consumer in one second, it is just a limit we have set in requirements. When talking about fault management in Telco, the implementation needs to be by all means optimal and the one second is far from traditional Telco requirements.

One thing to be optimized in inspector is to eliminate the need to read list of host specific VMs from Nova API, when it gets a host specific failure event. Optimal way of implementation would be to initialize this list when Inspector start by reading from Nova API and after this list would be kept up-to-date by instance.update notifications received from nova. Polling Nova API can be used as a complementary channel to make snapshot of hosts and VMs list in order to keep the data consistent with reality.

This is enhancement and not perhaps something needed to keep under one second in a small system. Anyhow this would be something needed in case of production use.

This guideline can be summarized as following:

  • cache the host VMs mapping instead of reading it on request
  • subscribe and handle update notifications to keep the list up to date
  • make snapshot periodically to ensure data consistency

Parallel execution

In doctor’s architecture, the inspector is responsible to set error state for the affected VMs in order to notify the consumers of such failure. This is done by calling the nova reset-state API. However, this action is a synchronous request with many underlying steps and cost typically hundreds of milliseconds. According to the discussion in mailing list, this time cost will grow linearly if the requests are sent one by one. It will become a critical issue in large scale system.

It is recommended to introduce parallel execution for actions like reset-state that takes a list of targets.

Shortcut notification

An alternative way to improve notification performance is to take a shortcut from inspector to notifier instead of triggering it from controller. The difference between the two workflow is shown below:

conservative notification

Conservative Notification

shortcut notification

Shortcut Notification

It worth noting that the shortcut notification has a side effect that cloud resource states could still be out-of-sync by the time consumer processes the alarm notification. This is out of scope of inspector design but need to be taken consideration in system level.

Also the call of “reset servers state to error” is not necessary in the alternative notification case where the “host forced down” is still called. “get-valid-server-state” was implemented to have valid server state while earlier one couldn’t get it unless calling “reset servers state to error”. When not having “reset servers state to error”, states are more unlikely to be out of sync while notification and force down host would be parallel.

Appendix

A study has been made to evaluate the effect of parallel execution and shortcut notification on OPNFV Beijing Summit 2017.

notification time

Notification Time

Download the full presentation slides here.