6 VNF High Availability

6.1 Service Availability

In the context of NFV, Service Availability refers to the End-to-End (E2E) Service Availability which includes all the elements in the end-to-end service (VNFs and infrastructure components) with the exception of the customer terminal such as handsets, computers, modems, etc. The service availability requirements for NFV should be the same as those for legacy systems (for the same service).

Service Availability = total service available time / (total service available time + total service recovery time)

The service recovery time among others depends on the number of redundant resources provisioned and/or instantiated that can be used for restoring the service.

In the E2E relation a Network Service is available only of all the necessary Network Functions are available and interconnected appropriately to collaborate according to the NF chain.

General Service Availability Requirements

We need to be able to define the E2E (V)NF chain based on which the E2E availability requirements can be decomposed into requirements applicable to individual VNFs and their interconnections
The interconnection of the VNFs should be logical and be maintained by the NFVI with guaranteed characteristics, e.g. in case of failure the connection should be restored within the acceptable tolerance time
These characteristics should be maintained in VM migration, failovers and switchover, scale in/out, etc. scenarios
It should be possible to prioritize the different network services and their VNFs. These priorities should be used when pre-emption policies are applied due to resource shortage for example.
VIM should support policies to prioritize a certain VNF.
VIM should be able to provide classified virtual resources to VNFs in different SAL

6.1.1 Service Availability Classification Levels

The [ETSI-NFV-REL] defined three Service Availability Levels (SAL) are classified in Table 1. They are based on the relevant ITU-T recommendations and reflect the service types and the customer agreements a network operator should consider.

[ETSI-NFV-REL]

ETSI GS NFV-REL 001 V1.1.1 (2015-01)

Table 1: Service Availability classification levels

SAL Type	Customer Type	Service/Function	Notes
Level 1	Network Operator Control Traffic Government/ Regulatory Emergency Services	Intra-carrier engineering traffic Emergency telecommunication service (emergency response, emergency dispatch) Critical Network Infrastructure Functions (e.g VoLTE functions DNS Servers,etc.)	Sub-levels within Level 1 may be created by the Network Operator depending on Customer demands E.g.: 1A - Control; 1B - Real-time; 1C - Data; May require 1+1 Redundancy with Instantaneous Switchover
Level 2	Enterprise and/ or large scale customers (e.g. Corporations, University) Network Operators (Tier1/2/3) service traffic	VPN Real-time traffic (Voice and video) Network Infrastructure Functions supporting Level 2 services (e.g. VPN servers, Corporate Web/ Mail servers)	Sub-levels within Level 2 may be created by the Network Operator depending on Customer demands. E.g.: 2A - VPN; 2B - Real-time; 2C - Data; May require 1:1 Redundancy with Fast (maybe Instantaneous) Switchover
Level 3	General Consumer Public and ISP Traffic	Data traffic (including voice and video traffic provided by OTT) Network Infrastructure Functions supporting Level 3 services	While this is typically considered to be "Best Effort" traffic, it is expected that Network Operators will devote sufficient resources to assure "satisfactory" levels of availability. This level of service may be pre-empted by those with higher levels of Service Availability. May require M+1 Redundancy with Fast Switchover; where M > 1 and the value of M to be determined by further study

SAL Type

Customer Type

Service/Function

Notes

Level 1

Network Operator Control Traffic

Government/ Regulatory Emergency Services

Intra-carrier engineering traffic
Emergency telecommunication service (emergency response, emergency dispatch)
Critical Network Infrastructure Functions (e.g VoLTE functions DNS Servers,etc.)

Sub-levels within Level 1 may be created by the Network Operator depending on Customer demands E.g.:

1A - Control;
1B - Real-time;
1C - Data;

May require 1+1 Redundancy with Instantaneous Switchover

Level 2

Enterprise and/ or large scale customers (e.g. Corporations, University)

Network Operators (Tier1/2/3) service traffic

VPN
Real-time traffic (Voice and video)
Network Infrastructure Functions supporting Level 2 services (e.g. VPN servers, Corporate Web/ Mail servers)

Sub-levels within Level 2 may be created by the Network Operator depending on Customer demands. E.g.:

2A - VPN;
2B - Real-time;
2C - Data;

May require 1:1 Redundancy with Fast (maybe Instantaneous) Switchover

Level 3

General Consumer Public and ISP Traffic

Data traffic (including voice and video traffic provided by OTT)
Network Infrastructure Functions supporting Level 3 services

While this is typically considered to be "Best Effort" traffic, it is expected that Network Operators will devote sufficient resources to assure "satisfactory" levels of availability. This level of service may be pre-empted by those with higher levels of Service Availability. May require M+1 Redundancy with Fast Switchover; where M > 1 and the value of M to be determined by further study

Requirements

It shall be possible to define different service availability levels
It shall be possible to classify the virtual resources for the different availability class levels
The VIM shall provide a mechanism by which VNF-specific requirements can be mapped to NFVI-specific capabilities.

More specifically, the requirements and capabilities may or may not be made up of the same KPI-like strings, but the cloud administrator must be able to configure which HA-specific VNF requirements are satisfied by which HA-specific NFVI capabilities.

6.1.2 Metrics for Service Availability

The [ETSI-NFV-REL] identifies four metrics relevant to service availability:

Failure recovery time,
Failure impact fraction,
Failure frequency, and
Call drop rate.

6.1.2.1 Failure Recovery Time

The failure recovery time is the time interval from the occurrence of an abnormal event (e.g. failure, manual interruption of service, etc.) until the recovery of the service regardless if it is a scheduled or unscheduled abnormal event. For the unscheduled case, the recovery time includes the failure detection time and the failure restoration time. More specifically restoration also allows for a service recovery by the restart of the failed provider(s) while failover implies that the service is recovered by a redundant provider taking over the service. This provider may be a standby (i.e. synchronizing the service state with the active provider) or a spare (i.e. having no state information). Accordingly failover also means switchover, that is, an orederly takeover of the service from the active provider by the standby/spare.

Requirements

It should be irrelevant whether the abnormal event is due to a scheduled or unscheduled operation or it is caused by a fault.
Failure detection mechanisms should be available in the NFVI and configurable so that the target recovery times can be met
Abnormal events should be logged and communicated (i.e. notifications and alarms as appropriate)

The TL-9000 forum has specified a service interruption time of 15 seconds as outage for all traditional telecom system services. [ETSI-NFV-REL] recommends the setting of different thresholds for the different Service Availability Levels. An example setting is given in the following table 2. Note that for all Service Availability levels Real-time Services require the fastest recovery time. Data services can tolerate longer recovery times. These recovery times are applicable to the user plane. A failure in the control plane does not have to impact the user plane. The main concern should be simultaneous failures in the control and user planes as the user plane cannot typically recover without the control plane. However an HA mechanism in VNF itself can further mitigate the risk. Note also that the impact on the user plane depends on the control plane service experiencing the failure, some of them are more critical than others.

Table 2: Example service recovery times for the service availability levels

SAL	Service Recovery Time Threshold	Notes
1	5 - 6 seconds	Recommendation: Redundant resources to be made available on-site to ensure fast recovery.
2	10 - 15 seconds	Recommendation: Redundant resources to be available as a mix of on-site and off- site as appropriate. On-site resources to be utilized for recovery of real-time services. Off-site resources to be utilized for recovery of data services.
3	20 - 25 seconds	Recommendation: Redundant resources to be mostly available off-site. Real-time services should be recovered before data services

SAL

Service Recovery Time Threshold

Notes

5 - 6 seconds

Recommendation: Redundant resources to be made available on-site to ensure fast recovery.

10 - 15 seconds

Recommendation: Redundant resources to be available as a mix of on-site and off- site as appropriate.

On-site resources to be utilized for recovery of real-time services.

Off-site resources to be utilized for recovery of data services.

20 - 25 seconds

Recommendation: Redundant resources to be mostly available off-site. Real-time services should be recovered before data services

6.1.2.2 Failure Impact Fraction

The failure impact fraction is the maximum percentage of the capacity or user population affected by a failure compared with the total capacity or the user population supported by a service. It is directly associated with the failure impact zone which is the set of resources/elements of the system to which the fault may propagate.

Requirements

It should be possible to define the failure impact zone for all the elements of the system
At the detection of a failure of an element, its failure impact zone must be isolated before the associated recovery mechanism is triggered
If the isolation of the failure impact zone is unsuccessful the isolation should be attempted at the next higher level as soon as possible to prevent fault propagation.
It should be possible to define different levels of failure impact zones with associated isolation and alarm generation policies
It should be possible to limit the collocation of VMs to reduce the failure impact zone as well as to provide sufficient resources

6.1.2.3 Failure Frequency

Failure frequency is the number of failures in a certain period of time.

Requirements

There should be a probation period for each failure impact zones within which failures are correlated.
The threshold and the probation period for the failure impact zones should be configurable
It should be possible to define failure escalation policies for the different failure impact zones

6.1.2.4 Call Drop Rate

Call drop rate reflects service continuity as well as system reliability and stability. The metric is inside the VNF and therefore is not specified further for the NFV environment.

Requirements

It shall be possible to specify for each service availability class the associated availability metrics and their thresholds
It shall be possible to collect data for the defined metrics
It shall be possible to delegate the enforcement of some thresholds to the NFVI
Accordingly it shall be possible to request virtual resources with guaranteed characteristics, such as guaranteed latency between VMs (i.e. VNFCs), between a VM and storage, between VNFs

6.2 Service Continuity

The determining factor with respect to service continuity is the statefulness of the VNF. If the VNF is stateless, there is no state information which needs to be preserved to prevent the perception of service discontinuity in case of failure or other disruptive events. If the VNF is stateful, the NF has a service state which needs to be preserved throughout such disruptive events in order to shield the service consumer from these events and provide the perception of service continuity. A VNF may maintain this state internally or externally or a combination with or without the NFVI being aware of the purpose of the stored data.

Requirements

The NFVI should maintain the number of VMs provided to the VNF in the face of failures. I.e. the failed VM instances should be replaced by new VM instances
It should be possible to specify whether the NFVI or the VNF/VNFM handles the service recovery and continuity
If the VNF/VNFM handles the service recovery it should be able to receive error reports and/or detect failures in a timely manner.
The VNF (i.e. between VNFCs) may have its own fault detection mechanism, which might be triggered prior to receiving the error report from the underlying NFVI therefore the NFVI/VIM should not attempt to preserve the state of a failing VM if not configured to do so
The VNF/VNFM should be able to initiate the repair/reboot of resources of the VNFI (e.g. to recover from a fault persisting at the VNF level => failure impact zone escalation)
It should be possible to disallow the live migration of VMs and when it is allowed it should be possible to specify the tolerated interruption time.
It should be possible to restrict the simultaneous migration of VMs hosting a given VNF
It should be possible to define under which circumstances the NFV-MANO in collaboration with the NFVI should provide error handling (e.g. VNF handles local recoveries while NFV-MANO handles geo-redundancy)
The NFVI/VIM should provide virtual resource such as storage according to the needs of the VNF with the required guarantees (see virtual resource classification).
The VNF shall be able to define the information to be stored on its associated virtual storage
It should be possible to define HA requirements for the storage, its availability, accessibility, resilience options, i.e. the NFVI shall handle the failover for the storage.
The NFVI shall handle the network/connectivity failures transparent to the VNFs
The VNFs with different requirements should be able to coexist in the NFV Framework
The scale in/out is triggered by the VNF (VNFM) towards the VIM (to be executed in the NFVI)
It should be possible to define the metrics to monitor and the related thresholds that trigger the scale in/out operation
Scale in operation should not jeopardize availability (managed by the VNF/VNFM), i.e. resources can only be removed one at a time with a period in between sufficient for the VNF to restore any required redundancy.