Requirements¶
Upgrade duration¶
As the OPNFV end-users are primarily Telecom operators, the network services provided by the VNFs deployed on the NFVI should meet the requirement of ‘Carrier Grade’.:
In telecommunication, a "carrier grade" or"carrier class" refers to a
system, or a hardware or software component that is extremely reliable,
well tested and proven in its capabilities. Carrier grade systems are
tested and engineered to meet or exceed "five nines" high availability
standards, and provide very fast fault recovery through redundancy
(normally less than 50 milliseconds). [from wikipedia.org]
“five nines” means working all the time in ONE YEAR except 5‘15”.
We have learnt that a well prepared upgrade of OpenStack needs 10
minutes. The major time slot in the outage time is used spent on
synchronizing the database. [from ' Ten minutes OpenStack Upgrade? Done!
' by Symantec]
This 10 minutes of downtime of the OpenStack services however did not impact the users, i.e. the VMs running on the compute nodes. This was the outage of the control plane only. On the other hand with respect to the preparations this was a manually tailored upgrade specific to the particular deployment and the versions of each OpenStack service.
The project targets to achieve a more generic methodology, which however requires that the upgrade objects fulfil certain requirements. Since this is only possible on the long run we target first the upgrade of the different VIM services from version to version.
Questions:
- Can we manage to upgrade OPNFV in only 5 minutes?
- Is it acceptable for end users ? Such as a planed service interruption will lasting more than ten minutes for software upgrade.
- Will any VNFs still working well when VIM is down?
The maximum duration of an upgrade¶
The duration of an upgrade is related to and proportional with the scale and the complexity of the OPNFV platform as well as the granularity (in function and in space) of the upgrade.
The maximum duration of a roll back when an upgrade is failed¶
The duration of a roll back is short than the corresponding upgrade. It depends on the duration of restore the software and configure data from pre-upgrade backup / snapshot.
The maximum duration of a VNF interruption (Service outage)¶
Since not the entire process of a smooth upgrade will affect the VNFs, the duration of the VNF interruption may be shorter than the duration of the upgrade. In some cases, the VNF running without the control from of the VIM is acceptable.
Pre-upgrading Environment¶
System is running normally. If there are any faults before the upgrade, it is difficult to distinguish between upgrade introduced and the environment itself.
The environment should have the redundant resources. Because the upgrade process is based on the business migration, in the absence of resource redundancy,it is impossible to realize the business migration, as well as to achieve a smooth upgrade.
Resource redundancy in two levels:
NFVI level: This level is mainly the compute nodes resource redundancy. During the upgrade, the virtual machine on business can be migrated to another free compute node.
VNF level: This level depends on HA mechanism in VNF, such as: active-standby, load balance. In this case, as long as business of the target node on VMs is migrated to other free nodes, the migration of VM might not be necessary.
The way of redundancy to be used is subject to the specific environment. Generally speaking, During the upgrade, the VNF’s service level availability mechanism should be used in higher priority than the NFVI’s. This will help us to reduce the service outage.
Release version of software components¶
This is primarily a compatibility requirement. You can refer to Linux/Python Compatible Semantic Versioning 3.0.0:
Given a version number MAJOR.MINOR.PATCH, increment the:
MAJOR version when you make incompatible API changes,
MINOR version when you add functionality in a backwards-compatible manner,
PATCH version when you make backwards-compatible bug fixes.
Some internal interfaces of OpenStack will be used by Escalator indirectly, such as VM migration related interface between VIM and NFVI. So it is required to be backward compatible on these interfaces. Refer to “Interface” chapter for details.
Work Flows¶
Describes the different types of requirements. To have a table to label the source of the requirements, e.g. Doctor, Multi-site, etc.
Basic Actions¶
This section describes the basic functions may required by Escalator.
Preparation (offline)¶
This is the design phase when the upgrade plan (or upgrade campaign) is being designed so that it can be executed automatically with minimal service outage. It may include the following work:
- Check the dependencies of the software modules and their impact, backward compatibilities to figure out the appropriate upgrade method and ordering.
- Find out if a rolling upgrade could be planned with several rolling steps to avoid any service outage due to the upgrade some parts/services at the same time.
- Collect the proper version files and check the integration for upgrading.
- The preparation step should produce an output (i.e. upgrade
campaign/plan), which is executable automatically in an NFV Framework
and which can be validated before execution.
- The upgrade campaign should not be referring to scalable entities directly, but allow for adaptation to the system configuration and state at any given moment.
- The upgrade campaign should describe the ordering of the upgrade of different entities so that dependencies, redundancies can be maintained during the upgrade execution
- The upgrade campaign should provide information about the applicable recovery procedures and their ordering.
- The upgrade campaign should consider information about the verification/testing procedures to be performed during the upgrade so that upgrade failures can be detected as soon as possible and the appropriate recovery procedure can be identified and applied.
- The upgrade campaign should provide information on the expected execution time so that hanging execution can be identified
- The upgrade campaign should indicate any point in the upgrade when coordination with the users (VNFs) is required.
Validation the upgrade plan / Checking the pre-requisites of System( offline / online)¶
The upgrade plan should be validated before the execution by testing it in a test environment which is similar to the product environment.
Before the upgrade plan being executed, the system healthy of the online product environment should be checked and confirmed to satisfy the requirements which were described in the upgrade plan. The sysinfo, e.g. which included system alarms, performance statistics and diagnostic logs, will be collected and analogized. It is required to resolve all of the system faults or exclude the unhealthy part before executing the upgrade plan.
Backup/Snapshot (online)¶
For avoid loss of data when a unsuccessful upgrade was encountered, the data should be back-upped and the system state snapshot should be taken before the execution of upgrade plan. This would be considered in the upgrade plan.
Several backups/Snapshots may be generated and stored before the single steps of changes. The following data/files are required to be considered:
- running version files for each node.
- system components’ configuration file and database.
- image and storage, if it is necessary.
Although the upper layer, which include VNFs and VNFMs, is out of the scope of Escalator, but it is still recommended to let it ready for a smooth system upgrade. The escalator could not guarantee the safe of VNFs. The upper layer should have some safe guard mechanism in design, and ready for avoiding failure in system upgrade.
Execution (online)¶
- The execution of upgrade plan should be a dynamical procedure which is
- controlled by Escalator.
- It is required to supporting execution ether in sequence or in parallel.
- It is required to check the result of the execution and take the action according the situation and the policies in the upgrade plan.
- It is required to execute properly on various configurations of system object. I.e. stand-alone, HA, etc.
- It is required to execute on the designated different parts of the system. I.e. physical server, virtualized server, rack, chassis, cluster, even different geographical places.
Testing (online)¶
The testing after upgrade the whole system or parts of system to make sure the upgraded system(object) is working normally.
- It is recommended to run the prepared test cases to see if the functionalities are available without any problem.
- It is recommended to check the sysinfo, e.g. system alarms, performance statistics and diagnostic logs to see if there are any abnormal.
Restore/Roll-back (online)¶
When upgrade is failure unfortunately, a quick system restore or system roll-back should be taken to recovery the system and the services.
- It is recommend to support system restore from backup when upgrade was failed.
- It is recommend to support graceful roll-back with reverse order steps if possible.
Monitoring (online)¶
Escalator should continually monitor the process of upgrade. It is keeping update status of each module, each node, each cluster into a status table during upgrade.
- It is required to collect the status of every objects being upgraded and sending abnormal alarms during the upgrade.
- It is recommend to reuse the existing monitoring system, like alarm.
- It is recommend to support pro-actively query.
- It is recommend to support passively wait for notification.
Two possible ways for monitoring:
Pro-Actively Query requires NFVI/VIM provides proper API or CLI interface. If Escalator serves as a service, it should pass on these interfaces.
Passively Wait for Notification requires Escalator provides callback interface, which could be used by NFVI/VIM systems or upgrade agent to send back notification.
Logging (online)¶
Record the information generated by escalator into log files. The log file is used for manual diagnostic of exceptions.
- It is required to support logging.
- It is recommended to include time stamp, object id, action name, error code, etc.
Administrative Control (online)¶
Administrative Control is used for control the privilege to start any escalator’s actions for avoiding unauthorized operations.
- It is required to support administrative control mechanism
- It is recommend to reuse the system’s own secure system.
- It is required to avoid conflicts when the system’s own secure system being upgraded.
Requirements on Object being upgraded¶
Escalator focus on smooth upgrade. In practical implementation, it might be combined with installer/deplorer, or act as an independent tool/service. In either way, it requires targeting systems(NFVI and VIM) are developed/deployed in a way that Escalator could perform upgrade on them.
On NFVI system, live-migration is likely used to maintain availability because OPNFV would like to make HA transparent from end user. This requires VIM system being able to put compute node into maintenance mode and then isolated from normal service. Otherwise, new NFVI instances might risk at being schedule into the upgrading node.
On VIM system, availability is likely achieved by redundancy. This impose less requirements on system/services being upgrade (see PVA comments in early version). However, there should be a way to put the target system into standby mode. Because starting upgrade on the master node in a cluster is likely a bad idea.
- It is required for NFVI/VIM to support service handover mechanism that minimize interruption to 0.001%(i.e. 99.999% service availability). Possible implementations are live-migration, redundant deployment, etc, (Note: for VIM, interruption could be less restrictive)
- It is required for NFVI/VIM to restore the early version in a efficient way, such as snapshot.
- It is required for NFVI/VIM to migration data efficiently between base and upgraded system.
- It is recommend for NFV/VIM’s interface to support upgrade orchestration, e.g. reading/setting system state.
Functional Requirements¶
Availability mechanism, etc.