A Production Readiness Checklist

This appendix provides a checklist of tasks that must be accomplished before taking Oracle Communications Order and Service Management (OSM) into production.

About Using the Production Checklist

Use the OSM checklist to verify that operational best-practices and key procedures exist before to going into production or performing a major release upgrade. These operational best-practices include success factors such as performance, configuration, purging, and database statistics management.

Ensure that implementation project teams review this material well ahead of going into production to ensure that project plans incorporate the required activities.

Ensure that implementation project teams provide an explicit response to this checklist to Oracle. The production checklist can help Oracle clarify key success factors and risks so that Oracle can provide faster emergency responses in case of problems.

The implementation project should share additional information (for example, performance test results) with Oracle to provide better overall context on the deployment. Oracle maintains all such information confidentially according to the Oracle Information Protection Policy. Shared information can be deleted upon request.

Provide the checklist response and any additional information by opening an OSM service request with a title such as ”Production Checklist” at least one month prior to going into production. Oracle may request clarifications through the service request, which can then be closed.

Checking for a Current OSM Patch Before Going into Production

Going into production using a recent OSM patch ensures that the OSM solution is not susceptible to known issues encountered by other customers, which may or may not manifest themselves in pre-production tests. It also facilitates resolution of any new issues that may be encountered, which are normally resolved on recent patches.

For OSM releases in Premier Support with active customers, Oracle continue to release patch sets (for example, 7.3.0.1.0, or 7.3.0.2.0) at a regular interval. The patch set interval depends on the level of customer activity for a given release. For example, the patch set interval may be every six months. The majority of bugs are fixed in patch sets and often include important operational enhancements (for example, improved purging performance).

In addition, Oracle releases interim patches (for example, 7.3.0.1.2) to address specific problems that are blocking specific customers. These interim patches are released on a code branch for the given patch set. All interim patches are cumulative (a superset) of prior interim patches for that branch. Changes released in an interim patch are forwarded-ported into the next upcoming patch set.

Oracle normally releases interim patches for the last two patch sets of a given release.

The implementation project teams should plan to go into production or upgrade on a recent patch of the latest patch set (preferred) or the prior patch set (also acceptable). For example, if 7.3.0.1.0 was the last patch set issued, going into production on 7.3.0.1.5 (a recent patch of that patch set) would be the preferred option. Alternatively, it would also be acceptable to go into production on 7.3.0.0.14, a recent patch of 7.3.0.0.0 (the prior patch set).

In general, if the implementation project takes more than six months, Oracle recommends that the implementation project teams plan to upgrade to a recent patch on the same release prior to going into production. Oracle can provide advanced notice of planned patch sets to assist in this planning. For example, if a project started three months ago using OSM 7.3.0.0.9 (on the prior patch set) and the implementation project teams plan to go into production in the next three months, Oracle would recommend that the teams check with Oracle when a 7.3.0.1.0 patch set may be released. The implementation project teams should plan to upgrade prior to going into production, avoiding going into production on an older patch set.

Checklist:

The production system launch or upgrade should take place on a recent interim patch for the latest patch set (preferred) or the previous patch set.
If implementation project teams have a long-running project (more than six months), plan to upgrade to a patch of the most recent patch set prior to going into production or upgrading.

Useful information to share: What is the five digit OSM patch level targeted for going into production or upgrading?

Checking for Deployment Architecture

Understanding the OSM deployment architecture is important in troubleshooting configuration-related production issue. Diagrams depicting deployment architecture information are helpful.

Checklist:

Optimize storage I/O to sustain both order processing and order (row-based) purging. For more information, see "RAID Recommendations for the Database".
Consider using fast solid state drives (SSD) disks for critical database files (for example, REDO logs). For more information, see "RAID Recommendations for the Database".
Validate that shared WebLogic Server persistent store storage I/O is enough for the order processing throughput. For high-volume deployments fast SSD are recommended. For more information, see "Network Latency and NFS Configuration for WebLogic Server Shared Storage".
Application logs (including WebLogic Server and OSM) are written to a local file-system.

Useful information to share:

At a high-level, what are the external systems that OSM interacts with?
How many application servers are deployed? What are their hardware specifications?
Is Oracle RAC used and if so what Oracle RAC configuration is used (for example, Active/Active)?
How many database servers are deployed and what are their hardware specifications?
Are virtual machines used, if so which product and version?
Are multiple instances of OSM (for example, COM, SOM) deployed?
What are the version numbers for key other software (Oracle Application Integration Architecture (Oracle AIA), OSM Order-to-Activate cartridges, WebLogic Server, operating system, database, virtual machine)?
What RAID storage configuration is used (for example RAID 10, RAID 5)?
What type of storage is used for the shared WebLogic Server Persistent Store?
What security provider configuration (for example, internal or external LDAP directory) does OSM use?

Checking the OSM Production System Configuration

Optimal performance and system behavior under load or faults depend heavily on proper configuration of OSM, the database, and WebLogic Server. The compliance tool identifies configuration problems that may cause production issues. Run the compliance tool before doing performance tests to avoid costly delays. For more information about running the compliance tool, see "Verifying the OSM Installation".

Checklist:

The compliance tool was run on the OSM production system.
All configuration compliance issues of severity ”Error” and ”Warning” have been corrected or reviewed and determined to be acceptable.
The size of log files and their rollover is correctly configured for app server and database components according to the amount of disk space available.
Do a manual review of all configuration files to identify any typos or inconsistent configurations (for example, between managed server startup parameters, and so on).
Ensure that all recommended patches have been installed. For more information, see "Software Requirements".

Useful information to share: Rerun the compliance tool when all corrections have been made and share the report with Oracle. Add an explanation for any ”Error” or ”Warning” that are ignored.

Checking for Performance Expectations

Ensure that you validate that the OSM solution performs at the full expected target order volume prior to going into production. For more information about running performance tests on the OSM solution, see "OSM Pre-Production Testing and Tuning".

Gathering this information ahead of time may save precious investigation time to resolve a production issue.

Checklist:

Performance tests have confirmed that the OSM production system can sustain the full expected maximum hourly order volume rate. These tests should observe the following:
- Ensure that representative mix of order types and size is used in the test workload.
- Ensure that the average size of the order (line item count for orchestration orders) is aligned with expected production order workload.
- Ensure that the test order mix includes enough revision orders and larger orders, if applicable.
- Ensure that the production system has enough memory to handle the expected number of cartridge versions deployed.
- Ensure that for long-running orders, enough in-progress orders are already loaded in the system to simulate the maximum workload of the system.
- Validate that enough capacity remains to sustain peak volume if any one machine of the production system were to fail, or to deal with unexpected order volume peaks (for example, +20%).
Performance tests have confirmed that the production system can process the maximum expected order size.
Performance tests have confirmed that the OSM production system can process any expected large bulk operations.
Longevity performance tests (for example, 24 hours) have been performed without errors on the OSM production system. Longevity performance tests must validate the following:
- The workload is distributed uniformly across the cluster.
- JMS messages do not accumulate.
- Java Garbage collection is operating properly. Oracle recommends that you perform a regular analysis of garbage collection logs (for example, with GCViewer) to detect changes in your memory usage patterns. In particular, frequently check the live data size (LDS), which is the amount of memory used in tenured heap after a full garbage collection cycle. A healthy LDS should be at 50% or less when using the Parallel Old garbage collection algorithm. Garbage collection logs should be captured for a period long enough for at least one full garbage collection to occur (when using parallel old garbage collection). In addition, memory problems often present themselves as periods of high CPU utilization where the system appears to become unresponsive. Cluster instability is another frequent symptom.
- Database locks are not occurring.
- WebLogic Server thread locks are not occurring.
The I/O capacity of the database storage sub-system and of the shared WebLogic Server persistent store has been validated.

Useful information to share:

What are the average and maximum OSM order creation volume targets per day and during the busy hour?
If orders coming into OSM (for example, COM) generate additional orders (for example, SOM), what is the expected generated order volume for each?
What is the expected line item count for average size orders and the expected line item count for large size orders? Indicate the approximate percentage of large orders.
Indicate the approximate percentage of expected revision orders, if applicable.
Indicate the average order duration time (number of days to complete an order).
Indicate the average number of cartridge types and versions planned to be deployed on the production system.
Describe any bulk operations that may occur during the day including the number of orders submitted and at what time they will be executed.
Indicate whether Oracle has previously provided hardware sizing recommendations for the deployment.
Share a summary of performance test results that specifies how the production system (for example, the application servers, the database, and the storage) responds (for example, the CPU, memory, and I/O) during the target peak load with the target number of cartridge versions deployed. This provides a baseline for comparison when investigating any future performance-related issue.

Checking for a Migration Strategy and Production Schedule

Many OSM projects require a migration phase. Some options include an in-place upgrade of an existing OSM database, or setting up a new parallel instance of OSM and cutting over all orders to the new instance at one time or progressively (for example, blocks of subscribers).

Checklist:

If an in-place OSM database upgrade is planned.
- Ensure that completed partitions and orders have been purged to reduce the amount of data to be upgraded.
- Ensure that the procedure has been tested with a full production database and includes a back-out strategy.
The schedule for migration, going into production, and key post production activities is defined and Oracle Support is notified.

Useful information to share:

What is the planned migration procedure, including the back-out procedure?
What is the schedule for going into production and post going into production activities?

Checking for Database Management Procedures

Determining partition size and deciding on a purging policy are key to ensure that storage capacity is reclaimed in a timely fashion and system performance maintained. For more information, see the discussion about managing the OSM schema in OSM System Administrator's Guide.

Checklist:

Oracle recommends that OSM uses a partitioned database schema to benefit from the ability to purge partitions and from other partition maintenance procedures.
Ensure that the average expected daily and weekly storage consumption rate has been measured.
Ensure that a procedure exists for regularly creating empty partitions before they are needed. This procedures should align with the expected rate at which the partitions are consumed. This procedure is typically performed after completed partitions are purged.
Ensure that you have defined a purging strategy: partition based purging, row based purging, or a hybrid of both. For more information, see OSM System Administrator's Guide.
Ensure that there is a schedule for backing up the system and collecting statistics and that these activities are included in any performance tests that include purging (for example, during longevity tests).
If you are using partition based purging, do the following:
- Ensure that partition size and purge frequency are defined and documented. The purge frequency should provide enough time for most orders (for example, 98%) to complete in partitions to be purged while factoring in the retention period. This ensures that storage capacity is reclaimed in a timely and predictable fashion.
- Ensure that performance tests are run with purge_partitions, or drop_empty_partitions, or both to ensure that the database, storage and OSM are tuned for maximum purging performance.
- If you have inter-order dependencies you may need to use special purge criteria for which you should conduct tests and ensure that subsequent order processing of dependent orders works as expected.
- Ensure that a backup is taken before running the partition purge in case an error occurs during the purge.
- Ensure that optimizer statistics are gathered after running a partition purge so that the updated statistics take into account the consolidated partitions.
- Ensure that the database administrator is trained to recover from common purging problems identified in the discussion about troubleshooting and error handling in OSM System Administrator's Guide.
If you are using row based purging, do the following:
- Ensure that purge frequency is defined and documented. The daily or weekly purge frequency is primarily a function of the amount of CPU and I/O available throughout the day and week. You must run performance tests to determine whether you have enough CPU and I/O to meet the expected daily and weekly order volume.
- Ensure that partition size is large enough that space freed up in a partition can be reused by new orders in the same partition. Generally partition size should be able to accommodate at least two months worth of order data.
- Plan a schedule for when row based purging will happen on a daily and weekly basis and at what parallelism this purging will use. If you have high order processing volume, you would use less parallel processing. If you have less order processing volume, you can use more parallel processing.
Ensures that the maximum expected number of OSM partitions multiplied by the number of sub-partitions (for example, 32) never exceeds 4800. This is the practical database server scalability limits on number of partitions.
Ensure that enough storage capacity is available to persist all in-progress orders, pre-created empty partitions including a margin of safety in case of delays in purging completed partitions.
If the OSM solution includes additional database tables, ensure that you have database management strategies that include purging (if applicable) and backup.
Ensure that operational procedures exist for purging and dropping partitions, for undeploying unneeded cartridge versions, and for creating empty partitions to receive new orders for the upcoming production period (for example, the next month).

Useful information to share:

What is the partition size (number of orders), number of Oracle RAC nodes, and number of subpartitions configured?
If using partition based purging, what is the expected number of days (starting from first order creation) required for a partition to complete most (for example, ~98%) of its orders to be ready for purging?
If using row based purging, what is the row based purging schedule and purge rate (purged orders per minute).
How long do you plan to retain orders (order retention policy) after orders have completed?
How frequently do you plan to add and drop partitions?
Are you planning to use purge_partitions, to retain a small percentage of orders still not completed, or to drop_empty_partitions, having completed (or aborted) all orders?

Checking for Database Optimizer Statistics Schedule

Because OSM performance depends on optimal database performance, you must properly gather database optimizer statistics on a daily basis. For more information about database optimizer statistics, see OSM System Administrator's Guide.

Checklist:

Ensure that the daily and weekly OSM production schedule is defined, including expected peak order processing hours and the time when OSM batch operations occur.
Ensure that a job is scheduled to gather database statistics for highly volatile OSM order tables during expected daily high-volume periods. Incremental statistics are disabled for these tables.
Ensure that a strategy is defined for handling low to medium volatility OSM order tables according to your scenario and that the tables are included in the daily statistics gathering job if applicable.
Enable incremental statistics for other OSM order tables and enough low-volume time exists in the day for the database to gather statistics for these tables automatically. Ensure that you confirm when this normally occurs.
If batch OSM operations do not naturally occur before database statistics are automatically gathered daily, you may need to schedule a job to gather statistics explicitly.
Ensure that a backup and disaster recovery strategy is in place for this OSM deployment.

Useful information to share:

What is the daily OSM production schedule (peaks, batch operations, backups, and database statistics gathering)?
What are the key elements of the OSM database statistics gathering procedure?

Checking for Outage and Order Failure Plans

It must be possible to stop incoming orders, to be queued in an upstream system (for example, Oracle AIA), for a planned OSM outage or an unplanned incident.

Ensure that a strategy exists to identify and correct order-related failures.

Ensure that operational procedures exist for stopping the creation of new orders in OSM and for dealing with systems that may be sending responses to OSM.
Ensure that procedures exist for introducing queued orders into OSM in a gradual way (see OSM Warm-up Procedures in 1919049.1) when needed (for example, after major changes to the system).
Ensure that a strategy, procedures, and tools exist to manage orders in fallout, including orders that may be stuck. This includes procedures for dealing with JMS messages that go into fallout queues and that may need to be resubmitted into processing queues.

Checking for Change Control Management Plans

As for any mission-critical software, when OSM is in service you must make changes to the environment (for example, configuration changes or cartridge deployments) in a controlled manner. You must ensure the existence of detailed steps for the introduction of any changes.

Provide the following:

Document and test a specific backup strategy so that if anything that was planned fails, a reversal of the changes can be performed to bring the system back to its previous state. This would potentially need to happen at the database level, at the solution level, at the cartridges level, or even at the core OSM application level.
Document of potentially involved and affected systems.
Document those that implemented the changes (including prerequisites), their roles and how to contact them during the time period the changes are scheduled for.
Document the approximate duration for every step.
Document prerequisite steps to be performed on OSM and external systems with outcomes and responsibilities, for example:
- Taking a snapshot or backup of various other systems in case rollback is required.
- Stopping the upstream flow of orders into OSM.
- Waiting until the OSM inbound queues are drained and all orders are started.
- Stopping the various managed servers.
- And so on.
Document the step-by-step plan for what is be done on OSM and external systems with outcomes and responsibilities.
Document the steps that must be performed after the change with outcomes and responsibilities.
Document validation steps to ensure that the changes are active and working as expected.
Document any hardware, operating system, database, application server, and OSM changes.

In the event of a serious production issue, Oracle support may request change control documentation to understand whether any recent changes may have contributed to a production issue. The availability of this information can greatly shorten the resolution of an issue. It's also important to retain the ability to test the system under volume after going into production, which requires a separate pre-production environment.

Checklist:

Ensure that a change control management strategy exists that defines how changes are tracked, applied one at a time (or in small groups), validated, approved, and monitored after implementation (in case of problems).
Ensure that an initial baseline of configuration is captured and change control documentation exists.
Ensure that procedures exist (and are tested) for introducing solution cartridge changes. The procedure should include how to version cartridge changes, a regression test strategy and plan, and the ability to validate in-progress order compatibility if un-versioned cartridge changes are planned (see 1612273.1)
Ensure that procedures exist for applying patches (OSM, WebLogic Server, and database), including a test strategy.
Ensure that procedures exist for gathering database statistics after major changes.
Ensure that a pre-production environment exists to test future changes and investigate product issues (including performance issues) that may occur.

Useful information to share: Describe the pre-production environment and how it compares (for example, processing capacity) to the production environment. Is it enough to validate performance volume related issues or changes?

Checking for Performance Monitoring Procedures

It is important to monitor the OSM workload and key system metrics to quickly detect any changes in workload or performance characteristics. Oracle support may request performance monitoring information to investigate any performance-related issue that may arise in production.

Ensure that a method exists for capturing and tracking the daily and hourly volume of OSM orders created and completed and the volume of tasks executed.
Ensure that a method exists for capturing and tracking the creation and completion of large orders (if applicable).
Capture and track key system performance metrics such as application server and database server CPU utilization, memory consumption, and storage I/O and capacity.