Mongo CPU Overload Investigation and Logs

Mongo CPU Overload Investigation and Logs#

Created: June 22, 2021 5:52 PM

Phenomenon#

In the production environment, the CPU utilization of the primary database periodically exceeds 80% within 5 minutes at specific times (04:05, 04:15, 07:05, 07:15, 11:05), as well as non-periodically (04:15-04:30, 11:05-11:15).

Due to the periodic and non-continuous nature of the CPU spikes, it is suspected that there may be an issue with scheduled tasks.
Filter relevant scheduled tasks based on the alert times on AWS EC2.

Objective Causes of CPU Overload#

Full Table Scan (COLLSCAN)#

What: COLLSCAN is a high CPU-consuming operation that scans all records in a collection or table, filtering them according to conditions. If the keyword COLLSCAN is found in the profile collection or logs and this operation is frequent, it is necessary to optimize by creating indexes for the query fields.
How to locate: You can determine if a slow query full table scan has occurred by checking the planSummary parameter in the Profile.
How to address: Create appropriate indexes.

Improper Indexing#

What: Improper indexing can cause excessive scanning of indexes during queries, resulting in high CPU overhead.
How to locate: You can locate index queries by checking the planSummary value in system.profile or Query Plans, and determine if the index is appropriate based on keysExamined.
How to address: Create appropriate indexes.

Sorting Large Data Sets#

What: When a sorting request cannot be completed using indexes, MongoDB will sort the data in memory. If the data set is large, it can cause high CPU usage.
How to locate: When you find the SORT keyword in the system.profile collection, consider optimizing the sorting by using indexes. The hasSortStage field in the system.profile collection will be true if the request includes a sorting stage.
How to address: Create indexes for sorting purposes.

Exceeding Service Capacity#

If all the above issues have been investigated and all queries and indexes are optimized, it indicates that the service capacity is insufficient for the business volume.
Consider upgrading the configuration in this case.

How to Scientifically Create Indexes in MongoDB

Read the documentation (https://docs.mongodb.com/manual/indexes/)
High selectivity: distinct value count / total value count
No null values: do not create indexes for null values
Frequently used for updates and query filters
Ascending or descending order:
- In a single-field index, the order does not matter because MongoDB's indexes are sorted and can be queried in both directions.
- In a compound index, the order of the second index field will affect the query performance. For example, if the second field is order time, a descending index can lock the order faster when updating orders.
Do not blindly create indexes, as there is a cost associated with index creation and insertion.
This topic is too large to cover here, it can be expanded upon later.

Monitoring and Identifying the Above Issues#

CurrentOp - Current Operation Information#

During peak periods, use db.currentOp() to check the status of the overloaded database. The meaning of the returned values can be found at (https://docs.mongodb.com/manual/reference/command/currentOp/#currentop-output-fields)

Profiler - Performance Analysis Logs#

The MongoDB query performance logging feature is disabled by default. It is recommended to only enable it in the pre-production environment due to its slight impact on database performance.
The main function is to store database commands and their execution information in the profile collection under each db's system folder. The size of this collection is fixed, and when it exceeds the set size, old logs are automatically deleted. Currently, it can store up to 950 recent records (thanks to large-scale slow query reconciliation).
The profiler is the log version of currentOp, as the returned parameters are exactly the same for both.
The profiler can be set using db.setProfilingLevel(level), with three levels: 0 (disabled), 1 (filter records based on the slowms threshold, where slowms is the number of milliseconds the operation runs), 2 (record all operation logs).
You can use db.getProfilingStatus() to check the current profiler settings for the db.

Pay special attention to the following parameters#

secs_running

The running time of the operation, in seconds.

microsecs_running

The running time of the operation, in milliseconds.
What is considered slow?
- By default, MongoDB considers queries taking more than 100ms as slow queries. In our case, we have a full table scan query that takes 30s. Therefore, I will set it to 200ms initially, and we can discuss the value based on the complexity of the business.

planSummary

COLLSCAN: Full table scan
IXSCAN { field_name: 1 }: Index scan, indicating which index is used internally
IDHACK: Primary key index scan, skipping other query plans and providing a more optimized query based on the primary key. This appears in the query when there is a value for _id.

keysExamined

Used to determine if the existing indexes are appropriate for this query.
Indicates how many indexes were scanned before completing the query. The larger the value, the higher the CPU overhead.

Identifying Problematic Code#

Periodic#

04:05 (average over 5 minutes, so starting from 04:00, UTC+8=12:05)

sync_transaction_data
Task A (runs every 4 hours) (mainly reads from MySQL Membership and writes to Redis api db)

04:15 (12:15)

Task B

07:05 (15:05)

sync_transaction_data (second time in the afternoon)
Task C (mainly operates on MySQL read replica)

11:05 (19:05)

sync_transaction_data (third time in the evening)
Task C

Non-periodic#

(04:15-04:30, 11:05-11:15) Based on the time, it should be a follow-up issue with sync_transaction_data.
But it's still worth checking with profile.

Other Possible Causes#

If the periodic and non-periodic issues have been ruled out and the problem still persists, further investigation is needed.

Practice#

Enable Profile performance monitoring in the Freeze environment, set it to level 1, and set the threshold to 200.
Monitor the execution of asynchronous tasks using Flower.
Monitor CPU usage using AWS CloudWatch.
Execute sync_transaction_data.delay() in the Freeze Shell.
Open MongoDB and check the slow query logs in the profile collection under the Bill database.
Discover over 900 logs like this [1], which indicates a full table scan.
Three fields: transaction_time, channel, settlement_sign. The last two are enumeration fields with only 5 categories and a Boolean field with only 0 and 1. The selectivity is too low. Only the first field, transaction_time, is suitable.
After adding an index, CPU usage dropped from 99.9% to 3%, and the asynchronous task time decreased from 900s to 19s.

Maintenance and Prevention#

Enable Profile performance monitoring in the Freeze environment.
Regularly check and handle slow queries.

{
  "op":"query",
  "ns":"bill.transaction",
  "keysExamined":0,
  "query":{  
     "find":"transaction",
     "filter":{  
        "transaction_time":{  
           "$gte":{"$date":"2018-05-20T23:00:00.000+0000"},
           "$lte":{"$date":"2018-05-21T22:59:59.999+0000"}},
        "channel":"非再惠渠道",
        "settlement_sign":{  
           "$in":[1]}}
  },
  "docsExamined":2882636,
  "cursorExhausted":true,
  "numYield":22635,
  "nreturned":0,
  "responseLength":104,
  "protocol":"op_query",
  "millis":15131,
  "planSummary":"COLLSCAN",
  "execStats":{
     "stage":"COLLSCAN",
     "nReturned":0,
     "executionTimeMillisEstimate":14727,
     "works":2882638,
     "advanced":0,
     "needTime":2882637,
     "needYield":0,
     "saveState":22635,
     "restoreState":22635,
     "isEOF":1,
     "invalidates":0,
     "direction":"forward",
     "docsExamined":2882636
  },
  "ts":{
     "$date":"2018-05-23T12:30:42.539+0000"
  }
}