Monitoring and Sizing for Performance on AWS

Last updated May 07, 2025 13:54

Ensuring that the Kion platform performs well is key to ensuring that your end-users have an excellent experience. There are several considerations to make when reviewing your environment to ensure the best performance for the Kion platform. This guide will go through those considerations and provide you with information on how to properly monitor and size your Kion environment when installed within an AWS environment.

Kion Baseline Sizing

By default, all production instances of Kion should run with at least the following baseline resource sizes and quantities to avoid scenarios of poor performance regardless of conditions:

ECS: Baselines are provided in the template for each service within the cluster. These should not be adjusted down.
- With Daily Spend Data enabled (default in v3.12.0):
  - Financials Poller task should have a memory of at least 6144 (6GB).
  - Daily spend requires more processing power due to the increased amount of data being captured from financial reports.
  - You can verify the status of Daily Spend by checking Settings > System Settings > Financial Settings > Financial System Settings. If Daily Spend Data is enabled, you are processing Daily Spend granularity.
EC2:
- We recommend 2 nodes for production environments to ensure that nodes can properly roll if necessary due to changes within the cloud environment. Single node production deployments are likely to experience user unavailability during these events, which is undesirable.
- With Daily Spend Data enabled (default in v3.12.0):
  - 2 x t3.xlarge (or equivalent) nodes
    - Daily spend requires more processing power due to the increased amount of data being captured from financial reports.
    - You can verify the status of Daily Spend by checking Settings > System Settings > Financial Settings > Financial System Settings. If Daily Spend Data is enabled, you are processing Daily Spend granularity.
- Without Daily Spend Data enabled (v3.12.0 where it has been disabled or 3.11.x and prior):
  - 2 x t3.large (or equivalent) nodes
RDS: 1 x db.r5.large (or equivalent) node
- Because RDS snapshots are automatic, we do not typically recommend additional reader nodes for deployments as part of the baseline sizing.

Key Performance Factors

The list below provides some of the key factors that are known to influence the performance of the Kion platform and provide a need to increase from the baseline sizing:

The number of accounts that you manage in Kion.
- Each account managed by Kion has stored data as well as background activities that may take place in order to maintain those accounts.
- The baseline sizing handles a maximum of 200 cloud accounts though your specific configuration may require additional resources. Moving beyond 200 cloud accounts means you should account for sizing of your infrastructure appropriately using the steps in the remainder of this guide.
The use of Compliance.
- Kion’s Compliance engine is extremely powerful but can also impact the performance of the system heavily due to both the execution of scans and the posting of Compliance findings from accounts.
- The baseline sizing accounts for compliance check frequencies that are no greater than once every 12 hours for no more than 200 accounts and for no more than 200 checks. Exceeding any of these points means you should account for sizing of your infrastructure appropriately using the steps in the remainder of this guide.
- Additionally, having a large number of resources with findings can also generally be a reason for slower performance in the platform. Monitoring your platform performance can help make this determination.
The size of your cloud provider billing reports.
- The number of accounts in your billing report (regardless of the number managed by Kion) will impact the performance of financials processing.
- The number of resources and the metadata for those resources will impact the performance of financials processing.
The number of resources in your cloud environment.
- Resource Inventory and Savings Opportunities directly scan your cloud resources if enabled and applicable. These services must process all of the resource data and store that information in the database and run on a daily basis.

Indicators of a Performance Problem

These are some of the indications that you may have a performance problem with your instance of Kion:

Front-end operations are experiencing frequent 504 errors in a red toast message at the bottom right.
- This error is returned by the load balancer and indicates that the operation took too long to complete. The load balancer abandoned the operation. This does not indicate that the actual operation failed but that it was taking too long. See the section in this document about Kion Load Balancer Performance.
Financial data is repeatedly taking too long to load.
- Over a period of more than a week, the daily financial data has taken too long to come up-to-date in the system causing an error about financials being out-of-date multiple times. This may be an indication that the database is performing poorly during loads of large financial files. See the section in this document on Kion Database Performance.
Compliance scans are not completing in a timely fashion.
- A backlog of Compliance scans is an indicator that they may be scheduled too frequently or that additional CPU power is needed to complete these scans according to what you've requested. Consider adjusting the frequency of your Compliance scans or increase the size of your Kion nodes if running EC2.

Kion Database Performance

Measuring and Analyzing Database Performance

The Kion Database is the primary cause for performance problems within the platform. For that reason, follow the steps below to take a sampling of your Kion Database performance for analysis:

Navigate to the RDS service where the database is hosted and select the cluster itself (not the writer only).
Select the Monitoring tab and then click on the CPU Utilization graph.
In the expanded graph, adjust these settings:
1. Set the Statistic to Maximum (do not use Average).
2. Set the Time Range to two weeks.

In reviewing the graph that you retrieve with the steps above, you should have the following considerations:

Generally, your graph should look somewhat similar to this example graph for the writer:
The spikes up for short periods of time multiple times per day are the loads of financial database or potentially periods of heavy compliance activity.
If you see any periods on the graph that show that you are peaking at 100% CPU on the database for longer than a 5 minutes, this is an indication you may need to increase the database size because of the size of your financials processing jobs.
- The period of constraint at 100% CPU suggests potentially how overloaded your database may be. The longer the constraint, the worse the potential overload.
If you see that your database is not settling back to below 70% CPU regularly, this is an indication you may need to increase the database size as you’re not leaving enough headroom for financials processing amongst other tasks.
- In the example graph above, the database settles back to around 40% CPU regularly, which is considered healthy.
High CPU usage on the database writer is not necessarily just apparent CPU usage due to operations. This can also be an indication that the instance is running low on memory and is swapping data to the disk for operations. This information from AWS provides some background.
- In particular, measuring the BufferCacheHitRatio metric can indicate whether or not the buffer cache is being used regularly. A low number here is undesirable as it means that the buffer cache is not being used efficiently and the database writer is potentially running low on memory.
If you’ve found that your database is consistently running at or near 100% CPU, you should check for long-running query operations before you proceed using the instructions in the next section.

Check for Long-Running Queries

This check is only necessary if you find that your database is running at or near 100% CPU continuously. If you do not find these conditions in your graph, you may skip this section. These conditions are abnormal but could indicate a data-related issue that may not be corrected by adjusting your database size. Follow the steps below to check for long-running queries:

Connect to the Kion database.

Execute this SQL command:

select * from information_schema.processlist where command != "Sleep" order by time desc limit 10;

This command will output a table of the currently running queries on the database similar to the one below:

+--------+-------+------------------+------------+---------+------+-----------+----------------------------------------------------------------------------------------------+
| ID     | USER  | HOST             | DB         | COMMAND | TIME | STATE     | INFO                                  |
+--------+-------+------------------+------------+---------+------+-----------+----------------------------------------------------------------------------------------------+
| 410061 | admin | 10.0.0.226:36096 | cloudtamer | Query   |    0 | executing | select * from information_schema.processlist where command!= "Sleep" order by time desc limit 10 |
+--------+-------+------------------+------------+---------+------+-----------+----------------------------------------------------------------------------------------------+

The longest running query should be provided at the bottom of the list with the number of seconds indicated in the TIME column. Under typical conditions, the Kion platform should not have queries that run longer than 300 seconds. The longest running query is likely to be what is most constraining the database instance at the time of observation if you see queries that have a runtime longer than this.

If you locate a long-running query, please reach out to Kion Support at support@kion.io before proceeding for assistance.

Increasing Database Size

Once you’ve analyzed your database performance, you may have found that you should increase the size of the database. The sections below provide guidance on selecting an increase in sizing.

Selecting the Correct RDS Instance Size

Kion generally recommends that you remain with the db.r-class instances for the increased memory capacity. We do not recommend using the db.t- or db.m-class instances. Refer to the AWS documentation for specifications on the RDS instance types and sizes.

Generally, we recommend that you take the following paths when increasing your size:

(current) → (current equivalent in Graviton) → (next step size in Graviton)

Here’s an explanation with embedded examples:

You start with your current size, which needs to be increased.
For smaller shifts where you are just on the cusp of needing an upgrade, we recommend that you shift to the same size as your current database but in Graviton.
- For example, a db.r5.large would equate to something like a db.r7g.large. The cost for this on AWS Commercial would be roughly the same but with an expectation of better performance.
If you’re already using Graviton or the shift you need to make is larger (your database is severely constrained), you would step to the next size in Graviton.
- For example, if you made the change above and you’re already on db.r7g.large but performance has not significantly improved, you would step to the db.r7g.xlarge.

You should make use of the AWS Calculator to help you understand the cost differences between what you purchase today and what you plan to move towards before making any changes.

Making the Adjustment in Size

Once you’ve selected your next target RDS instance size, you should make the change using the CloudFormation Template you deployed for your Kion database. Perform an update without changing the template and adjust the Instance Type field to the desired instance type and size.

NOTE: This adjustment will result in downtime. We recommend that you plan this adjustment during a maintenance window as the application will become unavailable for a period of around 15 minutes during the change.

NOTE: You should take a manual snapshot before executing this change as a measure of caution. An automatic backup is not sufficient when making changes to the CloudFormation Template for an RDS cluster.

After you’ve made your update, you would submit the change and allow the template to coordinate the changes for your cluster. We do not recommend modifying your cluster directly as this will create drift with your database deployment in CloudFormation.

Post-Adjustment Monitoring

After making the adjustment to your database, you should repeat the steps in the Measuring and Analyzing Database Performance section to ensure that you’ve made a meaningful impact with your change. If you find that your database is still constrained, please reach out to the Kion support team for assistance at support@kion.io so that we can help you verify that no other problems existing your environment.

Kion Load Balancer Performance

If you're frequently seeing 504 error messages on the front-end but do not regularly see a database problem, you may need to simply increase the timeout for your load balancer to allow an operation to complete. Instructions are provided below as a simple way to make this adjustment and see if this resolves the problem.

Adjusting User Load Balancer Timeout

Within the AWS account and region where Kion is installed, navigate to the EC2 service.
Select Load Balancers.
Select the load balancer that has "ulb" or "V2Use" in the name. This is your User Load Balancer.
Choose the Attributes tab and select Edit.
Under Connection idle timeout, adjust the timing to a maximum of 2 minutes. We do not recommend exceeding 2 minutes.
Select Save Changes to complete the change.

If this method does not resolve your problem, please reach out to the Kion support team for assistance at support@kion.io so that we can help you identify the cause of your problem.

Ongoing Monitoring

As your Kion environment increases in usage, you add more accounts, or you enable new features, the need to revisit your sizing is necessary. This is an ongoing effort to ensure that the application performs well at all times. Here are some guidelines on monitoring performance:

Outside of any apparent performance issues, we recommend that you revisit sizing both before and after making any significant changes to the platform. This includes performing upgrades between major versions of Kion.
We recommend that you take a few minutes to monitor your environment quarterly to help you ensure that performance issues do not impact your environment unexpectedly.