Diagnosing Out of Memory Issues – AutoPilot

Introduction

This article covers how to identify and troubleshoot Out of Memory (OOM) errors in your AutoPilot environment.

Symptoms

If your jump host runs out of memory, it will be forced to kill processes in order to free memory. In some cases, this can cause the server to become unresponsive, resulting in timeout errors when you try to connect with SSH, or gateway timeout errors on your website.

Immediate Resolution

If your jump server becomes unresponsive, you can restart it in the AutoPilot interface in the Actions tab. This process should only take a few minutes.

Once the jump server is back online, follow the steps below to diagnose why it ran out of memory.

Viewing Available Memory

First, check how much available memory the server currently has. Depending on the type of AutoPilot deployment you have chosen, your jump instance will have varying amounts of memory available. You can check your instance type(s) in the AutoPilot interface under the Adjust tab. For example, the t4g.large instance type has 8 GiB of memory:

You can verify how much memory is available and used with the following command:

free -h

You will see output similar to the following:

      total   used   free  shared  buff/cache  available
Mem:  7.6Gi  4.1Gi  2.5Gi   431Mi       1.0Gi      2.9Gi

The important statistic is available memory, which is how much memory on the system is available for use by processes. Do not use free memory as a measure of available memory, as these are different.

In this case, the server has 7.6 GiB total memory, and 2.9 GiB available. If the server has less than 1 GiB of available memory, you may need to upgrade your instance type to provide more memory.

Viewing Memory Usage by Process

To view a list of processes running on the server sorted by their memory consumption, use the following command:

ps aux --sort=-pmem | head

This will provide a sorted list of processes, starting with the highest memory usage at the top.

Viewing Out of Memory Logs

The systemd logs on the server will show Out of Memory (OOM) errors, which also contain a list of procesess that were running when the OOM occurred. To get a list of processes that were killed due to the server running out of memory, use the following command:

journalctl --since "1 day ago" | grep "Out of memory"

You can change 1 day ago in the command to whichever timeframe you prefer.

The above command only shows what processes were killed, not a full list of processes that were running when the OOM occurred. To view all processes that were running at the time, use the following command:

journalctl --since "1 day ago" | grep "kernel: Tasks state (memory values in pages):" -A100

The -A100 flag at the end of the command shows the first 100 processes that were running at the time of the OOM. You may need to adjust this to a higher number if more procesess were running in order to see all of them.

As an example, you may see output similar to the following:

In this example, we can see numerous php8.2 processes running when the OOM occurred. This is an unusual amount of processes and is usually an indicator that there are overlapping Magento cron processes running, or other PHP-CLI scripts. If you are using Magento, check the Magento logs for further information.

Conclusion

Using the information you found with these commands, you should have a better idea of what is consuming memory on your system. If you need further assistance, please open a support ticket with us with as much detail as possible regarding the issue.