Redis troubleshooting pocket guide

Last updated 18, Apr 2024

Symptoms

Latency issues, other problems or just as health-check

Changes

Configuration changes to the software or to the system, changes in the workload or dataset size may provoke latency.

Identify issues on Redis hosts

  • Check that disk space is not excessively consumed using "df -h". Check if the capacity of the log directory did not increase using “du -sh /var/opt/redislabs/log/” and proceed to check other possible causes
  • Check that RAM memory or CPU are not excessively consumed. It is recommended that RAM and CPU utilization does not cross 80%. The host resources must be exclusively available for Redis software
  • Verify swap memory is not utilized or not configured using "free"
  • It is recommended to have the host clock in sync with a time server. Verify using  timedatectl or "ntpq -p" or "chronyc sources"
  • Check the output of "env", remove https_proxy/http_proxy variable if it exists: "unset https_proxy"
  • review system logs including the syslog or journal for any error messages, warnings, or critical events

Identify potential issues caused by security hardening

  • Temporarily disable any security/hardening and check if the problem is relieved. Examples: selinux, cylance, McAfee, dynatrace, ...
  • Linux user "redislabs" must have read/write access to /tmp folder. Verify using "su - redislabs -s /bin/bash -c 'touch /tmp/test'"
  • Non-permissive umask can cause issues. If umask differs from the default 022, it might prevent normal operation. Consult your sysadmin and revert to the default umask

Identify Redis cluster issues

  • Execute “supervisorctl status" and verify all processes are in a RUNNING state.
  • Execute "rlcheck" and verify no errors appear
  • Execute "rladmin status issue_only" and verify no issues appear
  • Execute "rladmin status shards" and verify that the used memory of shards participating in the same database is balanced and that each shard does not exceed 25GB
  • Execute "rladmin cluster running_actions" and verify no tasks appear

Troubleshooting connectivity

  • Check if the Redis endpoint can be resolved on the client machine "dig <endpoint>". If the resolution fails, proceed to check if the Redis endpoint can be resolved on one of the cluster nodes "dig @localhost <endpoint>". If the resolution succeeds, the problem is with the organizational DNS.
  • To identify any issue with the client app, check connectivity from the client machine to the database using redis-cli: "redis-cli -h <endpoint> -p <port> -a <password> info" or  "redis-cli -h <endpoint> -p <port> -a <password> --tls --insecure --cert --key ping" If that fails check connectivity to the database using redis-cli from one of the cluster nodes If that fails, the issue is with the network. Consult your sysadmin.
  • Verify the client uses the db name and not ip
  • Verify the the database is configured with eviction policy and key expiration to avoid OOM
  • Verify that access to the database is not blocked by a firewall on the client side or the Redis side iptables -L, ufw status, firewall-cmd –list-all
  • Additional details can be found in the related document about testing client connections.

Troubleshooting latency

Server-side

  • Ensure that the memory used in the database does not reach the configured database max memory limit. More details can be found in the document about database memory limits.
  • Try to correlate the latency time with any surge in the following metrics.
    • number of connections
    • used memory
    • evicted keys, expired keys
  • Check the output of "slowlog get <number of entries to display>" for slow commands such as KEYS or HGETALL Use alternative commands: SCAN, SSCAN, HSCAN, ZSCAN
  • Keys with large memory footprints can cause latency. To identify these keys, one can compare the key name that appear in the output of “slowlog get” with the big key reported by the following commands: redis-cli -h <endpoint> -p <port> -a <password> --memkeys redis-cli -h <endpoint> -p <port> -a <password> --bigkeys
  • Additional diagnostics steps can be found in the following links: https://redis.io/docs/latest/operate/oss_and_stack/management/optimization/latency/ https://redis.io/docs/latest/operate/rs/clusters/logging/redis-slow-log/

Client-side

  • check there is no memory/CPU pressure on the client host
  • check the client does not frequently open and close connections and instead uses a connection pool
  • check the client does not erroneously open multiple connections that can pressure the client or server