Troubleshooting steps in a Docker-based cluster

In the event of an OpenIAM installation failure, users can troubleshoot and identify potential issues by collecting and examining logs. Below is a general guide on how to collect logs for failed installations and individual service failures, which is the first step in troubleshooting. By following the steps below, you can systematically identify and resolve issues in your Docker-based cluster.

  1. Check cluster status.

For Docker Swarm

  • Run the following command to check the status of nodes.
docker node ls

Ensure all nodes are in the Ready state. If a node shows Down or Unreachable, check network connectivity and the node’s status.

For Kubernetes

  • Run the following command to verify node status.
kubectl get nodes

Ensure all nodes are in the Ready state. If a node is NotReady, check logs and system resources.

  1. Check container and service status.

For Docker Swarm

  • Check the running services by running the following command.
docker service ls
  • If needed, restart a service with the following command.
docker service update --force <service_name>

For Kubernetes

  • Run the following command to list all pods across all namespaces.
kubectl get pods -A
  • In case you need to check a specific pod's details, run the following.
kubectl describe pod <pod_name>
  1. Check container logs.

For Docker Swarm

  • The logs for a specific container are checked with the following command.
docker logs <container_id>

For Kubernetes

  • You can check the logs for a pod with the command below.
kubectl logs <pod_name>

Then, you will need to follow live logs, using the command below.

kubectl logs -f <pod_name>
  1. Check network connectivity
  • Verify connectivity between nodes with the command below.
ping <node_ip>
  • Check service reachability as follows.
curl http://<service_ip>:<port>
  1. Check system resources.
  • Monitor system usage using the commands below.
top # Check CPU and Memory
df -h # Check disk space
free -m # Check available memory
  1. Check event logs.
  • in Docker Swarm, run the following.
docker events
  • in Kubernetes, use the following commands.
kubectl get events -A
  1. Restart services or nodes (if needed).
  • Restart a failing container with...
docker restart <container_id>
  • Restart a Kubernetes pod with ...
kubectl delete pod <pod_name>
  • Restart a node with the following command. This step is the last resort.
systemctl restart docker
  1. You can continue to debug further using shell access.
  • Access a container for debugging by running the following commands.
docker exec -it <container_id> /bin/sh // for Docker Swarm
kubectl exec -it <pod_name> -- /bin/sh // For Kubernetes
  1. Verify Image Versions and Configurations.
  • For Docker, check the running image version.
docker inspect <container_id>
  • Check environment variables inside a container.
env
  • For Kubernetes, check deployment configurations.
kubectl get deployment <deployment_name> -o yaml
  1. Check storage and volume issues.
  • List Docker volumes with the following commands.
docker volume ls
docker volume inspect <volume-name>
  • Check persistent volume claims in Kubernetes by running the following commands.
kubectl get pvc -A
Note: If the problem persists, consider checking external logs (e.g., system logs or application logs) and reviewing any recent updates or changes that may have impacted the cluster.