Troubleshooting Bricks

On this page you’ll find troubleshooting tips for when bricks are down.

If bricks are down, try and follow the steps bellow:

Check if there is enough disk space available. Gluster core dumps or logs may have filled up the root filesystem or /var, causing Gluster to crash again.
If a brick is down, you can start it with: gluster volume start $volname force

Check for free RAM

ansible mungg_gluster_server -a "free -h"

Mass-start volumes

Force all bricks to start:

gluster volume list | xargs --max-procs=3 --max-args=5 bash -c '
for i; do
  if /usr/lib64/nagios/plugins/check_gluster_volume --retries 1 "$i"> /dev/null; then
    continue
  fi

  gluster --mode=script volume start "$i" force
done
' --

Start heal on all unhealthy volumes:

gluster volume list | xargs --max-procs=3 --max-args=5 bash -c '
for i; do
  if /usr/lib64/nagios/plugins/check_gluster_volume_heal --retries 1 "$i" > /dev/null; then
    continue
  fi

  gluster --mode=script volume heal "$i" enable && \
  gluster --mode=script volume heal "$i"
done
' --

Free up log space

ansible mungg_gluster_server -m shell -a 'find /var/log/glusterfs -mtime +50 -delete; logrotate --force /etc/logrotate.conf'

Rejoin a Gluster glusterfs process

This should only be done if the brick process has died and is not restarted by the management process glusterd.

# # gluster volume status gluster-pvxxxx
Status of volume: gluster-pvxxxx
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick storage2:/data/shared_s
sd/gluster-pvxxxx/brick                     N/A       N/A        N       N/A
Brick storage3:/data/shared_s
sd/gluster-pvxxxx/brick                     49163     0          Y       13471
Brick storage1:/data/shared_s
sd/gluster-pvxxxx/brick                     49163     0          Y       26454
Self-heal Daemon on localhost               N/A       N/A        Y       25299
Self-heal Daemon on x.x.x.x                 N/A       N/A        Y       2478
Self-heal Daemon on x.x.x.x                 N/A       N/A        Y       10545

# ps -f --pid $(cat /run/gluster/vols/gluster-pvxxx/*.pid)
23077 ? Ssl 1:33 /usr/sbin/glusterfs --read-only --log-file=/var/log/glusterfs/backup-gluster-pv1099.log --volfile-server=x.x.x.x --volfile-server=x.x.x.x --volfile-server=x.x.x.x --volfile-id=/gluster-pvxxxx /var/lib/gluster-backup/mnt/gluster-pvxxxx

# kill $pid

Remove obsolete PID file

In the case the glusterfs process do crash, they leave their PID file in place, preventing the glusterfs process to start.

volume=gluster volume name # e.g gluster-pv42
# ps -f --pid $(cat /run/gluster/vols/$volume/*.pid)
<must be empty>

# rm /run/gluster/vols/$volume/*${volume}-brick.pid

# gluster volume start $volume force

Clean up all stale PID files:

find /run/gluster/vols/ -name '*.pid' | \
  while read -r pidfile; do
    vol="$(basename "$(dirname "$pidfile")")";
    ps -f --pid "$(<"$pidfile")" | grep -q glusterfsd || rm -vf "$pidfile";
  done