nfs server not responding / is alive again

Discussion:

Marc G. Fournier

2004-10-04 03:22:30 UTC

I'm using an nfs mount to get at the underlying file system on a system
that uses unionfs mounts ... instead of using nullfs, which, last time I
used it over a year ago, caused the server to crash to no end ...

But, as soon as there is any 'load', I'm getting a whack of:

Oct 3 22:46:16 neptune /kernel: nfs server neptune.hub.org:/vm: not responding
Oct 3 22:46:16 neptune /kernel: nfs server neptune.hub.org:/vm: is alive again
Oct 3 22:48:30 neptune /kernel: nfs server neptune.hub.org:/vm: not responding
Oct 3 22:48:30 neptune /kernel: nfs server neptune.hub.org:/vm: is alive again

in /var/log/messages ...

I'm running nfsd with the standard flags:

nfs_server_flags="-u -t -n 4"

Is there something that I can do to reduce this problem? increase number
of nfsd processes? force a tcp connection?

The issue is more prevalent when I have >4 processes trying to read from
the nfs mounts ... should there be one mount per process? the process(es)
in question are rsync, if that helps ... they tend to be a bit more 'disk
intensive' then most processes, which is why I thought of increasing -n
...

Thanks ...

Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: ***@hub.org Yahoo!: yscrappy ICQ: 7615664

Alex de Kruijff

2004-10-05 05:22:49 UTC

Permalink

Post by Marc G. Fournier
I'm using an nfs mount to get at the underlying file system on a system
that uses unionfs mounts ... instead of using nullfs, which, last time I
used it over a year ago, caused the server to crash to no end ...
Oct 3 22:46:16 neptune /kernel: nfs server neptune.hub.org:/vm: not responding
Oct 3 22:46:16 neptune /kernel: nfs server neptune.hub.org:/vm: is alive again
Oct 3 22:48:30 neptune /kernel: nfs server neptune.hub.org:/vm: not responding
Oct 3 22:48:30 neptune /kernel: nfs server neptune.hub.org:/vm: is alive again
in /var/log/messages ...
nfs_server_flags="-u -t -n 4"
Is there something that I can do to reduce this problem? increase number
of nfsd processes? force a tcp connection?

You could try giving the nfsd processes more priority as root with
rtprio. If the file /var/run/nfsd.pid exist then you could try something
like: rtprio 10 -`cat /var/run/nfds.pid`.

You could also try giving the other porcesses less priority like
nice -n 2 rsync. But i'm am not show how this works at the other end.

Post by Marc G. Fournier
The issue is more prevalent when I have >4 processes trying to read from
the nfs mounts ... should there be one mount per process? the process(es)
in question are rsync, if that helps ... they tend to be a bit more 'disk
intensive' then most processes, which is why I thought of increasing -n
...

I think you're problem is not that you disk is used havely but that
you're NIC (rsync kinda does that) is. The warnings you get indicate
that you're computer can't get a responce from you're server. It acts
normaly as soon as it can.

Why do you have rsync sync mounted nfs disks?

--
Alex

Articles based on solutions that I use:
http://www.kruijff.org/alex/FreeBSD/

Bill Moran

2004-10-05 12:51:02 UTC

Permalink

Post by Alex de Kruijff

In my experience, this is caused by the server responding unpredictably.

Someone smarter than me may correct me, but I believe the nfs client keeps
track of how quickly the NFS server responds, and uses it to judge whether
the server is still working or not. Any time the server's response time
varies too much from that amount, the client will assume the server is
down, but if the server is not down, you'll see the "is alive" message
immediately after. Basically, during normal usage, the server is
responding very quickly, so the client assumes it will always respond
that fast. Then, under heavy load, the slower response makes the client
a little paranoid.

I've seen this when running NFS over WiFi, where the ping times are
usually not consistent.

One thing is to just ignore the messages and accept that this is a
natural side effect of high loads. Another would be to use TCP mounts
instead of UDP mounts, which don't have this trouble.

What kind of network topology is between the two machines? Do you notice
a high load on the hub/switch/routers during these activities? You may
be able to improve the intervening network topology to improve the
problem as well.

Post by Alex de Kruijff

Post by Marc G. Fournier
in /var/log/messages ...
nfs_server_flags="-u -t -n 4"
Is there something that I can do to reduce this problem? increase number
of nfsd processes? force a tcp connection?

You could try giving the nfsd processes more priority as root with
rtprio. If the file /var/run/nfsd.pid exist then you could try something
like: rtprio 10 -`cat /var/run/nfds.pid`.
You could also try giving the other porcesses less priority like
nice -n 2 rsync. But i'm am not show how this works at the other end.

Might help. I would look at networking before I looked at disk usage ...
are there dropped packets and the like. But it could be either.

<snip>

--
Bill Moran
Potential Technologies
http://www.potentialtech.com

Marc G. Fournier

2004-10-05 14:34:29 UTC

Permalink

Post by Bill Moran
What kind of network topology is between the two machines? Do you
notice a high load on the hub/switch/routers during these activities?
You may be able to improve the intervening network topology to improve
the problem as well.

My bad ... I thought i had mentioned it in the original ... the nfs mount
is from local machine to local machine, to do what nullfs normally would
provide were I to risk it ... namely, to get at the 'bottom layer' of a
unionfs based storage system ...

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: ***@hub.org Yahoo!: yscrappy ICQ: 7615664

Bill Moran

2004-10-05 15:01:33 UTC

Permalink

Post by Marc G. Fournier

Well ... that's just weird.

I guess the same problem could apply: if the loopback slows down when the
kernel is loaded, it could cause the same effect.

Have you tried forcing TCP mounts? IIRC, that's what solved the problem
for me.

--
Bill Moran
Potential Technologies
http://www.potentialtech.com

Marc G. Fournier

2004-10-05 15:49:31 UTC

Permalink

Post by Bill Moran

Post by Marc G. Fournier

Well ... that's just weird.
I guess the same problem could apply: if the loopback slows down when the
kernel is loaded, it could cause the same effect.
Have you tried forcing TCP mounts? IIRC, that's what solved the problem
for me.

Haven't tried yet, but will ... thanks :)

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: ***@hub.org Yahoo!: yscrappy ICQ: 7615664

Marc G. Fournier

2004-10-05 14:32:44 UTC

Permalink

Post by Alex de Kruijff
I think you're problem is not that you disk is used havely but that
you're NIC (rsync kinda does that) is. The warnings you get indicate
that you're computer can't get a responce from you're server. It acts
normaly as soon as it can.

Except, the nfs mount is from the local host to the local host ...

Post by Alex de Kruijff
Why do you have rsync sync mounted nfs disks?

I want to get at the unlying file system ... I have a real file system
mounted as /vm, which /vm mounted as /du via nfs ... over top of /vm, I
have several unionfs's mounted ... if I did a du of '/vm/dir', where dir
is a union mount, I'd see all files on both "layers" ... if I do a du of
'/du/dir', I only see the /vm layer ...

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: ***@hub.org Yahoo!: yscrappy ICQ: 7615664