Discussion:
iscsi cmnd abort issue
(too old to reply)
g***@moll.cl
2010-10-15 18:56:41 UTC
Permalink
Hi,

I'm using Openfiler (OF) as a storage server and it is connected to a ESXi
cluster. I am experiencing some disconnections between the ESXi servers
and the OF storage.

The messages log in the openfiler server shows the following

Oct 14 11:32:48 storage kernel: [ 6154.950871] iscsi_trgt:
cmnd_abort(1144) 71a0bc00 1 0 42 512 0 0
Oct 14 11:32:48 storage kernel: [ 6154.951112] iscsi_trgt: Abort Task (01)
issued on tid:2 lun:0 by sid:845524445757952 (Unknown Task)
Oct 14 11:32:48 storage kernel: [ 6155.218499] iscsi_trgt:
cmnd_abort(1144) bac85200 1 0 42 512 0 0
Oct 14 11:32:48 storage kernel: [ 6155.218730] iscsi_trgt: Abort Task (01)
issued on tid:2 lun:0 by sid:282574492336640 (Unknown Task)

and It also contains messages about ietd

Oct 14 09:50:35 storage ietd: initiators.deny is depreciated and will be
obsoleted in the next release, please see README.initiators for more
information
Oct 14 09:50:35 storage ietd: /etc/ietd.conf's location is depreciated and
will be moved in the next release to /etc/iet/ietd.conf
Oct 14 09:50:35 storage iscsi-target: ietd startup succeeded
Oct 14 11:39:10 storage ietd: CHAP initiator auth.: No valid user/pass
combination for initiator iqn.1998-01.com.vmware:localhost-066cb67d found
Oct 15 08:30:32 storage ietd: initiators.deny is depreciated and will be
obsoleted in the next release, please see README.initiators for more
information
Oct 15 08:30:32 storage ietd: /etc/ietd.conf's location is depreciated and
will be moved in the next release to /etc/iet/ietd.conf
Oct 15 08:30:32 storage iscsi-target: ietd startup succeeded

and the /etc/ietd.conf contains the following

# cat /etc/ietd.conf
##### WARNING!!! - This configuration file generated by Openfiler. DO
NOT MANUALLY EDIT. #####


IncomingUser vcenter vcenter


Target iqn.2006-01.com.openfiler:tsn.3d9c3d927c81
HeaderDigest None
DataDigest None
MaxConnections 1
InitialR2T Yes
ImmediateData No
MaxRecvDataSegmentLength 131072
MaxXmitDataSegmentLength 131072
MaxBurstLength 262144
FirstBurstLength 262144
DefaultTime2Wait 2
DefaultTime2Retain 20
MaxOutstandingR2T 8
DataPDUInOrder Yes
DataSequenceInOrder Yes
ErrorRecoveryLevel 0
IncomingUser openfiler openfiler
OutgoingUser openfiler openfiler
Lun 0
Path=/dev/corestorage/coreca,Type=blockio,ScsiSN=xvVmYp-yeSi-Otqp,ScsiId=xvVmYp-yeSi-Otqp,IOMode=wt


Target iqn.2006-01.com.openfiler:tsn.c1933bae483b
HeaderDigest None
DataDigest None
MaxConnections 1
InitialR2T Yes
ImmediateData No
MaxRecvDataSegmentLength 131072
MaxXmitDataSegmentLength 131072
MaxBurstLength 262144
FirstBurstLength 262144
DefaultTime2Wait 2
DefaultTime2Retain 20
MaxOutstandingR2T 8
DataPDUInOrder Yes
DataSequenceInOrder Yes
ErrorRecoveryLevel 0
IncomingUser openfiler openfiler
OutgoingUser openfiler openfiler
Lun 0
Path=/dev/corestorage/coreca,Type=blockio,ScsiSN=xvVmYp-yeSi-Otqp,ScsiId=xvVmYp-yeSi-Otqp,IOMode=wt

I was wondering if it is caused by a miss configuration of the time2wait
or something that check the health of iscsi targets.

Any idea how to solve this issue ?

Thanks in advanced

Best regards,
--
Sin otro particular se despide,
Gabriel Möll Ibacache
Ingeniero Civil en Computación
Red Hat Certified Engineer
http://www.moll.cl
+56 9 79964684 - +56 2 8974302
***@moll.cl
Sunny
2010-10-15 19:41:29 UTC
Permalink
as far as I can tell It's not a issue. This usually happens when you
live-migrate a VM on the datastore.

if you upgrade to latest ietd version, it prints better task names
than Unknown, but still, it's not a issue.
Post by g***@moll.cl
Hi,
I'm using Openfiler (OF) as a storage server and it is connected to a ESXi
cluster. I am experiencing some disconnections between the ESXi servers
and the OF storage.
The messages log in the openfiler server shows the following
cmnd_abort(1144) 71a0bc00 1 0 42 512 0 0
Oct 14 11:32:48 storage kernel: [ 6154.951112] iscsi_trgt: Abort Task (01)
issued on tid:2 lun:0 by sid:845524445757952 (Unknown Task)
cmnd_abort(1144) bac85200 1 0 42 512 0 0
Oct 14 11:32:48 storage kernel: [ 6155.218730] iscsi_trgt: Abort Task (01)
issued on tid:2 lun:0 by sid:282574492336640 (Unknown Task)
and It also contains messages about ietd
Oct 14 09:50:35 storage ietd: initiators.deny is depreciated and will be
obsoleted in the next release, please see README.initiators for more
information
Oct 14 09:50:35 storage ietd: /etc/ietd.conf's location is depreciated and
will be moved in the next release to /etc/iet/ietd.conf
Oct 14 09:50:35 storage iscsi-target: ietd startup succeeded
Oct 14 11:39:10 storage ietd: CHAP initiator auth.: No valid user/pass
combination for initiator iqn.1998-01.com.vmware:localhost-066cb67d found
Oct 15 08:30:32 storage ietd: initiators.deny is depreciated and will be
obsoleted in the next release, please see README.initiators for more
information
Oct 15 08:30:32 storage ietd: /etc/ietd.conf's location is depreciated and
will be moved in the next release to /etc/iet/ietd.conf
Oct 15 08:30:32 storage iscsi-target: ietd startup succeeded
and the /etc/ietd.conf contains the following
# cat /etc/ietd.conf
#####   WARNING!!! - This configuration file generated by Openfiler. DO
NOT MANUALLY EDIT.  #####
       IncomingUser vcenter vcenter
Target iqn.2006-01.com.openfiler:tsn.3d9c3d927c81
       HeaderDigest None
       DataDigest None
       MaxConnections 1
       InitialR2T Yes
       ImmediateData No
       MaxRecvDataSegmentLength 131072
       MaxXmitDataSegmentLength 131072
       MaxBurstLength 262144
       FirstBurstLength 262144
       DefaultTime2Wait 2
       DefaultTime2Retain 20
       MaxOutstandingR2T 8
       DataPDUInOrder Yes
       DataSequenceInOrder Yes
       ErrorRecoveryLevel 0
       IncomingUser openfiler openfiler
       OutgoingUser openfiler openfiler
       Lun 0
Path=/dev/corestorage/coreca,Type=blockio,ScsiSN=xvVmYp-yeSi-Otqp,ScsiId=xvVmYp-yeSi-Otqp,IOMode=wt
Target iqn.2006-01.com.openfiler:tsn.c1933bae483b
       HeaderDigest None
       DataDigest None
       MaxConnections 1
       InitialR2T Yes
       ImmediateData No
       MaxRecvDataSegmentLength 131072
       MaxXmitDataSegmentLength 131072
       MaxBurstLength 262144
       FirstBurstLength 262144
       DefaultTime2Wait 2
       DefaultTime2Retain 20
       MaxOutstandingR2T 8
       DataPDUInOrder Yes
       DataSequenceInOrder Yes
       ErrorRecoveryLevel 0
       IncomingUser openfiler openfiler
       OutgoingUser openfiler openfiler
       Lun 0
Path=/dev/corestorage/coreca,Type=blockio,ScsiSN=xvVmYp-yeSi-Otqp,ScsiId=xvVmYp-yeSi-Otqp,IOMode=wt
I was wondering if it is caused by a miss configuration of the time2wait
or something that check the health of iscsi targets.
Any idea how to solve this issue ?
Thanks in advanced
Best regards,
--
Sin otro particular se despide,
Gabriel Möll Ibacache
Ingeniero Civil en Computación
Red Hat Certified Engineer
http://www.moll.cl
+56 9 79964684 - +56 2 8974302
------------------------------------------------------------------------------
Download new Adobe(R) Flash(R) Builder(TM) 4
The new Adobe(R) Flex(R) 4 and Flash(R) Builder(TM) 4 (formerly
Flex(R) Builder(TM)) enable the development of rich applications that run
across multiple browsers and platforms. Download your free trials today!
http://p.sf.net/sfu/adobe-dev2dev
_______________________________________________
Iscsitarget-devel mailing list
https://lists.sourceforge.net/lists/listinfo/iscsitarget-devel
g***@moll.cl
2010-10-18 14:21:46 UTC
Permalink
Hi Sunny,

Problem is that there 's no live-migration in the ESXi cluster, and it's
no a storage overload.
Post by Sunny
as far as I can tell It's not a issue. This usually happens when you
live-migrate a VM on the datastore.
if you upgrade to latest ietd version, it prints better task names
than Unknown, but still, it's not a issue.
Post by g***@moll.cl
Hi,
I'm using Openfiler (OF) as a storage server and it is connected to a ESXi
cluster. I am experiencing some disconnections between the ESXi servers
and the OF storage.
The messages log in the openfiler server shows the following
cmnd_abort(1144) 71a0bc00 1 0 42 512 0 0
Oct 14 11:32:48 storage kernel: [ 6154.951112] iscsi_trgt: Abort Task (01)
issued on tid:2 lun:0 by sid:845524445757952 (Unknown Task)
cmnd_abort(1144) bac85200 1 0 42 512 0 0
Oct 14 11:32:48 storage kernel: [ 6155.218730] iscsi_trgt: Abort Task (01)
issued on tid:2 lun:0 by sid:282574492336640 (Unknown Task)
and It also contains messages about ietd
Oct 14 09:50:35 storage ietd: initiators.deny is depreciated and will be
obsoleted in the next release, please see README.initiators for more
information
Oct 14 09:50:35 storage ietd: /etc/ietd.conf's location is depreciated and
will be moved in the next release to /etc/iet/ietd.conf
Oct 14 09:50:35 storage iscsi-target: ietd startup succeeded
Oct 14 11:39:10 storage ietd: CHAP initiator auth.: No valid user/pass
combination for initiator iqn.1998-01.com.vmware:localhost-066cb67d found
Oct 15 08:30:32 storage ietd: initiators.deny is depreciated and will be
obsoleted in the next release, please see README.initiators for more
information
Oct 15 08:30:32 storage ietd: /etc/ietd.conf's location is depreciated and
will be moved in the next release to /etc/iet/ietd.conf
Oct 15 08:30:32 storage iscsi-target: ietd startup succeeded
and the /etc/ietd.conf contains the following
# cat /etc/ietd.conf
#####   WARNING!!! - This configuration file generated by Openfiler. DO
NOT MANUALLY EDIT.  #####
       IncomingUser vcenter vcenter
Target iqn.2006-01.com.openfiler:tsn.3d9c3d927c81
       HeaderDigest None
       DataDigest None
       MaxConnections 1
       InitialR2T Yes
       ImmediateData No
       MaxRecvDataSegmentLength 131072
       MaxXmitDataSegmentLength 131072
       MaxBurstLength 262144
       FirstBurstLength 262144
       DefaultTime2Wait 2
       DefaultTime2Retain 20
       MaxOutstandingR2T 8
       DataPDUInOrder Yes
       DataSequenceInOrder Yes
       ErrorRecoveryLevel 0
       IncomingUser openfiler openfiler
       OutgoingUser openfiler openfiler
       Lun 0
Path=/dev/corestorage/coreca,Type=blockio,ScsiSN=xvVmYp-yeSi-Otqp,ScsiId=xvVmYp-yeSi-Otqp,IOMode=wt
Target iqn.2006-01.com.openfiler:tsn.c1933bae483b
       HeaderDigest None
       DataDigest None
       MaxConnections 1
       InitialR2T Yes
       ImmediateData No
       MaxRecvDataSegmentLength 131072
       MaxXmitDataSegmentLength 131072
       MaxBurstLength 262144
       FirstBurstLength 262144
       DefaultTime2Wait 2
       DefaultTime2Retain 20
       MaxOutstandingR2T 8
       DataPDUInOrder Yes
       DataSequenceInOrder Yes
       ErrorRecoveryLevel 0
       IncomingUser openfiler openfiler
       OutgoingUser openfiler openfiler
       Lun 0
Path=/dev/corestorage/coreca,Type=blockio,ScsiSN=xvVmYp-yeSi-Otqp,ScsiId=xvVmYp-yeSi-Otqp,IOMode=wt
I was wondering if it is caused by a miss configuration of the time2wait
or something that check the health of iscsi targets.
Any idea how to solve this issue ?
Thanks in advanced
Best regards,
--
Sin otro particular se despide,
Gabriel Möll Ibacache
Ingeniero Civil en Computación
Red Hat Certified Engineer
http://www.moll.cl
+56 9 79964684 - +56 2 8974302
------------------------------------------------------------------------------
Download new Adobe(R) Flash(R) Builder(TM) 4
The new Adobe(R) Flex(R) 4 and Flash(R) Builder(TM) 4 (formerly
Flex(R) Builder(TM)) enable the development of rich applications that run
across multiple browsers and platforms. Download your free trials today!
http://p.sf.net/sfu/adobe-dev2dev
_______________________________________________
Iscsitarget-devel mailing list
https://lists.sourceforge.net/lists/listinfo/iscsitarget-devel
--
Sin otro particular se despide,
Gabriel Möll Ibacache
Ingeniero Civil en Computación
Red Hat Certified Engineer
http://www.moll.cl
+56 9 79964684 - +56 2 8974302
***@moll.cl
Ross S. W. Walker
2010-10-18 14:29:29 UTC
Permalink
Post by g***@moll.cl
Hi Sunny,
Problem is that there 's no live-migration in the ESXi cluster, and it's
no a storage overload.
Something isn't working right for a SCSI command to timeout, which is
what the aborts mean.

Post your network setup?

-Ross

______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.
g***@moll.cl
2010-10-18 15:31:45 UTC
Permalink
Post by Ross S. W. Walker
Post by g***@moll.cl
Hi Sunny,
Problem is that there 's no live-migration in the ESXi cluster, and it's
no a storage overload.
Something isn't working right for a SCSI command to timeout, which is
what the aborts mean.
Post your network setup?
-Ross
______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.
Ross,

I see some interruptions in the interfaces eth0 and eth2 from my Openfiles
server. My network setup is the following where the eth2 is in the storage
network.

ip add ls shows

3: eth0: <BROADCAST,MULTICAST,UP,10000> mtu 1500 qdisc pfifo_fast qlen 1000
link/ether 00:16:17:82:6b:ab brd ff:ff:ff:ff:ff:ff
inet 192.168.1.250/24 brd 192.168.1.255 scope global eth0
inet6 fe80::216:17ff:fe82:6bab/64 scope link
valid_lft forever preferred_lft forever
4: eth2: <BROADCAST,MULTICAST,UP,10000> mtu 1500 qdisc pfifo_fast qlen 1000
link/ether 00:16:17:82:6b:aa brd ff:ff:ff:ff:ff:ff
inet 10.0.0.60/24 brd 10.0.0.255 scope global eth2
inet6 fe80::216:17ff:fe82:6baa/64 scope link
valid_lft forever preferred_lft forever

and ifconfig

eth0 Link encap:Ethernet HWaddr 00:16:17:82:6B:AB
inet addr:192.168.1.250 Bcast:192.168.1.255 Mask:255.255.255.0
inet6 addr: fe80::216:17ff:fe82:6bab/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:138123 errors:0 dropped:0 overruns:0 frame:0
TX packets:71311 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:18165735 (17.3 Mb) TX bytes:141405395 (134.8 Mb)
Interrupt:23 Base address:0x6000

eth2 Link encap:Ethernet HWaddr 00:16:17:82:6B:AA
inet addr:10.0.0.60 Bcast:10.0.0.255 Mask:255.255.255.0
inet6 addr: fe80::216:17ff:fe82:6baa/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:154297713 errors:0 dropped:59 overruns:0 frame:0
TX packets:193817846 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:135086066854 (128828.1 Mb) TX bytes:206246698556
(196692.1 Mb)
Interrupt:18


# ethtool eth2
Settings for eth2:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Half 1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Half 1000baseT/Full
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: g
Wake-on: g
Current message level: 0x000000ff (255)
Link detected: yes

# ethtool eth0
Settings for eth0:
Supported ports: [ MII ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: MII
PHYAD: 1
Transceiver: external
Auto-negotiation: on
Supports Wake-on: g
Wake-on: d
Link detected: yes
--
Sin otro particular se despide,
Gabriel Möll Ibacache
Ingeniero Civil en Computación
Red Hat Certified Engineer
http://www.moll.cl
+56 9 79964684 - +56 2 8974302
***@moll.cl
Ross S. W. Walker
2010-10-15 21:30:58 UTC
Permalink
Post by g***@moll.cl
Hi,
I'm using Openfiler (OF) as a storage server and it is connected to a ESXi
cluster. I am experiencing some disconnections between the ESXi servers
and the OF storage.
The messages log in the openfiler server shows the following
Oct 14 11:32:48 storage kernel: [ 6154.950871] iscsi_trgt: cmnd_abort(1144) 71a0bc00 1 0 42 512 0 0
Oct 14 11:32:48 storage kernel: [ 6154.951112] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:845524445757952 (Unknown Task)
Oct 14 11:32:48 storage kernel: [ 6155.218499] iscsi_trgt: cmnd_abort(1144) bac85200 1 0 42 512 0 0
Oct 14 11:32:48 storage kernel: [ 6155.218730] iscsi_trgt: Abort Task (01) issued on tid:2 lun:0 by sid:282574492336640 (Unknown Task)
and It also contains messages about ietd
Oct 14 09:50:35 storage ietd: initiators.deny is depreciated and will be
obsoleted in the next release, please see README.initiators for more information
Oct 14 09:50:35 storage ietd: /etc/ietd.conf's location is depreciated and
will be moved in the next release to /etc/iet/ietd.conf
Oct 14 09:50:35 storage iscsi-target: ietd startup succeeded
Oct 14 11:39:10 storage ietd: CHAP initiator auth.: No valid user/pass combination for initiator
iqn.1998-01.com.vmware:localhost-066cb67d found
Oct 15 08:30:32 storage ietd: initiators.deny is depreciated and will be
obsoleted in the next release, please see README.initiators for more information
Oct 15 08:30:32 storage ietd: /etc/ietd.conf's location is depreciated and
will be moved in the next release to /etc/iet/ietd.conf
Oct 15 08:30:32 storage iscsi-target: ietd startup succeeded
and the /etc/ietd.conf contains the following
# cat /etc/ietd.conf
##### WARNING!!! - This configuration file generated by
Openfiler. DO
NOT MANUALLY EDIT. #####
IncomingUser vcenter vcenter
Target iqn.2006-01.com.openfiler:tsn.3d9c3d927c81
HeaderDigest None
DataDigest None
MaxConnections 1
InitialR2T Yes
ImmediateData No
MaxRecvDataSegmentLength 131072
MaxXmitDataSegmentLength 131072
MaxBurstLength 262144
FirstBurstLength 262144
DefaultTime2Wait 2
DefaultTime2Retain 20
MaxOutstandingR2T 8
DataPDUInOrder Yes
DataSequenceInOrder Yes
ErrorRecoveryLevel 0
IncomingUser openfiler openfiler
OutgoingUser openfiler openfiler
Lun 0
Path=/dev/corestorage/coreca,Type=blockio,ScsiSN=xvVmYp-yeSi-O
tqp,ScsiId=xvVmYp-yeSi-Otqp,IOMode=wt
Target iqn.2006-01.com.openfiler:tsn.c1933bae483b
HeaderDigest None
DataDigest None
MaxConnections 1
InitialR2T Yes
ImmediateData No
MaxRecvDataSegmentLength 131072
MaxXmitDataSegmentLength 131072
MaxBurstLength 262144
FirstBurstLength 262144
DefaultTime2Wait 2
DefaultTime2Retain 20
MaxOutstandingR2T 8
DataPDUInOrder Yes
DataSequenceInOrder Yes
ErrorRecoveryLevel 0
IncomingUser openfiler openfiler
OutgoingUser openfiler openfiler
Lun 0
Path=/dev/corestorage/coreca,Type=blockio,ScsiSN=xvVmYp-yeSi-Otqp,ScsiId=xvVmYp-yeSi-Otqp,IOMode=wt
I was wondering if it is caused by a miss configuration of the time2wait
or something that check the health of iscsi targets.
Any idea how to solve this issue ?
Newer versions of IET/OpenFiler would prevent ESXi from disconnecting, but
the timeout issue is most likely from backing storage that can't keep up
with the demand. Think about adding more disks, or changing RAID config
from RAID5/6 to RAID10.

-Ross

______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.
Dave Cundiff
2010-10-16 17:59:40 UTC
Permalink
Post by Ross S. W. Walker
Newer versions of IET/OpenFiler would prevent ESXi from disconnecting, but
the timeout issue is most likely from backing storage that can't keep up
with the demand. Think about adding more disks, or changing RAID config
from RAID5/6 to RAID10.
-Ross
Hmm, I'm having a similar issue. I've been getting iSCSI ping timeouts
on the initiator which almost always ends in 5 retries and a
connection timeout. Generally that requires a reboot to get my
filesystems back out of read-only from the SCSI error it causes. I've
been digging through my network trying to figure out if something is
weird but haven't been able to track anything down.

Is there any way on the target to tell if the disks are getting
overloaded? The only evidence of load I see is istiod processes
blocking in blockio_make_request. My array is pretty high speed (21
disk RAID50 7 containers x 3) and going by the flashy light test
doesn't appear to be that busy.

http://www.wbhs.tv/SanActivity.3GP

The above was taken while IOStat was showing 100% usage most of the
time. I've actually been starting to suspect the controller (Areca
1680) but I've never had trouble with them before.
--
Dave Cundiff
System Administrator
A2Hosting, Inc
http://www.a2hosting.com
Ross S. W. Walker
2010-10-16 19:04:25 UTC
Permalink
Post by Dave Cundiff
Post by Ross S. W. Walker
Newer versions of IET/OpenFiler would prevent ESXi from disconnecting, but
the timeout issue is most likely from backing storage that can't keep up
with the demand. Think about adding more disks, or changing RAID config
from RAID5/6 to RAID10.
-Ross
Hmm, I'm having a similar issue. I've been getting iSCSI ping timeouts
on the initiator which almost always ends in 5 retries and a
connection timeout. Generally that requires a reboot to get my
filesystems back out of read-only from the SCSI error it causes. I've
been digging through my network trying to figure out if something is
weird but haven't been able to track anything down.
Is there any way on the target to tell if the disks are getting
overloaded? The only evidence of load I see is istiod processes
blocking in blockio_make_request. My array is pretty high speed (21
disk RAID50 7 containers x 3) and going by the flashy light test
doesn't appear to be that busy.
http://www.wbhs.tv/SanActivity.3GP
The above was taken while IOStat was showing 100% usage most of the
time. I've actually been starting to suspect the controller (Areca
1680) but I've never had trouble with them before.
Are the drivers up to date?

I suppose you have looked at the network side as well? Used iperf to make sure the network is functioning properly?

If you use iostat -x it will show the service times and queue depths of the disks. You can use that to determine if the controller or drives are overloaded, svc_tm greater then disk spec, queue depth constantly greater then number of disks.

-Ross


______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.
Dave Cundiff
2010-10-17 02:52:07 UTC
Permalink
On Sat, Oct 16, 2010 at 3:04 PM, Ross S. W. Walker
Post by Dave Cundiff
Post by Ross S. W. Walker
Newer versions of IET/OpenFiler would prevent ESXi from disconnecting, but
the timeout issue is most likely from backing storage that can't keep up
with the demand. Think about adding more disks, or changing RAID config
from RAID5/6 to RAID10.
-Ross
Hmm, I'm having a similar issue. I've been getting iSCSI ping timeouts
on the initiator which almost always ends in 5 retries and a
connection timeout. Generally that requires a reboot to get my
filesystems back out of read-only from the SCSI error it causes. I've
been digging through my network trying to figure out if something is
weird but haven't been able to track anything down.
Is there any way on the target to tell if the disks are getting
overloaded? The only evidence of load I see is istiod processes
blocking  in blockio_make_request. My array is pretty high speed (21
disk RAID50 7 containers x 3) and going by the flashy light test
doesn't appear to be that busy.
http://www.wbhs.tv/SanActivity.3GP
The above was taken while IOStat was showing 100% usage most of the
time. I've actually been starting to suspect the controller (Areca
1680) but I've never had trouble with them before.
Are the drivers up to date?
I suppose you have looked at the network side as well? Used iperf to make
sure the network is functioning properly?
Latest drivers on everything, iperf'd the network and it gets gigabit
easily. The target has a 10G controller in it. In hindsight I probably
didn't it need it. My IO pattern is so random I never get above 400
megabits total with 10 machines hitting it. Probably saves me some
latency though.
Post by Dave Cundiff
If you use iostat -x it will show the service times and queue depths of the
disks. You can use that to determine if the controller or drives are
overloaded, svc_tm greater then disk spec, queue depth constantly greater
then number of disks.
-Ross
In a raid50 a good queue depth should be total drives minus all the
parity drives correct? Or would it be the number of containers in the
raid50? svc_tm is generally 4ms or less on my 10ms drives. I don't
think I've seen the queue size above 10.

One thing I thought was strange is the svc_tm on my lvm devices on top
of the storage controller is always very high. The physical device
will be showing 2-5ms and I'll be seeing 15-35ms on the lvm devices.
--
Dave Cundiff
System Administrator
A2Hosting, Inc
http://www.a2hosting.com
Ross S. W. Walker
2010-10-17 15:09:08 UTC
Permalink
Post by Dave Cundiff
On Sat, Oct 16, 2010 at 3:04 PM, Ross S. W. Walker
Post by Dave Cundiff
Post by Ross S. W. Walker
Newer versions of IET/OpenFiler would prevent ESXi from disconnecting, but
the timeout issue is most likely from backing storage that can't keep up
with the demand. Think about adding more disks, or changing RAID config
from RAID5/6 to RAID10.
-Ross
Hmm, I'm having a similar issue. I've been getting iSCSI ping timeouts
on the initiator which almost always ends in 5 retries and a
connection timeout. Generally that requires a reboot to get my
filesystems back out of read-only from the SCSI error it causes. I've
been digging through my network trying to figure out if something is
weird but haven't been able to track anything down.
Is there any way on the target to tell if the disks are getting
overloaded? The only evidence of load I see is istiod processes
blocking in blockio_make_request. My array is pretty high speed (21
disk RAID50 7 containers x 3) and going by the flashy light test
doesn't appear to be that busy.
http://www.wbhs.tv/SanActivity.3GP
The above was taken while IOStat was showing 100% usage most of the
time. I've actually been starting to suspect the controller (Areca
1680) but I've never had trouble with them before.
Are the drivers up to date?
I suppose you have looked at the network side as well? Used iperf to make
sure the network is functioning properly?
Latest drivers on everything, iperf'd the network and it gets gigabit
easily. The target has a 10G controller in it. In hindsight I probably
didn't it need it. My IO pattern is so random I never get above 400
megabits total with 10 machines hitting it. Probably saves me some
latency though.
Post by Dave Cundiff
If you use iostat -x it will show the service times and queue depths of the
disks. You can use that to determine if the controller or drives are
overloaded, svc_tm greater then disk spec, queue depth constantly greater
then number of disks.
-Ross
In a raid50 a good queue depth should be total drives minus all the
parity drives correct? Or would it be the number of containers in the
raid50? svc_tm is generally 4ms or less on my 10ms drives. I don't
think I've seen the queue size above 10.
One thing I thought was strange is the svc_tm on my lvm devices on top
of the storage controller is always very high. The physical device
will be showing 2-5ms and I'll be seeing 15-35ms on the lvm devices.
Now that is strange indeed!

Are you running drbd between the physical and logical layers?

Besides that though, IET/ESX shouldn't get themselves into a death spiral. Would you work with me on finding out why it does?

-Ross


______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.
Dan Barker
2010-10-17 16:00:05 UTC
Permalink
<snip>
Post by Ross S. W. Walker
Now that is strange indeed!
Are you running drbd between the physical and logical layers?
Besides that though, IET/ESX shouldn't get themselves into a death spiral.
Would you work with me on finding out why it does?
Post by Ross S. W. Walker
-Ross
Is this the same problem as Dave is experiencing? I'm not sure it's the same
problem, but it sure smells similar and I'd be happy to work with you to
figure it out.

Target:
=======
Sata II Disk (500G, WD5000AAKS2D00YGA0)
Intel MoBo (DG33TL, ICH9R)
ESXi 4.1
VMFS3 (2M blocksize to allow 454G vmdk)
Debian VM (5.0.6)
DRBD (8.3.8.1) Primary/Secondary
iSCSI-Target 1.4.20.2 (Blockio)

Initiator(s):
ESXi 4.1 (Same host as the Target), not very busy.
logs lots of lost/restored connectivity to device xxx messages, usually
just a few seconds apart, but sometimes too long for the sensitive hosts to
handle it cleanly.

ESXi 4.0u2 Moderately busy, doing an install of SBS 2008 from an ISO on the
iSCSI-homed datastore to a vmdk on the same datastore, and not much else.
logs lots of lost/restored ..., same as above.

I can stop the issue by simply stop/starting the DRBD on the Secondary side.
I can FIX the issue by simply stopping the DRBD on the Secondary side, and
leaving it down, but I am "riding bareback" as to recoverability of my data
if a disk fails on the primary side before I reconnect the drbd's.

This leads me to believe it's simply some sort of race condition that will
be a real tough thing to debug, but is pretty easy to reproduce on my
specific setup.

The time DRBD uses to replicate to the other box appears to expose this
issue by changing the timing of things. The logs don't help me much, but are
available. They were a bit large to attach - so they are at
ftp.visioncomm.com. messages.gz is the ESXi log from the Primary side and
dmesg.gz is the drbd/iscsi-target log from the Secondary side. The dmesg on
the Primary showed zero messages. The ESXi log on the Secondary side was
similar to the one saved.

I'd be a lot happier if the ESXi were at the same level, but one of my
machines won't run 4.1 (Same MoBo/CPU/Disks/RAM as the one that works!). If
both were 4.1, I could use jumbo frames on the NICs that are dedicated to
DRBD traffic and worry less about the E1000 drivers or VMWare tools being
different. The iSCSI traffic is on a different pair of NICs.

Dan, in Atlanta

Some, from dmesg:
iscsi_trgt: Abort Task (01) issued on tid:1 lun:1 by sid:1973623375856128
(Function Complete)
iscsi_trgt: Abort Task (01) issued on tid:1 lun:1 by sid:2252899329311232
(Function Complete)
iscsi_trgt: Abort Task (01) issued on tid:1 lun:2 by sid:1973623375856128
(Function Complete)
iscsi_trgt: Abort Task (01) issued on tid:1 lun:0 by sid:1973623375856128
(Function Complete)
iscsi_trgt: Abort Task (01) issued on tid:1 lun:2 by sid:2252899329311232
(Function Complete)
... for days.
Dave Cundiff
2010-10-18 09:08:48 UTC
Permalink
On Sun, Oct 17, 2010 at 11:09 AM, Ross S. W. Walker
Post by Dave Cundiff
On Sat, Oct 16, 2010 at 3:04 PM, Ross S. W. Walker
Post by Dave Cundiff
Post by Ross S. W. Walker
Newer versions of IET/OpenFiler would prevent ESXi from disconnecting, but
the timeout issue is most likely from backing storage that can't keep up
with the demand. Think about adding more disks, or changing RAID config
from RAID5/6 to RAID10.
-Ross
Hmm, I'm having a similar issue. I've been getting iSCSI ping timeouts
on the initiator which almost always ends in 5 retries and a
connection timeout. Generally that requires a reboot to get my
filesystems back out of read-only from the SCSI error it causes. I've
been digging through my network trying to figure out if something is
weird but haven't been able to track anything down.
Is there any way on the target to tell if the disks are getting
overloaded? The only evidence of load I see is istiod processes
blocking  in blockio_make_request. My array is pretty high speed (21
disk RAID50 7 containers x 3) and going by the flashy light test
doesn't appear to be that busy.
http://www.wbhs.tv/SanActivity.3GP
The above was taken while IOStat was showing 100% usage most of the
time. I've actually been starting to suspect the controller (Areca
1680) but I've never had trouble with them before.
Are the drivers up to date?
I suppose you have looked at the network side as well? Used iperf to make
sure the network is functioning properly?
Latest drivers on everything, iperf'd the network and it gets gigabit
easily. The target has a 10G controller in it. In hindsight I probably
didn't it need it. My IO pattern is so random I never get above 400
megabits total with 10 machines hitting it. Probably saves me some
latency though.
Post by Dave Cundiff
If you use iostat -x it will show the service times and queue depths of the
disks. You can use that to determine if the controller or drives are
overloaded, svc_tm greater then disk spec, queue depth constantly greater
then number of disks.
-Ross
In a raid50 a good queue depth should be total drives minus all the
parity drives correct? Or would it be the number of containers in the
raid50? svc_tm is generally 4ms or less on my 10ms drives. I don't
think I've seen the queue size above 10.
One thing I thought was strange is the svc_tm on my lvm devices on top
of the storage controller is always very high. The physical device
will be showing 2-5ms and I'll be seeing 15-35ms on the lvm devices.
Now that is strange indeed!
Are you running drbd between the physical and logical layers?
Besides that though, IET/ESX shouldn't get themselves into a death spiral.
Would you work with me on finding out why it does?
-Ross
I think we're getting confused between Me, and the 2 other people in
the thread. :P If so I'll break mine out to a different one. My setup
is just plain IET <-> Open-iSCSI.

There's no DRBD on my system(yet). Just Open-iSCSI initiators
attaching to exported LVM devices. My timeouts are pretty random, but
generally occur when a decent IO operation starts up on a box. Only
the box with the high IO operation will timeout. The rest stay
connected. I've taken everything back to a very default setup. No
large frames, no weird VLAN's. Everything's even plugged into the same
switch. Only weirdness I get is the high svctm's on the LVM devices
while the physical device looks fine.

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz
avgqu-sz await svctm %util
sdb 0.00 0.17 626.89 159.35 11799.00 4210.86
20.36 4.61 5.87 1.27 99.76
dm-0 0.00 0.00 30.29 6.20 1051.38 116.23 32.00
0.37 10.25 8.02 29.25
dm-1 0.00 0.00 0.70 31.29 9.60 604.87 19.21
0.01 0.40 0.31 1.00
dm-2 0.00 0.00 161.21 17.96 1335.82 389.74
9.63 1.03 5.74 5.33 95.50
dm-4 0.00 0.00 32.66 8.16 1034.86 149.28 29.01
0.47 11.52 5.25 21.45
dm-5 0.00 0.00 32.56 8.96 1066.58 99.17 28.08
0.39 9.52 6.46 26.83
dm-6 0.00 0.00 133.99 48.65 2760.68 1030.86
20.76 1.12 6.14 3.17 57.92
dm-7 0.00 0.00 221.53 30.69 4296.70 1630.92
23.50 1.12 4.44 2.71 68.29
dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
dm-8 0.00 0.00 5.63 1.43 154.88 26.39 25.66
0.08 11.03 9.88 6.98
dm-9 0.00 0.00 0.60 0.43 9.06 5.33 13.94
0.01 9.74 9.10 0.94


The above was taken at 4am so probably isn't a very good picture.
Things are pretty quiet at the moment. I'm gonna grab one around
8-9ish when everyone's checking their mail. There is still quite a
difference between the svctm's on the controller(sdb) and the lvm's.
Does LVM add that much latency into the mix?
--
Dave Cundiff
System Administrator
A2Hosting, Inc
http://www.a2hosting.com
Ross S. W. Walker
2010-10-18 13:26:13 UTC
Permalink
Post by Dave Cundiff
On Sun, Oct 17, 2010 at 11:09 AM, Ross S. W. Walker
Post by Dave Cundiff
On Sat, Oct 16, 2010 at 3:04 PM, Ross S. W. Walker
Post by Dave Cundiff
Post by Ross S. W. Walker
Newer versions of IET/OpenFiler would prevent ESXi from disconnecting, but
the timeout issue is most likely from backing storage that can't keep up
with the demand. Think about adding more disks, or changing RAID config
from RAID5/6 to RAID10.
-Ross
Hmm, I'm having a similar issue. I've been getting iSCSI ping timeouts
on the initiator which almost always ends in 5 retries and a
connection timeout. Generally that requires a reboot to get my
filesystems back out of read-only from the SCSI error it causes. I've
been digging through my network trying to figure out if something is
weird but haven't been able to track anything down.
Is there any way on the target to tell if the disks are getting
overloaded? The only evidence of load I see is istiod processes
blocking in blockio_make_request. My array is pretty high speed (21
disk RAID50 7 containers x 3) and going by the flashy light test
doesn't appear to be that busy.
http://www.wbhs.tv/SanActivity.3GP
The above was taken while IOStat was showing 100% usage most of the
time. I've actually been starting to suspect the controller (Areca
1680) but I've never had trouble with them before.
Are the drivers up to date?
I suppose you have looked at the network side as well? Used iperf to make
sure the network is functioning properly?
Latest drivers on everything, iperf'd the network and it gets gigabit
easily. The target has a 10G controller in it. In hindsight I probably
didn't it need it. My IO pattern is so random I never get above 400
megabits total with 10 machines hitting it. Probably saves me some
latency though.
Post by Dave Cundiff
If you use iostat -x it will show the service times and queue depths of the
disks. You can use that to determine if the controller or drives are
overloaded, svc_tm greater then disk spec, queue depth constantly greater
then number of disks.
-Ross
In a raid50 a good queue depth should be total drives minus all the
parity drives correct? Or would it be the number of containers in the
raid50? svc_tm is generally 4ms or less on my 10ms drives. I don't
think I've seen the queue size above 10.
One thing I thought was strange is the svc_tm on my lvm devices on top
of the storage controller is always very high. The physical device
will be showing 2-5ms and I'll be seeing 15-35ms on the lvm devices.
Now that is strange indeed!
Are you running drbd between the physical and logical layers?
Besides that though, IET/ESX shouldn't get themselves into a death spiral.
Would you work with me on finding out why it does?
-Ross
I think we're getting confused between Me, and the 2 other people in
the thread. :P If so I'll break mine out to a different one. My setup
is just plain IET <-> Open-iSCSI.
There's no DRBD on my system(yet). Just Open-iSCSI initiators
attaching to exported LVM devices. My timeouts are pretty random, but
generally occur when a decent IO operation starts up on a box. Only
the box with the high IO operation will timeout. The rest stay
connected. I've taken everything back to a very default setup. No
large frames, no weird VLAN's. Everything's even plugged into the same
switch. Only weirdness I get is the high svctm's on the LVM devices
while the physical device looks fine.
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz
avgqu-sz await svctm %util
sdb 0.00 0.17 626.89 159.35 11799.00 4210.86
20.36 4.61 5.87 1.27 99.76
dm-0 0.00 0.00 30.29 6.20 1051.38 116.23 32.00
0.37 10.25 8.02 29.25
dm-1 0.00 0.00 0.70 31.29 9.60 604.87 19.21
0.01 0.40 0.31 1.00
dm-2 0.00 0.00 161.21 17.96 1335.82 389.74
9.63 1.03 5.74 5.33 95.50
dm-4 0.00 0.00 32.66 8.16 1034.86 149.28 29.01
0.47 11.52 5.25 21.45
dm-5 0.00 0.00 32.56 8.96 1066.58 99.17 28.08
0.39 9.52 6.46 26.83
dm-6 0.00 0.00 133.99 48.65 2760.68 1030.86
20.76 1.12 6.14 3.17 57.92
dm-7 0.00 0.00 221.53 30.69 4296.70 1630.92
23.50 1.12 4.44 2.71 68.29
dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
dm-8 0.00 0.00 5.63 1.43 154.88 26.39 25.66
0.08 11.03 9.88 6.98
dm-9 0.00 0.00 0.60 0.43 9.06 5.33 13.94
0.01 9.74 9.10 0.94
The above was taken at 4am so probably isn't a very good picture.
Things are pretty quiet at the moment. I'm gonna grab one around
8-9ish when everyone's checking their mail. There is still quite a
difference between the svctm's on the controller(sdb) and the lvm's.
Does LVM add that much latency into the mix?
LVM generally doesn't add any latency, but if you have a VG comprised of several devices and one of those devices has high service times then it can affect total service time for the whole VG.

Also if you have multiple LVs and one LV is fully utilizing the resources of the VG then it leaves little left over for everyone else.

Can you post a detail of your setup, controller with it's specs, RAID setup, LVM layout and what volumes are on it.

I would consider breaking the storage of the VMs from the storage of the applications if they are together on one big raid set.

I have a patch set for IET I'm working on which enforces a command window. The idea is that when a session has used up it's command window the target signals to the initiator to pause until some commands have completed. I'm hoping this will prevent aborts from happening by throttling IO under load.

If it works as advertised then you will only need to reduce the command window until the aborts stop. On some controllers this may mean setting it to 1, I believe it defaults to 32, so you can try setting it to 1 now and see if it helps. IIRC the parameter is MaxCmds = X. If it does help then the command window patch would probably work too.

-Ross


______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.
Emmanuel Florac
2010-10-18 13:49:49 UTC
Permalink
Le Mon, 18 Oct 2010 09:26:13 -0400
Post by Ross S. W. Walker
If it works as advertised then you will only need to reduce the
command window until the aborts stop. On some controllers this may
mean setting it to 1, I believe it defaults to 32, so you can try
setting it to 1 now and see if it helps. IIRC the parameter is
MaxCmds = X. If it does help then the command window patch would
probably work too.
In VMWare "iSCSI initiator advanced settings", the parameter is
"MaxCommands".
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <***@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
Dave Cundiff
2010-10-18 17:13:31 UTC
Permalink
On Mon, Oct 18, 2010 at 9:26 AM, Ross S. W. Walker
Post by Dave Cundiff
On Sun, Oct 17, 2010 at 11:09 AM, Ross S. W. Walker
Post by Dave Cundiff
On Sat, Oct 16, 2010 at 3:04 PM, Ross S. W. Walker
Post by Dave Cundiff
Post by Ross S. W. Walker
Newer versions of IET/OpenFiler would prevent ESXi from disconnecting, but
the timeout issue is most likely from backing storage that can't keep up
with the demand. Think about adding more disks, or changing RAID config
from RAID5/6 to RAID10.
-Ross
Hmm, I'm having a similar issue. I've been getting iSCSI ping timeouts
on the initiator which almost always ends in 5 retries and a
connection timeout. Generally that requires a reboot to get my
filesystems back out of read-only from the SCSI error it causes. I've
been digging through my network trying to figure out if something is
weird but haven't been able to track anything down.
Is there any way on the target to tell if the disks are getting
overloaded? The only evidence of load I see is istiod processes
blocking  in blockio_make_request. My array is pretty high speed (21
disk RAID50 7 containers x 3) and going by the flashy light test
doesn't appear to be that busy.
http://www.wbhs.tv/SanActivity.3GP
The above was taken while IOStat was showing 100% usage most of the
time. I've actually been starting to suspect the controller (Areca
1680) but I've never had trouble with them before.
Are the drivers up to date?
I suppose you have looked at the network side as well? Used iperf to make
sure the network is functioning properly?
Latest drivers on everything, iperf'd the network and it gets gigabit
easily. The target has a 10G controller in it. In hindsight I probably
didn't it need it. My IO pattern is so random I never get above 400
megabits total with 10 machines hitting it. Probably saves me some
latency though.
Post by Dave Cundiff
If you use iostat -x it will show the service times and queue depths of the
disks. You can use that to determine if the controller or drives are
overloaded, svc_tm greater then disk spec, queue depth constantly greater
then number of disks.
-Ross
In a raid50 a good queue depth should be total drives minus all the
parity drives correct? Or would it be the number of containers in the
raid50? svc_tm is generally 4ms or less on my 10ms drives. I don't
think I've seen the queue size above 10.
One thing I thought was strange is the svc_tm on my lvm devices on top
of the storage controller is always very high. The physical device
will be showing 2-5ms and I'll be seeing 15-35ms on the lvm devices.
Now that is strange indeed!
Are you running drbd between the physical and logical layers?
Besides that though, IET/ESX shouldn't get themselves into a death spiral.
Would you work with me on finding out why it does?
-Ross
I think we're getting confused between Me, and the 2 other people in
the thread. :P If so I'll break mine out to a different one. My setup
is just plain IET <-> Open-iSCSI.
There's no DRBD on my system(yet). Just Open-iSCSI initiators
attaching to exported LVM devices. My timeouts are pretty random, but
generally occur when a decent IO operation starts up on a box. Only
the box with the high IO operation will timeout. The rest stay
connected. I've taken everything back to a very default setup. No
large frames, no weird VLAN's. Everything's even plugged into the same
switch. Only weirdness I get is the high svctm's on the LVM devices
while the physical device looks fine.
Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sdb               0.00     0.17 626.89 159.35 11799.00  4210.86
20.36     4.61    5.87   1.27  99.76
dm-0              0.00     0.00 30.29  6.20  1051.38   116.23    32.00
    0.37   10.25   8.02  29.25
dm-1              0.00     0.00  0.70 31.29     9.60   604.87    19.21
    0.01    0.40   0.31   1.00
dm-2              0.00     0.00 161.21 17.96  1335.82   389.74
9.63     1.03    5.74   5.33  95.50
dm-4              0.00     0.00 32.66  8.16  1034.86   149.28    29.01
    0.47   11.52   5.25  21.45
dm-5              0.00     0.00 32.56  8.96  1066.58    99.17    28.08
    0.39    9.52   6.46  26.83
dm-6              0.00     0.00 133.99 48.65  2760.68  1030.86
20.76     1.12    6.14   3.17  57.92
dm-7              0.00     0.00 221.53 30.69  4296.70  1630.92
23.50     1.12    4.44   2.71  68.29
dm-3              0.00     0.00  0.00  0.00     0.00     0.00     0.00
    0.00    0.00   0.00   0.00
dm-8              0.00     0.00  5.63  1.43   154.88    26.39    25.66
    0.08   11.03   9.88   6.98
dm-9              0.00     0.00  0.60  0.43     9.06     5.33    13.94
    0.01    9.74   9.10   0.94
The above was taken at 4am so probably isn't a very good picture.
Things are pretty quiet at the moment. I'm gonna grab one around
8-9ish when everyone's checking their mail. There is still quite a
difference between the svctm's on the controller(sdb) and the lvm's.
Does LVM add that much latency into the mix?
LVM generally doesn't add any latency, but if you have a VG comprised of
several devices and one of those devices has high service times then it can
affect total service time for the whole VG.
Also if you have multiple LVs and one LV is fully utilizing the resources of
the VG then it leaves little left over for everyone else.
Can you post a detail of your setup, controller with it's specs, RAID setup,
LVM layout and what volumes are on it.
I would consider breaking the storage of the VMs from the storage of the
applications if they are together on one big raid set.
Raid Setup:
Areca-1680 512Meg BBU Cache, readahead disabled, write-back enabled
21 1TB 7200 Seagate SAS drives, Raid 50, 7 - 3 disk Raid 5's in the
Raid 0. 64k stripes.
I went with the exotic raid level to get it past budget. Raid 10
wasted to much space, I also decided on small Raid 5's in the 0 so I'd
lose less performance on rebuilds.

The raid creates one 14TB physical device that is in the VG, I create
the LVM on that and export to my servers. The servers are using the
LVM for /home which is where the majority of my users storage needs
are.
Post by Dave Cundiff
I have a patch set for IET I'm working on which enforces a command window.
The idea is that when a session has used up it's command window the target
signals to the initiator to pause until some commands have completed. I'm
hoping this will prevent aborts from happening by throttling IO under load.
If it works as advertised then you will only need to reduce the command
window until the aborts stop. On some controllers this may mean setting it
to 1, I believe it defaults to 32, so you can try setting it to 1 now and
see if it helps. IIRC the parameter is MaxCmds = X. If it does help then the
command window patch would probably work too.
-Ross
Is this the QueuedCommands parameter I see in the ietd.conf? Its
currently at the default of 32.

Thanks!
--
Dave Cundiff
System Administrator
A2Hosting, Inc
http://www.a2hosting.com
Ross S. W. Walker
2010-10-18 17:51:00 UTC
Permalink
Post by Dave Cundiff
Post by Ross S. W. Walker
LVM generally doesn't add any latency, but if you have a VG comprised of
several devices and one of those devices has high service times then it can
affect total service time for the whole VG.
Also if you have multiple LVs and one LV is fully utilizing the resources of
the VG then it leaves little left over for everyone else.
Can you post a detail of your setup, controller with it's specs, RAID setup,
LVM layout and what volumes are on it.
I would consider breaking the storage of the VMs from the storage of the
applications if they are together on one big raid set.
Areca-1680 512Meg BBU Cache, readahead disabled, write-back enabled
21 1TB 7200 Seagate SAS drives, Raid 50, 7 - 3 disk Raid 5's in the
Raid 0. 64k stripes.
It's a sound configuration if the controller can keep up with it.

Write IOPS should be 7 x 80 = 560 IOPS
Read IOPS should be (2 x 7) x 80 = 1120 IOPS
Post by Dave Cundiff
I went with the exotic raid level to get it past budget. Raid 10
wasted to much space, I also decided on small Raid 5's in the 0 so I'd
lose less performance on rebuilds.
Understood, and the configuration is sound from my perspective.
Post by Dave Cundiff
The raid creates one 14TB physical device that is in the VG, I create
the LVM on that and export to my servers. The servers are using the
LVM for /home which is where the majority of my users storage needs
are.
What else is on there?

If you are just sharing /home I would seriously think about XFS over
NFS instead of iSCSI (at least for /home).

For VMFS you could go either NFS or iSCSI, NFS is definitely a lot
easier to deploy and maintain, and with ESX, performs identically.

The only caveat with NFS for ESX is it doesn't support RDMs, but
you can overcome that with a small iSCSI based VMFS datastore
to just hold RDMs.

For database applications, I would use iSCSI over NFS to handle
the myriad of block sizes more efficiently and lock management.

I might even think of breaking that one large RAID50 into two
or three RAID50s, one for VMs, one for /home, and one for
applications. Don't know what your size requirements vs growth
requirements are, but that would make more sense.

You could also just have it exported from the controller as
7 RAID5s, build a VG out of those, then create LVs striped
over different RAID5 sets with the VG. Or create different
VGs out of the RAID5 sets and if need be re-allocate PVs
between VGs.
Post by Dave Cundiff
Post by Ross S. W. Walker
I have a patch set for IET I'm working on which enforces a command window.
The idea is that when a session has used up it's command window the target
signals to the initiator to pause until some commands have completed. I'm
hoping this will prevent aborts from happening by throttling IO under load.
If it works as advertised then you will only need to reduce the command
window until the aborts stop. On some controllers this may mean setting it
to 1, I believe it defaults to 32, so you can try setting it to 1 now and
see if it helps. IIRC the parameter is MaxCmds = X. If it does help then the
command window patch would probably work too.
Is this the QueuedCommands parameter I see in the ietd.conf? Its
currently at the default of 32.
Ah, yes, your right, QueuedCommands, try setting it to 1 so ESX
only allows 1 outstanding IO at a time and see if the timeouts
disappear.

-Ross

______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.
g***@moll.cl
2010-10-18 15:17:54 UTC
Permalink
Post by Ross S. W. Walker
Post by g***@moll.cl
Post by Ross S. W. Walker
Newer versions of IET/OpenFiler would prevent ESXi from disconnecting,
but
Post by Ross S. W. Walker
the timeout issue is most likely from backing storage that can't keep
up
Post by Ross S. W. Walker
with the demand. Think about adding more disks, or changing RAID
config
Post by Ross S. W. Walker
from RAID5/6 to RAID10.
-Ross
Hmm, I'm having a similar issue. I've been getting iSCSI ping timeouts
on the initiator which almost always ends in 5 retries and a
connection timeout. Generally that requires a reboot to get my
filesystems back out of read-only from the SCSI error it causes. I've
been digging through my network trying to figure out if something is
weird but haven't been able to track anything down.
Is there any way on the target to tell if the disks are getting
overloaded? The only evidence of load I see is istiod processes
blocking in blockio_make_request. My array is pretty high speed (21
disk RAID50 7 containers x 3) and going by the flashy light test
doesn't appear to be that busy.
http://www.wbhs.tv/SanActivity.3GP
The above was taken while IOStat was showing 100% usage most of the
time. I've actually been starting to suspect the controller (Areca
1680) but I've never had trouble with them before.
Are the drivers up to date?
I suppose you have looked at the network side as well? Used iperf to make
sure the network is functioning properly?
If you use iostat -x it will show the service times and queue depths of
the disks. You can use that to determine if the controller or drives are
overloaded, svc_tm greater then disk spec, queue depth constantly greater
then number of disks.
-Ross
Hi Ross,

I have the following scsi modules loaded

# lsmod | grep scsi
iscsi_trgt 79336 5
iscsi_tcp 10832 0
libiscsi_tcp 11720 1 iscsi_tcp
libiscsi 29704 2 iscsi_tcp,libiscsi_tcp
scsi_transport_iscsi 33040 3 iscsi_tcp,libiscsi
scsi_dh 7152 1 dm_multipath
scsi_wait_scan 1352 0
scsi_mod 155328 10
iscsi_tcp,libiscsi,scsi_transport_iscsi,scsi_dh,sr_mod,sg,scsi_wait_scan,arcmsr,libata,sd_mod


and reviewing the messages log file I found this

Oct 14 09:50:33 storage kernel: [ 17.479367] tg3 0000:01:06.0: firmware:
reque
sting tigon/tg3_tso5.bin
Oct 14 09:50:33 storage kernel: [ 17.492837] eth2: Failed to load
firmware "ti
gon/tg3_tso5.bin"
Oct 14 09:50:33 storage kernel: [ 17.492843] eth2: TSO capability disabled.
Oct 14 09:50:33 storage kernel: [ 17.492859] tg3 0000:01:06.0: PME#
disabled

and the Openfiler shows some drop packages in the eth2

Device Received Sent Err/Drop
eth2 122.54 GB 189.42 GB 0/56
--
Sin otro particular se despide,
Gabriel Möll Ibacache
Ingeniero Civil en Computación
Red Hat Certified Engineer
http://www.moll.cl
+56 9 79964684 - +56 2 8974302
***@moll.cl
Ross S. W. Walker
2010-10-18 15:27:32 UTC
Permalink
Post by g***@moll.cl
Post by Ross S. W. Walker
Post by g***@moll.cl
Post by Ross S. W. Walker
Newer versions of IET/OpenFiler would prevent ESXi from disconnecting,
but
Post by Ross S. W. Walker
the timeout issue is most likely from backing storage that can't keep
up
Post by Ross S. W. Walker
with the demand. Think about adding more disks, or changing RAID
config
Post by Ross S. W. Walker
from RAID5/6 to RAID10.
-Ross
Hmm, I'm having a similar issue. I've been getting iSCSI ping timeouts
on the initiator which almost always ends in 5 retries and a
connection timeout. Generally that requires a reboot to get my
filesystems back out of read-only from the SCSI error it causes. I've
been digging through my network trying to figure out if something is
weird but haven't been able to track anything down.
Is there any way on the target to tell if the disks are getting
overloaded? The only evidence of load I see is istiod processes
blocking in blockio_make_request. My array is pretty high speed (21
disk RAID50 7 containers x 3) and going by the flashy light test
doesn't appear to be that busy.
http://www.wbhs.tv/SanActivity.3GP
The above was taken while IOStat was showing 100% usage most of the
time. I've actually been starting to suspect the controller (Areca
1680) but I've never had trouble with them before.
Are the drivers up to date?
I suppose you have looked at the network side as well? Used iperf to make
sure the network is functioning properly?
If you use iostat -x it will show the service times and queue depths of
the disks. You can use that to determine if the controller or drives are
overloaded, svc_tm greater then disk spec, queue depth constantly greater
then number of disks.
-Ross
Hi Ross,
I have the following scsi modules loaded
# lsmod | grep scsi
iscsi_trgt 79336 5
iscsi_tcp 10832 0
libiscsi_tcp 11720 1 iscsi_tcp
libiscsi 29704 2 iscsi_tcp,libiscsi_tcp
scsi_transport_iscsi 33040 3 iscsi_tcp,libiscsi
scsi_dh 7152 1 dm_multipath
scsi_wait_scan 1352 0
scsi_mod 155328 10
iscsi_tcp,libiscsi,scsi_transport_iscsi,scsi_dh,sr_mod,sg,scsi
_wait_scan,arcmsr,libata,sd_mod
and reviewing the messages log file I found this
Oct 14 09:50:33 storage kernel: [ 17.479367] tg3
reque
sting tigon/tg3_tso5.bin
Oct 14 09:50:33 storage kernel: [ 17.492837] eth2: Failed to load
firmware "ti
gon/tg3_tso5.bin"
Oct 14 09:50:33 storage kernel: [ 17.492843] eth2: TSO
capability disabled.
Oct 14 09:50:33 storage kernel: [ 17.492859] tg3 0000:01:06.0: PME#
disabled
and the Openfiler shows some drop packages in the eth2
Device Received Sent Err/Drop
eth2 122.54 GB 189.42 GB 0/56
No TCP Segment Offload (TSO) could cause dropped packets if
your CPU isn't strong enough to do the segmentation under
load, and those dropped packets, if they are frequent
enough, can translate to SCSI timeout issues.

I'm guessing this is probably a low-powered CPU, in which case
you need as much hardware acceleration from the network cards
and storage processors as you can get.

It is possible that either the setup doesn't have the necessary
Broadcom firmware, or your network card is mis-identified and
the firmware doesn't match it. Make sure the Broadcom firmware
package is installed and/or swap cards with something like an
e1000 series Intel card which while not a screamer, is pretty
universally supported.

I would also seriously consider getting Acrea controller with
battery backed write-back cache and setting up a RAID level
on it with as many drives as you can afford. 4 1TB drives are
WAY better then 2 2TB drives.

-Ross

______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.
g***@moll.cl
2010-10-18 18:54:41 UTC
Permalink
Post by Ross S. W. Walker
Post by Ross S. W. Walker
Post by Ross S. W. Walker
Post by g***@moll.cl
Post by Ross S. W. Walker
Newer versions of IET/OpenFiler would prevent ESXi from
disconnecting,
Post by Ross S. W. Walker
Post by g***@moll.cl
but
Post by Ross S. W. Walker
the timeout issue is most likely from backing storage that can't
keep
Post by Ross S. W. Walker
Post by g***@moll.cl
up
Post by Ross S. W. Walker
with the demand. Think about adding more disks, or changing RAID
config
Post by Ross S. W. Walker
from RAID5/6 to RAID10.
-Ross
Hmm, I'm having a similar issue. I've been getting iSCSI ping
timeouts
Post by Ross S. W. Walker
Post by g***@moll.cl
on the initiator which almost always ends in 5 retries and a
connection timeout. Generally that requires a reboot to get my
filesystems back out of read-only from the SCSI error it causes. I've
been digging through my network trying to figure out if something is
weird but haven't been able to track anything down.
Is there any way on the target to tell if the disks are getting
overloaded? The only evidence of load I see is istiod processes
blocking in blockio_make_request. My array is pretty high speed (21
disk RAID50 7 containers x 3) and going by the flashy light test
doesn't appear to be that busy.
http://www.wbhs.tv/SanActivity.3GP
The above was taken while IOStat was showing 100% usage most of the
time. I've actually been starting to suspect the controller (Areca
1680) but I've never had trouble with them before.
Are the drivers up to date?
I suppose you have looked at the network side as well? Used iperf to
make
Post by Ross S. W. Walker
sure the network is functioning properly?
If you use iostat -x it will show the service times and queue depths
of
Post by Ross S. W. Walker
the disks. You can use that to determine if the controller or drives
are
Post by Ross S. W. Walker
overloaded, svc_tm greater then disk spec, queue depth constantly
greater
Post by Ross S. W. Walker
then number of disks.
-Ross
Hi Ross,
I have the following scsi modules loaded
# lsmod | grep scsi
iscsi_trgt 79336 5
iscsi_tcp 10832 0
libiscsi_tcp 11720 1 iscsi_tcp
libiscsi 29704 2 iscsi_tcp,libiscsi_tcp
scsi_transport_iscsi 33040 3 iscsi_tcp,libiscsi
scsi_dh 7152 1 dm_multipath
scsi_wait_scan 1352 0
scsi_mod 155328 10
iscsi_tcp,libiscsi,scsi_transport_iscsi,scsi_dh,sr_mod,sg,scsi
_wait_scan,arcmsr,libata,sd_mod
and reviewing the messages log file I found this
Oct 14 09:50:33 storage kernel: [ 17.479367] tg3
reque
sting tigon/tg3_tso5.bin
Oct 14 09:50:33 storage kernel: [ 17.492837] eth2: Failed to load
firmware "ti
gon/tg3_tso5.bin"
Oct 14 09:50:33 storage kernel: [ 17.492843] eth2: TSO
capability disabled.
Oct 14 09:50:33 storage kernel: [ 17.492859] tg3 0000:01:06.0: PME#
disabled
and the Openfiler shows some drop packages in the eth2
Device Received Sent Err/Drop
eth2 122.54 GB 189.42 GB 0/56
No TCP Segment Offload (TSO) could cause dropped packets if
your CPU isn't strong enough to do the segmentation under
load, and those dropped packets, if they are frequent
enough, can translate to SCSI timeout issues.
I think CPU is strong enough to do it, I have a 2.2GHz AMD Opteron(tm)
Processor 248
Post by Ross S. W. Walker
I'm guessing this is probably a low-powered CPU, in which case
you need as much hardware acceleration from the network cards
and storage processors as you can get.
It is possible that either the setup doesn't have the necessary
Broadcom firmware, or your network card is mis-identified and
the firmware doesn't match it. Make sure the Broadcom firmware
package is installed and/or swap cards with something like an
e1000 series Intel card which while not a screamer, is pretty
universally supported.
Thanks Ross, I'm checking the network cards and firmware ...
Post by Ross S. W. Walker
I would also seriously consider getting Acrea controller with
battery backed write-back cache and setting up a RAID level
on it with as many drives as you can afford. 4 1TB drives are
WAY better then 2 2TB drives.
I'll check if the Areca has some battery cache.


Thanks again,

Best regards,
Post by Ross S. W. Walker
-Ross
______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.
--
Sin otro particular se despide,
Gabriel Möll Ibacache
Ingeniero Civil en Computación
Red Hat Certified Engineer
http://www.moll.cl
+56 9 79964684 - +56 2 8974302
***@moll.cl
Emmanuel Florac
2010-10-16 20:08:46 UTC
Permalink
Post by Dave Cundiff
The above was taken while IOStat was showing 100% usage most of the
time. I've actually been starting to suspect the controller (Areca
1680) but I've never had trouble with them before.
I've seen the same problem with VMWare ESX under high load, requiring
an iet restart. However the controller is a 3Ware.
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <***@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
Ross S. W. Walker
2010-10-16 20:45:45 UTC
Permalink
Post by Emmanuel Florac
Post by Dave Cundiff
The above was taken while IOStat was showing 100% usage most of the
time. I've actually been starting to suspect the controller (Areca
1680) but I've never had trouble with them before.
I've seen the same problem with VMWare ESX under high load, requiring
an iet restart. However the controller is a 3Ware.
I was unable to view the attachment.

Is the controller locking up under load?

-Ross


______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.
Emmanuel Florac
2010-10-17 17:12:53 UTC
Permalink
Post by Ross S. W. Walker
Post by Emmanuel Florac
I've seen the same problem with VMWare ESX under high load,
requiring an iet restart. However the controller is a 3Ware.
I was unable to view the attachment.
Is the controller locking up under load?
I didn't send any attachment. The problem may come from DRBD too, the
system is under quite heavy load.
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <***@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
g***@moll.cl
2010-10-18 14:53:28 UTC
Permalink
Well, It happens something similar in my infraestructure, when the storage
shows the cmnd_abort messages it lost connectivity with the ESXi cluster
and then all the VM get with read-only filesystem.
Post by Dave Cundiff
Post by Ross S. W. Walker
Newer versions of IET/OpenFiler would prevent ESXi from disconnecting, but
the timeout issue is most likely from backing storage that can't keep up
with the demand. Think about adding more disks, or changing RAID config
from RAID5/6 to RAID10.
-Ross
Hmm, I'm having a similar issue. I've been getting iSCSI ping timeouts
on the initiator which almost always ends in 5 retries and a
connection timeout. Generally that requires a reboot to get my
filesystems back out of read-only from the SCSI error it causes. I've
been digging through my network trying to figure out if something is
weird but haven't been able to track anything down.
Is there any way on the target to tell if the disks are getting
overloaded? The only evidence of load I see is istiod processes
blocking in blockio_make_request. My array is pretty high speed (21
disk RAID50 7 containers x 3) and going by the flashy light test
doesn't appear to be that busy.
http://www.wbhs.tv/SanActivity.3GP
The above was taken while IOStat was showing 100% usage most of the
time. I've actually been starting to suspect the controller (Areca
1680) but I've never had trouble with them before.
--
Dave Cundiff
System Administrator
A2Hosting, Inc
http://www.a2hosting.com
--
Sin otro particular se despide,
Gabriel Möll Ibacache
Ingeniero Civil en Computación
Red Hat Certified Engineer
http://www.moll.cl
+56 9 79964684 - +56 2 8974302
***@moll.cl
Ross S. W. Walker
2010-10-18 15:12:11 UTC
Permalink
Post by g***@moll.cl
Well, It happens something similar in my infraestructure, when the storage
shows the cmnd_abort messages it lost connectivity with the ESXi cluster
and then all the VM get with read-only filesystem.
The aborts show a problem with the network or storage, ESXi disconnecting
because of the "Unknown Task" responses is due to a limitation in the
ESXi initiator.

We updated the IET code in 1.4.20 to respond with a "Function Complete"
on those tasks that were within the command window and thus most likely
completed before reception of the abort, which should fix ESXi
disconnecting, but this doesn't fix the root of the problem.

You said in another email you have a 1.8TB volume, but it's not on
RAID.

SCSI Devices
- Areca ARC-1230-VOL#00 (Direct-Access)
- Areca RAID controller (Processor)

Does this mean you have 1 2TB hard disk your exporting?

Understand that these SATA drives only put out 80 IOPS of random
IO before they reach 100% utilization and hosting multiple VMs
on these can saturate them pretty quickly.

You would be much better served either with 4 1TB disks in
a RAID10 or even 4 1TB disks in a RAID5 with battery backed
write-back cache.

-Ross

______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.
g***@moll.cl
2010-10-18 14:47:59 UTC
Permalink
Post by Ross S. W. Walker
Post by g***@moll.cl
Hi,
I'm using Openfiler (OF) as a storage server and it is connected to a ESXi
cluster. I am experiencing some disconnections between the ESXi servers
and the OF storage.
The messages log in the openfiler server shows the following
cmnd_abort(1144) 71a0bc00 1 0 42 512 0 0
Oct 14 11:32:48 storage kernel: [ 6154.951112] iscsi_trgt: Abort Task
(01) issued on tid:2 lun:0 by sid:845524445757952 (Unknown Task)
cmnd_abort(1144) bac85200 1 0 42 512 0 0
Oct 14 11:32:48 storage kernel: [ 6155.218730] iscsi_trgt: Abort Task
(01) issued on tid:2 lun:0 by sid:282574492336640 (Unknown Task)
and It also contains messages about ietd
Oct 14 09:50:35 storage ietd: initiators.deny is depreciated and will be
obsoleted in the next release, please see README.initiators for more information
Oct 14 09:50:35 storage ietd: /etc/ietd.conf's location is depreciated and
will be moved in the next release to /etc/iet/ietd.conf
Oct 14 09:50:35 storage iscsi-target: ietd startup succeeded
Oct 14 11:39:10 storage ietd: CHAP initiator auth.: No valid user/pass
combination for initiator
iqn.1998-01.com.vmware:localhost-066cb67d found
Oct 15 08:30:32 storage ietd: initiators.deny is depreciated and will be
obsoleted in the next release, please see README.initiators for more information
Oct 15 08:30:32 storage ietd: /etc/ietd.conf's location is depreciated and
will be moved in the next release to /etc/iet/ietd.conf
Oct 15 08:30:32 storage iscsi-target: ietd startup succeeded
and the /etc/ietd.conf contains the following
# cat /etc/ietd.conf
##### WARNING!!! - This configuration file generated by
Openfiler. DO
NOT MANUALLY EDIT. #####
IncomingUser vcenter vcenter
Target iqn.2006-01.com.openfiler:tsn.3d9c3d927c81
HeaderDigest None
DataDigest None
MaxConnections 1
InitialR2T Yes
ImmediateData No
MaxRecvDataSegmentLength 131072
MaxXmitDataSegmentLength 131072
MaxBurstLength 262144
FirstBurstLength 262144
DefaultTime2Wait 2
DefaultTime2Retain 20
MaxOutstandingR2T 8
DataPDUInOrder Yes
DataSequenceInOrder Yes
ErrorRecoveryLevel 0
IncomingUser openfiler openfiler
OutgoingUser openfiler openfiler
Lun 0
Path=/dev/corestorage/coreca,Type=blockio,ScsiSN=xvVmYp-yeSi-O
tqp,ScsiId=xvVmYp-yeSi-Otqp,IOMode=wt
Target iqn.2006-01.com.openfiler:tsn.c1933bae483b
HeaderDigest None
DataDigest None
MaxConnections 1
InitialR2T Yes
ImmediateData No
MaxRecvDataSegmentLength 131072
MaxXmitDataSegmentLength 131072
MaxBurstLength 262144
FirstBurstLength 262144
DefaultTime2Wait 2
DefaultTime2Retain 20
MaxOutstandingR2T 8
DataPDUInOrder Yes
DataSequenceInOrder Yes
ErrorRecoveryLevel 0
IncomingUser openfiler openfiler
OutgoingUser openfiler openfiler
Lun 0
Path=/dev/corestorage/coreca,Type=blockio,ScsiSN=xvVmYp-yeSi-Otqp,ScsiId=xvVmYp-yeSi-Otqp,IOMode=wt
I was wondering if it is caused by a miss configuration of the time2wait
or something that check the health of iscsi targets.
Any idea how to solve this issue ?
Newer versions of IET/OpenFiler would prevent ESXi from disconnecting, but
the timeout issue is most likely from backing storage that can't keep up
with the demand. Think about adding more disks, or changing RAID config
from RAID5/6 to RAID10.
-Ross
_____
Hi Ross,

Well I have the following versions of the software:

Openfiler
Distro Release: Openfiler NSA 2.3
GUI Version: r1650-2-1

[2009 Dec 28 00:41:02] installed
arecacli=/***@ofns:2/v1.82_81103-1-3[is: x86_64]

[2009 Dec 28 00:41:02] installed
arecacli:runtime=/***@ofns:2/v1.82_81103-1-3[is: x86_64]

The hardware
SCSI Devices
- Areca ARC-1230-VOL#00 (Direct-Access)
- Areca RAID controller (Processor)

I'm not using RAID, and I have one 1.8 TB LVM filesystem.

Best regards,
--
Sin otro particular se despide,
Gabriel Möll Ibacache
Ingeniero Civil en Computación
Red Hat Certified Engineer
http://www.moll.cl
+56 9 79964684 - +56 2 8974302
***@moll.cl
Continue reading on narkive:
Loading...