Debian Bug report logs - #524553
openmpi-bin: mpiexec seems to be resolving names on server instead of each node

version graph

Package: openmpi-bin; Maintainer for openmpi-bin is Debian Open MPI Maintainers <pkg-openmpi-maintainers@lists.alioth.debian.org>; Source for openmpi-bin is src:openmpi.

Reported by: Micha Feigin <michf@post.tau.ac.il>

Date: Fri, 17 Apr 2009 22:48:02 UTC

Severity: normal

Tags: confirmed, upstream

Found in version openmpi/1.2.8-3

Reply or subscribe to this bug.

Toggle useless messages

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to debian-bugs-dist@lists.debian.org, michf@post.tau.ac.il, Debian OpenMPI Maintainers <pkg-openmpi-maintainers@lists.alioth.debian.org>:
Bug#524553; Package openmpi-bin. (Fri, 17 Apr 2009 22:48:04 GMT) Full text and rfc822 format available.

Acknowledgement sent to Micha Feigin <michf@post.tau.ac.il>:
New Bug report received and forwarded. Copy sent to michf@post.tau.ac.il, Debian OpenMPI Maintainers <pkg-openmpi-maintainers@lists.alioth.debian.org>. (Fri, 17 Apr 2009 22:48:04 GMT) Full text and rfc822 format available.

Message #5 received at submit@bugs.debian.org (full text, mbox):

From: Micha Feigin <michf@post.tau.ac.il>
To: Debian Bug Tracking System <submit@bugs.debian.org>
Subject: openmpi-bin: mpiexec seems to be resolving names on server instead of each node
Date: Sat, 18 Apr 2009 01:49:41 +0300
Package: openmpi-bin
Version: 1.2.8-3
Severity: important


As far as I understand the error, mpiexec resolves name -> addresses on the server
it is run on instead of an each host seperately. This works in an environment where
each hostname resolves to the same address on each host (cluster connected via a
switch) but fails where it resolves to different addresses (ring/star setups for
example where each computer is connected directly to all/some of the others)

I'm not 100% sure that this is the problem as I'm seeing success on a single
case where this should probably fail but it is my best bet from the error message.

version 1.2.8 worked fine for the same simple program (a simple hellow world that
just comunicated the computer name for each process)

An example output:

mpiexec is run on the master node hubert and is set to run the processes on two nodes
fry and leela. As is understood from the error messages leela tries to connect to
fry on address 192.168.1.2 which is it's address on hubert but not leela (where it
is 192.168.4.1)

This is a four node claster all interconnected

    192.168.1.1      192.168.1.2
hubert ------------------------ fry
  |    \                    /    | 192.168.4.1
  |       \              /       |
  |          \        /          |
  |             \  /             |
  |             /  \             |
  |          /        \          |
  |       /              \       |
  |    /                     \   | 192.168.4.2
hermes ----------------------- leelas

=================================================================
mpiexec -np 8 -H fry,leela test_mpi
Hello MPI from the server process of 8 on fry!
[[36620,1],1][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: Network is unreachable

[[36620,1],3][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: Network is unreachable

[[36620,1],7][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: Network is unreachable

[leela:4436] *** An error occurred in MPI_Send
[leela:4436] *** on communicator MPI_COMM_WORLD
[leela:4436] *** MPI_ERR_INTERN: internal error
[leela:4436] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[[36620,1],5][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: Network is unreachable

--------------------------------------------------------------------------
mpiexec has exited due to process rank 1 with PID 4433 on
node leela exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------
[hubert:11312] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[hubert:11312] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
=================================================================

This seems to be a directional issue as running the program -H fry,leela failes
where -H leela,fry works, same behaviour for all senarious except those that include
the master node (hubert) where it resolves the external ip (from an external dns) instead
of the internal ip (from the hosts file). thus one direction fails (no external connection
at the moment for all but the master) and the other causes a lockup

I hope that the explenation is not too convoluted

-- System Information:
Debian Release: squeeze/sid
  APT prefers unstable
  APT policy: (500, 'unstable'), (1, 'experimental')
Architecture: amd64 (x86_64)

Kernel: Linux 2.6.28.8 (SMP w/4 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/bash

Versions of packages openmpi-bin depends on:
ii  libc6                         2.9-7      GNU C Library: Shared libraries
ii  libgcc1                       1:4.3.3-7  GCC support library
ii  libopenmpi1                   1.2.8-3    high performance message passing l
ii  libstdc++6                    4.3.3-7    The GNU Standard C++ Library v3
ii  openmpi-common                1.2.8-3    high performance message passing l

openmpi-bin recommends no packages.

Versions of packages openmpi-bin suggests:
ii  gfortran                      4:4.3.3-2  The GNU Fortran 95 compiler

-- no debconf information




Information forwarded to debian-bugs-dist@lists.debian.org, Debian OpenMPI Maintainers <pkg-openmpi-maintainers@lists.alioth.debian.org>:
Bug#524553; Package openmpi-bin. (Wed, 22 Apr 2009 13:51:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Manuel Prinz <manuel@debian.org>:
Extra info received and forwarded to list. Copy sent to Debian OpenMPI Maintainers <pkg-openmpi-maintainers@lists.alioth.debian.org>. (Wed, 22 Apr 2009 13:51:02 GMT) Full text and rfc822 format available.

Message #10 received at 524553@bugs.debian.org (full text, mbox):

From: Manuel Prinz <manuel@debian.org>
To: Micha Feigin <michf@post.tau.ac.il>, 524553@bugs.debian.org
Cc: Jeff Squyres <jsquyres@cisco.com>
Subject: Re: [Pkg-openmpi-maintainers] Bug#524553: openmpi-bin: mpiexec seems to be resolving names on server instead of each node
Date: Wed, 22 Apr 2009 15:48:46 +0200
[Message part 1 (text/plain, inline)]
Hi Micha!

I'm sorry for replying late! I was on holidays.

Your description sounds reasonable but I have no possibility to do tests
of my own at the moment. I CC'ed Jeff (upstream), maybe he can comment
on the issue.

BTW, did you also try the 1.3 series of Open MPI?

Best regards
Manuel


Am Samstag, den 18.04.2009, 01:49 +0300 schrieb Micha Feigin:
> Package: openmpi-bin
> Version: 1.2.8-3
> Severity: important
> 
> 
> As far as I understand the error, mpiexec resolves name -> addresses on the server
> it is run on instead of an each host seperately. This works in an environment where
> each hostname resolves to the same address on each host (cluster connected via a
> switch) but fails where it resolves to different addresses (ring/star setups for
> example where each computer is connected directly to all/some of the others)
> 
> I'm not 100% sure that this is the problem as I'm seeing success on a single
> case where this should probably fail but it is my best bet from the error message.
> 
> version 1.2.8 worked fine for the same simple program (a simple hellow world that
> just comunicated the computer name for each process)
> 
> An example output:
> 
> mpiexec is run on the master node hubert and is set to run the processes on two nodes
> fry and leela. As is understood from the error messages leela tries to connect to
> fry on address 192.168.1.2 which is it's address on hubert but not leela (where it
> is 192.168.4.1)
> 
> This is a four node claster all interconnected
> 
>     192.168.1.1      192.168.1.2
> hubert ------------------------ fry
>   |    \                    /    | 192.168.4.1
>   |       \              /       |
>   |          \        /          |
>   |             \  /             |
>   |             /  \             |
>   |          /        \          |
>   |       /              \       |
>   |    /                     \   | 192.168.4.2
> hermes ----------------------- leelas
> 
> =================================================================
> mpiexec -np 8 -H fry,leela test_mpi
> Hello MPI from the server process of 8 on fry!
> [[36620,1],1][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: Network is unreachable
> 
> [[36620,1],3][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: Network is unreachable
> 
> [[36620,1],7][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: Network is unreachable
> 
> [leela:4436] *** An error occurred in MPI_Send
> [leela:4436] *** on communicator MPI_COMM_WORLD
> [leela:4436] *** MPI_ERR_INTERN: internal error
> [leela:4436] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [[36620,1],5][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: Network is unreachable
> 
> --------------------------------------------------------------------------
> mpiexec has exited due to process rank 1 with PID 4433 on
> node leela exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpiexec (as reported here).
> --------------------------------------------------------------------------
> [hubert:11312] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
> [hubert:11312] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
> =================================================================
> 
> This seems to be a directional issue as running the program -H fry,leela failes
> where -H leela,fry works, same behaviour for all senarious except those that include
> the master node (hubert) where it resolves the external ip (from an external dns) instead
> of the internal ip (from the hosts file). thus one direction fails (no external connection
> at the moment for all but the master) and the other causes a lockup
> 
> I hope that the explenation is not too convoluted
> 
> -- System Information:
> Debian Release: squeeze/sid
>   APT prefers unstable
>   APT policy: (500, 'unstable'), (1, 'experimental')
> Architecture: amd64 (x86_64)
> 
> Kernel: Linux 2.6.28.8 (SMP w/4 CPU cores)
> Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
> Shell: /bin/sh linked to /bin/bash
> 
> Versions of packages openmpi-bin depends on:
> ii  libc6                         2.9-7      GNU C Library: Shared libraries
> ii  libgcc1                       1:4.3.3-7  GCC support library
> ii  libopenmpi1                   1.2.8-3    high performance message passing l
> ii  libstdc++6                    4.3.3-7    The GNU Standard C++ Library v3
> ii  openmpi-common                1.2.8-3    high performance message passing l
> 
> openmpi-bin recommends no packages.
> 
> Versions of packages openmpi-bin suggests:
> ii  gfortran                      4:4.3.3-2  The GNU Fortran 95 compiler
> 
> -- no debconf information

[signature.asc (application/pgp-signature, inline)]

Tags added: confirmed Request was from Manuel Prinz <debian@pinguinkiste.de> to control@bugs.debian.org. (Tue, 23 Jun 2009 21:27:03 GMT) Full text and rfc822 format available.

Severity set to 'normal' from 'important' Request was from Manuel Prinz <manuel@debian.org> to control@bugs.debian.org. (Mon, 12 Jul 2010 15:36:03 GMT) Full text and rfc822 format available.

Added tag(s) upstream. Request was from Manuel Prinz <manuel@debian.org> to control@bugs.debian.org. (Fri, 16 Jul 2010 09:21:10 GMT) Full text and rfc822 format available.

Information forwarded to debian-bugs-dist@lists.debian.org, Debian Open MPI Maintainers <pkg-openmpi-maintainers@lists.alioth.debian.org>:
Bug#524553; Package openmpi-bin. (Thu, 22 Jul 2010 18:57:07 GMT) Full text and rfc822 format available.

Acknowledgement sent to Jeff Squyres <jsquyres@cisco.com>:
Extra info received and forwarded to list. Copy sent to Debian Open MPI Maintainers <pkg-openmpi-maintainers@lists.alioth.debian.org>. (Thu, 22 Jul 2010 18:57:07 GMT) Full text and rfc822 format available.

Message #21 received at 524553@bugs.debian.org (full text, mbox):

From: Jeff Squyres <jsquyres@cisco.com>
To: 524553@bugs.debian.org
Subject: Is this still happening?
Date: Thu, 22 Jul 2010 14:41:45 -0400
I just replied to this issue on the Open MPI user's list:

    http://www.open-mpi.org/community/lists/users/2010/07/13714.php

It would be good to know if this is still happening.

-- 
Jeff Squyres
jsquyres@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/





Send a report that this bug log contains spam.


Debian bug tracking system administrator <owner@bugs.debian.org>. Last modified: Thu Apr 24 05:21:26 2014; Machine Name: buxtehude.debian.org

Debian Bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.