[IPython-User] engines dying due to excessive load

Robert Nishihara robertnishihara@gmail....
Fri Jun 29 11:22:47 CDT 2012


Ok, so numpy uses Intel's Math Kernel Library, which tries to automatically
parallelize things, which, on a cluster, can cause problems for the
scheduler.

Setting MLK_NUM_THREADS=1 on the engines appears to have completely fixed
the problem. In my script, I did this with the line

    dview.execute("os.environ['MKL_NUM_THREADS']='1'")

which stops the scheduler from suspending my jobs and also gives me a
performance increase (presumably because the scheduler was unable to
effectively handle the load).

-Robert

On Fri, Jun 29, 2012 at 12:19 AM, Robert Nishihara <
robertnishihara@gmail.com> wrote:

> I am using numpy all over the place, so I will investigate if that is the
> issue.
>
>
> On Thu, Jun 28, 2012 at 8:19 PM, MinRK <benjaminrk@gmail.com> wrote:
>
>>
>>
>> On Thu, Jun 28, 2012 at 5:04 PM, Bago <mrbago@gmail.com> wrote:
>>
>>>
>>>
>>> On Thu, Jun 28, 2012 at 3:28 PM, Robert Nishihara <
>>> robertnishihara@gmail.com> wrote:
>>>
>>>> I've been trying to figure this out for a couple days now, and I'm
>>>> curious if anyone has seen a similar problem.
>>>>
>>>> My setup is
>>>>
>>>>     ipcontroller --profile=sge
>>>>     ipcluster engines -n 100 --profile=sge
>>>>
>>>> My script uses map_sync with a direct view. After running my script for
>>>> a couple minutes, the load on the compute nodes grows excessively high and
>>>> the scheduler starts suspending jobs, so some of the engines get suspended.
>>>> This causes my script to terminate with an error like the one below
>>>>
>>>>     [Engine Exception]EngineError: Engine 1315 died while running task
>>>> '966abf73-3183-4db3-8cf2-96bd08c2312b'
>>>>
>>>> The engine is numbered 1315 because I sometimes restart the engines
>>>> without restarting the controller.
>>>>
>>>> Why would suspending an engine would cause my script to terminate
>>>> instead of simply forcing it to wait?
>>>>
>>>> Why might the load be so high? Each node has 32 cores. At most twenty
>>>> engines are running on each node. Yet, sometimes several hundred processes
>>>> are vying for space on a given node (and I'm the only one using the
>>>> cluster). Could it be the queuing of messages or something?
>>>>
>>>
>>> This is a bit of shot in the dark, but on our machines we need to set **
>>> MKL_NUM_THREADS=1, otherwise some numpy functions (which I assume are
>>> calling MKL functions) try and use 16 threads. Is it possible some of your
>>> code, or some library you rely on, is mufti-threaded?
>>>
>>
>> The only library *IPython* uses that is multithreaded in zeromq, but
>> that's only one additional thread.  If *you* are using numpy, then the MKL
>> environment is relevant.
>>
>>
>>>
>>>
>>>> _______________________________________________
>>>> IPython-User mailing list
>>>> IPython-User@scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/ipython-user
>>>>
>>>>
>>>
>>> _______________________________________________
>>> IPython-User mailing list
>>> IPython-User@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/ipython-user
>>>
>>>
>>
>> _______________________________________________
>> IPython-User mailing list
>> IPython-User@scipy.org
>> http://mail.scipy.org/mailman/listinfo/ipython-user
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/ipython-user/attachments/20120629/53cdfbcc/attachment.html 


More information about the IPython-User mailing list