现在的位置: 首页 > 综合 > 正文

Experiences on Couples of Issues of Using Condor.

2013年10月03日 ⁄ 综合 ⁄ 共 11825字 ⁄ 字号 评论关闭

The following is some issues that I encounter using condor, and some of them may not have right solution.

 

1. When you condor_submit a job in a dir which is not that of executor, it prompts me

 

[zhxue@osg root]$ condor_submit /opt/app/tmp/condor-ia64/hello-vanilla.cmd
Submitting job(s).
1 job(s) submitted to cluster 9503.
[zhxue@osg root]$ condor_q -analyze

-- Submitter: osg.cnic.cn : <159.226.3.188:55200> : osg.cnic.cn
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD              
---
9502.000:  Request is held.

Hold reason: Cannot access initial working directory /root: Permission denied

---
9503.000:  Request is held.

Hold reason: Cannot access initial working directory /root: Permission denied

 

When you switch to a "rwt/x" dir (such as "/tmp"), it is ok.

 

 

 

2. When you submit a job using your account to a WN(worker node) which has not your account, the job sits idle forever.

 

 

1) Submitter

 

9/25 11:51:21 (pid:5089) Activity on stashed negotiator socket
9/25 11:51:21 (pid:5089) Negotiating for owner: zhxue@*.cnic.cn
9/25 11:51:21 (pid:5089) Checking consistency running and runnable jobs
9/25 11:51:21 (pid:5089) Tables are consistent
9/25 11:51:21 (pid:5089) Rebuilt prioritized runnable job list in 0.000s.
9/25 11:51:21 (pid:5089) Out of jobs - 1 jobs matched, 0 jobs idle, flock level = 0
9/25 11:51:21 (pid:5089) Sent ad to central manager for zhxue@*.cnic.cn
9/25 11:51:21 (pid:5089) Sent ad to 1 collectors for zhxue@*.cnic.cn
9/25 11:51:23 (pid:5089) Starting add_shadow_birthdate(9550.0)
9/25 11:51:23 (pid:5089) Started shadow for job 9550.0 on "<192.168.21.241:20182>", (shadow pid = 4381)
9/25 11:51:23 (pid:5089) Shadow pid 4381 for job 9550.0 exited with status 107
9/25 11:51:23 (pid:5089) Sent RELEASE_CLAIM to startd at <192.168.21.241:20182>
9/25 11:51:23 (pid:5089) Match record (<192.168.21.241:20182>, 9550, 0) deleted
9/25 11:51:23 (pid:5089) Null parameter --- match not deleted
9/25 11:51:23 (pid:5089) Got VACATE_SERVICE from <192.168.21.241:20029>
9/25 11:51:26 (pid:5089) Sent ad to central manager for zhxue@*.cnic.cn
9/25 11:51:26 (pid:5089) Sent ad to 1 collectors for zhxue@*.cnic.cn

 

 

2) Executor

 

[root@c2401 root]# tail -f /opt/app/condor/condor/local.c2401/log/StartLog
9/25 11:52:27 slot1: State change: starter exited
9/25 11:52:27 slot1: Changing activity: Busy -> Idle
9/25 11:52:27 Aborting CA_LOCATE_STARTER
9/25 11:52:27 ClaimId (<192.168.21.241:20182>#1253849726#14#3999829961) and GlobalJobId ( osg.cnic.cn#1253849811#9550.0 ) not found
9/25 11:52:27 slot1: State change: received RELEASE_CLAIM command
9/25 11:52:27 slot1: Changing state and activity: Claimed/Idle -> Preempting/Vacating
9/25 11:52:27 slot1: State change: No preempting claim, returning to owner
9/25 11:52:27 slot1: Changing state and activity: Preempting/Vacating -> Owner/Idle
9/25 11:52:27 slot1: State change: IS_OWNER is false
9/25 11:52:27 slot1: Changing state: Owner -> Unclaimed

 

 

When you useradd zhxue, the idle job will run at once. Now, it looks like the following

 

[root@c2401 root]# tail -f /opt/app/condor/condor/local.c2401/log/StartLog
9/25 11:57:25 slot1: Request accepted.
9/25 11:57:25 slot1: Remote owner is zhxue@*.cnic.cn
9/25 11:57:25 slot1: State change: claiming protocol successful
9/25 11:57:25 slot1: Changing state: Matched -> Claimed
9/25 11:57:27 ZKM: setting default map to condor_pool@*.cnic.cn
9/25 11:57:27 slot1: Got activate_claim request from shadow (<159.226.3.188:37436>)
9/25 11:57:27 slot1: Remote job ID is 9550.0
9/25 11:57:27 slot1: Got universe "VANILLA" (5) from request classad
9/25 11:57:27 slot1: State change: claim-activation protocol successful
9/25 11:57:27 slot1: Changing activity: Idle -> Busy
9/25 12:07:28 slot1: Called deactivate_claim_forcibly()
9/25 12:07:28 Starter pid 18644 exited with status 0
9/25 12:07:28 slot1: State change: starter exited
9/25 12:07:28 slot1: Changing activity: Busy -> Idle
9/25 12:07:28 slot1: State change: received RELEASE_CLAIM command
9/25 12:07:28 slot1: Changing state and activity: Claimed/Idle -> Preempting/Vacating
9/25 12:07:28 slot1: State change: No preempting claim, returning to owner
9/25 12:07:28 slot1: Changing state and activity: Preempting/Vacating -> Owner/Idle
9/25 12:07:28 slot1: State change: IS_OWNER is false
9/25 12:07:28 slot1: Changing state: Owner -> Unclaimed

 

The above is an integrated messages of executor (our job runs 10 minutes).

 

 

 

3.  Why outputfile are written by both submitter and executor

 

When you set the following in your submit file:

 

output = /tmp/hello.out.$(Cluster).$(Process)

 

When the job is over, you will get the following info:

 

In the submitter box:  

 

[zhxue@osg tmp]$ cat /tmp/hello.out.9507.0
You have new mail in /var/spool/mail/root

while the executor box:

 

[root@c2408 root]# cat /tmp/hello.out.9507.0
Hello World!

 

Actually, if you set "output=/mnt/test2.out", you will fail in executing jobs. The job has been done, but it fails in writing results to the /mnt/testout, and this job will pend forever. I traced this issue. It seems that when you submit a job in the submitter host using zhxue, the executor has the account zhxue, but this job will change to nobody but not zhxue. see the following in the executor:

                                                                                                                
20255 nobody    26  10 26000 1004  836 S 599.7  0.0   5:36.76 thread2 

 

After you submit the job, it will instantly produce a file /mnt/testout by zhxue:

 

ls -al test2.out

-rw-r--r-- 1 zhxue wheel     0 Mar 17 11:38 test2.out

 

while nobody have no right to write the file after the job is finished, so the job seems failed.

 

I referenced condor manual, and found the solution. If the submitter host and executor host has the same UidDomain (in config file), it will use the same user. For example, you use zhxue to submit in submit host, and it will use zhxue to execute jobs in executor host. Notice, it seems that ClassAd don't accept *. For example, I set UidDomain as *.cnic.cn, it is unavailable. When I set as osg.cnic.cn, it works. I set Name==*@c201. it can no recgonize. so I have to set Machine==c201. If the executor has no zhxue, it will use nobody to execut the jobs. 

 

 

4.Diffrence between local.c2408 and local.c2401, which indicates the availabilty of clients

 

# This value of HOSTALLOW_WRITE overwrites the invalid value in condor_config
#HOSTALLOW_WRITE = $(FULL_HOSTNAME)  #it incurs inavalability
HOSTALLOW_WRITE = *                                  #it works   

 

The following is the message concerning this issue:

 

1)  -analyze

[zhxue@osg tmp]$ condor_q -analyze

---
9511.000:  Run analysis summary.  Of 36 machines,
     35 are rejected by your job's requirements
      1 reject your job because of their own requirements
      0 match but are serving users with a better priority in the pool
      0 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job
        
2) submitter SchedLog 

[zhxue@osg tmp]$ tail -f /opt/osg/condor/local.osg/log/SchedLog --lines=500

9/25 08:57:46 (pid:5089) ZKM: setting default map to condor_pool@*.cnic.cn
9/25 08:57:46 (pid:5089) condor_write(): Socket closed when trying to write 2146 bytes to <192.168.21.242:24712>, fd is 14
9/25 08:57:46 (pid:5089) Buf::write(): condor_write() failed
9/25 08:57:46 (pid:5089) Couldn't send eom to startd.
9/25 08:57:46 (pid:5089) Match record (<192.168.21.242:24712>, 9511, 0) deleted
9/25 08:57:51 (pid:5089) Sent ad to central manager for zhxue@*.cnic.cn
9/25 08:57:51 (pid:5089) Sent ad to 1 collectors for zhxue@*.cnic.cn
9/25 08:58:03 (pid:5089) Sent ad to central manager for zhxue@*.cnic.cn
9/25 08:58:03 (pid:5089) Sent ad to 1 collectors for zhxue@*.cnic.cn

 

3) executor Log

 

[root@c2402 root]# tail -f --lines=100 /opt/app/condor/condor/local.c2402/log/SchedLog

There is no meesage about that, since it is just a worker nodes, and not responsible for scheduling.

 

 

[root@c2402 root]# tail -f --lines=100 /opt/app/condor/condor/local.c2402/log/StartLog

9/25 08:50:28 ZKM: setting default map to condor_pool@*.cnic.cn
9/25 08:50:28 DaemonCore: PERMISSION DENIED to condor_pool@*.cnic.cn from host <159.226.3.188:46780> for command 442 (REQUEST_CLAIM), access level DAEMON
9/25 08:50:28 ZKM: setting default map to condor_pool@*.cnic.cn
9/25 08:50:28 slot2: match_info called
9/25 08:50:28 slot2: Received match <192.168.21.242:24712>#1253686678#2#...
9/25 08:50:28 slot2: State change: match notification protocol successful
9/25 08:50:28 slot2: Changing state: Unclaimed -> Matched
9/25 08:52:28 slot2: State change: match timed out
9/25 08:52:28 slot2: Changing state: Matched -> Owner
9/25 08:52:28 slot2: State change: IS_OWNER is false
9/25 08:52:28 slot2: Changing state: Owner -> Unclaimed
9/25 08:59:03 slot2: match_info called
9/25 08:59:03 slot2: Received match <192.168.21.242:24712>#1253686678#6#...
9/25 08:59:03 slot2: State change: match notification protocol successful
9/25 08:59:03 slot2: Changing state: Unclaimed -> Matched
9/25 08:59:03 DaemonCore: PERMISSION DENIED to condor_pool@*.cnic.cn from host <159.226.3.188:60141> for command 442 (REQUEST_CLAIM), access level DAEMON
9/25 08:59:03 DaemonCore: PERMISSION DENIED to condor_pool@*.cnic.cn from host <159.226.3.188:52786> for command 443 (RELEASE_CLAIM), access level DAEMON

 

 

 

5. Valadating MPI and Java jobs in both heterogeneous pool and homegeneous one using vanilla and globus respectively

 

 

 

 

6. Valadating the local.d229 config file line "CA certificate directory for Condor-G"

 

 

 

7. Why it does not work in c2402 after revising WRITE equals *?

 

1) on Submitter

[zhxue@osg tmp]$ tail -f /opt/osg/condor/local.osg/log/SchedLog --lines=500

9/25 09:40:30 (pid:5089) ZKM: setting default map to zhxue@*.cnic.cn
9/25 09:40:30 (pid:5089) Sent ad to central manager for zhxue@*.cnic.cn
9/25 09:40:30 (pid:5089) Sent ad to 1 collectors for zhxue@*.cnic.cn
9/25 09:40:30 (pid:5089) Called reschedule_negotiator()
9/25 09:40:30 (pid:5089) Activity on stashed negotiator socket
9/25 09:40:30 (pid:5089) Negotiating for owner: zhxue@*.cnic.cn
9/25 09:40:30 (pid:5089) Checking consistency running and runnable jobs
9/25 09:40:30 (pid:5089) Tables are consistent
9/25 09:40:30 (pid:5089) Rebuilt prioritized runnable job list in 0.000s.
9/25 09:40:30 (pid:5089) Out of jobs - 1 jobs matched, 0 jobs idle, flock level = 0
9/25 09:40:32 (pid:5089) Starting add_shadow_birthdate(9520.0)
9/25 09:40:32 (pid:5089) Started shadow for job 9520.0 on "<192.168.21.242:22978>", (shadow pid = 21643)
9/25 09:40:33 (pid:5089) ZKM: setting default map to zhxue@*.cnic.cn
9/25 09:40:33 (pid:5089) Shadow pid 21643 for job 9520.0 exited with status 100
9/25 09:40:33 (pid:5089) Checking consistency running and runnable jobs
9/25 09:40:33 (pid:5089) Tables are consistent
9/25 09:40:33 (pid:5089) Rebuilt prioritized runnable job list in 0.000s.  (Expedited rebuild because no match was found)
9/25 09:40:33 (pid:5089) match (<192.168.21.242:22978>#1253840973#3#...) out of jobs (cluster id 9520); relinquishing
9/25 09:40:33 (pid:5089) Sent RELEASE_CLAIM to startd at <192.168.21.242:22978>
9/25 09:40:33 (pid:5089) Match record (<192.168.21.242:22978>, 9520, -1) deleted
9/25 09:40:33 (pid:5089) Got VACATE_SERVICE from <192.168.21.242:23880>
9/25 09:40:35 (pid:5089) Sent owner (0 jobs) ad to 1 collectors

2) on Executor

 

[root@c2402 root]# tail -f --lines=100 /opt/app/condor/condor/local.c2402/log/StartLog

9/25 09:33:13 slot3: match_info called
9/25 09:33:13 slot3: Received match <192.168.21.242:22978>#1253840973#3#...
9/25 09:33:13 slot3: State change: match notification protocol successful
9/25 09:33:13 slot3: Changing state: Unclaimed -> Matched
9/25 09:33:13 slot3: Request accepted.
9/25 09:33:13 slot3: Remote owner is zhxue@*.cnic.cn
9/25 09:33:13 slot3: State change: claiming protocol successful
9/25 09:33:13 slot3: Changing state: Matched -> Claimed
9/25 09:33:15 ZKM: setting default map to condor_pool@*.cnic.cn
9/25 09:33:15 slot3: Got activate_claim request from shadow (<159.226.3.188:33488>)
9/25 09:33:15 slot3: Remote job ID is 9520.0
9/25 09:33:15 slot3: Got universe "VANILLA" (5) from request classad
9/25 09:33:15 slot3: State change: claim-activation protocol successful
9/25 09:33:15 slot3: Changing activity: Idle -> Busy
9/25 09:33:15 slot3: Called deactivate_claim_forcibly()
9/25 09:33:15 slot3: State change: received RELEASE_CLAIM command
9/25 09:33:15 slot3: Changing state and activity: Claimed/Busy -> Preempting/Vacating
9/25 09:33:15 Starter pid 15332 exited with status 0
9/25 09:33:15 slot3: State change: starter exited
9/25 09:33:15 slot3: State change: No preempting claim, returning to owner
9/25 09:33:15 slot3: Changing state and activity: Preempting/Vacating -> Owner/Idle
9/25 09:33:15 slot3: State change: IS_OWNER is false
9/25 09:33:15 slot3: Changing state: Owner -> Unclaimed

 

 

8. Executables is shell

 

Environment "vanilla" supports executables shell. I always failed since the following:

 

#!/bin/bash     //I misssed this line, it leads to failure.

./bin/blastall -p blastn -d ./blast/db/test_na_db -i foo.fas -e 10 -v 100

 

 

9.   1 reject your job because of their own requirements 

 

Sure enough, no machines can run your job. The message discusses the machine's requirements not matching your job because typically START is defined in part as a set of requirements about your job. In this particular case no job can satisify the requirement of FALSE.

Set START back to TRUE. As root:

% echo "START=TRUE" >> /tmp/condor/var/condor_config.local
% condor_reconfig

In a bit your job should run and exit.

抱歉!评论已关闭.