Skip to end of metadata
Go to start of metadata

osg-xsede.grid.iu.edu is one of OSG's popular submit host where you can submit jobs locally on osg-xsede and jobs will be executed on various OSG resources automatically via glidein mechanism.

How do I submit jobs?

SSH to osg-xsede.grid.iu.edu, then submit condor jobs on vanilla universe.

Sample condor submit file

Put whatever you want to run in run.sh. As usual, make sure to chmod +x and put shebang at the top.

  • You should clean up the execution directory at the end of run.sh to prevent execution directories to fill up the cluster's disk space.

 

How can I make sure my jobs will be resubmitted if it gets stuck

Often, I get my jobs stuck running on some cluster forever (probably not an issue with the job itself, but maybe workflow manager lose track of it).

I can set following option in my condor submit file to automatically cancel (hold) job and resubmit (release) to another site (or on the same site...)

  • Update 9000 (seconds) to some value above expected execution time of your job.

How can I black list some sites that is known to not work with my jobs?

Add something like following in your condor submit file

(Open GOC ticket if you are having any issue with specific site at https://ticket.grid.iu.edu)

Obviously, you can use something similar to submit jobs only to certain site, like..

 

You can add other conditions such as minimum memory available by (in this case 2000MB)

 

List glidein sites that my jobs could be submitted to

  • You can actually run this on other Glidein enabled sites - not just osg-xsede.
  • GLIDEIN_ResourceName (some name entered by glidein admin manually) != OIM resource name

Available constraints

glidein validation script sets following constrants that you can use via condor_status (see below for example)

 Constraint NamesDescription
OASIS availability and versioning

HAS_CVMFS_oasis_opensciencegrid_org

CVMFS_oasis_opensciencegrid_org_REVISION

CVMFS_oasis_opensciencegrid_org_TIMESTAMP

 
Numpy/ScipyHAS_NUMPY 
Diskspace available to the pilotOSGVO_PILOT_DF (in bytes) 
Standard HTCondor attributes(Supported) 
OSG_SQUID_LOCATION is set in WN env (question)HAS_SQUID 
Example

Run on site where OASIS is installed

 

Show all sites that has blast

Show sites that has OASIS installed

Show all sites that doesn't have blast

Show which site my jobs are running

 

You can filter list of jobs based on JobStatus..

  
Running Jobscondor_q -constraint JobStatus==2
Idle Jobscondor_q -constraint JobStatus==1
Held Jobscondor_q -constraint JobStatus==5

For example, if you want to list all held jobs and show which site is is being sutmitted to, you can do

 

Submitting from BOSCO 

If you are submitting from BOSCO to osg-xsede, here are some useful tips. 

Icon

Execute these command on bosco submit nodes - not on osg-xsede 

Show WallClockTime used by OSG-XSEDE

Step 1) Find the DAGMan Job ID

Step 2) Run condor_history with the DAG ID (example here is 177)

 

Troubleshooting

ssh to remote job

Once job is submitted to a remote site, you can ssh to the remote site and troubleshoot any issues while it's running (and even after the job is completed) by doing following on osg-xsede

The parameter passed to condor_ssh_to_job command is the cluster/process id of the job you want to debug. Some sites, however, is not configured to allow you to ssh (usually due to old condor, misconfiguration, etc.. so contact site admin.. and you might get lucky!)

 

Another fun thing to try is to run following ..

If you get lucky, condor will submit a random job and let you ssh to it (if the remote site supports it)

 

 

  • No labels