osg-xsede.grid.iu.edu is one of OSG's popular submit host where you can submit jobs locally on osg-xsede and jobs will be executed on various OSG resources automatically via glidein mechanism.
How do I submit jobs?
SSH to osg-xsede.grid.iu.edu, then submit condor jobs on vanilla universe.
Put whatever you want to run in run.sh. As usual, make sure to chmod +x and put shebang at the top.
- You should clean up the execution directory at the end of run.sh to prevent execution directories to fill up the cluster's disk space.
How can I make sure my jobs will be resubmitted if it gets stuck
Often, I get my jobs stuck running on some cluster forever (probably not an issue with the job itself, but maybe workflow manager lose track of it).
I can set following option in my condor submit file to automatically cancel (hold) job and resubmit (release) to another site (or on the same site...)
- Update 9000 (seconds) to some value above expected execution time of your job.
How can I black list some sites that is known to not work with my jobs?
Add something like following in your condor submit file
(Open GOC ticket if you are having any issue with specific site at https://ticket.grid.iu.edu)
Obviously, you can use something similar to submit jobs only to certain site, like..
You can add other conditions such as minimum memory available by (in this case 2000MB)
List glidein sites that my jobs could be submitted to
- You can actually run this on other Glidein enabled sites - not just osg-xsede.
GLIDEIN_ResourceName (some name entered by glidein admin manually) != OIM resource name
glidein validation script sets following constrants that you can use via condor_status (see below for example)
|OASIS availability and versioning|
|Diskspace available to the pilot||OSGVO_PILOT_DF (in bytes)|
|Standard HTCondor attributes||(Supported)|
|OSG_SQUID_LOCATION is set in WN env||HAS_SQUID|
Show which site my jobs are running
You can filter list of jobs based on JobStatus..
|Running Jobs||condor_q -constraint JobStatus==2|
|Idle Jobs||condor_q -constraint JobStatus==1|
|Held Jobs||condor_q -constraint JobStatus==5|
For example, if you want to list all held jobs and show which site is is being sutmitted to, you can do
Submitting from BOSCO
If you are submitting from BOSCO to osg-xsede, here are some useful tips.
Show WallClockTime used by OSG-XSEDE
Step 1) Find the DAGMan Job ID
Step 2) Run condor_history with the DAG ID (example here is 177)
ssh to remote job
Once job is submitted to a remote site, you can ssh to the remote site and troubleshoot any issues while it's running (and even after the job is completed) by doing following on osg-xsede
The parameter passed to condor_ssh_to_job command is the cluster/process id of the job you want to debug. Some sites, however, is not configured to allow you to ssh (usually due to old condor, misconfiguration, etc.. so contact site admin.. and you might get lucky!)
Another fun thing to try is to run following ..
If you get lucky, condor will submit a random job and let you ssh to it (if the remote site supports it)