- What is OSG-Connect
- Available computing resources
- Sign up for an account
- Join or create a project
- Login to OSG-Connect
- Software modules
- Running jobs via HTCondor
- Use of the tutorial command
- Data Transfer
- Workflow managers - DAGMan and Pegasus
- HTC jobs - Best practices
- Workshops and Training
- How to get help
Welcome to Open Science Grid (OSG) Connect computing service. This guide will walk you through the basics of OSG computing resources and services. After going through this document, you could also check the connect book and FAQ for additional details.
What is OSG-Connect
Broadly speaking, OSG-Connect is a computing service for the academic research community. This means any researcher having an affiliation with a U.S. institution (college, university, national laboratory or research foundation) whose science application can benefit from distributed high-throughput computing resources is eligible to use OSG-Connect.
Available computing resources
OSG-Connect enables distributed computing on thousands of cores that are spread across multiple institutions. At the time of computational demand, the cores that are free in the shared pool are made available to the users. These are opportunistic resources for the users since the number of freely available cores are varying at any given time. OSG-Connect is capable to support large scale computing demands of more than 2 million CPU hours per day.
Sign up for an account
To get an account, visit the OSG-Connect web site, then select Sign In/Sign Up ▸ Sign up as a new user. A new user could sign up with In Common registration or Globus online. The details of sign up process is given in the connect book.
New account approval involves the following steps:
- User completes the sign up process
- User application send to OSG support staff
- OSG support staff checks the credentials of the applicant and contacts the user via phone or email
- OSG staff approves/denies the application
- Once approved, user can connect to login.osgconnect.net
Join or create a project
Account approval means you have joined the OSG-Connect. Initially, the new users are accommodated in
osg.ConnectTrain project for the short term to learn using OSG-Connect. After the short term use of osg.ConnectTrain project, the next step is to create a new project or join one (or more) of them.
New projects are created and managed by the principle investigators or their delegates within OSG-Connect. Details on setting up a new project in OSG-Connect are shown in the ConnnectBook.
You may already know what project(s) you need to join. You have to work or collaborate with the group to join the existing active projects. Go to the OSG-Connect web site and select the Connect ▸ Join a Project menu item to see a list of currently active projects. Once you've found a project to join, simply click on its name at the left of your screen, and then on Join Now at the lower right. On receiving your request, OSG-Connect staff will request authorization from the PI or designated representative.
Login to OSG-Connect
You have to use secured shell (SSH) utility to login on OSG-Connect. The ssh applications are natively available on Linux and Mac machines. For windows machines, consider installing PuTTY application which is a free SSH application.
login.osgconnect.net is the login node for OSG-Connect. In the terminal window, type the following to login on OSG-Connect.
We could avoid typing our password to login on OSG. This is done by creating an SSH key to tell the OSG login node that it should always trust our laptop or desktop. To implement the SSH keys you may check the lesson here or the connect book login instructions.
Whenever you login, the default path of your working directory is your home directory. The home directory is "/home/username" (username is your login ID) that is defined in the shell variable $HOME. If you are a new user on OSG-Connect, some files and directories are already created for you. In your home directory, type
The content in the "public" directory is available for anyone via HTTP protocol. To view it, go to the URL, http://stash.osgconnect.net/+username where username is your login ID. The other directory is "stash" which is useful to transfer large amount of data to remote machines. We will see the usage of "stash" and "public" in the latter section.
In addition to the two directories, there are several "standard" dot-files visible to you when you type
you may see files such as .bash_profile, .bashrc, .cshrc, .kshrc, .login, .profile, .tcshrc, and .zprofile, .bashrc.ext, .cshrc.ext, .kshrc.ext, .login.ext, .profile.ext, .tcshrc.ext, .zprofile.ext, and .zshrc.ext. You customize your environment by changing the variables and aliases.
To see the list of projects you have membership, use the connect command as follows
In OSG, there are several applications, tools, libraries and utilities. These software are installed and maintained in a central repository known as OASIS. The advantage is that the software on OASIS are available on the remote worker machine. Popular software such as NAMD, GAMESS, blast, R and many more are available. Here is full list of available software. If you want to see some open software installed on OASIS, please send your request to email@example.com.
Commonly used software on OASIS are accessed via the module command. Module is a convenient way to set the execution environment for software. The execution environment of a software includes the path of binaries, libraries and dependent files.
The first step in using the modules on OSG is to initialize the module system. If you use bash
For other shells such as sh, zsh, tcsh, csh, etc., you would replace bash with the shell name (e.g. zsh). Now we can use the module command to see the available software, load a software or check what has been loaded.
To see the available software
To load a specific software, say for example, R
To check the list of loaded software
To unload a specific package from the list of loaded modules, say for example, R
Running jobs via HTCondor
In OSG, the job schedular is HTCondor. The jobs submitted via HTCondor waits in the queue till the compute resources are available. The compute resources in OSG-Connect depend on the availability of the remote worker machines that are spread across multiple locations. Upon the availability of the compute resources, the jobs will be send to the remote machines along with the necessary files. A typical condor job needs the input files, wrapper script and execution binaries.
Let us see the process of submitting an example compute job on OSG-Connect. Say for example, we want to run a program called "my_program" which operates on input data and prints output after processing. This is a typical case for running several jobs. For the sake of simplicity we define "my_program" to use the unix sort command. Open a file "my_program" and write the following
save and close.
We write a wrapper script to run "my_program" with input and output definitions. Open a file "short_wrapper.bash"
The input file "input_randomINT.dat" contains a list of random integers.
Now we have inform HTCondor about what to execute and what are the input and output data. This is done via the job submission file. Let us see how to write a simple HTCondor job file and then see how to run the job. For out example, we are going to instruct Condor about the following things
- Execute the wrapper script "short_wrapper.bash"
- Declare the arguments required for the wrapper script
- Transfer "my_program" to the worker machine since the wrapper script needs the program
- Transfer "input_randomINT.dat" to the worker machine since "my_prgram" needs input file
- Specify the name of standard error and output files
- Transfer the output files upon job completion from the worker machine
Open a file "short_condorjob.submit" and insert the following information.
The job file "short_condorjob.submit" is submitted to the HTCondor job scheduler using the condor command "condor_submit".
The number 2470320 is the job ID. Check the status of the job by typing
Note that ST is the status of the which is R for running jobs, I for idle jobs or H for held jobs. Initially, the job state is " I " till condor finds the compute resource and then the job state is changed to " R " when start to run. If we see " H " the jobs are held. To learn more details of a specific job, we do the following
You may want to check the status of all the jobs running in your account.
You may want to remove a specific job from the queue
The example discussed above serves as a good template to write your own job submission file and job wrapper script. In some cases, one may able to run a job without an execution script. Even in such cases, we recommend the usage of wrapper script. In the above example, we have not shown the usage of distributed environmental models. There are several examples available to run the calculations with R, Octave, NAMD...etc in the connect book page and tutorials. Next we see how to utilize tutorials.
Use of the tutorial command
The example discussed above is an introduction to write a simple job file that executes a shell wrapper script which contains the instruction of running a program. It is highly recommended to write a wrapper script to run your own program or the packages installed on the OASIS. Utilizing the packages on OASIS involves invoking the initial bash script followed by module load command. There are several examples available for the users in the form of tutorials.
The tutorial command is useful to get started on of the existing tutorial. To see a list of existing tutorial, type
would print the following list of tutorials
R ....................................... Estimate Pi using the R programming language
cp2k .................................. How-to for the electronic structure package CP2K
dagman-namd .................. Launch a series of NAMD simulations via Condor DAG
error101 ........................... Use condor_q -better-analyze to analyze stuck jobs
exitcode ........................... Use HTCondor's periodic_release to retry failed jobs
htcondor-transfer ............. Transfer data via HTCondor's own mechanisms
namd ................................ Run a molecular dynamics simulation using NAMD
nelle-nemo ....................... Running Nelle Nemo's goostats on the grid
oasis-parrot ...................... Software access with OASIS and Parrot
octave .............................. Matrix manipulation via the Octave programming language
pegasus ........................... An introduction to the Pegasus job workflow manager
pegasus-namd ................. Pegasus workflow to run large scale simulations - NAMD examples
photodemo ..................... A complete analysis workflow using HTTP transfer
quickstart ........................ How to run your first OSG job
root ................................. Inspect ntuples using the ROOT analysis framework
scaling ............................ Learn to steer jobs to particular resources
scaling-up-resources ...... A simple multi-job demonstration
software .......................... Software access tutorial
stash-chirp ...................... Use the chirp I/O protocol for remote data access
stash-http ........................ Retrieve job input files from Stash via HTTP
stash-namd ..................... Provide input files for NAMD via Stash's HTTP interface
swift ................................. Introduction to the SWIFT parallel scripting language
The files related to the tutorial are readily available for the users. For example, to get the files related to NAMD tutorials
which creates a directory "tutorial-namd" with all necessary files in it.
Users can store the data on home directory and stash directory. The capacity of home is limited. So the users can store data on home in the range of less than few gigabytes. Any data requiring more than few gigabytes are stored in the stash directory. For easy access, stash is mounted on your home directory. Use stash to pre-stage job input datasets or write output files when the input/output data size larger than home filesystem can handle. More details about stash is available here. To see how to utilize stash for your computing follow the tutorial outlined here or another example here.
At present, the storage service is offered as a free scratch-like storage service and there are no user quotas. When space on the system becomes tight, files will be removed on a simple least-recently-used basis.
login.osgconnect.net <=> Desktop
The data transfer between login.osgconnet.net and desktop (or laptop or some backup machine) may be performed to backup data. We use scp, sftp or rsync for transferring data of less than 10 GB. Globus transfer is highly recommended for transferring data of more than ten gigabytes.
login.osgconnect.net <=> Worker Machine
It is important to know how to transfer the data between login.osgconnet.net and the remote worker node. Because the input/output files are transferred in a job execution on OSG-Connect. To accomplish the transfer we use two transfer protocols, namely HTTP and condor transfer. As outlined in running job section, the condor file transfer are ensured through keyword "transfer_input_files, transfer_output_files, when_to_transfer...".
The HTTP protocol utilizes wget command to transfer the data from public directory. The files located at ~/public or ~/data/public are available to the users through HTTP protocol. Go to http://stash.osgconnect.net/+username to view the files under ~/public. You will have to copy or create files and directories under the ~/public in order to access them by HTTP protocol. You can use wget to retrieve the files:
To copy the "test_file" from the public directory to the remote worker node, add the above line in your wrapper script.
Workflow managers - DAGMan and Pegasus
In scientific computing, one may have to perform several computational tasks or data manipulations that are inter dependent. Workflow management systems help to deal with such tasks or data manipulations. We highly recommend DAGMan and Pegasus workflow managers.
DAGMan is a workflow management system developed for distributed high throughput computing. DAGMan (Directed Acyclic Graph Manager) handles computational jobs that are mapped as a directed acyclic graph. Cyclic graph forms loop while acyclic graph does not form loop. Directed acyclic graph does not form loop and the nodes (jobs) are connected along specific direction. For further details check DAGman tutorial.
Pegasus can handle millions of computational tasks and takes care of managing input/output files for you. It is built on DAGMan. Pegasus enables scientists to construct workflows in abstract terms without worrying about the details of the underlying execution environment or the particulars of the low-level specifications required by the middleware. Some of the advantages of using Pegasus include
- Performance - The Pegasus mapper can reorder, group, and prioritize tasks in order to increase the overall workflow performance.
Scalability - Pegasus can easily scale both the size of the workflow, and the resources that the workflow is distributed over. Pegasus runs workflows ranging from just a few computational tasks up to 1 million. The number of resources involved in executing a workflow can scale as needed without any impediments to performance.
Portability / Reuse - User created workflows can easily be run in different environments without alteration.
Data Management - Pegasus handles replica selection, data transfers and output registrations in data catalogs. These tasks are added to a workflow as auxiliary jobs by the Pegasus planner.
Reliability - Jobs and data transfers are automatically retried in case of failures. Debugging tools such as pegasus-analyzer helps the user to debug the workflow in case of non-recoverable failures.
Provenance - By default, all jobs in Pegasus are launched via the kickstart process that captures runtime provenance of the job and helps in debugging. The provenance data is collected in a database, and the data can be summaries with tools such as pegasus-statistics, pegasus-plots, or directly with SQL queries.
Error Recovery - When errors occur, Pegasus tries to recover when possible by retrying tasks, by retrying the entire workflow, by providing workflow-level checkpointing, by re-mapping portions of the workflow, by trying alternative data sources for staging data, and, when all else fails, by providing a rescue workflow containing a description of only the work that remains to be done.
Check the Pegasus tutorials for further details.
HTC jobs - Best practices
High throughput workflows with simple system and data dependencies are a good fit for OSG-Connect. The HTCondor manual has an overview of high throughput computing. Please consider the following tips of best practices.
- Usually, users may submit thousands of jobs on OSG-Connect. Each job should preferably be single threaded, using less than 2 GB memory and run for 4-12 hours. There is some support for jobs with longer run time, more memory or multi-threaded codes. Please contact the support for more information about these capabilities.
- Binaries should preferably be statically linked. However, dynamically linked binaries with standard library dependencies, built for a 64-bit Red Hat Enterprise Linux (RHEL) 5 machines will also work. Also, interpreted languages such as Python or Perl will work as long as there are no special module requirements.
Job Submission directory:
- For some scale testing and development, users can use a directory in their home directory area, however for production usage where the output returned is likely to be large, users should use a directory in /stash or ~/stash to hold the logs and output from jobs
- If users anticipate creating a large number of files (e.g. greater than 50,000) within a single directory, please contact the OSG Connect administrators so that we can create a special area for this
- The condor file transfer mechanism should not be used if the total amount of files being staged in or out to jobs is over 1GB. In these situations the job should do a http transfer for staging in data and/or use a tool like connect-copy to stage data in or out
- A tarball should be used to aggregate multiple small files rather than trying to transfer 100+ small files
- Whenever possible utilize workflow manager to run the jobs.
Workshops and Training
We offer training and tutorials for the scientists and researchers via online webinars and on-campus visits. As a part of our training service we are offering a joint Software Carpentry/Open Science Grid workshops over many campus sites. You can find the list of upcoming workshops and details following the link here.
How to get help
Feel free to ask any questions related to account sign up, project creation, job submission, workflows suitable for large scale simulations or documentation/training. Direct your questions to us by email firstname.lastname@example.org. Case by case basis, the support staff may resolve the issue or question by subsequent emails or may plan for a meeting by in-person, on-line chat or phone.