This page is not intended to replace reading the Xgrid Admin Manual but rather to supplement it. Additions here concisely hit highlights and integrate issues beyond what is covered in the frequently asked questions.
Frequently Asked Questions
(note to editors: keep this section short and sweet-- most folks coming here already know what xgrid is).
Xgrid, with a captial-X is the name for Apple's grid computing system. With a lowercase x, xgrid is the name of the command line utility that clients can use, among other ways, to submit jobs to a controller.
An Xgrid divides into three logical components: Agent,Controller and Client. Client computers submit jobs (a set of tasks) they want run to a Controller computer. The Controller queues the Client jobs and distributes tasks to Agent computers. Agent computers run the tasks and report their output and status back to the controller where it is stored until deleted by the Client. The Clients can asynchronously query the controller about the status of a job and the results. Any OSX computer can be any of these. A single mac can be more than one: it's possible to be Agent, Controller and Client at the same time. There is one Controller per Grid. Clients can submit jobs to Controllers of different grids. Agents can work for more than one grid.
Xgrid's setup has a pleasantly small palette of choices. The first two decisions to make are the kind of authentication & authroization to use and if a shared file system is needed. A shared file system that all the agents can access can be very beneficial for many computing problems, but it is not appropriate for every network.
See More Details.
Related XGRID FAQ
- Do I need a powerful computer to run a controller?
- Do I need to install software?
- Do I need a Mac OSX server to be a controller?
Xgrid replacements for Linux and other platforms
Planning the Security Envelope
On any Xgrid network the Agents are placing a certain level of trust in the Clients since resources on their computer are in use. There are different degrees of trust granted but there is no purely zero-trust level. Authentication can be used to prevent uninvited bad guys from submitting jobs or reading results. A related issue is limiting the scope of the damage from accidents, broken trust, and party crashers.
A major factor in choosing a setup depends whether the Xgrid manager will have routine admin access to the Agents. For example, on a dedicated cluster the Xgrid controller Admin is probably the also Admin of the cluster of agents. Whereas on a 'scavenger' grid (A.K.A 'distributed' grid) the xgrid manager will likely have no admin or user level access to the clients at all.
Agents must make a choice: they can either advertise their availability to any nearby Controller via Bonjour or they can choose one, and only one, Controller by a hostname or an IP address entered into their preferences. ( While exquisitely convenient for self-forming, almost administration-free Xgrids, the Bonjour method when combined with password identification, may be ill-advised on a non-trusted local network.)
There is a choice between modes of authentication: password mode and Kerberos. One can use a blank password as well but usually this is for debugging since there is no important simplification of the setup without a password. For most set-ups password mode is the preferred choice for simplicity.
For each controller there are two passwords: one shared by all clients to autenticate to the controller for job submission and control ("the client password") and one shared by all agents that the controller supplies to authenticate itself to the agent when sending out or controlling a task. ("the agent password").
While it's casually said that Clients authenticate to the Controller and the Controller authenticates to the Agents, in fact the authentication is mutual. The controller and agent symmetrically prove to each other they both know a shared secret without ever revealing that secret to each other. Thus a rogue cannot masquerade as agent, client, or controller as a means to learn the password.
Password security on Xgrid should not be confused with that of robust login protocols like SSH. Authentication of the controller to the agent is a fairly weak form of security because the controller uses the same password for all the Agents. This password is stored as thinly veiled Caesar Cipher in a text file on all the Agents and so can be read by anyone with admin access on any agent. The file is identical on all agents: It's not salted or varied from agent to agent or signed in any way; the cipher is not a secure hash and it can be inverted to recover the password. If these files are backed up by time machine to an external drive, they may be readable even by people without admin access. Thus many people have potential access to the password file. The situation is thus analogous to automated scripts that do FTP using hard-coded passwords in the scripts. During authentication these passwords are validated in a reciprocal, encrypted manner and are never sent over the network, but unlike SSH all other communication is not encryptend after authentication.
On the clients, there are different kinds of potential exposures of the controller password. Again all clients share a single controller password. To use the xgrid command the client must place the password on the command line or in an environment variable. Thus these passwords are often in clear text on screen and persistently lurking in the shell history, environment and bashrc. Notably, the third party client replacement app, GridStuffer, stores the client password in the keychain and thus is more secure.
The bottom line is it possible that a bad-guy could encounter either widely shared password and use it to submit jobs. With the client password jobs could be submitted directly. With the agent password, on a bonjour set up the agent can impersonate the controller to submit jobs. With the agent password, where a specific computer is designated as the controller, then this bad-guy would also have to spoof the hostname or IP address of the controller on the local network. This last threat is less likely: behind router networks using a non-routable IP, it is usually difficult to spoof an IP from outside the router and so is only a threat from insiders on such a network.
Thus password authentication while not requiring additional services like LDAP and kerberos to be configured on the Agents, has risks if the network, clients or agents are not trusted.
Kerberos is the alternative to password authentication; implemented properly, it is more secure, but it may have limited applicability in many network configurations. Compared to password authentications, Kerberos is more difficult to set up and requires other network services beyond Xgrid. Kerberos is a system by which computers can delegate authentication of allowed users to a trusted kerberos server, typically part of a trusted LDAP or local DNS setup. Setting up security systems is considered a sophisticated sys admin task, but it is made significantly easy by Mac OSX servers which have a GUI process for LDAP and Kerberos. The agents, controller and clients, must also be set up by a qualified sysadmin to bind their authentication to the kerberos server.
Under Kerberos, jobs execute on Agents with the userid (or credentials) of the client that submitted the job. Thus they have the userid of the client and some elevated privileges on the Agent computer that may allow the jobs some enhanced capabilities through a looser resource sandbox (seatbelt) and file access permissions. If the Agents don't trust the Clients it may be unwise to give them elevated privileges.
Moreover, on some local networks this might be not useful or local IT policy might oppose kerberos. Specifically, in common unmanaged local networks individual computers authenticate their own user logins and passwords. For example, in scavenger grids composed of cycles donated by underutilized workstations, the workstation owners might not have a single trusted authority they would want to delegate user authentication to. Conversely, in highly managed networks the sysops may have already established a centralized kerberos system and might be averse to having an ad hoc kerberos or LDAP server on their network. Thus kerberos authentication is likely to be the natural choice only when the Xgrid manager and the Kerberos manager are in the same IT department.
Under Kerberos when the Controller is authenticating to the Agent it is actually doing so with the Client's credentials.
Open question: how does this works in a hybrid case where clients authenticate by password and Agents authenticate by kerberos. (Some online discussions seem to suggest that there is a mechanism to allow Agents to authenticate to the Controller credentials instead, but this does not seem to be documented.)
As can be seen the authentication choices are polar opposites: very weak or very invasive/difficult. Apple has not yet offered a middle ground such as ssh-keys where the passwords can differ on every agent and client and are not so exposed by use. Fortunately in addition to authentication, there is also an authorization layer to the security onion: IP filtering, unix permissions and the sandbox.
Unix permissions and the sandbox
First, in password mode, Xgrid agents always run jobs as user ""nobody"" who has very few system privileges and file access on properly configured computers. Thus one is taking advantage of the natural Unix permissions to limit capabilities by username. This introduces a high degree of safety as long as the unix permissions of the filesystem exist (e.g. FAT formatted thumb drives with lack unix permissions) and have not been misconfigured. Agents, would be wise to run "repair permissions" before activating xgrid.
Second and most important of all, xgrid jobs are run in a sandbox. The sandbox further lowers the system access privileges of the job. It can limit the job to seeing only certain parts of the file system and selectively remove write access regardless of unix permissions, and limit other system calls such as interprocess communications. In particular under password mode, the seatbelt file prevents access to file systems mounted in /Volumes which is where unprotected FAT32 thumb drives will be automatically mounted. Setuid executables cannot be executed, making privilege escalation security holes difficult.
The "seatbelt file" that defines the sandbox limits is more restrictive in password mode than in Kerberos mode. Agents may want to consider tightening the sandbox further.
Xgrid also provides some house keeping. The Xgrid agent reaps all forked children of a task, if they are left running when a task exits. It also removes the temporary working directory where the task executed.
Consequences of an attack
To summarize: to gain access to job submission an attacker must learn passwords and in some cases appropriately spoofIP addresses to allowed him to pretend to be a client or pretend to be a controller. To gain higher level access to an agent from a job would require some sort of privilege exploit that defeated both the unix permissions and the sandbox. To accomplish all of this simultaneously is a tall order and no known attack allowing that exists.
However gaining just the ability to submit jobs can be exploited for a Denial of Service attack (using up all CPU, memory, and network resources on the Agent to a point where the computer is effectively paralysed.) Since the Apple seatbelt allows network privileges it could be used as a base to launch network attacks on other computers. The Apple default sandbox seatbelts also allow a job to scan files like hosts and auto_master for attack planning reconnaissance to gather names of users, names other services and computers on the local network, and sometimes passwords.
There are two firewalls to consider. One is the firewall on the individual agent/client/controller computers and the other is a possible network firewall surrounding all of them.
Restrictions by IP on who can access Xgrid can be the best way to define the circle of trust. If all components live inside a company firewall then that confines the circle of trust. One can still get breaches if someone can tunnel a connection through the firewall but that narrows the exposure vector to piggyback on other security holes.
An important security feature is that all controller -agent communications are initiated by the agent not the controller. One should not open firewall port 4111 for inbound communication on the agent. Only the controller needs its firewall set for incoming connections. Similarly, there is a significant security improvement in not using bonjour for agents to prevent them from promiscuously offering themselves to any controller. This is also why it's a good idea for controllers not to also be agents.
Rather than relying on soley on passwords, you may want to limit client submissions to come from specific subnets, hostnames or IP Addresses. On OSX Servers and pre-10.5 OSX, the default firewall is IPFW which easily allows tailoring of allowed IP addresses for each port. An similar approach can use TCP wrappers; this is useful for post-10.5 OSX, which replaced IPFW with an application firewall.
To use the built-in TCP wrappers. First check /etc/services to make sure xgrid is a named service for port 4111. Then simply add to the controller's /etc/hosts.deny :
and then add to the controller's /etc/hosts.allow :
xgridcontrollerd: <list of subnets or IP>
However, you must include the agents on this list as well or they too will be cut off. Note: this also means that every agent could pretend to be a client if they know the client password. ( It's a shame that apple did not use two ports to segregate clients and agent.) Never-the-less, on many grids the agents are managed locally along with a controller, and it is the possibility of rogue clients one is trying to protect against.
If you are willing to give clients a log-in userid on some proxy computer then you can set up real client authentication with per-client passwords or ssh-key files. Would be submitters enable themselves as clients by ssh-ing to that proxy computer and tunneling the connection to the controller computer.
ssh -f -L 4111:controller.host.name:4111 proxy.host.name
On the controller, add a TCP wrappers denying to all but the proxy computer and the list of agents. Again note that all agents who know the client password can be clients without the need for an ssh login.
Tunneling for Agents is hard to automate so generally won't be practical.
Agents currently have the preference setting to either take-all-comers or one specific IP. If you want something in between, set it for bonjour, but then limit the controllers that can connect with TCP wrappers on the agent.
open question: what happens if you put a list of IPS into the specific host configuration in the preferences?
Reported security holes
- The Xgrid configuration wizard in the Server Admin gui, can silently open an undocumented NFS share on the local hard drive of the directory /var/xgrid/controller/sfs that is open for read/write to the universe ( no IP address restrictions). It appears this mount is intended to be used by PodcastProducer. This security hole was reported to apple who subsequently closed the report as "operates normally", so it is likely to persist in future OS releases. The hole is easily fixed by deleting the line for the share from /etc/exports. Fortunately, an unrelated major bug in the 10.6 server admin tools prevents NFS configuration prevents NFS from working when there is a firewall, so this hole is accidentally masked.
Shared File Systems
Commonly an executable needs access to data files and libraries to do it's job. On the agent these may not be available. Likewise results may file-based rather than just the standard out. The Xgrid command offers limited migration facilities for input and output. Specific files and even a whole directory tree can be encoded into the batch file and transiently transported to the agent for the duration of a task or job. To go beyond this requires the agents to mount a file system from another computer.
Why shared file systems are desirable
Shared file systems can relax many constraints. For example, they can offer a lightweight approach to inter-task communication and persistence. When used with scoreboard they can permit more sophisticated reporting than just a single art score. But the biggest issue is simply data transport efficiency.
The built-in transport sacrifices efficiency for simplicity. Input files transported to the agent for a task are not cached even when the same agent is running tasks that all use the same input files. The data is retransmitted for every task. The binary data is sent in uncompressed base64 encoded XML, so it's 4/3 larger than the original file size. And for single tasks, the network traffic is doubled since it all goes from client to controller then controller to agent. On the return trip, for output files, the same redundancies occur.
Moreover, that modest inefficiency can be dwarfed by another consideration. Consider an calculation needing just a few items from a very large database. If the items needing to be read cannot be known beforehand then the whole database must be transported across rather than just the records of interest. When using canned software, one may not have the luxury of optimizing the data storage and retrieval format for the idiosyncrasies of Xgrid. An analogous issue arrises in dynamic loading of libraries in code. A given piece of code may want just a few modules in a very large library. Consider, for example, Python code that uses extensions at runtime not found natively on the Agent. You may not know which parts of a large distribution library are required and so have to migrate it all.
To accommodate these ad hoc run-time read/writes needs it highly desirable to be able to mount a shared filesystem on all the agents. This also allows the agents to share intermediate results and otherwise be more tightly coupled.
Strategies for Shared file systems
Here we will ignore the two extreme cases: on a remotely distributed system where sharing a disk mount is prohibitively difficult and on dedicated clusters with complete control of Agents and so there are no constraints on what can be done. Instead we examine the case of password authentication where jobs are run as user nobody and are severely limited by the sandbox. Even stronger constraints arise in grids of inhomogeneous workstations with limited admin access.
Can a job mount an external file system on demand?
Not easily. Xgrid has no provision for this on the agents and user-nobody cannot execute a mount command, so conventional approaches are out. The sandbox allows custom socket communication with a server, but this would not be a universal solution since many pre-existing programs expect a normal file system. You can't even fake this with a UNIX fifo-file because the sandbox prohibits a pipe between different process IDs under password authentication.
Thus admin intervention is needed on the Agents to set up a file mount. The easiest way to make this persistent across both reboots and temporary network failures is to include it as an automount. Such a one-time configuration can even be done by the workstation owners themselves using the directory utility GUI (in leopard). However for simplicity and for greater NFS option selections it may be better to run a script that directly edits /etc/auto_master.
Where in the agent filesystem should a shared file system be mounted?
Since the mount is not for the benefit of the workstation owners, the mount point is best kept out-of-plain-sight for various reasons
- The mount point should be unseen in file dialogs to avoid sluggish response times from querying remote disks.
- Workstation users might mistakenly save their files on your remote disk.
- Avoid time machine backup or spotlight indexing
- Put up a speed bump against workstation users or Scooby Doo meddling with your file system.
- No intrusive icon on the desktop or finder sidebar for the remote disk mount.
First off we can only consider places that user-nobody and the sandbox allow access. The most obscure place is perhaps in $TEMPDIR which is the nobody-specific folder Darwin unix creates as a per-user-name temp directory. A problem with this is assuring that path will be the same path on all computers. Uniformity of the mount point is desired since our executables often need to know the absolute path to access this. Another possible place is just one node above where xgrid creates the working folders for the tasks. This seems logical and consistent but since it's not a documented path it might change or get over-written on system updates.
Another possibility is /var/xgrid/sfs. Apparently this is a directory Apple uses as part of it's podcast producer suite. When you configure an xgrid controller using the "wizard" apple provies it silently creates an NFS export (a modest security hole by the way). When you run the podcast producer installer it tries to modify LDAP or Open Directory so that this directory gets pushed out as a mount on all the Kerberos clients This behavior does not seem to be documented and also might be subject to change or collisions if another controller on your network is using the same mount point.
Charlie Strauss suggests /tmp/.xsfs or any name you like in /tmp, as a worthy candidate. The finder has /private/tmp on it's list of hidden directories, and unix will hide the .xsfs name, making it hard to accidentally appear in a file dialog or directory listing. Because apple NFS dynamically creates the mount directory on demand (unlike other some other *NIX) the folder ".xsfs" does not have to pre-exist, so there's no problem with the /tmp folder getting wiped every reboot. The absolute path is also nicely short making it easy to type in arguments to commands. It has the added virtue that if the workstation owner did manage to accidentally save a precious document there and subsequently lost it, the owner can hardly blame the xgrid manager since /tmp is always a risky place to save something. It won't get indexed by spotlight. And there's no need to uninstall it later since /tmp get's wiped every reboot.
Should the shared file system be password protected?
While it might seem like a password is desirable, however, as commonly implemented, it has only modest protection and added inconvenience. First, discriminating soley by password rather than IP could be more vulnerable since it relies on keeping the password secret. That can be a hazard since auto_mount passwords are by default kept in plain text configuration files readable by all users on the agent machine including third-party xgrid jobs. (See sandbox section for a way to hide this from xgrid jobs.) Second, one must be careful how the incoming user-ids are enabled on the file server since simply creating standard user-IDs would in turn grant login access to the server that would be undesired. Third, once mounted no password is need for an admin on the agent to access the files.
Issues to consider when choosing between NFS,AFS or SMB
From the server-side, NFS export is attractive because it does not require a userID and password from the client but easily restricts by IP addresses. It can enforce user-id squashing so all written files are accessibly by user nobody.
AFP and SMB authenticate by user-id and password which will need to be exposed in /etc/auto_master on every agent mounting the disk. They don't natively discriminate by IP address of the agent, so by default anyone who learns the userid and password can access the server disk from any computer. You won't be able to remotely edit these if you need to change them without getting admin access on the Agent again. To limit the scope of a password breach, It could be a good idea to also implement additional IP address filtering such as using TCP wrappers. You also need to make sure that user nobody on the agent is able to access the mount by the User ID chosen. Note: If you simply choose nobody as the AFP mount user-id then all agents will be sharing the same password (for user nobody) reducing its security value. Moreover that is a trick that can only be played once: one could not export a different filesystem to a different grid this way since the Nobody user ID is unique. Finally you need to make sure that files written by agents by default bear the nobody user-id not the user-id of the diskmount or else the agents won't be able to access each other's shared data. (recipe for how to do this on AFP?)
One minor virtue of AFP/SMB over NFS is the ease of opening the firewall ports (1 for AFP and 6 for NFS ). That is, if you are having to beg an overworked IT staff to open ports in company firewalls, you might find that AFP is open already but because of the variability in RPC port assignment, NFS usually can't be open by default.
open questions: Is it possible to store an AFP or SMB passwords in a keychain to keep it out of the eyes of casual users on the agent, but since it needs to mount at boot, how can these key chains be unlocked? Don't know. Also there are other issues with passwords stored in keychains. Is there some way to authenticate AFP by LDAP that avoids these issues? Don't know.
How do I make a mac into an NFS disk server?
How do I automount a shared file system on the Agent at boot time?
Xgrid Client Choices
The client is the person or machine submitting the jobs. While the agent and controller are specific software daemons, there are many different tools that fill the roll of client. Apple has designed a set of APIs that any program can use to talk to the controller for job submission and information retrieval. Thus many different forms of clients can exist for various special purposes. However for use more as traditional batch job system one has the client tools xgrid and MPI built into all OSX versions. There are also third party queue's like GridStuffer, and demo apps in apple's development package.
All OSX machines have the xgrid command for submitting jobs from the command line. The man page documents most of the command so we wont repeat that here. But it leaves out a few important details.
- MAJOR WARNING: Make sure you understand how xgrid interpolates paths before using it. The behavior is very cryptic, and it is surpisingly easy to cause accidental transfers of hundreds of gigabytes, run away agent jobs, and unexpected environment-dependent client behavior.
- paths for -art scripts on the command line must be relative from the current working directory. Absolute paths fail on the command line for -art scripts.
- when using -art, one must also specify -artid and also a condition like -artequal or -art , otherwise it will be silently ignored.
- Arguments to art scripts are possible only in batch mode submissions.
- scheduler hints are only merely suggestions not requirements to the controller.
See the XGRID FAQ for more details.
Objective C API
Apple has a documented API and some examples of using it in the developer tool kit. There is also a third party API toolkit called EZGrid which purports to greatly simplify setting up an Objective C client.
Other clients, APIs, and batch file generators
There are projects in Ruby and Python that will import client functions. As of mid 2010 the RxGrid appears to be in active development. PyXG  project has no online-documentation but there is documentation in the executable pythons scripts . Both of these use a syntax and capability that mimics the xgrid client tool.
Additionally, there are dedicated batch file generators. These are not client replacements but rather simplify generating sophisticated batch files that can then be submitted with the standard xgrid client too. The Xgrid Batch Editor is a stand alone gui interface. Xgrid.rb is a ruby-based batch file generator.
xgridstatus is a must-have single purpose tool that accesses undocumented features of the API and offers some unique functions. It returns grid IDs, availability and processor details of the agents. Currently no other tool, including apple's own offer those features. It is not used for submitting jobs.
The R package xgrid 0.1-11 is available on CRAN. The package provides functions to distribute and collate results from simulation studies and other computationally expensive tasks to Apple Xgrid clusters from within the interactive R programming and data analysis environment.
More specifically, the routines within the package facilitate access to Apple Xgrid clusters (which can be created using standard installations of Mac OS X, for example in a classroom or computer lab) to undertake independent simulations or other long-running jobs that can be divided into replicate runs. In addition to the documentation within the package, more information can be found http://www.math.smith.edu/xgrid
Matlab has it's own distributed toolkit that does not use xgrid. However you can run matlab as an executable on xgrid.
[MAJ] Matlab is very powerful scientific calculation framework which can take fully benefit from processors with multicores as well as multiprocessors clusters since last few releases. Widely used in numbers of research activities, its performances can be greatly improved by an appropriate parallelization, provided the studied problem can take benefit of it.
For users which have purchased the Parallel Computing Toolbox, Matlab can parallelize independent jobs in a script with the simple 'parfor-loop' command which can replace the usual 'for-loop' one. The command
parfor (loopvar = initval:endval, M); statements; end
executes independent statements in a loop using a maximum of M local cores/processors to evaluate statements in the body of the parfor-loop. Input variable M must be a nonnegative integer. By default, Matlab uses up to as many cores/processors as it finds available, and M can't exceed a value of eight (this can be threads in place of hardware cores if the hyperthreading option is turned on for the local processors). This is certainly the most simple way for easy parallelization, since it requires no ab initio knowledge, and it turns out to be very effective. In addition, this toolbox allows parallel processing with GPU (Graphical Processing Units) which can further gives a serious boost to applications that have the appropriate structure.
When calculations require the use of more than eight local threads/cores or must use the calculation capability of a remote cluster of processors having themselves multiple cores, Matlab must then use an external task scheduler like PBS Pro®, TORQUE, Sun Grid Engine®, Condor®, LSF® which must be set-up before being able to be used by the Parallel Computing Toolbox of Matlab. Since Xgrid is a task scheduler which has the big advantage to be already installed in all Macintosh computers with OS X system, it is possible to use it for this purpose. Although not officially supported, it can be used with the generic scheduler option in Matlab that allows to incorporate in principle any scheduler. Work is in progress in this direction, and when fully functional examples will be available, they will be given in this section.
For users who haven't purchased the Parallel Computing Toolbox, it is nevertheless possible to perform powerful local and remote parallel calculations with the combination of Matlab and Xgrid. The starting point for running matlab as an executable on Xgrid is given effectively in this link , but it can be also more general by using this hint in combination with the ssh command and perform calculations on a set of distant macpros while launching the jobs themselves from any place with a laptop for example, which could be under macos, linux or even windows. The tutorial will be given for remote calculations with Xgrid and Matlab since the method is also valid for local ones, with a straightforward modification.
Step 1: since the ssh command may be used within a script, it is necessary to authenticate through ssh to the dedicated machine on which matlab will run parallel jobs without having to enter your password each time. For test purposes, it is very convenient to do it not only for the remote computer, but also the local one (127.0.0.1) from which the command is launched. This turns out to be particularly helpful is the debugging phase to do self ssh call.
Step 2: Install Matlab application in /Applications/ directory of the remote machine (and possibly on the local machine at least for the debugging phase). There is no special need to modify any rights for running successfully Matlab with Xgrid. Note that it is necessary to use systematically the absolute path to Matlab, because the global environment used to locate Matlab application may not work correctly with Xgrid. So always use the following command
/Applications/Matlab_Rxxxxy.app/bin/matlab where xxxx is the year, and y is the release letter (a,b)
Step 3: The Xgrid can be run with password or Single Sign-On (SSO) authentication. So far, for setting up the Xgrid for remote parallel calculations, the mode without any authentication is highly recommended, though less safe in principle, but it allows to separate what comes from authentication and from path problems quite clearly. So it is supposed at this stage that the Xgrid is in operation both of the local machine (your laptop for example) and the remote one. See the tutorial in tengrid.com for this purpose. You should see the number of agents available with Xgrid Admin and also that attached agents are available.
Step 4: now, as pointed out in this link , it is impossible to call directly Matlab from Xgrid. The command like
> xgrid -h localhost -job submit /Applications/Matlab_Rxxxxy.app/bin/matlab < pathtotestfile/test.m
does not work at all. It is important to note that this problem is not specific to Xgrid, and has been also found using qsub command of PBS Pro® in a Linux cluster in place of xgrid one. This is the main reason why it is absolutely necessary to go through a script that can be interpreted. It can be a shell script or a python one, it does not matter. The important point is that the call to Matlab is done within an executable script. This script can be created dynamically within Matlab in the local machine (very convenient) to pass a set of parallel jobs to the remote machine where they will be executed in parallel with Xgrid. They can be also static and written with any text editor. We will consider this case for illustrating the remote parallel calculation with Matlab and Xgrid. It is however very easy to do it dynamically within matlab using the built-in Matlab command unix that allows to open a pseudo tty terminal and launch an independent process from a file.
So let suppose that we want to execute the following matlab script test.m place in the directory /Users/yourusername/Shared/matlabwithxgrid/
test.m (4 lines):
a = rand(1024); b = fft2(a); save /Users/yourlocalusername/Shared/matlabwithxgrid/output.mat exit;
It is necessary to check first that the command in the terminal application (do a cd /Users/yourlocalusername/Shared/matlabwithxgrid/ first to be placed in the correct directory)
/Applications/Matlab_Rxxxxy.app/bin/matlab < /Users/yourlocalusername/Shared/matlabwithxgrid/test.m
and the output.mat should appear after few seconds. Your can check that is not empty by running Matlab in the interactive mode, and check its content.
Now it's time to jump into Xgrid and create a shell script name testxgrid.sh to be placed in the directory /Users/yourlocalusername/Shared/matlabwithxgrid/
testxgrid.sh (1 line):
/Applications/Matlab_Rxxxxy.app/bin/matlab -nojvm -nodisplay -nosplash < /Users/yourlocalusername/Shared/matlabwithxgrid/test.m
Once done, it is possible to check locally that Xgrid works fine with Matlab using the command
> xgrid -h localhost -job submit /bin/sh /Users/yourlocalusername/Shared/matlabwithxgrid/testxgrid.sh
and the output.mat should appear again after few seconds.
It can be done on the remote machine (in a local network only) transparently by replacing localhost with the remotehost IP address
> xgrid -h remotehostIPadress -job submit /bin/sh /Users/yourlocalusername/Shared/matlabwithxgrid/testxgrid.sh
provided the controller and the agents of the remote host are well recognized (see Xgrid Admin program for that)
Step 5: if all previous steps have been successful, then it is possible to combine Xgrid, ssh and Matlab in the following manner
> ssh yourlocalusername@localhost 'xgrid -h localhost -job submit /bin/sh /Users/yourlocalusername/Shared/matlabwithxgrid/testxgrid.sh'
and it should works transparently. Note that the absolute paths are all necessary in the command but also inside the testxgrid.sh file !
The ultimate step is to perform the same command on the remote machine. Then it is
> ssh yourremoteusername@remotehostIPadress 'xgrid -h localhost -job submit /bin/sh /Users/yourremoteusername/Shared/matlabwithxgrid/testxgrid.sh'
and again the absolute paths are all necessary the testxgrid.sh file ! So your send a command via ssh that tell the remote machine to work locally with Xgrid, calling Matlab to execute a script. There are several advantages to do that: (i) the ssh command can be written on any local machine with any system (linux, windows, ...) and send to a mac cluster which run Xgrid. This approach is more universal, (ii) the security is provided by the ssh keychain, even if Xgrid on the cluster is not protected by a password or a SSO authentication procedure, (iii) it is possible to send the command dynamically without login in explicitly before launching Xgrid on the remote machine, (iv) it is always possible to follow calculations using standard Xgrid commands. Since jobs are executed on the remote machine, the data are in the remote machine, and the user has to do itself to get them back on his local machine or leave them on the remote one.
(i) the procedure listed above works very well in general, but there are some problems related to the sandboxing execution that forbid some Matlab built-in commands, in particular all commands which looks like unix ones. In particular, all command like unix('cmd') or dos('cmd') in a matlab script won't work in general. There is a hint to bypass this problem: replace (deny default) by (allow default) in the following files xgridagentd_task_nobody.sb and xgridagentd.sb both placed in the directory /usr/share/sandbox. It's a kind of patch, since it opens a hole in the security, but otherwise, it is necessary to use the Kerberos SSO authentication, with no guarantee that it will work for matlab !
(ii) from a laptop, the problem of ssh without password could be sometimes an issue when the user is connected to a network that is different from the one used for the Step 1 procedure. Then it could be necessary to use a dynamic ssh procedure  , with a remapping of the hostname to the IP address. Of course, the SSO authentication is likely the best method, but a Kerberos ticket server must be setup, which is not an obvious task for a non server version of the mac OS X.
(iii) if the Matlab scripts are not in a Shared directory with the correct access rights, there could be some execution problem in Matlab
(iv) So far, the path save in Matlab desktop environment is the one considered in the calculations. Dynamical paths must be considered with care
In conclusion, it is fully possible to run Xgrid with Matlab on a local or a remote computer (or cluster) with the powerful combination of ssh. No need of OS X server for that purpose. If problems occur, they are all related to path or right issues, not a fundamental incompatibility.
As of 10.5, openMPI is built into every mac and jobs can be launched form the command line using "mpirun". Conveniently, openMPI will autodetect an xgrid and use it if the shell contains environment variables specifying the controller and password. What could be easier! Yet there's a snag....
Can jobs submitted to openMPI use Xgrid?
As of 10.5, openMPI is built into every mac and jobs can be launched form the command line using "mpirun". Conveniently, openMPI will autodetect an xgrid and use it if the shell contains environment variables specifying the controller and password. What could be easier! Unfortunately, As of April 2010, the support for Xgrid built into leopard openMPI has been broken for 17 months due to a specific known bug that remains unfixed in v1.3 and v1.4 of openMPI according to this bug ticket. It is reported that the older release, openMPI 1.2 can be installed and will work. Some guidance on using openMPI 1.2 can be found at open-mpi.org here with greater detail here and FAQs
How to install and configure OpenMPI 1.5rc1
As of june 2010, a pre-release openMPI 1.5 (release candidate 1) that allegedly supports xgrid is now available. However there are no reports of succefful xrgid-aware installs and by some reports that this fails with an error saying -mpa unknown option
How to install and configure OpenMPI 1.2
(needs to be filled in)
Can jobs submitted to xgrid launch openMPI themselves?
Apple provides a second, completely undocumented and unsupported, way to launch MPI jobs. Included in Xcode is a demo project called GridSampler intended to be a GUI for job submission. Building this produces not one but three apps, one of which will boot MPI from jobs on the grid. When run you will see a job entered into the xgrid controller that executes multiple tasks on one or more machines. Inspecting the job specification for this (recoverable from the controller) reveals that it simply submits ordinary concurrent tasks with a wrapper that launches the desired executable. The wrapper binary bootstraps the MPI connection from each agent's task to establish the network links for MPI. Unfortunately there is no documentation so it's not known how to invoke any of the many options mpirun has to offer. Moreover, in use it seems to be unstable, sometimes failing to set up communications. But at least it does work somewhat. The GUI app is not required: just include the stand-alone wrapper binary in your own custom job specifications batch files submitted from the command line. This is easily done by copying the specification pattern found in any of the jobs submitted from the GUI app.
Resource Management and Etiquette
The XGRID FAQ covers topics on:
- how to measure load
- how to know what agents are on a grid
- how to know what logical grids are present
Forks, Threads and Interprocess Signals
The XGRID_FAQ has a detailed section on this as well.
Tasks can fork or use multiple threads (c.f. twinjob.pl). However, with great power comes great responsibility. The Xgid controller queue has no provision for a job to hint at that it will fork, or for a single task to request multiple processor slots on the agent. When the queue is full the controller will assign additional tasks to the Agent till it reaches the maximum allowed level of tasks. If these contain forks the number of running processes may exceed what is desirable for good workflow on the agent. It may even impair a workstation's responsivity to it's desktop user: even though the xgrid jobs work at nice level -20, empirically for ~4xCores purely computational processes working full time, the workstation will be crippled. Process that use up memory, disk, or network resources can impact even more quickly.
Other than micromanging the queue there are several blunt hacks that somewhat ameliorate this. First, the xgrid agent preferences plist allows the setting of the maximum number of jobs to accept. In 10.5 it was 2xCores and in 10.6 it is 0.5x cores. If one's process flow has a predictable number of forks per task, then one can adjust this number so the resultant number of processes is acceptable. However, for process flows that use variable numbers of forks this solution would lead to too conservative a processor restriction. To help somewhat with inhomogenous task forking needs, one can divide the agent pool into several logical grids and set the agents on each to have a different cores/max_tasks ratio. Altering the max_tasks in the xgridagent.plist can not be done from xgrid itself, as it requires admin access to the Agent. Thus it cannot be done dynamically as the processing needs change.
Another approach would be turn this down to a single process per node and have a pre-processing step before job submission to combine jobs into tasks sets that must start simultaneously. This approach will breakdown since the marriage of convenience of unrelated tasks may have ones that finish sooner than others.
Another approach for a task known to fork N times is to submit the task along with N-1 bogus tasks that simply wait idle for the true task to complete. If we specify that all N tasks must start at the same time this forces the accounting of processor slots to respect the fork count. The problem with this approach is that it only works if xgrid places all the taks on a single agent rather than distributing them. Unfortunately there no way to instruct xgrid to place all the tasks on a single Agent without also specifying the exact agent the jobs must run on using a narrow -art command. That micromangement subverts the desirable feature of the queues handler to place a job on the next available Agent. see drone_thread for an example of code for creating a job with bogus tasks.
intriguing observation: Normally if an agent rapidly forks normally all the children will have consecutive process IDs. On 10.5 xgrid agent with user Nobody, the processes IDs will be two apart instead of consecutive. Speculatively, perhaps there is some intervening shadow process created on forks that is also not propagating all of the POSIX/BSD signals.
Inter process signaling
Two easy ways processes on the same machine can communicate efficiently are BSD/POSIX signals (i.e. sending kill events) and through named pipes (fifo). Unfortunately, Xgrid on 10.5 (user Nobody) prevents forked processes from reading and writing to a named pipe. (oddly they can read from a named pipe and they can write to a named pipe, but not when both ends of the pipe are xgrid processes.) Second, BSD/POSIX signals only work to a limited extent. A process cannot send a signal to another process group. A parent process can send a signal to a child process provided the child has not created a new process group. In 10.5 a process cannot send a signal to itself.
One signal that does work between process groups is file-locking. One xgrid process can flock a file and another process will respect the file system flock (warning this can fail on NFS mounted file systems). Thus one can create semiphores in any mutually accessible directory and poll them as areplacement for signalling.
See the XGRID FAQ for further details
First you would have to be extremely foolish to monkey with the security sandbox the xgrid runs the tasks in. However fools may wish to forge ahead. First a couple caveats, the security sandbox is not documented, and it is machine generated. So next system update you might find your sandbox modifications either overwritten or, worse, causing a complete malfunction of some new sandbox system such that your jobs are running unsandboxed. Yikes. So why try? Well the default one allows network connections. Do you really want your agents to have that privilege? It also allows agents to write to anywhere in /var that user nobody can write. It lets the agents read all sorts of things in your /System, /Library, /var, and /etc folders that you might not want outsiders to know the details of. Here is Charlie E.M. Strauss's suggestion for a more tightly restricted while still practically useful replacement for /usr/share/sandbox/xgridagentd_task_nobody.sb sandbox.
This advanced topic section has now moved to its own page: see Mastering Xgrid for the following topics:
What documentation is available for Xgrid?
Official Apple Docs
man pages for the command-line tools contain useful examples:
The primary Xgrid documentation is the Xgrid Administration manual for Mac OS X Server:
- https://help.apple.com/advancedserveradmin/mac/10.7/#apd4556D358-0D30-4366-8C4B-67AC2FC2E2EC for 10.7
- http://images.apple.com/server/pdfs/Xgrid_Admin_v10.6.pdf for 10.6
The ADC Developer library contains a reference description of the Xgrid Foundation API for Cocoa developers:
Apple's FAQ has some good info, but it is often too terse and has not been groomed of stale info or broken links.
Apple Marketing docs.
Non-Apple web sites that deal with Xgrid
A cohesive introductory tutorial to basic xgrid client use can be found a smith college
- Smith College Tutorial (there are 3 parts there)
MacResearch has a superb set of tutorials by Charles Parnot.
- The Xgrid Tutorials (Part I): Xgrid Basics
- The Xgrid Tutorials (Part II): GridStuffer Basics
- The Xgrid Tutorials (Part III): Running Batch Jobs
- The Xgrid Tutorials (Part IV): Submit Jobs with Ruby
- Xgrid Leopard: the good, the bad, the ugly, and the new stuff
Another set of tutorials by Drew McCormack
- Distributed Tiger: Xgrid Comes of Age (2005)
- Sweetening Your Xgrid with Cocoa (2005)
- Integrating Xgrid into Cocoa Applications, Part 2 (from 2004)
- Getting Started with OpenMPI and Xgrid ] (from 2006)
- Writing your own xgrid job monitor in Python (2007)
Other helpful sites that discuss Xgrid:
- The agentidler Library
- Xgrid Batch Editor
- PyXG = Python + Xgrid
- xgridstatus A fast command line tool with XML output, providing features Apple seems to have overlooked with their XgridAdmin tool. Charlie Strauss says it's a must have.
Other Clustering Systems for OSX
- Ganglia (open source) is suited for dedicated clusters
- RemoteActivity.app A meta tool to overlook xgrid, LSF and other queues from a single tool. Some documentation can be found here from Drew McCormack.
References and Footnotes
- ↑ Java xgrid agent replacement at sourceforge http://sourceforge.net/projects/xgridagent-java/
- ↑ Xgrid for Linux was at http://unu.novajo.ca/simple/archives/000026.html but that link is sometimes flakey. If so see the google cache. An alternative set of links can be found here http://lists.apple.com/archives/xgrid-users/2005/Apr/msg00056.html
- ↑ http://www.macenterprise.org/articles/settinguppodcastproducerinanactivedirectoryenvironment
- ↑ 4.0 4.1 GridStuffer is available at http://cmgm.stanford.edu/~cparnot/xgrid-stanford/html/goodies/GridStuffer-info.html
- ↑ xgridFuse is available at http://cmgm.stanford.edu/~cparnot/xgrid-stanford/html/goodies/XgridFUSE-info.html
- ↑ RxGrid is the ruby client API http://www.tslab.se.shibaura-it.ac.jp/index.php/Software/rxgrid
- ↑ pyXG is the Python client API project http://pyxg.scipy.org/
- ↑ The PyXG project files are hosted at https://launchpad.net/pyxg
- ↑ The Xgrid Batch Editor is available for free from http://kellerfarm.com/kfsproducts/yesfree/xgridbatcheditor/
- ↑ Xgrid.rb is available at http://bitbucket.org/j05e/xgrid/src
- ↑ xgridstatus is a compiled binare available for free at http://cmgm.stanford.edu/~cparnot/xgrid-stanford/html/goodies/xgridstatus-info.html
- ↑ Run matlab code: http://www.macosxhints.com/article.php?story=20041125065000639
- ↑ Run matlab code: http://www.macosxhints.com/article.php?story=20041125065000639
- ↑ Run matlab code: http://www.macosxhints.com/article.php?story=20041125065000639
- ↑ http://www.seul.org/sys/doc/dev-faq/dynamic-ssh.html
- ↑ http://support.apple.com/kb/HT4020