Basic Information / Background Information

Overview Diagram

Looking at the diagram above, we see there is a master (that contains the list of jobs and slaves), a number of slaves, the clients and then a shared storage that might be needed by all or some of them.

Note: This shared storage is not part of DrQueue itself, it has to be managed by the underlying operating system as any kind of network filesystem (NFS, CIFS/SMB, AFS, GFS, GPFS, Lustre, OCFS2, ...).

There could be many shared storages, not a single one. DrQueue relies on the operating system, so you can have as many and from as many different types as your operating system could handle.

Master

The master is queried by slaves and clients. Slaves will also update their own status information (jobs running on it, load average, available CPUs, ... ). The master holds all the information about jobs and slaves and so it needs to be running all the time.

There are many type of requests that can be issued to the master. For example, clients could request the job list stored on it, the list of slaves, the job details of any single one and so on.

This master will have a TCP network port open to receive those requests.

Note: All requests (also the ones sent to slaves) are unauthenticated and so the port should not be open for non-trusted parties.

The shared storage will be accessed by the master to store it's own log files so they could be read and analyzed from any place with access to that location.

Slaves

Every slave holds it's own information, status, limits, tasks... and periodically updates it on the master.

When idle and available for rendering, it will request the master a task that will be associated to any job that fits into that slave's profile (pool, operating system, memory limit...).

Once a task is assigned to the slave by the master, the slave will receive all information relative to the job and task (jobscript location, type of job, task/frame number, all information relative to that job's type like Maya project directory, scene... and so on).

With that information, the slave will then create a new process with the environment variables set accordingly and execute directly the jobscript pointed by that job.

That jobscript needs to be located at any place on the shared storage so all slaves can read it with the same path.

That newly created process will start running and storing the task log file, on the logs directory of DrQueue (inside the it's own job directory).

As long as that process hasn't finished, the slave will periodically check for it's existence to confirm it has not died abnormally.

When the process finishes, the parent slave process receives the exit status of the child, and sends all that information to the master. It will receive the task that has finished running and the exit status. Depending on those values that task could be requeued automatically, marked as "Finished" or "Error". That information will be available for all clients.

The slaves will also open another TCP port for receiving slave specific requests. The reason of this is that slaves are the only ones that can modify it's own parameters. The master just receives them, but if any slave parameter was changed on the master alone it would be overwritten when the slave performs the periodic update. The information on the master would be then overwritten and lost.

Thus slaves provide a reduced set of requests to modify different aspects of their own, like setting new limits or enabling/disabling them.

When one of those requests is performed, the master will receive the correct information on next update.

(Even though the master does change some slave information on different parts of the program when needed to properly execute following commands. One example of this is when it is assigning tasks to slaves and then needs to check some limits. But even that information will be overwritten later. It is just a fast update to avoid inconsistencies on loops and similar situations. And in case of any failure on those operations it will be noticed once the updated slave data is received, and will react accordingly.)

The Render Process

When the slave creates that new child process that will execute the jobscript, you have to consider that the jobscript will be executed locally in that slave and so all job variables have to be valid locally.

That includes paths, binaries that the script could run (as the appropriate rendering engine executable), plugins, textures, scenes, output directories, etc.

Note: Everything that could be needed for a successful render has to be available to every node on that same location.

That is the main reason for the shared storage. Having that, all information saved there will be available and so slaves will be able to run the jobscripts without any problem.

In a cross platform rendering environment, the jobscript will be responsible of "transforming" or "translating" the values it receives to that slave's operating system and configuration needs.

Note: This path translation step is especially needed when you use windows slaves as well as other non-windows ones.

On Unix-like systems you can easily create symbolic links to create those paths in case of having different paths for shared storages.

Clients

Clients are all other programs that could request to receive information about the queue (the master) or send other requests to perform different actions like requeueing frames, enabling slaves, deleting jobs and so on.

Drqman is one of those clients but there are many other like all command line tools (sendjob, jobinfo, ctask, etc) as well as any other program that could use the DrQueue library (libdrqueue), like Python or Ruby scripts using the the DrQueueu language bindings (for example in connection with DrKeewee and DrQueueOnRails).

Those clients can perform all types of actions on the queue that are available on the library. There's no restriction on the type of requests that could be performed by any client, but wrong requests could lead to unexpected results like the previously metioned issue about updating information relative to a slave on the master. That change in the slave data could make you expect to have updated in fact the slave information while the truth is that you need to do that directly on the slave. (Although some requests to the master make it react with new requests to the slaves, and so might produce the same result. Don't confuse those situations because the step of requesting directly some change to the slave is always performed, either by the master or the clients.)

DrQueue_Overview.png - Overview Diagram (38.5 KB) Redmine Admin, 08/27/2006 01:05 am