Configuring MRS
The configurable settings
MRS performs several tasks, each of which can be configured.
Updating
Each databank configured in this task will typically be fetched from some location and then has to be indexed using MRS. The scripts distributed with MRS can take care of these steps. To add a databank to the the update process you have to decide how to download the data, options are rsync and FTP. You can specify additional information for this download step. Next you can specify which MRS files need to be created and what extra options to pass to mrs-build.
For the simplest databanks, only two lines are needed:
The first line adds the enzyme databank to the list of databanks we want to process. The second line specifies where to fetch the data for this databank.
As you can see in this example, the first step always defines a new databank name and adds it to the list contained in the variable DATABANKS. Adding is done with the +=
operator, if you use =
here you will throw away the current content and replace it with the new databank, don't do this.
The next steps define the details for the steps to take for this databank. Each databank has a set of variables that can be defined and each of those variables names start with a common name part, a dot and then the name of the variable. Therefore the rsync location for the enzyme databank consists of the RSYNC_URL
variable name and the enzyme
databank name: RSYNC_URL.enzyme
.
Variables that can be set for each databank are:
- FTP_URL
- If you want to use FTP to fetch the data, use this variable to set the URL to use.
- FTP_OPTIONS
- The MRS update process uses mrs-mirror to transfer data using the FTP protocol. You can specify extra options to mrs-mirror here. Extra options e.g. can be used to limit the number of files to transfer or to fetch recursively.
- RSYNC_URL
- If you prefer to use rsync (which is much more efficient than FTP) you can specify a location here.
- RSYNC_OPTIONS
- As with FTP_OPTIONS, you can specify additional options to RSYNC here.
- DATA_POST_PROCESS
- An optional command to execute after succesfully updating the data
- MRS_FILES
- The list of MRS databanks to create for this databank. Usually this is not needed. An exception is e.g. UniProt which consists of the two databanks SwissProt and TrEMBL.
- MRS_SCRIPT
- The MRS script to use, scripts can be shared among databanks and so you might want to add the name of the script here. E.g. RefSeq uses the Genbank layout and so it uses the genbank script.
- MRS_OPTIONS
- extra options for mrs-build, e.g. -I suppresses the creation of IDL data. See the mrs-build manual page for more information on the available options.
Configuration files
The goal is to have just one configuration file for MRS, unfortunately we're not there yet. And so MRS-4 has two configuration files: mrs-config.xml and databanks.info.
mrs-config.xml
The first, and eventually only, configuration file mrs-config.xml is located by default in /usr/local/etc
. This file is XML based and contains several sections. For now we have to edit this file by hand, it is a good idea to use a syntax highlighting editor for this.
globals section
The file starts with a number of global settings. These include:
- datadir
- The directory where the MRS datafiles are located.
- rawdir
- The directory where MRS stores the so-called raw files, these are the files that are downloaded from the mirror site
- scratchdir
- During indexing MRS needs to create scratch files to store temporary data. You can specify a separate directory here. If this variable is not specified, the datadir will be used instead. Scratch files are read in random order and so it is sensible to specify a partition on your fastest disks here. Scratch files can become up to half the size of the eventual MRS file.
- scriptdir
- The directory containing the MRS parser scripts.
- logfile
- The path to the logfile that will be created by the mrs-ws application. This file must be writable by the user running mrs-ws.
- user
- If specified, mrs-ws will drop priviliges when starting as a daemon.
Servers section
What follows is a section that specifies the servers to start up. You can specify as many servers as you want. Each of them will run as a separate process. The options per server are:
- service
- The type of service to start. Valid options are: www, search, blast, clustal and admin.
- docroot
- This is the location on disk containing the XHMTL files that act as templates for the web interface. Obviously, this option is useful for www services only.
- address
- The IP address to bind to. If you specify 0.0.0.0 here, you will bind to all available addresses.
- port
- The port to listen to. This should be a unique port number, you cannot share portnumbers with other processes.
- threads
- This option is only valid for the search and blast servers. It specifies how many threads should run simultaneously.
- dbs
- This is the list of logical databanks the server can provide. Each databank is listed in a
<db>
tag. Thesedb
tags have the optional parameterignore-in-all
. When specified, it tells mrs-ws to ignore this particular databank when searching all databanks. This is useful when the databank is in fact a joined databank of which the parts are also listed. (E.g. Uniprot consists of SwissProt and TrEMBL, both of which are separately searchable). - clustalw-exe
- This option is for clustal servers only, of course. It specifies where the clustalw executable is located.
- max-run-time
- This option is also for clustal servers, it specifies the maximum amount of time a clustal run may take. This is to protect your server.
Databanks section
The rest of the configuration file contains a list of databanks. There are two kinds of files: regular MRS data files and joined data files. Each has an id that can be used to specify the databank. This ID can also be used from the commandline by tools like mrs-build and mrs-query. Joined databanks are groups of other databanks that are accessed under one id.
Regular databanks are specified by the <db-file>
tag. This tag takes two attributes: id
specifies the ID for the databank and file
specifies the file name.
Joined databanks are specified by the <db-join>
tag. This tag has three attributes: id
is the attribute for the ID, name
is the attribute used to set a human readable name for this logical databank and update
is the attribute that specifies that this databank is an update databank. The last attribute is optional. The <db-join>
tag encloses a list of <db-part>
tags specifying the parts contained by this joined databank.
Update databanks are special, in this case the order of the parts is important. If a query results in an ID which is also contained in a databank part that is further down the list of db-parts, the ID is dropped. E.g. EMBL consists of an EMBL release and an EMBL updates file. When a query results in an entry that is also contained in the update file, you want to discard the one from the release file since it is now out of date.