MRS plugin tutorial
When designing MRS one of the goals was to keep the complexity as low as possible. One of the ways I tried to accomplish this is to adopt a well known scripting language instead of inventing my own. Given the popularity of (and my familiarity with) I chose Perl. Perl is known as a very fast and powerful language, especially for parsing and processing text. Which is exactly where we use it for in MRS.
The indexing process is started by the mrs-build application. The main parameter is the -d option that supplies the databank name we want to create. Based on this parameter the application chooses a plugin script to provide the necessary information and to do the actual processing. This choice of plugin script can be overridden using the -s option.
A plugin script is a Perl object that inherits from the base object MRS::Script
. It should provide some methods and can override some of the methods provided by the base class.
Since the title says this is supposed to be a tutorial, lets get started by inventing our own databank. Let's call this databank 'mydb'. mydb is a so called flat-file databank, records have a well defined layout. Here's an example record
ID A0001 DE This is a funny peptide AA AABBCCBBAA //
As you can see, there are three fields in this databank, an ID field containing a unique value for each record, a DE field containing a description of the record and AA containing an actual protein sequence.
One remark at this point. Our mydb databank as described above contains records. In MRS we do not store records but documents. You should for now consider these terms as synonyms and perhaps skip the rest of this paragraph. The reason MRS stores documents and not records is that MRS is not a database tool, but a tool to index data and there is no limit to what you can store. No structure in the data is assumed.
Lets start writing a mrs plugin for mydb. We will call the plugin mydb and thus we save it as mydb.pm in the parser_scripts directory. The first lines in the script are very straightforward, they are here to tell Perl what this file actuall contains:
package MRS::Script::mydb; our @ISA = "MRS::Script";
The first line defines the name for the package and the third line tells Perl our package inherits from the MRS::Script
package. We then continue with a 'new' subroutine. This one is required in all Perl plugins and initialises the plugin object. Here's the code:
sub new { my $invocant = shift; my $self = new MRS::Script( ... ); return bless $self, "MRS::Script::mydb"; }
Again, this is pretty standard stuff for anyone who has written a Perl object before, with one exception: we have to call the base class constructor first to create our $self object. We will now fill in this method with actual information for our mydb databank. We do this by providing values for a couple of standard fields in this object. Remember, an object in Perl is a actually a hash that is blessed. And hashes are associative arrays. We will now set up a couple of values for this hash.
sub new { my $invocant = shift; my $self = new MRS::Script( name => 'MyDB, a databank containing funny peptides', url => 'http://www.mydb.org/funny', section => 'protein', meta => [ 'title' ], raw_files => qr/mydb.txt.gz/, indices => \%INDICES, @_ ); return bless $self, "MRS::Script::mydb"; }
Let's see what we've got here. First of all there's the name field. This is the pretty name the user interface presents to the end user of MRS.
URL contains the url where users can find more information about this databank.
Section is not used at the moment but will be later on. The idea is to use this information to group databanks in the user interface.
Meta contains an array of the names of meta data fields. MRS needs to know in advance what meta data fields will be constructed. This means that you have to list all meta data fields here before you actually fill them. If you don't you will get the error "Meta data field xxx not defined".
If you provide meta datafields, which is optional, you must make sure that the first meta data field contains the title (or description) of the document. All tools that use MRS files assume this convention is used.
raw_files is a compiled regular expression. It is used to determine which files to process. The mirror process might fetch more files into the 'raw' directory of this databank than only datafiles. Some databanks e.g. have files with release information. Sometimes there is no simple regular expression that describes all datafiles, in the PDB databank e.g. the datafiles are located in sub directories. In that case you can provide a raw_files method to create a list of raw data file paths.
Then there is the indices field. This one contains a reference to a hash which maps a field identifier to a field name. In our case we use a global hash called %INDICES, but if we do, we have to define this hash itself as well of course. Here it is, this code should come before the new subroutine:
our %INDICES = ( 'id' => 'Identifier', 'de' => 'Description', 'aa' => 'sequence' );
A close reading of this piece of code reveals that we use lowercase id's while our databank had uppercase field codes. The reason is that MRS matches a field name case insensitive. Internally the id's are stored lowercase and so I opted for the lowercase in the scripts as well. This is cosmetic, but if you decide to use otherwise I can tell you this is untested.
OK, our databank script object is ready to be initialised and mrs now knows most of the information it needs to start the parsing process. There are a couple of fields that we didn't provide. MRS will create a default value for these.
Before we continue with the parsing, let me spend a couple of words on version information. If you don't provide any, mrs-build will store the date of the most recent raw file as version information. However, you can set the version field or you can provide a method with the same name to fetch this information.
Parsing
The only other method you're required to provide is the parse subroutine. This one does the actual parsing of the raw data. mrs.pl will open the files specified by the raw_files value, extract them on the fly if needed. The use of extraction is based on extension of the raw file names.
The parse method is called with no parameters. And so we start our parse method as follows:
sub parse { my $self = shift;
Again, if you know Perl you know what happens here. If you don't know Perl, you'd better simply copy over the lines since I know that Perl syntax can be a bit confusing.
And so we are in the parse method and we're almost ready to start parsing. The data we produce will be put into a new databank and the mrs-build application has already created a new temporary mrs file ready to accept this data.
Since our parser script is derived from MRS::Script, it inherits several functions that can be used. Among these is GetLine to read the raw files line by line, others are the various Store and Index methods. Their use will be explained here.
Lets start parsing. There are several ways to do this, but one of my favorite ways to do this is to read the raw data line by line after we've declared some variables to use later on:
my ($doc, $title); while (my $line = $self->GetLine) { $doc .= $line; chomp($line);
So we read the file, line by line and add the lines to the $doc variable to collect the data that belongs to the current document. After this we chop off the newline at the end of the line using chomp to make processing easier. Now if we look at the structure of our mydb again we see that a record is always closed with a line containg two forward slashes.
if ($line eq '//') { $self->StoreMetaData('title', $title); $self->Store($doc); $self->FlushDocument; $doc = undef; $title = undef; }
What happens here is we've finished a document and $doc contains the full text of the original record. $title contains the title for this document (see below) and we store both into the current document of the MRS databank. When we've done that we can flush the document and reset the variables $doc and $title.
Now if a line does not contain the two forward slashes, it must contain a key/value pair. The key is two characters wide followed by three space and then the rest of the line contains the value. We test for this and parse out the key and value values at the same time: