OELib Primer
An Introduction to Programming with the OpenEye Library
by Matthew T. Stahl
Table of Contents
It is fair to say that OELib is a direct result of Babel. Programmers have long known that application development is facilitated by building software on top of libraries rich in functionality. Babel was the first experience I had in working with a molecule library. In addition to developing Babel, Pat Walters and I developed `OBabel' at Vertex Pharmaceuticals. OBabel was the first attempt at developing an object oriented molecule library. Although OBabel was a successful project, my departure from Vertex Pharmaceuticals provided a great opportunity to include much of what I had learned in previous projects into a new molecule class library. OELib was designed to be flexible, extensible, portable, and efficient class library for processing small molecules. I hope that others find it as useful as I have.
I admire open-source projects. Linux and the Free Software Foundation are both great examples of how summing the spare time of Geeks and Hackers can result in useful projects. Although the potential audience is small, I hope to take advantage many of the great minds writing chemical software by making OELib available to others. Two things can be accomplished by releasing source code to OELib. First, development time can be shortened by basing projects on OELib. The less people have to reinvent the wheel (or the function) the better. Second, by releasing the source code hopefully other programmers can contribute to the project. Joe Corkery, Brian Goldman, Anthony Nicholls, Roger Sayle, and Pat Walters have already made significant contributions to OELib. As the list of contributors grows all the users of OELib benefit.
This primer is not intended to be an exhaustive manual. OELib is still too young and changing too rapidly to write a manual that will not be sorely lacking soon after the first draft. The primer will hopefully be sufficient to get developers started with the basic functionality of OELib. Writing a program to perform simple manipulations of molecules should possible after reading and understanding the material in the primer. More complex software based on OELib will require the developer to dig into the headers and source code to understand all of the classes and member functions.
The most important class in OELib is OEMol, or the molecule class. The OEMol class is designed to store all the basic information associated with a molecule, to make manipulations on the connection table of a molecule facile, and to provide member functions which automatically perceive information about a molecule. A guided tour of the OEMol class is a good place to start.
An OEMol class can be declared in either of the following ways
The second declaration type sets the input and output formats for a molecule. For example
will read in a
molecule in SD file format from stdin (or the C++ equivalent cin) and
write a MOL2 format file out to standard out. Additionally, The input
and output formats can be altered after declaring an OEMol by the
member functions
where the current
values of enum io_type are
The following
lines of code show how to set the input and output types of an OEMol
through the member functions
Once a molecule has been read into an OEMol the atoms and bonds can be accessed by the following methods
or
or
or
OEBond *bond; vector<OEBond*>::iterator i; for (bond = mol.BeginBond(i);bond;bond = mol.NextBond(i)) //iterator access
It is important
to note that atom arrays begin at one and bond arrays begin at zero.
Requesting atom zero
will result in an error, but
is perfectly valid.
The ambiguity of numbering issues and off-by-one errors led to the use of iterators in OELib. An iterator is essentially just a pointer, but when used in conjunction with Standard Template Library (STL) vectors it provides an unambiguous way to loop over arrays. OEMols store their atom and bond information in STL vectors. Since vectors are template based, a vector of any user defined type can be declared. OEMols declare vector<OEAtom*> and vector<OEBond*> to store atom and bond information. Iterators are then a natural way to loop over the vectors of atoms and bonds.
Before going on to the
OEAtom and OEBond classes in more detail it is important to point out
several things about the OEMol class. Much of the information
contained in a molecule can be accessed or modified by Get or Set
methods. For instance, the GetAtom() method has already been
discussed. The following member functions are examples of Get and Set
methods in OEMol
Of the
preceding member functions GetData and SetData deserve
special attention. The data structure of OEMol can be include all
types of data through the GetData() and SetData() routines.
Attribute/Value data can be added to an OEMol in the following manner
and retrieved
from the molecule by
The onus of
freeing the memory stored with GetData is on the user as the pointer
to the data inside OEMol is released when the Clear() member
function is called.
The following OEMol member functions are useful manipulation routines
The following are useful perception member functions for OEMol.
The basics of
the OEMol class have now been covered. For more details and the inner
workings of OEMol please look at mol.h and mol.cpp in the OELib
distribution. To understand more about what can be done with the
OEMol class it is necessary to look at the associated data classes.
To
understand the OEAtom class it is important to state a key decision
on which the design was based. In OBabel the atom class existed, but
it was only a data container. All data access and modification of
atoms was done through the molecule. The result was a molecule class
that was very large an unwieldy. I decided to make the OEAtom class
smarter, and separate out many of the atom specific routines into the
OEAtom thereby decentralizing and shrinking the OEMol class. As a
result the OEAtom class not only holds data, but facilitates
extraction of data perceived from both the atom and the molecule.
A number of data extraction methods perform what is called `Lazy Evaluation', which is essentially on the fly evaluation. For example, when an atom is queried as to whether it is cyclic or what it's hybridization state is the information is perceived automatically. The perception of a particular trait is actually performed on the entire molecule the first time it is requested of an atom or bond, and stored for subsequent requests for the same trait of additional atoms or bonds.The OEAtom class is similar to OEMol and the whole of OELib in that data access and modification is done through Get and Set methods. The following are Get and Set methods in OEAtom
//***methods to set atomic information*** void SetIdx(int idx); void SetHyb(int hyb); void SetAtomicNum(int atomicnum); void SetFormalCharge(int fcharge); void SetType(char *type); void SetType(string &type); void SetPartialCharge(float pcharge); void SetVector(Vector v); void SetVector(float x,float y, float z); void SetFlag(int flag); void SetParent(OEMol *ptr);
The following code demonstrates how to
print out the atom numbers, element numbers, and coordinates of a
molecule.
The OEAtom class also has a number of
convenience methods for extracting property information from an atom.
A number of the property member functions indicate that atoms have some knowlege of their covalently attached neighbor atoms. Bonding information is partly redundant within a molecule in that an OEMol has a complete list of bonds in a molecule, and an OEAtom has a list bonds of which it is a member. The following code demonstrates how an OEAtom uses its bond information to loop over atoms attached to itself.
should produce an output like
atom #1 is attached to atom #2
The OEBond class is straightforward in its data access and modification methods. OEBonds store pointers to the atoms on each end of the bond. In storing pointers to atoms instead of integer indices, the necessity of having to reorder bonds when atoms are shuffled, added, or delete is obviated.
//***bond modification methods*** void SetIdx(int idx); void SetBO(int order); void SetBegin(OEAtom *begin); void SetEnd(OEAtom *end); void SetFlag(int flag); void SetLength(OEAtom*,float); void Set(int,OEAtom*,OEAtom*,int,int); //***bond data request methods*** int GetBO(); int GetBondOrder(); int GetFlag(); int GetBeginAtomIdx(); int GetEndAtomIdx(); int GetIdx(); OEAtom *GetBeginAtom(); OEAtom *GetEndAtom(); OEAtom *GetNbrAtom(OEAtom *ptr); float GetEquibLength(); int GetNbrAtomIdx(OEAtom *ptr);
Several convenience property methods are provided
OEResidue
Residue information drawn from PDB or MOL2 files are stored in the OEResidue class. OEResidues are stored inside the OEAtom class. The residue information for an atom can be requested in the following way
Ring information beyond atom and bond membership is usually not necessary, but more information can be had about the rings in a molecule. The OERing class is used to store the information from a Smallest Set of Smallest Rings (SSSR) perception of a molecule. The OEMol member function GetSSSR() stores the information it perceives in a vector<OERing*> inside the molecule. Perception is only done once for a molecule unless the connection table is modified. The following code demonstrates how to extract the SSSR information.
OERings store the atom numbers of the atoms in each of the smallest set of smallest rings in both a vector<int> and an OEBitVec (described later in this document). An example of how to print out the atom numbers present in all SSSR rings is show below.
will produce
something like the following output for benzene
1 2 3 4 5 6
Ring information is automatically deleted from an OEMol when it goes out of scope or the Clear() member function is called.
Molecule operator classes add information to or discover information about molecule. Although some operator classes could have been made member functions, they are necessary in all applications using the OEMol class and therefore were broken out into operator classes to avoid excessive code bloat and complexity of the OEMol class.
Substructure search is an incredibly useful tool in the context of a small molecule programming library. Having an efficient substructure search engine reduces the amount of hard code needed for molecule perception, as well as increases the flexibility of certain operations. For instance, atom typing can be easily performed based on hard coded rules of element type and bond orders (or hybridization). Alternatively, atom typing can also be done by matching a set of substructure rules read at run time. In the latter case customization based on application (such as changing the pH) becomes a facile operation. Fortunately for OELib and it's users, Roger Sayle donated a SMARTS parser which became the basis for SMARTS matching in OELib.
The SMARTS matcher, or OESmartsPattern, is a separate object which can match patterns in the OEMol class. The following code demonstrates how to use the OESmartsPattern class.
OEMol mol(SDF,SDF); cin >> mol; OESmartsPattern sp; sp.Init("CC"); sp.Match(mol); vector<vector<int> > maplist; maplist = sp.GetMapList(); //or maplist = sp.GetUMapList(); //print out the results vector<vector<int> >::iterator i; vector<int>::iterator j; for (i = maplist.begin();i != maplist.end();i++) { for (j = i->begin();j != i->end();j++) cout << j << ' `; cout << endl; }The preceding code reads in a molecule, initializes a smarts pattern of two single-bonded carbons, and locates all instances of the pattern in the molecule. Note that calling the Match(OEMol&) function does not return the results of the substructure match. The results from a match are stored in the OESmartsPattern, and a call to GetMapList() or GetUMapList() must be made to extract the results. The function GetMapList() returns all matches of a particular pattern while GetUMapList() returns only the unique matches. For instance, the pattern [OD1]~C~[OD1] describes a carboxylate group. This pattern will match both atom number permutations of the carboxylate, and if GetMapList() is called, both matches will be returned. If GetUMapList() is called only unique matches of the pattern will be returned. A unique match is defined as one which does not cover the identical atoms that a previous match has covered.
The OEAtomTyper class is designed to read in a list of atom typing rules and apply them to molecules. The code that performs atom typing is not usually used directly as atom typing, hybridization assignment, implicit valence assignment and charge are all done automatically when their corresponding values are requested of atoms.
Atom types are defined in the file atomtyp.txt which is composed of two sections. The list of atom is at the head of the file and appears as
Each atom type line contains a reference number, the internal atom type, hybridization, formal charge, and implicit valence. The reference numbers for each of the atom types are used in the second section in association with SMARTS patterns.
The reference numbers following the SMARTS patterns correspond to the reference numbers given in the atom types section. Atoms matching the SMARTS patterns will be assigned to the type given by the reference number unless the reference number is zero. Only the first SMARTS pattern that covers an atom will be used to assign the atom type. All subsequent matches of other patterns do not affect an atom's type. SMARTS patterns are applied in order they appear in atomtyp.txt so specific patterns should be defined toward the top of the patterns section and more general patterns toward the bottom.
The atom typing system in OELib gives the user a great deal of control over customizing typing rules. Since SMARTS patterns determine how OELib perceives a molecule it is rather trival to write a rule set that reflects the state of functional groups in any type of media. Charge and protonation states can reflect a wide variety of pH's in aqueos environments, or organic solvents. Translating file formats that have low information content about atom types to those with high information content becomes a process with much higher certainty and specificity than a system with hardcoded rules.
Patty stands for programmable atom typer. The patty class was kindly donated by W. Patrick Walters. The patty class provides a more flexible means for atom typing than the OEAtomTyper. The behavior of patty is similar to the OEAtomTyper in that rules apply only to the first atom in the SMARTS pattern. The patty class can read any free format ASCII file which contains SMARTS patterns associated with user defined atom type. The following is an example of a valid patty rule
O=C hbacceptor
The following is a code sample that demonstrates the use of patty class.
The array indices in the vector<string> into which the result values are placed match the corresponding atom numbers. Since atoms are numbered beginning from one, the first element in the vector<string> is empty, and the values are placed in [1...mol.NumAtoms()].
OEGastChrg
The OEGastChrg class is responsible for the assignment of partial charges to a molecule according to the Gasteiger charge model. When the partial charge of an atom is requested and partial charges do not yet exist for the molecule the OEGastChrg class will automatically be called to assign charges for the molecule. If charges have been read in for a molecule the following code shows how to force the recalculation of partial charges.
Formal charges are used as seed values of the initial charge of atoms, and the partial charge is propagated to neighboring atoms. For example, quaternary amines would have a +1 charge, and the effect of the positive charge would be felt by neighboring atoms according to the Gasteiger model.
Utility classes provide functionality that is commonly desired in software in general. These classes facilitate the construction of larger end user programs based on OELib.
OEBitVec
The OEBitVec class is a fast and efficient bitstring class that is handy to use as a truth table. Truth tables are an easy way to store whether a list of items has a particular propery. Instances of OEBitVec can by dynamically resized, and have a number of overloaded operators that make code simple and readable. The following examples demonstrate uses of the OEBitVec class.
For a full
description of the OEBitVec class member functions please consult the
header file bitvec.h.
OEStopwatch
The OEStopwatch class makes timing the execution of blocks of code to microsecond accuracy very simple. The class effectively has two functions, Start() and Elapsed(). The usage of the OEStopwatch class is demonstrated by the following code.
Although based on
system specific function calls, OEStopwatch works on Unix flavored
systems as well as Microsoft Operating systems.
OEExtensionTable
A number of file extensions are common to certain formats of molecule file types. The OEExtensionTable class facilitates setting input and output types in the OEMol class based on input and output file names. The OEExtensionTable class is dependent on a file called extable.txt to provide the extension to type translation. The following is an example extable.txt file
The default constructor for the
OEExtensionTable class takes a single parameter specifying the name
of the environment variable which points to the default copy of the
extable.txt file. The OEExtensionTable class will look first at the
current working directory to see if a copy the extable.txt file
exists. If extable.txt is not found, the class with then look in the
directory pointed to by the environment variable for a global copy of
the file. An example of the initialization and usage follows
The input and output types taken from the extensions on the filenames and used in the molecule constructor.
CommandLine
The command line class provides a means of facile development of command line interfaces (CLI). The basic idea is that the CLI is predefined in a class which then parses what was typed on the command line, checks for errors, and stores the relevant data until retrieval is necessary. The following code demonstrates how to define a command line interface.
Void Usage() { cerr << "usage statement here" << endl; exit(0); }
It is important
to set a usage function in the command line class that alerts the
user to errors in the arguments given on the command line. If unknown
flags are used or arguments to flags are omitted the user can be
alerted to the error. The command line interface is defined by the
AddFlag() functions. As the command line arguments are parsed
the values associated with each flag are placed in the variable given
in the AddFlag() function. Default values can be given as a
third argument such as AddFlag("-val",val,10).
Booleans are set to false by default, and turned on if the associated
flag is given on the command line. Please see the file commandline.h
in the distribution for the full list of member functions of the
command line class.
Translating element data is a common task given that many file formats give either element symbol or atomic number information, but not both. The OEElementTable class facilitates conversion between textual and numeric element information. An instance of the OEElementTable class is declared as a global in data.cpp. Source files that include the header file mol.h automatically have an extern definition to the global in data.cpp. The following code sample demonstrates the use of the OEElementTable class.
OETypeTable
Molecular file formats frequently store information about atoms in an atom type field. Some formats store only the element for each atom, while others include hybridization and local environments, such as the mol2 atom type field. The OETypeTable class acts as a translation table to convert atom types between a number of different molecular file formats. The constructor for OETypeTable automatically reads the text file types.txt. Just as OEElementTable, an instance of OETypeTable is declared as a global in data.cpp and is referenced as extern OETypeTable ttab in mol.h. The following code demonstrates how to use the OETypeTable class to translate the internal representation of atom types in an OEMol to Sybyl Mol2 atom types.
Math Classes
Vector
The Vector class was designed to simplify operations with floating point coordinates. To this end many of the common operations have been overloaded for simplicity. Vector addition, subtraction, scalar multiplication, dot product, cross product, magnitude and a number of other utility functions are built in to the vector class. For a full description of the class member functions please consult the header file Vector.h. The following code demonstrates several of the functions of the vector class.
Matrix3x3
Rotating points in space can be performed by a vector-matrix multiplication. The Matrix3x3 class is designed as a helper to the Vector class for rotating points in space. The rotation matrix may be initialised by passing in the array of floating point values, by passing euler angles, or a rotation vector and angle of rotation about that vector. Once set, the Matrix3x3 class can be used to rotate Vectors by the overloaded multiplication operator. The following demonstrates the usage of the Matrix3x3 class.
Finding the largest ring system in a molecule is a useful example of how to loop over atoms and bonds in a molecule, as well as proper usage of the OEBitVec class. The FindLargestRingSystem function uses the method of expanding shells to identify ring systems. A seed atom is chosen which is part of a cycle, and all bonds of the seed are crossed if the bonds are part of a ring system. The same operations are performed to the new shell that has been created by following cyclic bonds. When no more shells are found a ring system has been identified, and it is stored if it is larger than any previously identified ring system. The bits corresponding to the atoms numbers in the largest ring system are turned on the OEBitVec bv which is passed as the second argument to the function.
A crude method of substructure searching can be accomplished in much the same way as the Unixtm grep command. The SMARTS grep program given below takes a SMARTS string and a smiles filename on the command line. All molecules that match the SMARTS pattern are then written to standard out. Note that the program can be modified to read from any file format that OELib supports, as well as write out any file format just by changing the declaration of OEMol.
#include "mol.h" #include "oeutil.h" #include "parsmart.h" using namespace std; using namespace OpenBabel; int main(int argc,char **argv) { if (argc != 3) { string err = "Usage: "; err += argv[0]; err += " \"PATTERN\" filename.smi"; ThrowError(err); exit(0); } OESmartsPattern sp; sp.Init(argv[1]); ifstream ifs; if (!SafeOpen(ifs,argv[2])) return(1); OEMol mol(SMI,SMI); for (;;) { mol.Clear(); ifs >> mol; if (mol.Empty()) break; if (sp.Match(mol)) cout << mol; } return(1); }Many thanks to those who have participated in the development of OELib.