mmCIF: A Chimera Developer Perspective

Tom Goddard
October 22, 2013

Chimera is a molecular visualization and analysis program similar to PyMol and VMD. About 10,000 users.
Why use mmCIF files? Convenient for large structures, more than 62 chains, more than 100,000 atoms in one file.
Why to not use mmCIF? Analysis uses many programs which all support PDB format, but only some support mmCIF format.
What would encourage software developers to support mmCIF? I'll give a Chimera developer perspective.
We want production quality code (fast, memory efficient, object oriented) for reading and writing mmCIF, and don't want to write it ourselves.
A significant technical obstacle reading mmCIF or PDB format is figuring out which atoms are connected by bonds.

Two main points: First a production quality mmcif reading and writing library will be needed to make mmcif widely used. Second, the mmcif files should explicitly list the bonds -- which pairs of atoms to connect.

mmCIF is Seldom Used in Chimera

mmCIF reader added to Chimera 7 years ago, May 2006.
Estimated PDB and CIF use: 98% PDB format, 2% CIF format.
- Searching Chimera bug database, 1788 tickets include ".pdb" file names and 49 tickets include ".cif" file names. Counted by searching for exact text ".cif" and ".pdb".
- Very few bugs are related to file format problems.
- Some of the ".cif" files are small molecule files not from the PDB.
Chimera does not write mmCIF.

Chimera mmCIF Reader is Slow and a Memory Hog

Comparison of speed and memory use of 4 different file readers on molecular structures of different sizes.

	HIV RT	Proteasome	5 ribosomes	HIV capsid
File parsing speed in seconds
Atom count	8,513	70,538	717,805	2,440,800
Chimera mmCIF	1.57	15.7	187	> 5000 sec
RCSB CIFPARSE-OBJ	0.11	0.82	8.42	29
Chimera PDB	0.04	0.35	3.26	12.5
Chimera Next Gen mmCIF	0.006 sec	0.05	0.62	1.8

Chimera PDB reader is about 50 times faster than its mmCIF reader.
Next generation Chimera is about 5 times faster than Chimera, but only the atom_site table, and fills arrays, not molecular objects.
Protein Data Bank C++ library CIFPARSE-OBJ only produces tables, not molecule, residue and atom objects. Doesn't convert strings to numbers.
Large structures are uncommon, but comparisons of many smaller structures (e.g. 190 HIV reverse transcriptase, 1.5 million atoms) are of interest to many researchers.
Memory use is important too -- should not require more memory than typical laptop/desktop computer has.

	HIV RT	Proteasome	5 ribosomes	HIV capsid
File parsing, memory use in Mbytes
mmCIF file size	1 Mbytes	8	110	266
Chimera mmCIF	115	960	9500	> 23000 Mb
RCSB CIFPARSE-OBJ	10.5	76	709	2330
Chimera PDB	7.6	60	560	1815
Chimera Next Gen mmCIF	1.9	18	279	423

Tests on 2013 iMac with 32 Gbytes of memory and 3 Tbyte fusion drive. PDB Identifiers: HIV RT 4b3o, Proteasome 4c0v, 5 ribosomes 1voq,1vor,...,1vp0 (10 ids), HIV capsid 1vu4,1vu5,...,1vut (25 ids), 3j3q.cif.

Code Complexity

Chimera uses the Python Macromolecular Library (mmLib) to read mmCIF files.
- Written by Jay Painter in 2004 in the lab of Ethan Merritt at University of Washington.
- All in Python.
- Developed for viewing TLS (translation, libration, screw), domain motions in crystallography.
- Provides molecule, residue and atom objects, not tables, when parsing mmCIF.
- mmCIF parser is 2000 lines of code, plus 500 lines we wrote to convert to Chimera objects.
For comparison, the Chimera PDB format read/write code is 6000 lines of C++ code, handling all the standard PDB records, many common non-standard PDB file problems, bond connectivity, ....
Much PDB parsing complexity results from variant PDB formats created by different software packages.
mmCIF defines 352 table types (i.e. "Categories") and 4142 items as of version 4.034, Aug 8, 2013.
Estimate one programmer year needed to develop production mmCIF read/write code.
Chimera reads or writes 65 file formats handling molecules, sequences, density maps, 3d scenes, etc.

Parsing mmCIF Tables to create Atom, Residue, and Chain objects

Applications like Chimera need molecular objects, not relational database tables.
Significant code is needed to convert tables to molecule objects.
The exact definitions of atom, bond, residue, chain ... objects is not too important since each application will need to convert to their own preferred object definitions.
CIFPARSE-OBJ parser from RCSB produces tables.
mmLib produces molecular objects, which is why we chose to use it in Chimera.

Creating molecular objects involves matching corresponding names in different mmCIF tables, such as chain, residue name, residue number and atom name.

mmCIF atom table (atom_site):

  ATOM   1    N  N  PRO A 1 4  -62.315  -62.643 -5.519  1.00 100.20
  ATOM   2    C  CA PRO A 1 4  -61.373  -61.942 -4.649  1.00 110.90
  ATOM   3    C  C  PRO A 1 4  -61.730  -60.460 -4.495  1.00 108.42
  ATOM   4    O  O  PRO A 1 4  -60.863  -59.592 -4.628  1.00 100.29
  ATOM   5    C  CB PRO A 1 4  -60.037  -62.112 -5.380  1.00 108.66
  ...

mmCIF bond table (struct_conn):

  C DT 3 N3   D A 27 N1
  C DT 3 O4   D A 27 N6
  C DA 4 N1   D U 26 N3
  C DA 4 N6   D U 26 O4
  C DT 5 N3   D A 25 N1
  C DT 5 O4   D A 25 N6
  C DG 6 N1   D C 24 N3
  C DG 6 N2   D C 24 O2
  ...

If the file reader handles matching all the columns in mmCIF table to build molecular objects, application code can be simple. For example, printing names and positions of atoms connected to a given atom:

  for a in atom.bondedAtoms:
    print a.name, a.x, a.y, a.z

Where are the Bonds?

It is difficult to determine which atoms are connected by covalent bonds from mmCIF files.
Using mmCIF files would be much easier if bonds are explicitly listed in the file.
Currently bond patterns for every possible residue type are kept in a chemical components file, components.cif.
This file is too large (150 Mbytes, 35 Mbytes compressed) to distribute with software.
New templates are added every week (2 Mbytes larger file size from August to October 2013).
Requiring internet connectivity to open an mmCIF file and show bonds is unreasonable.

How Chimera Figures out which Atoms are Bonded

mmCIF files

Include bond templates for 265 residue types (standard amino acids, nucleic acids, ...) with Chimera.
If residue type found that is not included with Chimera, call a web service (hosted by our lab) that provides the templates.
Our server updates the templates from the RCSB every week.
If residue type is not in PDB complete list of residue types, or if no internet connectivity, or server is down, then connect atom pairs that are close.
Atoms are close enough for a bond if closer than sum of element-dependent bond radii plus 0.4 Angstrom padding.

PDB format files

Include about 100 templates with Chimera from Amber LEAP template files.
For unknown residue types put bonds between close atoms as with mmCIF.

These methods will produce the incorrect bonds when distances between atoms are far from normal.

Bond Templates are Incomplete

Missing template bonds:

H1, H2 and H3 hydrogens are in NMR structures at the N-terminus of a protein, connected to the amide nitrogen.
There are no H1 or H3 bond templates -- only H2!
The Chimera mmCIF reader has special code to attach H1 and H3.

Missing inter-residue templates:

Templates give the intra-residue bonds.
Bonds between residues are given by chem_link_bond records.
I wasn't able to find chem_link_bond records in any mmCIF files.
Apparently inter-residue bonds are hard-coded for proteins and nucleic acids.

Many residue types with no templates:

Our bond template web service averages only 3 successful queries per day, with 20 failures per day.
The failures are from unknown residue types, probably small-molecule CIF files.

Are chemical component bond templates used by other software?

How to Include Bonds Explicitly in mmCIF Files

Correct bonds using simple code could be achieved by explicitly including the bonds in mmCIF files.
The existing mmCIF struct_conn table would list these.
File size would be increased by about 30% (including only mandatory columns: chain id, residue name, residue number and atom name of joined atoms).
Another option is to include the needed residue templates in each mmCIF entry. This would be harder to use.

loop_
_struct_conn.id 
_struct_conn.conn_type_id 
_struct_conn.ptnr1_label_asym_id 
_struct_conn.ptnr1_label_comp_id 
_struct_conn.ptnr1_label_seq_id 
_struct_conn.ptnr1_label_atom_id 
_struct_conn.ptnr2_label_asym_id 
_struct_conn.ptnr2_label_comp_id 
_struct_conn.ptnr2_label_seq_id 
_struct_conn.ptnr2_label_atom_id 
1  c C DT 3  N3 D A 27 N1
2  c C DT 3  O4 D A 27 N6
3  c C DA 4  N1 D U 26 N3
4  c C DA 4  N6 D U 26 O4
...

Conclusions

mmCIF format will only replace PDB format if most software supports it.
Open source, production quality mmCIF read/write code is needed.
All covalent bonds should be explicitly listed in mmCIF files.