The similarstructures command performs several functions related to the Similar Structures tool. The tool searches for similar protein structures and facilitates exploring large sets of results by efficiently showing them in 3D as backbone traces and in 2D as sequence alignment plots and conformation-based scatter plots.
• similarstructures fromblast [ blast-results-name ] [ save true | false ] [ saveDirectory results-folder ]Run a BLAST search of the PDB or AlphaFold Database with a protein structure chain as the query, using the web service hosted by the UCSF RBVI. The Foldseek (Similar Structures) tool provides a graphical interface to running this command.
The showTable option (default true) indicates whether to show the results in the Similar Structures tool, which facilitates exploring large sets of protein structures by efficiently showing them in 3D as backbone traces and in 2D as sequence alignment schematics or scatter plots based on conformation. The related blastprotein command and Blast Protein tool are different in that they can use sequence only as the query, not just a structure chain, and they show results in a different table that may contain several additional features of the hits (such as crystallographic resolution, ligand residue names, etc.); if a structure chain was used as the query, such results can be converted to a Similar Structures table with the command similarstructures fromblast.
A BLAST search may take several minutes, during which ChimeraX cannot be used for other tasks. A much faster (~100X) MMseqs2 search can be performed instead with the sequence search command, and a fast 3D structure search with the foldseek command.
The name assigned to a set of results (bl1 or bl2) is reported in the Log when the search is run.
Choices of database to search:
- pdb (default) – PDB
- afdb – AlphaFold Database
The hits can be limited with the following options:
- evalueCutoff max-evalue – maximum E-value of a hit to be retained (default 1e-3)
- maxHits N – maximum number of hits retained (default 1000)
The trim and alignmentCutoffDistance options do not act immediately when the search is run. Instead, they specify how to process the hit structures if later opened from the Similar Structures tool or with the command similarstructures open. The structures are fetched from the respective database (Protein Data Bank or AlphaFold Database) and processed as follows:
- The trim option indicates deleting all of the following (if true), or none of them (if false), or a comma-separated list of:
The trim default is the current setting in the Similar Structures options, otherwise (if the tool is not open) true.
- chains – for PDB entries, chains other than the hit chain
- sequence – N- and C-terminal segments of the hit chain that were not included in the sequence alignment returned by the search method
- ligands – ligands, solvent, and ions > 3 Å from the hit chain
- The hit chain is superimposed onto the query chain by least-squares fitting the α-carbons of the paired residues and iteratively pruning far-apart pairs as described for the align command. The alignmentCutoffDistance d is the alignment pruning distance so that only α-carbon pairs within the specified distance are used in the final fit (default as per the Similar Structures options, otherwise 2.0 Å)
Search results are saved in a similar structures file (suffix .sms, a JSON file format specific to ChimeraX) with filename based on the query name and the database searched. The file will be listed in the File History for easy access, and simply opening it loads the set of results into the Similar Structures interface. The saveDirectory option allows specifying the save location, either directly or as the word browse to specify it interactively in a file browser window (default location ~/Downloads/ChimeraX/BLAST/).
Convert results from a previous search with the Blast Protein tool or blastprotein command (with structure chain as query) to show in the Similar Structures table. If more than one set of Blast Protein results is open, the blast-results-name (as shown in its title bar, e.g. bp1 or bp2) can be given. The save option indicates whether to also save the converted results as a similar structures file (default true). The saveDirectory option allows specifying the save location, either directly or as the word browse to specify it interactively in a file browser window (default location ~/Downloads/ChimeraX/BLAST/).
Fetch and open a hit structure.• similarstructures sequences [ fromSet set-name ] [ showConserved true | false ] [ conservedThreshold fraction ] [ conservedColor color-spec ] [ identityColor color-spec ] [ lddtColoring true | false ] [ order evalue | cluster | identity | lddt ]Although only one set of results can be shown in the Similar Structures table at a time (the default set), other sets may be open and available for analysis. Another set can be specified with fromSet set-name. The name of a set of results is reported in the Log when a search is run, and the names of open sets can be listed in the Log with similarstructures list. However, the only way to get a set of results that is open but not shown in the table is to use the showTable false option of the search command. Results are closed when a new set of results replaces them in the Similar Structures table, when the tool is closed, or when the command similarstructures close is used to close them.
The hit-ID for a structure from the PDB is a combination of the PDB ID and chain ID (for example, 6cmi_B), and for a structure from the a AlphaFold Database, its UniProt accession number. The hit-ID is shown in the first column of the Similar Structures table.
By default, the opened structure will be Cα-aligned with the query using the sequence alignment provided by the search method, with fit iteration as described for the command align. The trim and alignmentCutoffDistance options are as described above. If a large number of structures are opened, it may be useful to omit them from the File History with inFileHistory false, and omit their descriptions from the Log with log false.
Show a sequence plot of all hits in the specified set. This plot provides an overview of which parts of the query sequence were matched and the depth of coverage. Each row of pixels in the image represents one hit sequence, and the columns correspond to the residues of the query. The plot is white where there is no residue aligned with the query:• similarstructures traces [ fromSet set-name ] [ ofStructures hit-ID(s) ] [ alignWith residue-spec ] [ alignmentCutoffDistance adist ] [ show all | close ] [ breakSegmentDistance bdist ] [ minSegmentResidues N ] [ distance dist ] [ maxSegmentDistance mdist ] [ replace true | false ]Several options specify how to color the positions with aligned residues:
- with showConserved true:
- use the specified identityColor (default ) for residues of the same amino acid type as the query in a column where at least conservedThreshold fraction (default 0.5) of the residues have that type (and the column contains at least 10 residues)
- use the specified conservedColor (default ) for residues of the same amino acid type as the query, but not meeting the column criteria above
- use black ( ) for residues of a different amino acid type than the aligned query residue
- with lddtColoring true: by LDDT of each aligned residue in each structure:
0 0.2 0.4 0.6 0.8 - with both of the above true, only the positions that would otherwise be black are colored by LDDT instead
The LDDT (local distance difference test) indicates the similarity of a hit residue to the aligned query residue in a neighborhood of 15 Å from the query residue α-carbon.
The sequences (rows in the plot) can be in order of:
Reissuing the command with different coloring options does not recolor the already open plot. To change the coloring, first close the plot and then reissue the command with the desired options.
- evalue – lowest to highest E-value (default, if clustering by coverage only gives one cluster, see below)
- cluster – grouping the sequences by which part of the query they cover (default, if clustering by coverage gives >1 cluster)
- identity – percent sequence identity compared to the query
- lddt – average LDDT over all residues in a hit structure
Display “licorice” (spaghetti-like) ribbons superimposed on the query structure for hits from the specified set of results, either all hits or a subset specified as a comma-separated list of hit identifiers with the ofStructures option. These traces are meant to give an overview of the variability of a large number of stuctures and their coverage of the query. They are calculated from α-carbons only and do not show helix and strand assignments.• similarstructures cluster reference-residues [ fromSet set-name ] [ ofStructures hit-ID(s) ] [ alignWith residue-spec ] [ alignmentCutoffDistance adist ] [ clusterCount N | clusterDistance cdist ] [ colorBySpecies true | false ] [ replace true | false ]Search results from the structure-based Foldseek method automatically include the α-carbon coordinates of the hits. Search results from sequence methods (MMSeqs2 and BLAST) do not automatically include α-carbon coordinates, but trying to show traces will raise a dialog asking the user whether to fetch them, which could take several minutes. (Alternatively, similarstructures fetchcoords can be used beforehand to fetch the α-carbon coordinates for the structures of interest.) All of the hit structure α-carbons are loaded as a single atomic model, one chain per structure, with chain ID set to the database ID of the structure. The residue types of the hit are retained, but the residues are renumbered according to the paired residues of the query structure.
Each hit is superimposed on the query structure by fitting the α-carbons of the residues paired by the search method. The fitting is done as described for the align command, with iteration so that only α-carbon pairs within the specified alignmentCutoffDistance adist are used in the final fit (default as per the Similar Structures options, otherwise 2.0 Å). The alignWith option can be used to specify a subset of the query residues to use for superposition, instead of all paired positions.
With show close (default, where “close” means “nearby”), the traces are displayed as follows:
With show all, the traces are instead shown for all residues regardless of how far they are from the query structure.
- The ribbon is broken into segments where two consecutive aligned α-carbons are > breakSegmentDistance bdist apart (default 5.0 Å).
- Ribbons are shown for ≥ minSegmentResidues N contiguous α-carbons within a segment (default 5) and within distance dist of the corresponding query α-carbons (default 4.0 Å).
- Ribbons are shown for entire segments in which every α-carbon is within maxSegmentDistance mdist of its counterpart (default 10.0 Å).
The replace option indicates whether to overwrite a pre-existing trace model (true, default). If false, an additional model will be generated.
Create a scatter plot based on backbone conformations of hits from the specified set of results, either all hits or a subset specified as a comma-separated list of hit identifiers with the ofStructures option. The plot is generated as follows:• similarstructures ligands [ fromSet set-name ] [ ofStructures hit-ID(s) ] [ warn true | false ] [ rmsdCutoff rmsd ] [ alignmentRange range ] [ minimumPaired fraction ] [ combine true | false ]Search results from the structure-based Foldseek method automatically include the α-carbon coordinates of the hits. Search results from sequence methods (MMSeqs2 and BLAST) do not automatically include α-carbon coordinates, but trying to show clusters will raise a dialog asking the user whether to fetch them, which could take several minutes. (Alternatively, similarstructures fetchcoords can be used beforehand to fetch the α-carbon coordinates for the structures of interest.)
- Each hit is superimposed on the query structure by fitting the α-carbons of the paired residues as described above.
- After fitting, the x,y,z coordinates of the hit α-carbons paired with the query reference-residues are concatenated into a vector. Any hits without a residue aligned to one or more of the reference residues will be omitted from the plot, so it is best to use a relatively small set (<30) of query residues with highly populated columns in the sequence alignment of hits to the query.
- The vector is projected to a point in two dimensions with UMAP (Uniform Manifold Approximation and Projection). When similarstructures cluster is first run, it may take a minute to install the large UMAP Python package umap-learn.
- The points (circles) in the plot may be clustered and colored according to the additional parameters described below.
Two methods of clustering the points are available. Either one (but not both) of the following can be used:
If neither option is used, clustering will not be not done (they do not have default values).
- If clusterCount N is given (say 5 clusters), the k-means algorithm will be used to produce that number of clusters.
- If clusterDistance cdist is given (say 1.5 Å), points within that distance of each other in the 2D projection will be clustered together.
Colors are chosen randomly. If the coloring is unpleasant, simply reissuing the command may give a better set of colors. By default (colorBySpecies false), the circles in the plot are colored by cluster, if clustering was done. With colorBySpecies true, they are colored by source species and the color corresponding to each species is reported in the Log.
The replace option indicates whether to overwrite a pre-existing cluster plot (true, default). If false, an additional plot will be generated.
Copy ligands, ions, and solvent molecules (nonpolymer residues) from the hit structures onto corresponding locations on the query structure. Either all hits from the specified set of results can be used, or a subset specified as a comma-separated list of hit identifiers with the ofStructures option. Copying the residues requires fetching the full coordinates of each structure. With warn true (default), a dialog will appear to ask the user whether to proceed with the fetch, which could take several minutes to complete.Each ligand (ion, solvent) residue is evaluated for mapping onto the query structure, as follows:
- Protein residues within alignmentRange range (default 5.0 Å) of the ligand are identified.
- If at least minimumPaired fraction of those nearby protein residues are paired with query residues (default 0.5), the α-carbons of those pairs are fitted.
- If the resulting RMSD is ≤ rmsdCutoff rmsd (default 3.0 Å), the ligand is copied to corresponding position relative to the query structure.
How many residues were copied and their residue types are reported in the Log. Often thousands of water molecules, and ions, and crystallization adjuvants are found, and they can be hidden to get a better view of more interesting ligands (details...).
With combine true (default), the copied ligand, ion, and solvent residues are loaded as a single atomic model, in which the chain ID of a residue is generated from the PDB ID and chain ID of its source structure (e.g., 2cml_B). Pausing the cursor over a residue in the graphics window shows its name and chain ID in a pop-up balloon. With combine false, a separate model is generated for each hit with mappable residues, containing the residues in their mapped positions with their original chain IDs.
• similarstructures scrollto hit-ID [ fromSet set-name ]
Scroll to the row for a specified hit in the Similar Structures table and highlight that row.• similarstructures pairing model-spec [ fromSet set-name ]
Draw pseudobonds between the paired α-carbons of the (previously opened) atomic model of the hit and the query structure.• similarstructures seqalign model-spec [ fromSet set-name ]
Show the sequence alignment between the (previously opened) atomic model of the hit and the query structure in the Sequence Viewer.• similarstructures list
List the names of currently available sets of search results in the Log. The name is reported in the Log when the results are generated. Although only one set of results can be shown in the Similar Structures table at a time, additional sets may be open and available for analysis with similarstructures commands. However, the only way to get a set of results that is open but not shown in the table is to use the showTable false option of the search command.• similarstructures close set-name
Close a specified set of results. The names of currently open sets of results can be listed in the Log with similarstructures list. Results are also closed when a new set of results replaces them in the Similar Structures table or when the tool is closed.