OTU’s for amplicon analysis

Overarching Goal

  • This tutorial will contribute towards an understanding of microbial amplicon analysis

Learning Objectives

  • Define operational taxaonomic units (OTUs)

  • Familiarize class with different type of OTU assignment philosphies
    • phylotyping
    • closed reference
    • open reference
  • Execute a QIIME workflow, and understand the separate steps in the workflow

  • Align sequences, assign taxonomy, and build a tree with representative sequences from OTU definitions

2.1 Picking Operational Taxonomic Units, OTUs.

2.1.1 Preamble

Picking OTUs is sometimes called “clustering,” as sequences with some threshold of identity are “clustered” together to into an OTU.

Important decision: Should I use a de-novo method of picking OTUs or a reference-based method, or some combination? (Or not at all?).

The answer to this will depend, in part, on what is known about the community a priori.

Reference Based OTU Picking

For instance, a human or mouse gut bacterial community will have lots of representatives in well-curated 16S databases, simply because these communities are relatively well-studied.

Therefore, a reference-based method may be preferred.

The limitation is that any taxa that are unknown or previously unidentified will be omitted from the community.

De Novo Based OTU Picking

As another example, a community from a lesser-known environment, like Mars or a cave, or a community from a relatively less-explored environment would have fewer representative taxa in even the best databases. However, this method is more computationally intensive and exhaustive de novo clustering (pair-wise distance matrix followed by hierarchical clustering) is not scalable (e.g. 1000 sequences requires 1000x1000 comparisons)

Popular Science/NASA/Nickelodeon

The third option is to use a reference database but to set aside any sequences that do not have good matches to that database, and then to cluster these de novo.

Open Reference Based OTU Picking

For this tutorial, we are going to use an OTU-picking approach that uses a reference to identify as many OTUs as possible, but also includes any “new” sequences that do not hit the database.

It is called “open reference” OTU picking, and you can read more about it in this paper by Rideout et al. Discussion of the workflow by the QIIME developers is here.

We use the QIIME workflow command: pick_open_reference_otus.py for this step. Documentation is here. The default QIIME 1.9.1 method for OTU picking is uclust, which we use here for simplicity’s sake. However, we encourage you to explore different OTU clustering algorithms to understand how they perform. They are not created equal.

2.1.2 OTU picking

This next step will take about 45 minutes. Note: Use SGE_Batch instructions!

The link explains SGE_Batch on the CGRB infrastructure in more detail.

Before continuing, make sure you are in the “EDAMAME_16S” directory.

cd ~/EDAMEME_16s/

One way to speed up this program, which we know will take a long time, is to use a parallel option to make it work on multiple cores at the same time. For example, instead of running the program on just my Mac, I can ask the program to run simultaneously on several Macs. This is how the infrastructure works, we have multiple computers linked together that can subdivide your tasks into multiple simultaneous runs, this is called running jobs in “parallel”. Not ever script or program has this option, so it is always good to check when you know that the program will take a long time to run.

We are going to run the script in “parallel” using a “batch” call facilitated by the SunGridEngine(SGE) scheduler built into the infrastructure.

pick_open_reference_otus.py --help

Scroll down to the descriptions of the options -a and -O

We are going to add -a -O ###?? to our batch command.

How do we determine what a good number for -O is?

For example, we can’t ask for 100 jobs, when we only have 64 “cores” available, it will not work correctly.

Let’s find out how many cores the class has available to it.


You should see something like this:


We have multiple machines containing multiple cores at our disposal. For the sake of argument, let’s use 4 cores each to not be too greedy and make sure we all get our runs done quickly. The batch caller will pick among the three machines: watson, franklin or mcclintock for the “-q teaching” option below. If you want to specify a machine, you can do so by changing the option to “-q teaching@mcclintock”, for example.

SGE_Batch -c 'pick_open_reference_otus.py -a -O 4 -i combined_seqs.fna -r /ACTF/mg_s17/data/share/97_otus.fasta -o uclust_openref/ -f' -P 4 -q teaching -r PICKOR

In the above script:

  • We tell QIIME to look for the input file -i,

“combined_seqs.fna”. - We specify that output files should go in a new folder, uclust_openref/ - We tell the program to overwrite already-existing files in the folder if we are running this program more than once (-f), the default reference file, which is provided with the QIIME packages for python is designated by -r.

{NOTE: the default reference file was moved to our shared directory so we could take a look at it later, but usually you can ignore the -r option when using this command on the research side of the infrastructure.}

Other default parameters of interest:

  • Singletons are removed from the OTU table (default flag –min_otu_size)
  • Alignment is performed with PyNAST
  • Taxonomy is assigned with uclust
  • Default workflow values can be changed using a parameter file
  • We do not perform prefiltering, as per the recommendations of Rideout et al.

To watch the progress of your batch file, there are two ways to monitor the program.


This will show whether the job is waiting “qw”, in transit to the machine “t”, or running “r”. If you need to stop the program, from waterman type “qdel JOBID” – using the number assigned by qstat and appended to the front of all of your log files. It is also possible to use “qdel PICKOR” in this case, but be careful when doing this if you have multiple jobs with a similar name. Using the JOBID is the safest method.

The log directory we created “PICKOR” will show the details of the job.

tail *.sh

The ”.sh” file shows the command that we ran, and the path settings used.

tail *.e*

The “.e” file keeps a record of what would be considered STDOUT to screen if you were running the program interactively. This is the first place to look for clues to failed runs, such as typos in the command line.

tail *.o*

Occasionally, some programs will have an output to STDERR instead of STDOUT that describes errors encountered during your run. This is program dependent. If no STDERR is written by your program of choice, then the .o file will contain the start and finish times of your batch call. This is useful information when planning a large job.

To read more about how to successfully input commands to batch, please refer to the CGRB wiki page on Submitting jobs to the CGRB infrastructure.


  • Batch calls can only be made from “crick” on the teaching side or “waterman” from the research side.
  • To delete a job that is in error, type qdel and the job number shown by the qstat, or the log file name {e.g. qdel PICKOR}.
  • If you are rerunning the job after making a change, it is okay to reuse the log directory name {option “d” if the following comes up on the screen}.

2.1.3 Exploring results from OTU picking workflow

In uclust_openref/, we can see several new directories and files. Let’s explore them, starting with the “step1..” directories. As the documentation for pick_open_reference_otus.py explains:

  1. Step 1 picks OTUs based on a reference database, producing a file of successfully clustered OTUs and a file of sequences that failed to cluster based on the reference database.
  2. Step 2 performs computationally expensive de novo clustering for a subset of the failed sequences from step 1, and picks a representative sequence from each new cluster to add to the original database.
  3. Step 3 picks OTUs from all of the failed sequences, not just the subset used in step 2, based on the new database generated (of ref+ de novos) in step 2.
  4. Step 4 performs de novo clustering on all remaining OTUs.

An great overview of the steps of the open-reference process is provided by Figure 1 of Rideout et al. 2014 [Which can be found in your Box Readings Directory].


If you navigate into one of the “step” directories, you will see a series of output files, including representative sequences of OTU clusters (“rep_set”). Take your time to explore these files using the head or less commands. Then, navigate back to the uclust_openrefs directory.

What are the other directories? The open reference OTU picking also automatically takes the next steps towards building the OTU table. The pynast_aligned_seqs directory and the uclust_assigned_taxonomy each have outputs and logs from alignments and taxonomic assignment, respectively. Notice that the directories are named so that the algorithm/tool used to perform the task is provided (e.g., pynast was fused for alignment, uclust was used for taxonomy). Very smart!

Navigate into the pynast_aligned_seq directory directory. There are four files waiting there: one file of sequences that failed to align, one of sequences that did align, one of “pre-alignment” files, and a log file. Inspect each.

If you want to build a tree with some other out-of-QIIME software, this is where you would find the rep_set alignment. The log file provides a rep-sequence by rep-sequence report. If you needed align sequences as an independent step, you would use align_seqs.py; documentation here.

How many failed alignments were there?

count_seqs.py -i rep_set_failures.fasta

We see that there were 690 rep. sequences that failed to align, and approximately 22,206 that did. (Also, notice what short-read alignments generally look like...not amazing).

Sanity check? If you like, BLAST the top sequence that failed to align to convince yourself that it is, indeed, a pitiful failure.

If, in the future, you ever have a large proportion of rep seqs that fail to align, it could be due to:

  • Hooray! These are novel organisms! (But, think about the novelty of the habitat before jumping to conclusions.)
  • The alignment parameters were too stringent for short reads, causing “real” sequences to fail alignment.
  • The paired-end merger algorithm (e.g., join_paired_ends.py) did not do a perfect job, and concatenated ends that do not belong together.
  • Some combination of the above, as well as some other scenarios.

The failed-to-align sequences are filtered automatically with this QIIME otu-picking workflow (really, the removing the entire OTU cluster that they represent); the filtered results are in the otu_table_mc2_w_tax_no_pynast_failures.biom file. It’s generally good to use this filtered table; however, we will be using the non-filtered table, otu_table_mc2_w_tax.biom, for the rest of the workflow.

Move up a directory and then cd into the uclust_assigned_taxonomy directory.

less rep_set_tax_assignments.txt

In the “taxonomy” directory, you will find a log file and the specific taxonomic assignments given to each representative sequence, linked to the OTU ID of that representative sequence

2.1.4 Other Very Important Files in uclust_openref/

less -S final_otu_map.txt

Explore this file. It links the exact sequences in each sample to its OTU ID. You should see an OTU ID (starting with the number of the first OTU that was picked) in the the left most column. After that number, there is a list of Sequence IDs that have been clustered into that OTU ID. The first part of the sequence ID is the SampleID from which it came, and the second part is the sequence number within that sample.


You will notice that some files have “mc2” appended to them. “mc2” designates that the minimum count of sequences for any particular OTU was 2. In other words, that file has singleton OTUs already filtered out.

Sanity check: How can you compare the OTUs in the full dataset versus the singletons-omitted dataset?

Biom-formatted OTU tables

These tables have the extension ”.biom” There are lots of important resources for understanding and working with the “biome” formatted tables, which were developed to deal with very large, sparse datasets, like those for microbial communities. There are several versions - some omitting singletons (mc2), some containing taxonomic assignment of otus (w_tax), some omitting alignment failures (no_pynast_failures). Biom-formatted tables are actually a binary file, so looking at it with less or more won’t be informative.

Representative sequences (one from each OTU)

less rep_set.fna

This is not an alignment, but the list of representative sequences used to assign taxonomy to the OTU, to make the alignment, and to build the tree.

Phylogenetic tree

less rep_set.tre

You can import this tree into any tree-visualization software that accepts the .tre extension (Newick format). This is made from from an alignment of representative sequences (in the pynast directory). The OTU ID is given first, and then the branch length to the next node. This format is generally compatible with other tree-building and viewing software. For example, I have used it to input into the Interactive Tree of Life to build visually appealing figures. Topiary Explorer is another option for visualization, and is a QIIME add-on.

Log files

Open them up! You will be delighted! It has all of the information you ever wanted to know about the parameters and tools you’ve just used for your workflow analysis!

Hint: most of this information is needed when writing the methods sections of manuscripts using sequencing data.

Congratulations! You just had the QIIME of Your Life!


Resources and help


  • QIIME offers a suite of developer-designed tutorials.
  • Documentation for all QIIME scripts.
  • There is a very active QIIME Forum on Google Groups. This is a great place to troubleshoot problems, responses often are returned in a few hours!
  • The QIIME Blog provides updates like bug fixes, new features, and new releases.
  • QIIME development is on GitHub.
  • Remember that QIIME is a workflow environment, and the original algorithms/software that are compiled into QIIME must be referenced individually (e.g., PyNAST, RDP classifier, FastTree etc...)

Biom format

Authored by Ashley Shade, with contributions by Sang-Hoon Lee, Siobhan Cusack, Jackson Sorensen, and John Chodkowski.
Adapted by Adelaide Rhodes, Center for Genome Research and Biocomputing for Metagenomics Workshop I, 2017

EDAMAME tutorials have a CC-BY license. Share, adapt, and attribute please! ***