BIT 815, DeepSequencing Data Analysis
Course Materials, Spring 2013
HandyBashShellTricks.txt, a text file with collected useful
tricks for accomplishing specific tasks using Bash utilities. The file has Unix line endings so commands
can be pasted directly on a Linux or Mac command line to see how they work
A slide presentation with an overview of topics covered during the course is here.
Friday 1 March - Introduction to Linux and the command-line interface
Introductory slides provide an introduction to the course objectives in the first class session,
and a summary of Chapter 1 from Eric Raymond's book
The Art of Unix Programming, available here, is used as a framework for
discussion of differences between the Linux command-line interface and graphical interfaces.
File globbing and regular expressions provide a basis for discussion of abstraction
and generalization as key parts of computational thinking; more information
on regular expressions is also available at The Linux Documentation Project webpage
Instructions for setting up an Amazon Web Services account,
and preparing and running a Cloudbiolinux
instance on the Amazon Web Services Elastic Compute Cloud (AWS-EC2) are also available. The computers in
the teaching lab are already set up in this way; this information is provided for those who want to have
the same software installed on another computer.
A set of sample paired-end Fastq sequence files and a sample SAM alignment file are in smallfiles.zip.
The FileGlobbing.pdf and RegularExpressions.pdf documents
provide information on these pattern-matching tools
Analysis of a SAM alignment file using command-line Linux tools is described in SAMformatAndCLtools.pdf.
document has information on how to log into the Amazon Web Services console, and how to start a cloud computing 'instance'.
This document also includes some exercises on the Linux command line, and some exercises with fastx-toolkit and FastQC programs.
Details on the FastQC output are provided in the FastQC_details.pdf document, or through
the FastQC website at http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/.
The Unix Tutorial for Beginners site has a set of eight tutorials
that progress through use of different shell commands - these exercises provide practice in using shell commands,
to follow up on the Software Carpentry material. Another introductory
Unix tutorial is available through
the http://www.molecularevolution.org website. This website also provides a number of other learning resources, available from the home page
by moving the cursor over the "resources" heading in the upper right of the homepage, then moving the cursor over the
"learning activities" item in the drop-down menu that opens.
Monday 11 March and Friday 15 March – data exploration & QC; error detection; experimental design issues
Resources: Wikipedia has information on FASTA and
FASTQ sequence formats, and information on the
Sequence Alignment and Mapping (SAM) format is available here
and in the SAM format specification. The
fastx-toolkit webpage has information about the fastx-toolkit
package of programs for quality control and manipulation of FASTA and FASTQ files, and the
FastQC webpage has information about the FastQC
program. Follow links on those webpages to find more documentation about the functions each software provides.
- The Sequence Alignment/Map format and SAMtools. Li et al,
Bioinformatics 25:2078-9, 2009 is the original paper on the SAM format.
- The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Cock et al,
Nucleic Acids Res. 38: 1767–1771, 2010
is the only formal publication I know of about the different versions of the FASTQ sequence format, and it is not as up-to-date as the
- GemSIM: general, error-model based simulator of next-generation sequencing data.
McElroy et al, BMC Genomics 13: 74, 2012
describes software for simulation of sequence data that is useful for testing effects of error frequency on alignment and assembly.
- NGSUtils: a software suite for analyzing and manipulating next-generation sequencing datasets. Breese and Liu,
Bioinformatics 29: 494-496, 2013
describes a set of software tools for managing the process of data QC and format conversion, including tools for filtering
datasets of paired-end reads to find single reads where the paired-end read was removed by a quality-filtering step.
- AWS remote login and instance management; S3 bucket as a source of data for exercises
- Shell tutorial (Unix Tutorial for Beginners)
- Data structure and formats, exploratory analysis
- With fastx-toolkit and FASTQC
- With command-line tools - grep, awk, cut, sort, uniq (re-vist the SAMformatAndCLtools.pdf document).
- Using GemSIM to create simulated datasets of reads with different insert size and end orientation - download GemSIM.sh script here.
Monday 18 March and Friday 22 March
Library construction and experimental design for genome and transcriptome assembly, read simulation with GemSIM,
read mapping with bwa, and read assembly with velvet.
A lecture outline is here.
A text file list of things to do in class is here
Friday 22 March
FLASH: Fast Length Adjustment of Short Reads to improve genome assemblies
(PubMedCentral) describes a software tool for joining paired-end reads obtained from DNA fragments short
enough that the reads overlap at the ends. This is reported to improve the quality of assemblies created from the
An outline of an exercise with the FLASH assembler is available: FLASH_exercise.docx
GAGE: A critical evaluation of genome assemblies and assembly algorithms
Publisher Web Site describes a set of experiments comparing different assembly programs on four genomes, and
provides useful insights into the challenges of genome assembly.
Comparison of requirements for assembly with ABySS
and Velvet, and comparison of assembly to whole-genome alignment with MUMmer v.3 .
Exercises with ABySS are here
Monday 25 March
Assembly quality metrics and Assemblathon-1: Outline and notes
SAM format definition and Tablet assembly viewer.
Friday 29 March
No Class - Spring Holiday
Monday, Apr 1 and Friday Apr 5
Aligning short reads to a reference genome sequence is the first step after quality control in many types of
sequence data analysis pipelines. BWA and
Bowtie2 are two of many alternative programs
available for mapping reads; we will also discuss programs that are specialized for mapping RNA-seq reads to genomic
DNA sequence when we discuss RNA-seq in a few weeks.
For this week, the focus will be on discovery of DNA sequence variants, including single-nucleotide polymorphisms (SNPs),
copy number variants (CNVs), structural variants (SVs) such as translocations or inversions,
and insertion-deletion events (indels).
- A survey of tools for variant analysis of next-generation genome sequencing data. Pabinger et al,
Brief Bioinform doi: 10.1093/bib/bbs086, 2013
- Genome structural variation discovery and genotyping. Alkan et al,
Nature Reviews Genetics 12:363-76, 2011
- Steps to ensure accuracy in genotype and SNP calling from Illumina sequencing data. Liu et al,
BMC Genomics 13(Suppl 8):S8, 2012.
- The variant call format and VCFtools Danecek et al,
Bioinformatics 27: 2156–2158, 2011
- VCF (Variant Call Format) version 4.1 format specification page. The 1000 Genomes Project,
VCF (Variant Call Format) version 4.1
- PolyCat: A Resource for Genome Categorization of Sequencing Reads From Allopolyploid Organisms. Page et al,
G3 3:517–525, 2013.
- Mismatch and indel tolerance; speed vs accuracy vs completeness; handling repetitive sequences
- Variant discovery from aligned reads – SNPs, small indels, large rearrangements, and copy number variation
Monday, Apr 8 and Friday, Apr 12
ChIP-seq analysis: Chromatin immunoprecipitation to detect protein binding sites and DNA modification. A brief introduction is
presented in the IntroToChIPseq.pdf document.
- ChIP-seq: advantages and challenges of a maturing technology. Park,
Nat Rev Genet. 10:669-80, 2009
- ChIP-seq and Beyond: new and improved methodologies to detect and characterize protein-DNA interactions.
Furey, Nat Rev Genet 13: 840–852, 2012
- MuMoD: a Bayesian approach to detect multiple modes of protein–DNA binding from genome-wide ChIP data. Narlikar,
Nucleic Acids Res 41:21–32, 2013
- Systematic evaluation of factors influencing ChIP-seq fidelity. Chen et al.,
Nat Methods 9: 609–614, 2012.
The on-line supplementary data file for this publication contains a lot of additional information,
and is worth reading.
- Introduction to the Galaxy workspace environment: Tutorial_ChIP-seqOnGalaxy.pdf
- Introduction to R and Bioconductor: An R script to install Bioconductor installBioC.R;
slides from Thomas Girke at UC-Riverside with an exercise in analyzing ChIP-seq data in R using Bioconductor tools R-ChIPseq.pdf
Monday, Apr 15 and Friday, Apr 19
RNA-seq experiments are growing in popularity as a means of characterizing the transcriptome, detecting alternative
splicing events, and measuring differences in gene expression between samples of different types. The importance of
good experimental design in conducting RNA-seq experiments is emphasized in the first paper in the recommended reading,
by Auer and Doerge. Any experiment in which differences between samples are to be interpreted in a biological context
should take seriously the need for good experimental design. The most reliable conclusions will result from a
well-replicated design in which the experimental treatments are orthogonal to nuisance factors.
- Statistical design and analysis of RNA sequencing data. Auer and Doerge,
Genetics 185(2): 405–416, 2010 has useful recommendations regarding experimental design for RNA-seq experiments.
- A comparison of methods for differential expression analysis of RNA-seq data. Soneson and Delorenzi,
BMC Bioinformatics 14:91, 2013.
- A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis.
Dillies et al, Brief Bioinform doi: 10.1093/bib/bbs046, 2012.
- Systematic comparison of RNA-Seq normalization methods using measurement error models. Sun and Zhu,
Bioinformatics 28:2584-2591, 2012.
- STAR: ultrafast universal RNA-seq aligner. Dobin et al,
Bioinformatics 29:15-21, 2013
- Introduction to R:
slides from a UC-Riverside short course, with accompanying R-script of exercises.
Thomas Girke of UC-Riverside and his co-workers have assembled an extensive manual on using R and Bioconductor
for analysis of deep sequencing data.
- RStudio is an integrated development environment (IDE) for R, with versions available for Windows, Mac OSX, and Linux. This shell script installs RStudio on an Ubuntu EC2 instance.
Download the script using a browser from an EC2 instance, make the script executable, and run it to install RStudio. The program icon will appear in the Applications menu, under the Programming heading.
- RNA-Seq: example analysis using Arabidopsis reads aligned to chr 5 - download RNAseq.sh, installBioC.R, and
RNAseqScript.R to an EC2 instance and execute the shell script - it invokes the two R scripts. The three scripts together describe the process of downloading data from the course S3 bucket,
followed by a series of steps:
- Scripting – combining the shell and R commands into scripts that execute the RNA-seq analysis as a pipeline
- CreatingRNAseq_SampleDataset.pdf describes the process of creating SAM files
of alignments of RNA-seq reads from 3 control samples and 3 treated samples to Arabidopsis chromosome 5. This illustrates uses of several
software tools for handling sequence data, including BWA, the SRA-Toolkit, and Picard.
- RNAseq_WithoutGenomeReference.sh is a shell script that describes the process
of analyzing the example RNA-seq datasets by alignment to a reference transcriptome only, without using a reference genome sequence.
Monday, Apr 22 and Friday Apr 26
Genotyping by sequencing (GBS) and Restriction-site-Associated DNA sequencing (RAD-seq)
experimental design considerations, library preparation, and data analysis. Slides
with an overview of GBS methods are available as PDF; more information on experimental details is available in the papers listed below.
- A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. Elshire et al.,
PLoS One 6(5):e19379, 2011. This is the initial
publication describing genotyping by sequencing of single-digest restriction fragments without size selection.
- Development of high-density genetic maps for barley and wheat using a novel two-enzyme
genotyping-by-sequencing approach. Poland et al.,
PLoS One 7(2): e32253, 2012. This paper describes a two-enzyme system that yields better results with
larger and more complex plant genomes such as barley (5 Gb) and hexaploid wheat (16 Gb).
- Double digest RADseq: an inexpensive method for de novo SNP discovery and genotyping in model and non-model species.
Peterson et al, PLoS One 7(5):e37135, 2012 . This
paper describes the advantages of size-selection of a narrow size range of restriction fragments, and compares the yield
of genotype data across a population of individuals for RAD-seq (using DNA shearing) or restriction-fragment-based
approaches with or without size-selection. It also introduces the idea of combinatorial indexing to increase the number
of individuals that can be pooled without a linear increase in oligonucleotide synthesis.
- Switchgrass genomic diversity, ploidy, and evolution: novel insights from a network-based snp discovery protocol.
Lu et al, PLoS Genet 9: e1003215, 2013.
paper describes the UNEAK software pipeline for SNP discovery in species without a reference genome.
- Imputation of unordered markers and the impact on genomic selection accuracy.
Rutkowski et al, G3 3(3):427-39, 2013. This
paper describes methods for imputation of missing data in populations of related individuals, without the requirement
for a reference genome sequence or known physical order of the marker loci.
- Marker density and read depth for genotyping populations using genotyping-by-sequencing. Beissinger et al,
Genetics 193:1073-1081, 2013 . This paper
describes the non-random nature of restriction fragments sampled from the maize B73 genome by single-digest GBS, and
shows that assumptions of random sampling following a Poisson distribution are not met.
- Inferring phylogenies from RAD sequence data. Rubin et al,
PLoS ONE 7(4): e33394. This paper
describes experiments to test the ability to recover known phylogenies from organisms with sequenced genomes using simulated
RAD-seq data sampled at random from regions flanking restriction sites in the reference genome sequences. Clustering methods
are used as a means of testing for orthology among similar sequences, and the specificity and sensitivity of inference are
tested by comparison with the known relationships among loci and taxa.
- Special features of RAD Sequencing data: implications for genotyping. Davey et al,
Mol Ecol doi: 10.1111/mec.12084, 2012.
This paper describes biases detected in RAD-seq data, based on the length of restriction fragments created in the initial
restriction digest and other factors that affect the relative efficiency of DNA shearing, end repair, adapter ligation,
and PCR amplification during the preparation of sequencing libraries.
- Stacks: building and genotyping loci de novo from short-read sequences. Catchen et al.,
G3: Genes, Genomes, Genetics, 1:171-182, 2011
Full Text, Home Page
- See the class outline for Monday Apr 22 for a list of class activities.
- Run Install_OracleJava.sh on an EC2 instance to install Oracle Java.
This is essential - the GBS pipeline tools in TASSEL will not run on the OpenJDK version of Java installed in Ubuntu.
- Download the GBS.R R script to the home directory of the EC2 instance.
- Run GBS_tutorial.sh on an EC2 instance with Oracle Java installed to download
and install the standalone TASSEL 3.0 GBS pipeline, download sample data, and analyze the sample data using the GBS pipeline and R.