Description: NCSU Department of Forestry & Environmental Resources Home Page

Description: North Carolina State University Home Page

BIT 815, DeepSequencing Data Analysis

Course Materials, Spring 2013

Linux Command Line Reference

A brief list of useful shell commands

Instructions on how to create a Linux system on a flash drive

The course S3 bucket

HandyBashShellTricks.txt, a text file with collected useful tricks for accomplishing specific tasks using Bash utilities. The file has Unix line endings so commands can be pasted directly on a Linux or Mac command line to see how they work

Links to PDF and Word documents, Powerpoint presentations, and shell scripts are available here as well as throughout the course schedule under the appropriate days.

A slide presentation with an overview of topics covered during the course is here.

Friday 1 March - Introduction to Linux and the command-line interface

Introductory slides provide an introduction to the course objectives in the first class session, and a summary of Chapter 1 from Eric Raymond's book The Art of Unix Programming, available here, is used as a framework for discussion of differences between the Linux command-line interface and graphical interfaces. File globbing and regular expressions provide a basis for discussion of abstraction and generalization as key parts of computational thinking; more information on regular expressions is also available at The Linux Documentation Project webpage here. Instructions for setting up an Amazon Web Services account, and preparing and running a Cloudbiolinux instance on the Amazon Web Services Elastic Compute Cloud (AWS-EC2) are also available. The computers in the teaching lab are already set up in this way; this information is provided for those who want to have the same software installed on another computer.
A set of sample paired-end Fastq sequence files and a sample SAM alignment file are in
The FileGlobbing.pdf and RegularExpressions.pdf documents provide information on these pattern-matching tools
Analysis of a SAM alignment file using command-line Linux tools is described in SAMformatAndCLtools.pdf.
The LoggingIntoAWS_StartingEC2instance.pdf document has information on how to log into the Amazon Web Services console, and how to start a cloud computing 'instance'. This document also includes some exercises on the Linux command line, and some exercises with fastx-toolkit and FastQC programs. Details on the FastQC output are provided in the FastQC_details.pdf document, or through the FastQC website at The Unix Tutorial for Beginners site has a set of eight tutorials that progress through use of different shell commands - these exercises provide practice in using shell commands, to follow up on the Software Carpentry material. Another introductory Unix tutorial is available through the website. This website also provides a number of other learning resources, available from the home page by moving the cursor over the "resources" heading in the upper right of the homepage, then moving the cursor over the "learning activities" item in the drop-down menu that opens.

Monday 11 March and Friday 15 March – data exploration & QC; error detection; experimental design issues

Resources: Wikipedia has information on FASTA and FASTQ sequence formats, and information on the Sequence Alignment and Mapping (SAM) format is available here and in the SAM format specification. The fastx-toolkit webpage has information about the fastx-toolkit package of programs for quality control and manipulation of FASTA and FASTQ files, and the FastQC webpage has information about the FastQC program. Follow links on those webpages to find more documentation about the functions each software provides.


  1. AWS remote login and instance management; S3 bucket as a source of data for exercises
  2. Shell tutorial (Unix Tutorial for Beginners)
  3. Data structure and formats, exploratory analysis
  4. Using GemSIM to create simulated datasets of reads with different insert size and end orientation - download script here.

Monday 18 March and Friday 22 March
Library construction and experimental design for genome and transcriptome assembly, read simulation with GemSIM, read mapping with bwa, and read assembly with velvet.
A lecture outline is here.
A text file list of things to do in class is here

Friday 22 March
FLASH: Fast Length Adjustment of Short Reads to improve genome assemblies (PubMedCentral) describes a software tool for joining paired-end reads obtained from DNA fragments short enough that the reads overlap at the ends. This is reported to improve the quality of assemblies created from the joined reads. An outline of an exercise with the FLASH assembler is available: FLASH_exercise.docx
GAGE: A critical evaluation of genome assemblies and assembly algorithms Publisher Web Site describes a set of experiments comparing different assembly programs on four genomes, and provides useful insights into the challenges of genome assembly.
Comparison of requirements for assembly with ABySS and Velvet, and comparison of assembly to whole-genome alignment with MUMmer v.3 .
Exercises with ABySS are here

Monday 25 March
Assembly quality metrics and Assemblathon-1: Outline and notes
SAM format definition and Tablet assembly viewer.

Friday 29 March
No Class - Spring Holiday

Monday, Apr 1 and Friday Apr 5
Aligning short reads to a reference genome sequence is the first step after quality control in many types of sequence data analysis pipelines. BWA and Bowtie2 are two of many alternative programs available for mapping reads; we will also discuss programs that are specialized for mapping RNA-seq reads to genomic DNA sequence when we discuss RNA-seq in a few weeks. For this week, the focus will be on discovery of DNA sequence variants, including single-nucleotide polymorphisms (SNPs), copy number variants (CNVs), structural variants (SVs) such as translocations or inversions, and insertion-deletion events (indels).


  1. Mismatch and indel tolerance; speed vs accuracy vs completeness; handling repetitive sequences
  2. Variant discovery from aligned reads – SNPs, small indels, large rearrangements, and copy number variation

Monday, Apr 8 and Friday, Apr 12
ChIP-seq analysis: Chromatin immunoprecipitation to detect protein binding sites and DNA modification. A brief introduction is presented in the IntroToChIPseq.pdf document.



Monday, Apr 15 and Friday, Apr 19
RNA-seq experiments are growing in popularity as a means of characterizing the transcriptome, detecting alternative splicing events, and measuring differences in gene expression between samples of different types. The importance of good experimental design in conducting RNA-seq experiments is emphasized in the first paper in the recommended reading, by Auer and Doerge. Any experiment in which differences between samples are to be interpreted in a biological context should take seriously the need for good experimental design. The most reliable conclusions will result from a well-replicated design in which the experimental treatments are orthogonal to nuisance factors.



  1. Introduction to R: slides from a UC-Riverside short course, with accompanying R-script of exercises. Thomas Girke of UC-Riverside and his co-workers have assembled an extensive manual on using R and Bioconductor for analysis of deep sequencing data.
  2. RStudio is an integrated development environment (IDE) for R, with versions available for Windows, Mac OSX, and Linux. This shell script installs RStudio on an Ubuntu EC2 instance. Download the script using a browser from an EC2 instance, make the script executable, and run it to install RStudio. The program icon will appear in the Applications menu, under the Programming heading.
  3. RNA-Seq: example analysis using Arabidopsis reads aligned to chr 5 - download, installBioC.R, and RNAseqScript.R to an EC2 instance and execute the shell script - it invokes the two R scripts. The three scripts together describe the process of downloading data from the course S3 bucket, followed by a series of steps:
  4. Scripting – combining the shell and R commands into scripts that execute the RNA-seq analysis as a pipeline

Monday, Apr 22 and Friday Apr 26
Genotyping by sequencing (GBS) and Restriction-site-Associated DNA sequencing (RAD-seq)
experimental design considerations, library preparation, and data analysis. Slides with an overview of GBS methods are available as PDF; more information on experimental details is available in the papers listed below.