ECE365: Biological Data Analytics (Labs)

The labs and quizzes for the course will be posted here. The labs will be assigned each Tuesday and will be due the following Wednesday. 

These labs will require you to run the R for statistical computing language. Instructions for installing it as a kernel for your Jupyter notebooks can be found in the R Setup page.  Completing the final section of Lab 3.1 is a good test that your plotting and package installations for R are working correctly in your notebook.

Assignment Link Assigned Due
Lab 3.1: Introduction to R and Bioconductor [link] February 27 Not Graded
Lab 3.2: Gene Annotations and EM algorithm  [link] April 9 April 17
Lab 3.3: Differential Gene Expression [link] April 16 April 24
Lab 3.4: Gene Modules and Gene Set Enrichment [link] April 23 May 1

 

Lab 3.4

  • Since we will not be covering gene modules until the final lecture, only problems 1 and 2 (worth 75 points) are expected to be completed for Lab 3.4.  Problems 3 and 4 (worth 45 bonus points) are optional and will be graded as bonus points. So a perfect score on Lab 3.4 with all bonus points would be 120 out of 75.
  • For Question 2.7, you will need the Entrez gene identifers (not human readable names) from Question 1.4.

 

Lab 3.3

  • In Question 1.4, by "z_i^n is an indicator whether read n maps to transcript i", it was meant that z_i^n is 0) when n does not map to i and 1) when it does (unambiguous or multi) 
  • In Question 2.3, make sure you use the gm_mean() geometric mean function that does the calculation in the log space to avoid the product blowing up to infinity
  • For extracting the values from the t.test(x, y, …) function, may sure you save the result in a variable, such as tout <-  t.test(x, y, …), then
    • tout is a list with a number of components
      • tout$estimate - the pair of estimated means for a two-sample ttest,
      • tout$statistic - the t-statistic for the statistical test
      • tout$p.value - the pvalue of the statistical test
  • When you do Question 3.1, make sure
    • you use only patients with expression and subtype information
    • the rows of the clinical colData are in the exact same order as the columns of the expr countData
    • the design is a formula of the form "~ FACTOR", where FACTOR is a factor type object with two levels, one for the Mesenchymal subtype and one for the other subtype
  • The best online guidance I found for DESeq2 functions are here: https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#count-matrix-input

 

Lab 3.2

  • In the bullet list of notation between Question 3.3 and 3.4, the second item is the definition for the variance, not the standard deviation.  There should be a square root to calculate the standard deviation. 
  • For the stdev, if you are trying to match my output, I used the sd() function for Q 2.3 and Q 3.2, which implements the "sample" standard deviation which has N-1 in the denominator.  For other questions, I calculated the standard deviation without a built-in function and used the "population" standard deviation which has N in the denominator. 
  • In the bullet list of notation between Question 3.2 and 3.3, the second item of the list should be the mixing coefficent and say instead of .
  • The getSequence() command may return different results if your BioMart version is different.  The queried sequnces may also be shuffled between different runs.  If you get an occasional error, the Ensembl server might be busy, wait a minute and try again.
  • If your gc_contents are in a shuffled order from the example outputs, you may not match the example outputs where the order matters, for example, the rows posterior probabilties.  However, your vector and multiD implementations should still match each other.
  • Released assignment with sample outputs [link]

 

Lab 3.1

  • The solutions to Lab 3.1 are [here].
  • The barplot labels in the answer png for Check Your Understanding 5 are not correct.  The bins should be labelled from 2 - 7 and not from 1 to 6.
  • If you have having difficulty installing packages on a Mac with a Juptyer from Anaconda Navigator, try doing a clean install with the homebrew instructions.
  • If install.packages("ggplot2") does not work for you, your default CRAN repository might not be correct.  First run getOption("repos") to check what repo you are using.  If you were having problems and your CRAN repo is not "https://cran.r-project.org", then you can try installing the package while setting the repo argument. install.packages("ggplot2", repo="https://cran.r-project.org").  If this works, then you can consider changing your default CRAN repo with the following command: options(repos=structure(c(CRAN="https://cran.r-project.org"))).