# Mini Project 2

#### Posted: 2018-02-22   |   Due: 2018-03-16

Changelog

A list of changes so far include:
* 3/10/2018 - Task 3 + Report deadline pushed back to 03/16/2018 11:59pm
* 3/05/2018 - Task 3 + Report deadline pushed back to 03/14/2018
* 3/04/2018 - Task 2 finalized
* 3/01/2018 - Task 3 released
* 2/28/2018 - Task 2 released
* 2/25/2018 - Reordered subparts of Task 1, sent out assigned cells & genes emails
* 2/23/2018 - Initial release

## Introduction

The goal of this project is to apply unsupervised learning approaches to identify genes with significant differential expression across single-cell subpopulations induced by therapeutic treatment. Links to datasets with gene expression samples have been provided below.

This is a solo project, i.e. you are not allowed to work on it with other students in your group/class.

## Project Deliverables

1. Jupyter notebook containing code for Task 1 by the first checkpoint (11:59pm on March 1, 2018)
2. Jupyter notebook containing code for Tasks 1 and 2 by the second checkpoint (11:59pm on March 7, 2018)
3. Jupyter notebook containing code for all Tasks (by 12 Noon on March 16, 2018)
4. A report containing summarized answers for each task of the mini project.
• Turn in a PDF of a powerpoint presentation.
• No more than 30 slides.

## Project Materials

Datasets Baseline CSV
Metformin CSV
Discussion Examples Intro to Medical Dataset
Clustering Examples
Principal Component Analysis
Task 1 Cells & Genes By email
Unsupervised Single-Cell Analysis in Triple-Negative Breast Cancer: A Case Study Link (Requires on campus internet access or VPN)

## Task 0 - Biology Primer

You do not have to turn in anything for this Task.

Familiarize yourself with the following terms:
1. Gene
2. Cell
3. Gene Expression
4. RPKM

Hint: Look at Unsupervised Single-Cell Analysis in Triple-Negative Breast Cancer: A Case Study

## Task 1 - Getting familiar with the dataset

1. Import both datasets into Pandas.

• How many gene samples are present in each dataset?
• How many cells are present in each dataset?
• How many genes are common in both datasets?
2. Plot the variation of gene expression across different genes for your assigned cells.
You will have two plots - One for baseline cells and another for Metformin cells.

3. Plot the variation of gene expression across different cells for your assigned gene.

4. Perform the KS test on your Baseline and Metformin Variance distributions.

• How many genes are differentially expressed at $\alpha = 0.10, 0.05, 0.025, 0.01, 0.005$ and $0.001$?

Hint: Figure 2 in Unsupervised Single-Cell Analysis in Triple-Negative Breast Cancer: A Case Study covers Subtasks 2 & 3

The goal of this task is to identify genes with significant differential expression in the metformin dataset.
This can be achieved by clustering the baseline and metformin datasets.

Some important constants:

	rpkmThreshold = 32
numBaselineClustGMM = 2
numMetforminClustGMM = 3

Step by step guide:

1. Extract your RPKM matrix from each dataset. This is simply the dataframe without the first few columns that contain metadata.
2. Find the transpose of each matrix. This will be the input to your clustering algorithms.
3. For each transposed matrix:
• Apply the Gaussian Mixture Model (GMM) clustering algorithm.
Since your input is the transposed matrix, you will be clustering by cells.
• Use N=2 clusters for the baseline matrix and N=3 clusters for the metformin matrix.
• The output of your clustering algorithm is stored in the ‘means_’ attribute.
REPORT What do these means represent?
REPORT How many cells do you have in each of the 5 clusters?
4. Using the results of your clustering algorithm, create new mean baseline and mean metformin datasets

• Preserve metadata from the original dataframes
• Store the means of each cluster in additional columns
An example for the baseline dataset has been provided:
	  	baselineCells = GMM Result
df_baseline_means = baseline.iloc[:, 0:5]  # preserve metadata from first 5 columns
df_baseline_means["1"] = baselineCells.means_[0]
df_baseline_means["2"] = baselineCells.means_[1]

Repeat the same process for the metformin dataset, but with 3 columns instead (one per cluster)
Define clusters $B_u$ and $B_v$ such that $|B_u| < |B_v|$ where $|B|$ represents the number of cells in cluster $B$ in the baseline dataset.
(i.e let $B_u$ represent the smaller cluster and $B_v$ represent the larger one)
Similarly, define clusters $M_x$, $M_y$ and $M_z$ such that $|M_x| < |M_y| < |M_z|$ where $|M|$ represents the number of cells in cluster $M$ in the metformin dataset.

6. Filter your mean metformin dataset to only include genes that meet the following condition
$\overline{M_x} < rpkmThreshold < \overline{M_y}$ and $rpkmThreshold < \overline{M_z}$ where $\overline{M}$ represents the mean of cluster $M$.
Call this new dataframe metformin_upMy_downMx_upMz_DF.
Hint: Cluster means are in the dataframe you created in Step 4. However, the order of your means may change with each run of the clustering algorithm. Hence, you will have to identify how to map columns “1”, “2” and “3” to $\overline{M_x}, \overline{M_y}$ and $\overline{M_z}$. You might want to pick a fixed random state for your clustering algorithm to ensure the ordering is always the same.

7. Similarly, filter your mean baseline dataset to only include genes that meet the following condition
$\overline{B_u} < rpkmThreshold < \overline{B_v}$ where $\overline{B}$ represents the mean of cluster $B$.
Call this new dataframe baseline_upBv_downBu_DF.
Hint: Cluster means are in the dataframe you created in Step 4. However, the order of your means may change with each run of the clustering algorithm. Hence, you will have to identify how to map columns “1”, “2” to $\overline{B_u}$ and $\overline{B_v}$. You might want to pick a fixed random state for your clustering algorithm to ensure the ordering is always the same.

8. Are they any genes common in your filtered datasets from steps 6 & 7?
Low expression level genes in $M_x$ may come from $B_u$. These are genes with an inherently low expression level.
To account for inherently low expression levels, remove any genes that exist in baseline_upBv_downBu_DF from metformin_upMy_downMx_upMz_DF.
Call this filtered dataset newlyDownregulatedGenes_inMx_fromBu_df.

9. Find genes that are common to newlyDownregulatedGenes_inMx_fromBu_df and your mean baseline dataset.
Further filter your mean baseline and mean metformin dataframes to include only these overlap genes.
Your dataframes will now contain only downregulated genes.

10. Create a new dataframe with the following data:

• A column containing $\overline{B_u}$ for downregulated genes.
• A column containing $\overline{B_v}$ for downregulated genes.
• A column containing $\overline{M_y}$ for downregulated genes.
• A column containing $\overline{M_z}$ for downregulated genes.
You should have 4 columns with approx 230 genes/rows/entries.
11. Next, apply the log + 1 transformation to your dataframe in Step 10.
As long as your dataframe consists of all floats, this can be achieved by

  		new_df = np.log(df + 1)

Then, find the row means and row standard deviation for each row in your new dataframe.
Hint: Use np.mean/np.std or dataframe.apply to create new columns containing means and standard deviations

12. Finally, add a column containing $\overline{M_x}$ for downregulated genes to your dataframe in Step 11.
Apply the log+1 transformation to this column as well.

13. Visualize each cluster by providing a box plot. You will have 5 boxes, one each for $B_u, B_v, M_x, M_y$ and $M_z$.
One of your clusters should have zero expression.

14. Repeat Tasks 1-13 with k-means clustering in Step 3.

• Perform silhouette analysis to pick the optimal value of $k$ (number of clusters).
Provide the silhouette score or line plot for each value of $k$.
• Are you able to replicate the results obtained with GMM clustering? For filtering with more than 2 or 3 clusters, filter the smallest cluster such that $M_x < rpkmThreshold < other Ms$ or $B_u < rpkmThreshold < other Bs$.

In your report, remember to explain what you did in each step and why.
Simply providing screenshots from Jupyter notebook is insufficient.

## Task 3 - Principal Component Analysis

1. Perform Princial Component Analysis on your datasets.
• How many components did you use for your analysis? Provide a cumulative variance expression or knee curve to justify your answer.
• Provide a visualization of your analsysis in the form of a biplot. Remember to include eigenvectors of the covariance matrix.
Your datapoint labels will originate from your GMM clustering predictions. You should have one biplot for baseline with 2 clusters and another for metformin with 3 clusters.

Hint: The datasets provided have already been standardized. You do not need to scale or normalize any features

Here are some of example ideas for 4 credit hours/graduate students. However, the project is unsupervised and open ended.

You must make your own assumptions that are statistically correct and derived based on domain expertise. You may contact Saurabh Jha for ideas based on datasets from MP1 and Arjun Athreya for datasets based on MP2. Please contact Arjun only after pre-approval from course staff.

Graduate Hour Projects are due in the last week of classes (beginning of May). However, we may have checkpoints along the way.

### Idea 1

In Task 2, you found that k-means performs poorly on the given dataset. In In-Class Activity 2, you learned about ways to cluster non-hyper-ellipsoids shaped clusters by using a combination of (a) data transformations and (b) distance measures.

You goal is to try repeat k-means experiment at least 5 times with different combinations of (a) data transformation, and (b) distance measure.

You must be able to identify at least a set of parameter combination that will help you to identify the clusters that are similar to Gaussian Mixture Model-based clusters.

You may refer to kernel-based k-means here.

### Idea 2

In Task 2, you found that the number of observations in some of the identified GMM clusters are low (<10) and it may be one of the reasons why k-means clustering fails to provide the right answer.

The goal of this project is to determine the robustness of the parameters of the GMM Model. You may do this in several ways. We provide one possible methodology below.

You can try to learn the distribution of the observations for each of the clusters found using GMM. Let’s call these observation generation functions OG (observation generators). Using OGs you can generate more observations. Next, you will try to vary those parameters of the OGss and use GMM as well as k-means to perform clustering.

Your objective is to report if the clustering performance increases or decreases for each of the clustering methods (k-means and GMM) when parameters of the distribution are varied.