clusterMaker App
clusterMaker is an app that provides the functionalities needed for clustering, dimensionality reduction and ranking. The clusterMaker app for Cytoscape is available from the App Store.
This tutorial will demonstrate how the various algorithms can be used together to explore a data set and how to integrate it with other Cytoscape apps and capabilities.
This workflow uses two types of data; protein-protein interaction data from the STRING database, and expression data from a yeast heat shock experiment. We will combine these two data types in Cytoscape.
Setup
- Install and launch the latest version of Cytoscape.
- Install the clusterMaker 2 app via Apps → App Store ... or directly from the App Store.
- Install the stringApp via Apps → App Manager or directly from the App Store.
- Install the Largest Subnetwork App from the Cytoscape App Store, or using the Cytoscape App Manager.
- Download the matrix data for the yeast heat shock experiment.
Importing the PPI network
- Launch Cytoscape, and go to Import → Network from Public Databases....
- Select STRING: protein query in the Data source menu, and select Saccharomyces cerevisiae from the Species list.
- Select All proteins of this species.
- Select the option for physical subnetwork and set the Confidence (score) cutoff to 0.50.
- Click Import to load. The network might take some time to load as it is large.
- Finally, lets continue our analysis with only the largest connected subnetwork by clicking Select → Nodes → Largest subnetwork. Next, select File → New Network → From Selected Nodes, All Edges.
Importing the PPI network
The network should look like similar to this:
Importing the Expression Data
- Select File → Import → Table from File and select the text file with the matrix data.
- Set Where to Import Table Data: to To a Network Collection
- Set Import Data as: to Node Table Columns.
- Set the Key column for the network to display name.
- In the Preview, click on the Gene symbol column header and then click the key symbol, to assign this column as the key column for the data.
- Under Advanced Options, set the delimiter to Tab and turn off COMMA.
- Click OK to import.
Now we have a protein-protein interaction network where each of the proteins are annotated with the 5 minute (GPL51-01), 10 minute (GPL51-02), 15 minute (GPL51-03), 20 minute (repeat) (GPL51-05), 40 minute (GPL51-06), 60 minute (GPL51-07), and 80 minute (GPL51-08) heat shock expression fold changes.
Clustering the PPI Network
Our network is too dense for easy interpretation, so the next step is to break the network up into clusters representing tightly connected groups of proteins such as complexes. We will use Leiden clustering to do this.
- Select Apps → clusterMaker Cluster Network → Leiden Clusterer (remote) to bring up the Leiden cluster options.
- Set the Resolution parameter to 0.5 and Number of iterations to 30.
- In the Source for array data section, select stringdb::score as the Attribute. This is the edge confidence score assigned by STRING.
- Select Create new clustered network and click OK.
Clustering the PPI Network
The resulting network should look similar to this. Note that we have disabled the Glass ball effect and STRING style labels in the STRING results panel at the right.
Exploring Leiden Clusters
We can explore some of the clusters to confirm that Leiden has done a reasonable job. To do this, we will run functional enrichment analysis on the individual clusters.
- Select the first 5 clusters iteratively (top left) by click and drag, then select Functional enrichment for each one in the STRING results panel at the right.
- Select genome for Network to be used as background and click OK to continue.
- The results will open in the STRING Enrichment table, sorted by FDR value.
You will notice that the top 4 clusters represent the ribosome, mitochondrial ribosome, preribosome, and large subunit of the preribosome, respectively. Based on this, we can assume that Leiden clustering worked reasonably well.
Simplifying the Network
Now that we've clustered our network, there are a number of nodes that are not part of any cluster (singletons). By default, these are hidden, but before we do our analysis, we want to remove them. The easiest way to do that is to simply select all of our clusters, and create a new network with only those nodes:
- In the right-hand STRING results panel, select the Singletons button to show all of the singletons
- Now drag-select all of the nodes that are part of the clusters
- Create a new network with File→New Network→From selected nodes; all edges
NOTE: we could do the same thing by selecting all of the singletons and deleting them, but this allows us to go back to the original clustered network if we desire
Focus on nodes that have fold change data
Several of the nodes don't have any fold change data, but might be important to understand the biological
context of the cluster. For some of the steps we're going to do below, we only want to look at nodes that show differential expression under heat stress. We can easily build a filter that selects all nodes where at least one expression value is not 0.
Hierarchical Clustering of Expression Data
A classical analysis of an expression data set would involve performing a hierarchical clustering of the data and viewing it using a heatmap with associated dendrogram. We can do this using clusterMaker2 using the Cluster Attributes feature.
- Select Apps → clusterMaker Cluster Attributes → Hierarchical cluster.
- Select all of the heat shock columns (5 min, 10 min, 15 min, 20 min repeat, 40 min, 60 min, and 80 min).
- Select Only use selected nodes/edges for cluster.
- Select Show TreeView when complete and click OK.
Hierarchical Clustering of Expression Data
A heat map with the associated dendrogram will open when the clustering is complete.
Coloring the PPI Network
To help understand the biological significance of these transcriptional changes at the protein level, we would like to find a mapping from our hierarchical clustering onto the proteins in our PPI network. This could be useful, for example, to see if any particular complexes are particularly affected by transcriptional changes. There are two ways of doing this:
- In TreeView, select Map Colors Onto Network....
- Select all attributes in the Attribute List, making sure they are in the correct order, and click Create HeatStrips.
This will add bar charts showing the expression fold changes at the various timepoints on the nodes. Unfortunately, this is extremely hard to see when looking at the entire network.
Coloring the PPI Network
The second way to color the network is to use the information from the hierarchical clustering to create a new attribute representing the overall change of a gene, and then use this for coloring. We will use the ability to select branches of the dendrogram to select the corresponding nodes in the network. Here is how you would go about doing this (but don't do it):
- In the Node Table, click the Create New Column... button and select New Single Column → Integer. Name the new column Colors.
- In TreeView, start by assessing the whole tree to find the most intensily colored yellow (up-regulated) section.
- By clicking and dragging in the heat map, select the most intensily colored yellow part.
- In the Node Table, enter the number 10 in the top cell of the Colors column. Next, right-click on the cell and select Apply to selected nodes.
- Repeat this process with the rest of the up-regulated section of the tree, adjusting the number entered to reflect the degree of up-regulation, with 10 being the highest.
- Repeat the same process againwith the down-regulated section of the tree, this time entering numbers from -1 to -10.
Instead, download the Colors data and import it into your network.
Coloring the PPI Network
Now that we have a new column with values reflecting the degree of regulation, we can use this to create a style mapping:
- In the Style panel, remove the existing mapping for Node Fill Color (from STRING), and add a new continuous mapping for the Colors column, using the standard Red-Blue palette ColorBrewer palette.
The nodes with data will now be colored based on their overall regulation:
UMAP analysis of Expression Data
We will now look at the same heat shock expression data used for the hierarchical cluster, but this time using the Uniform Manifold Approximation and Projection (UMAP) approach to explore a 2D embedding of this multidimensional data.
- Select all nodes with a Color value not equal to 0 using the Filter interface. (or use your previously constructed filter)
- Select Apps → clusterMaker Dimensionality Reduction → UMAP (remote).
- Select all six of the heat shock columns and set the Number of neighbors to 20 and the Minimum distance to 0.5. Check the Only use data from selected nodes option.
- Select Show scatter plot with results and click OK.
UMAP analysis of Expression Data
Once the UMAP scatter plot comes up, click on Get Colors to apply the red-blue coloring from the nodes to the UMAP.
UMAP analysis of Expression Data
- In the scatter plot, highlight the red group in the middle.
- In the stringApp interface in the Results Panel, select Functional enrichment. Leave the defaults as-is.
- Filter the results for GO Biological Process. The results indicates significant enrichment for protein folding.
Fuzzy Clustering
Fuzzy clustering can help us understand the relationships between clusters, and help find instances where proteins are shared between clusters. To apply fuzzy clustering, we must start with the fully connected network, but to facilitate interpretation of the results, we will select the nodes from several clusters that are of interest – for example, clusters with high ranking or that show consistently high over-expression or under-expression, rather than the whole network.
- Select Apps → clusterMaker Visualizations → Link selection accross networks.
- Next, go to the clustered network. By using click and drag, select all nodes in clusters containing many over-expressed genes. Hold down the Shift key to select more than one cluster.
- The corresponding nodes will now be selected in the original network.
Fuzzy Clustering
- In the original network, with the nodes selected, choose Apps → clusterMaker Cluster Network → Cluster Fuzzifier
- Chose stringdb::score as the Array Source, and 1/value as the Edge weight conversion (to convert it to a distance). Also check Cluster only selected nodes and Create new clustered network to see the result.
- Click OK to continue.
Fuzzy Clustering
- To show the relationships between clusters, apply a force-directed layout using the score as an edge weight......
Cluster Ranking
The goal of ranking is to order the clusters based on some criteria (typically node attributes) to determine the most relevant or important clusters.
- Go back to the clustered network. Click anywhere in the network view window to remove the selection.
- Go to Apps → clusterMaker Ranking → Create rank from multiple nodes and edges (additive sum).
- Choose the same node attributes (GPL51-01 – GPL51-08) and select Basic normalization, but Only positive values for the Two-tailed values normalization.
Cluster Ranking
The ranking panel can be opened from Apps → clusterMaker Visualizations → Show results from ranking clusters.
Cluster Ranking
By calculating a ranking score for each cluster, we can analyze the relevance of the clusters in terms of the research question. In this case, a higher ranking score would imply that the cluster is more associated with yeast heat shock. We can focus on the highest-ranking cluster and try to assess the biological relevance of the genes in it in two ways:
- Manual lookup: Search the UniProt database to review function.
- Enrichment analysis: Perform functional enrichment analysis on the cluster using the stringApp.
This process can be repeated for other high-ranking clusters.