In these exercises, we will use the stringApp for Cytoscape to retrieve molecular networks from the STRING and STITCH databases. The exercises will teach you how to:
The original version of this tutorial was developed by Lars Juhl Jensen of the Novo Nordisk Center for Protein Research at the University of Copenhagen. We thank professor Jensen for his gracious willingness to allow us to repackage the content for delivery as a Cytoscape tutorial.
To follow the exercises, please make sure that you have the
latest version of Cytoscape installed. Then start Cytoscape and go
to
If you are not already familiar with the STRING database, we highly recommend that you go through the short STRING exercises provided by the Jensen lab to learn about the underlying data before working with them in these exercises.
In this exercise, we will perform some simple queries to retrieve molecular networks based on a protein, a small-molecule compound, a disease, and a topic in PubMed.
Unless the name(s) you entered give unambiguous matches, a
disambiguation dialog will be shown next. It lists all the matches
that the stringApp finds for each query term and selects the first
one for each. Select the right one(s) you meant and continue by
pressing the
How many nodes are in the resulting network? How does this
compare to the maximum number of interactors you specified? What
types of information does the
How is this network different from the protein-only network
with respect to node types and the information provided in the
Which additional attribute column do you get in the
Which attribute column do you get in the
In this exercise, we are going to use the stringApp to query the DISEASES database for proteins associated with epithelial ovarian cancer (EOC), retrieve a STRING network for them, and explore the resulting network.
Note that the retrieved network contains a lot of additional
information associated with the nodes and edges, such as the protein
sequence, tissue expression data, subcellular localization, disease score
(
Give an example for a node with the highest and lowest disease score.
The stringApp automatically retrieves information about in which compartments the proteins are located from the COMPARTMENTS database, which we will take a look at first to better understand the data.
What compartments is ARID1A present in with a confidence of 5? What source do these interactions come from? Hint: you can see what the abbreviations for different evidence types mean here.
Cytoscape allows you to map attributes of the nodes and edges to visual properties such as node color and edge width. Here, we will map the subcellular localization data for nucleus to the node color.
Many proteins are strongly associated with the nucleus – they will be purple.
Because many proteins are located in the nucleus, we will identify the
proteins with highest confidence of 5. One way to do this is to use the
COMPARTMENTS sliders in the
How many proteins are found in the nucleus with a confidence of 5? And in mitochondrion? Hint: You can see the number of hidden nodes in the light grey panel bar on the bottom-right part of the network view panel, just above the Table panel.
Important: Move the filter back to 0 before continuing with the next exercise.
In this exercise, we will work with a list of 541 proteins associated with epithelial ovarian cancer (EOC) as identified by phosphoproteomics in the study by Francavilla et al.. An adapted, simplified version of their results table can be downloaded here. Download the file, and open it in Excel or a similar tool.
How many nodes and edges are there in the resulting network? Do the proteins all form a connected network? Why?
Cytoscape provides several visualization options under the
Can you find a layout that allows you to easily recognize patterns in the network? What about the Edge-weighted Spring Embedded Layout with the attribute ‘score’, which is the combined STRING interaction score?
Cytoscape allows you to map attributes of the nodes
and edges to visual properties such as node color and edge
width. Here, we will map drug target family data from the Pharos database to the node
color. This data is contained in the node attribute called
This action will remove the rainbow coloring of the nodes and present you with a list of all the different values of the attributes that exist in the network.
Which target families are present in the network?
How many of the proteins in the network are ion channels or GPCRs?
There are many kinases in the network. We can avoid counting them manually by creating a selection filter.
How many kinases are in the network?
Network nodes and edges can have additional information associated with them that we can load into Cytoscape and use for visualization. We will import the data from the text file.
Now you need to map unique identifiers between the entries in the data and the nodes in the network. The key point of this is to identify which nodes in the network are equivalent to which entries in the table. This enables mapping of data values into visual properties like Fill Color and Shape. This kind of mapping is typically done by comparing the unique Identifier attribute value for each node (Key Column for Network) with the unique Identifier value for each data value (key symbol). As a default, Cytoscape looks for an attribute value of ‘ID’ in the network and a user-supplied Key in the dataset.
The
If there is a match between the value of a Key in the dataset
and the value the Key Column for Network field in the network, all
attribute–-value pairs associated with the element in the dataset are
assigned to the matching node in the network. You will find the imported
columns at the end of the
Now, we want to color the nodes according to the quantitative phosphorylation data (log ratio) between disease and healthy tissues for the most significant site for each protein.
Are the up-regulated nodes grouped together? Do you see any issues with the color gradient?
Can you improve the color mapping such that it is easier to see which nodes have a log ratio below -4 and above 4?
Next, we will use the MCL algorithm to identify clusters of
tightly connected proteins within the network. Go to the menu
How many clusters have at least 10 nodes?
We will work with the largest cluster in the network (it should be
in the upper left corner). Select the nodes of this cluster by holding
down the modifier key (Shift on Windows, Ctrl or Command on Mac) and
then left-clicking and dragging to select multiple nodes. Then, create
a new network by clicking on the
How many nodes and edges are there in this cluster?
Next, we will retrieve functional enrichment for the proteins in our
network of the largest cluster. First we will have to tell Cytoscape
that the new network we created is a STRING network, go to the menu
After making sure that no nodes are selected in the network, go
to the menu
Which are the four most statistically significant terms? Do the Uniprot and GO Process terms agree with each other, i.e., annotate the same set of nodes?
Next, we will visualize the top-5 enriched terms in the network using
split charts, click the colorful chart icon to show the terms as the
charts on the network. You can manually change the layout of the network
to improve the visualization. First apply the
To retrieve a list of publications that are enriched for the proteins
in the network, go to the menu
What is the title of the most recent publication?
To save the list of enriched terms and associated p-values as a text
file, go to
Cytoscape provides functionality to merge two or more networks,
building either their union, intersection or difference. We will now
merge the EOC network we have from the DISEASES query with the one we
have from the data, so that we can identify the overlap between them. Use
the Merge tool (
How many nodes are in the intersection?
Now we will make the union of the intersection network, which
contains the disease scores, and the experimental network. Use the
Now, we can change the visualization of the merged network to be
able to identify high disease score proteins. Specifically, we will
change the size of the nodes in function of their disease score. Select
In this exercise, we will retrieve virus-host networks for two closely related viruses, merge them into a single network, and then will retrieve the functional enrichment for the host proteins in this network.
Go to the menu
How many virus proteins are encoded for by this virus? What node information is imported along with the names of the proteins?
To retrieve interactions with host proteins, go to
The resulting network will be automatically re-styled such that the nodes representing virus proteins are red and host proteins are green-blue. These attributes can be changed from the Cytoscape Style menu.
Which human protein has the highest interaction score to one of
the virus proteins? What cellular functions is this protein involved
in? (Hint: open the results panel under
Additional viruses or hosts can be added to the network by iterating on this procedure, but this will only add proteins that interact with the proteins that are already in the network. This will work fine when adding new hosts, since all virus proteins are already in the network. However to add new viruses, we recommend merging the expanded networks for each virus.
If a specific host protein is desired, it can also be included in
the network from the
Which HPV proteins does p53 interact with?
Note that p53 will be added to the network in the previous step if more proteins are imported or the selectivity is set to a lower value. Choosing a lower selectivity will include more hub proteins (such as p53) that are connected to many proteins, and that do not interact specifically with proteins in your network. Conversely, choosing a higher selectivity will include more proteins that are more specific to your network, but these interactions will have lower confidence (since any higher confidence hub proteins will be filtered out). Further, be aware that changing the selectivity parameter will change the enrichment results in step 4.5, since different proteins will be included in the host network.
Let us now compare the networks for HPV 16 and HPV 1a. Create a new
host-virus network for “Human papillomavirus type 1a (HPV 1a)” by
repeating steps 4.1 and 4.2. Merge the two networks using
The resulting network can be styled to give the nodes of each species a distinct color so that the proteins of the two viruses can be distinguished from each other.
How many host proteins interact with E6 from both HPV species?
We will now examine the human proteins to see what pathways are enriched in this network.
Next, we will retrieve functional enrichment for the human proteins. Go
to the menu
Which two KEGG pathways have the lowest p-values? Which host proteins are associated with the KEGG pathways “cell cycle”? (Hint: click on the associated row in the enrichment table to select the proteins with this term.)
Doncheva NT, Morris JH, Gorodkin J and Jensen LJ (2018). Cytoscape stringApp: Network analysis and visualization of proteomics data.
Preprint