In these exercises, we will use the stringApp for Cytoscape to retrieve molecular networks from the STRING and STITCH databases. The exercises will teach you how to:
The original version of this tutorial was developed by Lars Juhl Jensen of the Novo Nordisk Center for Protein Research at the University of Copenhagen. We thank professor Jensen for his gracious willingness to allow us to repackage the content for delivery as a Cytoscape tutorial.
To follow the exercises, make sure that you have the latest version of Cytoscape installed. The exercises also require you to have certain Cytoscape apps installed; stringApp, enhancedGraphics and clusterMaker2, as well as the yFiles layout algorithms.
If you are not already familiar with the STRING database, we highly recommend that you go through the short STRING exercises provided by the Jensen lab to learn about the underlying data before working with them in these exercises.
In this exercise, we will perform some simple queries to retrieve molecular networks based on a protein, a small-molecule compound, a disease, and a topic in PubMed.
Unless the name(s) you entered give unambiguous matches, a
disambiguation dialog will be shown next. It lists all the matches
that the stringApp finds for each query term and selects the first
one for each. Select the right one(s) you meant and continue by
pressing the
How many nodes are in the resulting network? How does this
compare to the maximum number of interactors you specified? What
types of information does the
How is this network different from the protein-only network
with respect to node types and the information provided in the
Which additional attribute column do you get in the
Which attribute column do you get in the
In this exercise, we are going to use the stringApp to query the DISEASES database for proteins associated with epithelial ovarian cancer (EOC), retrieve a STRING network for them, and explore the resulting network.
Note that the retrieved network contains a lot of additional
information associated with the nodes and edges, such as the protein
sequence, tissue expression data, subcellular localization, disease score
(
Give an example for a node with the highest and lowest disease score.
The stringApp automatically retrieves information about in which compartments the proteins are located from the COMPARTMENTS database, which we will take a look at first to better understand the data.
What compartments is ARID1A present in with a confidence of 5? What source do these interactions come from? Hint: you can see what the abbreviations for different evidence types mean here.
Cytoscape allows you to map attributes of the nodes and edges to visual properties such as node color and edge width. Here, we will map the subcellular localization data for nucleus to the node color.
Many proteins are strongly associated with the nucleus – they will be purple.
Because many proteins are located in the nucleus, we will identify the
proteins with highest confidence of 5. One way to do this is to use the
COMPARTMENTS sliders in the
How many proteins are found in the nucleus with a confidence of 5? And in mitochondrion? Hint: You can see the number of hidden nodes in the light grey panel bar on the bottom-right part of the network view panel, just above the Table panel.
Important: Move the filter back to 0 before continuing with the next exercise.
In this exercise, we will work with a list of 541 proteins associated with epithelial ovarian cancer (EOC) as identified by phosphoproteomics in the study by Francavilla et al.. An adapted, simplified version of their results table can be downloaded here. Download the file, and open it in Excel or a similar tool.
How many nodes and edges are there in the resulting network? Do the proteins all form a connected network? Why?
Cytoscape provides several visualization options under the
Can you find a layout that allows you to easily recognize patterns in the network? What about the Edge-weighted Spring Embedded Layout with the attribute ‘score’, which is the combined STRING interaction score?
Cytoscape allows you to map attributes of the nodes
and edges to visual properties such as node color and edge
width. Here, we will map drug target family data from the Pharos database to the node
color. This data is contained in the node attribute called
This action will remove the rainbow coloring of the nodes and present you with a list of all the different values of the attributes that exist in the network.
Which target families are present in the network?
How many of the proteins in the network are ion channels or GPCRs?
There are many kinases in the network. We can avoid counting them manually by creating a selection filter.
How many kinases are in the network?
Network nodes and edges can have additional information associated with them that we can load into Cytoscape and use for visualization. We will import the data from the text file.
Now you need to map unique identifiers between the entries in the data
and the nodes in the network. The key point of this is to identify which
nodes in the network are equivalent to which entries in the table. This
enables mapping of data values into visual properties like Fill Color
and Shape. This kind of mapping is typically done by comparing the unique
identifier attribute value for each node (
The
If there is a match between the value of a
Now, we want to color the nodes according to the quantitative phosphorylation data (log ratio) between disease and healthy tissues for the most significant site for each protein.
Are the up-regulated nodes grouped together? Do you see any issues with the color gradient?
Can you improve the color mapping such that it is easier to see which nodes have a log ratio below -4 and above 4?
Next, we will use the MCL algorithm to identify clusters of
tightly connected proteins within the network. Go to the menu
How many clusters have at least 10 nodes?
We will work with the largest cluster in the network (it should be
in the upper left corner). Select the nodes of this cluster by holding
down the modifier key (Shift on Windows, Ctrl or Command on Mac) and
then left-clicking and dragging to select multiple nodes. Then, create
a new network by clicking on the
How many nodes and edges are there in this cluster?
Next, we will retrieve functional enrichment for the proteins in our network of the largest cluster.
After making sure that no nodes are selected in the network, go
to the menu
Which are the four most statistically significant terms? Do the Uniprot and GO Process terms agree with each other, i.e., annotate the same set of nodes?
Next, we will visualize the top-5 enriched terms in the network using
split charts, click the colorful chart icon to show the terms as the
charts on the network. You can manually change the layout of the network
to improve the visualization. First apply the
To retrieve a list of publications that are enriched for the proteins
in the network, go to the menu
What is the title of the most recent publication?
To save the list of enriched terms and associated p-values as a text
file, go to
Cytoscape provides functionality to merge two or more networks,
building either their union, intersection or difference. We will now
merge the EOC network we have from the DISEASES query with the one we
have from the data, so that we can identify the overlap between them. Use
the Merge tool (
How many nodes are in the intersection?
Now we will make the union of the intersection network, which
contains the disease scores, and the experimental network. Use the
Now, we can change the visualization of the merged network to be
able to identify high disease score proteins. Specifically, we will
change the size of the nodes in function of their disease score. Select
Doncheva NT, Morris JH, Gorodkin J and Jensen LJ (2018). Cytoscape stringApp: Network analysis and visualization of proteomics data.
Preprint