What is small RNA (sRNA)?
50-400bp RNA molecules, abundant in bacteria, not translated to protein, but function directly as structural, regulatory or even catalytic RNA.
RNA processing and modification
Messenger RNA stability and translocation
RNA molecules may form secondary structures stabilized by base-pairing of two different parts of the molecule.
What do we know about sRNA?
First few were discovered by chance, and found to have regulatory function
In recent years computational approaches were employed to predict new sRNAs
During 2001, several works were published and 55 new sRNAs were predicted and identified
Binding to complementary sequences by basepairing (--> SRTF's focus)
Interacting with proteins
One of the known examples of sRNA regulation includes a sRNA molecule called oxyS which regulates the translation of a gene called fhlA in response to oxidative stress in E. Coli. Both oxyS and fhlA molecules fold independently to form secondary structures of stems and loops. When the two molecules are present in the cell at the same time, oxyS inhibits the translation of the fhlA gene by forming 2 links (loop to loop antisense basepairing) between the two molecules, capturing fhlA's Shine-Dalgarno and AUG region (which are vital for translation initiation).
What is SRTF?
SRTF stands for Small RNA Target Finder. It is a heuristic program, written in Perl, which grades all E. Coli genes by their relative quality as target genes for a given sRNA sequence. The program uses different parameters to score each gene, including free energy calculations, secondary structure of the given sRNA sequence and analysis of sRNA-target gene match pattern.
Because there are only few examples of translation regulation by antisense basepairing of sRNA, SRTF uses a heuristic approach, trying to find genes that best answer several criteria which characterizes the known examples. SRTF scans each gene for basepairing matches to the given sRNA, then it grades every match based on how well it can influence translation initiation (based on parameters like match length, match proximity to the gene translation initiation site, etc.). The final analysis phase of every gene, includes the detection of match combinations, that can jointly maximize the chances of the gene translation prevention by the given sRNA.
The complete E. coli genome file including annotations, downloaded from NCBI.
MFold output files
The MFold program (Zuker M. 1989) predicts the secondary structure of a given RNA molecule based on free energy considerations. SRTF uses MFold in order to get the predicted secondary structure of the given sRNA sequence. SRTF processes the MFold output in order to get the following information about the sRNA structure:
Basepairing Specifies whether a certain position in the sRNA sequence is basepaired to a different base of the molecule, or whether it is unbasepaired (free, and thus more easily available to form basepairing with bases of the target molecule).
Extension Specifies the "height" of a given base in the sRNA molecule. The extension value of a given sRNA position is calculated as its distance (in bp) from the beginning of the stem it is located on. All bases located in a loop will have the same height the length of the stem leading to that loop. SRTF assumes that sRNA regions which are located on a long extension would be more readily available to reach key regions on the target gene and attach to them, then sRNA regions buried deep within the molecule secondary structure.
Picture: An example of MFOLD graphic output for the sRNA oxyS.
SRTF scans all genes, looking for matches of a gene window to a sRNA window. SRTF currently uses a 5 base minimal window size. Every match found (at least 5 sRNA bases which are complement to same number of bases in the currently scanned gene) is being stored for later analysis. When this phase is completed, SRTF holds a list of hits found within each gene, when every hits includes its 4 basic properties (raw properties): sRNA position, target position, hit length and hit free energy.
HIT = (sRNA pos , target pos , length ,energy)
Thanks to phase 1 (where SRTF collected data about the sRNA secondary structure), once we have the sRNA position of a certain hit, we know if the hit is located on a basepaired region or on an extension of the sRNA molecule. That information enables us to evaluate the quality of the hit and grade it accordingly in the next phase.
Important Note: SRTF displays sRNA positions starting from the 3`. This way it's easier to immediately see the relative position of two or more hits. Therefore, base number "sRNA pos" (starting from 3`) on the sRNA sequence is complement to base number "Target pos" on the target gene (starting from 5`).
Hits whose "Target Position" located around the Shine-Dalgarno & AUG region of the target gene are more likely to influence the translation initiation of that gene, making them interesting to us as we are interested translation regulation.
Important Note: Target Position 0 is the first base of the AUG on the target gene.
Hit Length Number of bases included in the match. SRTF has a minimal match length of 5. Every hit includes the exact same number of bases on the sRNA molecule and on the Target gene mRNA molecule.
Hits which contain many G and C bases would usually have a much lower energy value (The lower the value the stronger the match is).
Hits with a very negative value of free energy would mean that the hit is stronger more energy is needed to break that bond. Strong matches are preferable energetically, thus more likely to occur.
analyzeHitsFromPos This SRTF parameter defines mRNA position lower bound of hits that shall be included in the gene grading process. Only hits which their target position is higher than that parameter will be included.
analyzeHitsUntilPos This SRTF parameter defines mRNA position upper bound of hits that shall be included in the gene grading process. Only hits which their target position is lower than that parameter will be included.
Note: Phases 3 & 4 are performed on all E. Coli genes sequentially. Gene after gene, SRTF first grades each hit found on that gene independently, and then SRTF gives a total grade for the gene by finding best hits combination and giving a score to that combination.
In this phase, SRTF gives an independent score to each hit it found on the currently graded gene. Hit score is calculated based on the 4 basic hit properties: sRNA position, Target position, Length and Energy.
Hit Score Hit score specifies the relative estimated hit quality as involved in an effective translation regulation of a gene by a given sRNA.
Long and energetically stable matches, located on extended sRNA loops and in close proximity to the gene's Ribosomal binding site, should get a high score because they have a higher chance to be involved in translation regulation. Therefore, Hit Scores are calculated based on 5 sub-scores: 2 of them are basic hit properties (length and free energy) and the other 3 are functions positionScore (translates Target gene position to Target Position sub-score), bpScore (translates sRNA position to basepairing sub-score) and extensionScore (translates sRNA position to extension sub-score).
Hit Score =
(length / lengthThreshold ) * l00 * lengthWeight +
(energy / energyThreshold ) * 100 * energyWeight +
(positionScore / positionThreshold ) * 100 * positionWeight +
(extensionScore / extensionThreshold ) * 100 * extensionWeight +
(bpScore / bpThreshold) * 100 * bpWeight
There are 3 sub-score functions included in the hit scoring process:
1. Target gene position score (Position score) As we are interested in hits which can efficiently prevent gene translation, we would give precedence to hits falling on the translation initiation region of genes. We would usually be interested in hits which fall on the Shine-Dalgarno or on the AUG of the gene, blocking access to the ribosome and avoiding it from translating the gene. We would also be interested in hits which are located in close proximity to that important region (the oxyS-fhlA example includes 2 hits on target positions -15 and 33 which capture the AUG of the target).
Hence, the Target position score function may seem like the following graph (it depends on the parameters set) giving the highest grade to hits which fall on the "important region" between the Shine Dalgarno to the AUG, a rather good grade to hits located in close proximity to the AUG downstream, and low grades to distant hits downstream.
There are 7 parameters, defining the position score function:
posMark1 - First downstream mRNA position mark which is used to score hits by mRNA position
posMark2 - First downstream mRNA position mark which is used to score hits by mRNA position
prePosMark1 - Grade which will be given to hits which their mRNA position falls between 0 (AUG) and posMark1
postPosMark2 - Grade which will be given to hits which their mRNA position falls upstream to posMark2
This function uses the secondary structure that SRTF got from the MFold program on phase 1. Hits falling on un-basepaired regions in the sRNA should get higher grade because they are more readily available to connect to complementary window on a target gene, as the sRNA-target basepairing doesn't have to "compete" with internal sRNA basepairing.
sRNA windows containing consecutive un-basepaired bases will get higher bpScore than windows containing same number of un-basepaired bases, but in a non-consecutive pattern.
3. extensionScore Extension score of a given sRNA window measures the height of that window on the sRNA molecule two dimensional topography. sRNA windows located on linker regions will get an Extension Score of 0, while windows located within extended loops will get the height of that loop (number of bases in the loop's stem) as the extension score.
The extension score is supposed to reflect the sRNA topology, giving precedence to hits located on extended branches of the sRNA molecule assuming that sRNA extension located windows will be able to reach their complementary target gene more easily (On the oxyS-fhlA example, both hits are located on sRNA extensions).
Hit Scoring Weights Assigning scoring weight to each of the hit sub-scores enables us to control how dominant each hit property would be in the total score. Setting a sub-score weight to 0 would actually make SRTF to ignore that property when scoring hits.
Thresholds Every hit sub-score is normalized by division by a threshold value. Lower threshold values will increase the sub-score influence on the total hit score. There are 5 threshold parameters: lengthThreshold, positionThreshold, energyThreshold, bpThreshold and extensionThreshold.
After scoring each hit found within the currently graded gene independently, this phase includes the calculation of a total gene score. SRTF tries to identify the best hits combination, so it generates all hit pairs, scores each pair, and then sets the total gene score to be the maximum over all pair scores and all single hit scores.
Gm= max ( Sij , Si)
i=1..K , j=1..K
K = number of hits found on gene m
Si = Score of single hit j
Sij = Score of hits pair (i,j)
Hits Pair Score is calculated as the sum of the two hit score, in addition to 3 per-pair penalties/bonuses (Distance penalty, Ratio Bonus and Cover Fraction Bonus).
Sij = Si + Sj DistancePenalty + RatioBonus + CoverFractionBonus
1. Distance penalty Pair Distance is calculated as difference between target positions of two hits. Distance penalty is reduced from pair score when the distance on the target gene, between two hits, exceeds a certain value. The idea is to give precedence to hit pairs in which the two hits are rather close one to the other on the target gene, as we expect two tight hits to be more stable than 2 hits, hundreds of bases apart (2 tight hits attach the target gene locally stabilizing each other, while allowing most of the mRNA molecule to retain it's original secondary structure).
2. Cover Fraction Bonus - As we are interested in hit pairs which avoid the ribosome from accessing the translation initiation site of the target gene, we can use this "sensor" to identify hit pairs which capture the mRNA's Shine-Dalgarno and AUG regions between them.
Cover Fraction is the fraction of mRNA "important region" which the pair covers. The "important region" would usually be defined as the mRNA region from the approximate Shine-Dalgarno location until the AUG.
100% percent of the Bonus will be given to pairs which capture this entire "mRNA important region" between them, while no bonus will be given to pairs which do not capture any fraction of the "important region" and therefore won't be able influence translation initiation in the most direct way of blocking the translation initiation site.
3. Ratio Bonus Hits Pair Ratio is defined as the distance between the target gene positions of the two hits, divided by the distance between the sRNA positions of the two hits (=The Distance between the two hits on the mRNA divided by the distance between the two hits on the sRNA).
The Ratio is one of the most important elements of the SRTF program as it is used to identify different binding models of sRNA molecules to target gene mRNA molecules.
The following drawing displays the relation between the ratio value and the sRNA-Target binding models. The only model verified in Lab is the "Classic Double Kissing Complex" which is the binding model of the oxyS-fhlA example (ratio value of 0.66). All other models are theoretical. The drawing presents the mapping of ratio value to the different models.
High ratio values are mapped to binding model of 2 hits in very close proximity on the sRNA sequence, as Low ratio values are mapped to binding model of 2 hits in very close proximity on the mRNA sequence. Moderate ratio values are mapped to the classic double kissing complex the oxyS-fhlA model.
Negative ratio values are mapped to CROSSED binding models.
The Ratio Bonus Function gives a bonus to pairs which their ratio is within a desired range. Using Ratio Bonus, SRTF can give precedence to pairs which implement a specific requested binding model.
Ratio Zones The Ratio Bonus Function distinguished between 6 ratio zones. Each Zone is mapped to a different theoretical binding model. Bonus will be given to pairs with ratio value which is found within one of the requested zones.
Example: "Ratio Zones" parameter of "111000" would mean give bonus to pairs with ratio values which are found on the 3 first zones (the 3 negative zones ->requesting only crossed models).
The Ratio Zones are defined according to two parameters (as described in the drawing ahead) :
RatioBonusMargins This parameter is a fraction which determines the margin width of the ratio bonus function. Value of 0 means sharp transitions from 0 points bonus to maximal bonus, while value of 1 means slow and moderate decay of the bonus function around the zone edges. Look at the example below.
enableTargetHitsOverlapping Normally, SRTF would not allow hit pairs which include two hits which overlap in their target gene windows (the score of these pair is set to 0 and thus they will never be chosen as best pair). Setting this Boolean parameter to 1 will enable overlapping of two hits on the target gene.
enableSRnaHitsOverlapping - Normally, SRTF would not allow hit pairs which include two hits which overlap in their sRNA windows (the score of these pair is set to 0 and thus they will never be chosen as best pair). Setting this Boolean parameter to 1 will enable overlapping of two hits on the sRNA.
SRTF Web Site Help
The SRTF web site enables you to execute analysis on pre-prepared data. SRTF data infrastructure includes hits file for all currently known sRNA genes (Minimal hit length: 5bp).
The Find Targets section is the main website module, used to grade all genes according to a given set of parameters.
This screen is used to define the SRTF execution parameters.
sRNA name - Choose sRNA name for which you want to find targets.
Deep scan SRTF pre-prepared data includes two type of hit files. Normal scan is faster, using hits file with minimal window size of 6, and includes hits only at target position range -150 to 500. If the "Deep scan" check box is checked, then SRTF will use hits file with minimal window size of 5, and will include hits at target position range -150 to the end of each gene.
Pre-defined binding models Each drawing presents a different (most are theoretical ) binding model. Clicking on any of the model buttons will set the execution parameters to a pre-defined parameters set, aim at identifying that model.
Show parameters Clicking the small '+' icon will open the parameters form where you can set any of the parameters. You can first choose a pre-defined model by clicking one of the buttons above, and then edit the parameters set to achieve the desired result.
Show external files - Here you can direct SRTF to use external files for its analysis. The first list box enables you to choose a configuration file name located on the server. In time, when more parameter sets are formed, more configuration files can be put on the server. Next on this section you can specify Hits file name, local configuration file (located on your computer) or MFOLD connect (output of the MFOLD program for the sRNA you want to analyze).
After clicking the engage button, SRTF will start grading all genes according to the parameters you have specified. SRTF execution time depends on the following factors:
1. sRNA sequence length
3. Minimal window size and target position hits range Deep scan hit files will take much longer time to run because there as their hit files contain many more hits.
4. Server workload.
If you do not wish to wait until the execution is over, you can either copy the results link and review them later, or specify your e-mail address and SRTF will send you an e-mail containing the results link.
Your execution results will remain on the server for several days.
Find Targets results page The results page is divided to two parts: On the left you can see a listing of all graded genes (including their rank number, name, grade and GI number which is linked to the NCBI gene record). On the right part, you can see a graph describing grades distribution and execution parameters. Clicking on a gene name from the list will open the gene analysis page.
Gene analysis page This page presents SRTF's analysis for a given gene. The Hits table contains all hits included in the analysis marked in pink are the hits selected as best combination on phase 4. Hit values and sub-scores are colored according to the slogan "The Pinker, The better".
Above the table you can see several grading details including ratio, distance and cover values for the selected hits pair.
Clicking on the "Graphic View" button will display a graphic representation of the current gene.
Graphic representation page This page enables you to view the MFOLD graphic output for the SRNA and current gene. On the bottom you can see the sequences of both the sRNA and the current gene, while the selected hits are marked in color.
SRTF express enables you to quickly grade a single gene according to defined set of parameters. First define the execution parameters on the left (including sRNA and gene names), hit the engage button and the gene analysis page for the requested gene should appear on the right.
Hits explorer enables you to query basic hit properties from a hits file. The basic hit properties include target position, sRNA position, hit length and hit energy. All of these parameters are basic and don't include any assumption made by SRTF during the grading process of the "Find Targets" and "SRTF Express" sections.
16. Tetart, F. Bouche, JP. (1992) Molecular Microbiology 6,615-620
21. Zuker, M. (1989) Science 244,48-51