29.10.2018 change 29.10.2018
Ludwika Tomala
Ludwika Tomala

Researchers developed a statistical protein dictionary of nature`s favourite words

Photo: Fotolia Photo: Fotolia

Researchers from Warsaw analysed the language used by nature in the construction of proteins. They prepared a dictionary of its favourite and least favourite 5-letter "words". The study sheds light on the evolution of proteins.

In every cell of our body there are a dozen or even several tens of thousands of proteins. Each protein is a tangled chain of amino acids. A protein can be compared to a sentence. It can consist of 20 types of letters - amino acids. The length of a sentence can be, for example, several hundred or several thousand amino acid "letters". Each sentence has its functions. For example, a haemoglobin protein would contain the order: "grab oxygen and hold it until you get a signal and then grab carbon dioxide". Researchers are trying to crack the code that evolution uses to put life in motion. It is still largely a foreign language for mankind.

To learn the secrets of this language, scientists around the world study organisms such as chickens, fruit flies, corn, tuberculosis bacilli or humans and sequence the proteins they find in their cells. As a result, over 100 million protein sequences - protein "sentences" - are already stored in publicly available databases. This is a large database.

In one of the stories, Sherlock Holmes used statistics to decipher a message encrypted with strange signs. In an encrypted English text, he searched for clusters of three letters: "the", a very frequently used article in that language. And then he looked for other clusters frequent in this language. Similarly, Polish scientists used statistics to decipher the nature`s incomprehensible language.


It all started with Dr. Marcin Grynberg`s (Institute of Biochemistry and Biophysics PAS) decision to check whether there were any combinations of amino acids, which in nature were not used at all for the production of proteins. "I have always wondered why biology only studies what exists in nature, and not the things that are not there" - he says in an interview with PAP. The biologist, out of curiosity, wanted to produce such "forbidden" pieces of proteins and check why nature did not like them. Earlier research by other teams indicated that the amino acid conglomerates unknown to nature actually did exist.

The idea appealed to several other researchers: Dr. Anna Muszewska, Dr. Marta Hoffman-Sommer, Prof. Jarosław Poznanski, Dr. Krzysztof Pawłowski and Dr. Konrad Dębski - but not only to them. Jan Topiński - a philosopher by education and computer scientist by profession - became involved in the research project. Their research was conducted after hours, next to their daily activities, because grant agencies did not want to spend money on their work. The research also had the support of two private entities, in4mates and Fork Systems.

The research took over 3 years, but it paid off. In October, research results were published in the prestigious journal Scientific Reports (https://www.nature.com/articles/s41598-018-33433-8). And it`s not easy to find a paper by an entirely Polish team in that journal.

Researchers collected data on all known proteins from two global databases. They divided amino acids into 5-letter fragments and prepared a popularity ranking of such conglomerates called pentapeptides.

"If we treat proteins like sentences, we wanted to study fragments of protein sentences - the words they consist of. We wanted to check which words nature used most often" - says Jan Topiński.

The are 3.2 million possible 5-letter combinations of 20 amino acids. Two-, three- and four-peptide fragments were also tested. But the combinations consisting of five amino acids proved to be long enough to reveal interesting relationships. At the same time, they were short enough to be easy to analyse.

Unfortunately, the hopes of Dr. Grynberg did not come true: it turned out that there were no fragments "forbidden by nature". Each 5-letter conglomerate of amino acids is used by nature. The authors of previous studies reporting "forbidden peptides" were wrong. This was because earlier databases contained much less biodiversity data than those currently available.

The research also had other interesting results. For example, researchers prepared a statistical dictionary of nature`s favourite 5-letter peptide clusters. "It is certainly not the case that all peptides are equally popular, there are fragments that are hundreds of times more often used in proteins than others" - told PAP another project participant, Dr. Krzysztof Pawłowski from the Warsaw University of Life Sciences.


The scientist explains that the statistics he has prepared with his colleagues reflect the evolution of proteins. The hypothesis of scientists is that contemporary proteins were created from the evolution of short peptides. So in the early stages of life, the peptide sequences were very short. And yet they worked. Such most popular short protein words are great candidates for ancient proteins.

"Astronomers, who analyse radiation coming from the Universe, can look into the past and see the traces of the Big Bang. By analysing the amino acid sequence in proteins, we are able to see traces of ancient proteins" - Dr. Pawłowski compares.

For the analysis of the "protein" language, researchers used numerical methods previously used for language analysis. "If we divided the text of +Pan Tadeusz+ into 5-character strings, maybe at first glance it would not make sense, but despite appearances, we would be able to notice something important" - says Jan Topiński. He points out that one of the heroes is Father Robak. And this string of five letters "ROBAK" would probably appear in Pan Tadeusz statistically more often than others strings consisting of the same letters, for example "KORBA". "That would give us food for thought. Why exactly is this fragment more needed than others?" - explains Jan Topiński.

Even if the 5-letter conglomerates are not all the whole words used by nature, they can give scientists important areas for research - for example, which proteins are worth a closer look, if evolution likes them so much. "Now we want to understand what these fragments do so well that nature chooses them more often" - smiles Dr. Grynberg.

Researchers also wanted to see if evolution worked the same way in whole proteins. They checked if the same sequences were popular in the so-called domain protein fragments. And if other sequences were popular in the so-called non-domain fragments. Thus, evolution has chosen different paths in various parts of proteins. That is also an unexpected conclusion of scientists.

"If all pentapeptides were found in nature equally often, the world would be boring" - concludes Dr. Krzysztof Pawłowski from WULS-SGGW. He hopes that thanks to the publication in "Scientific Reports" it will be easier to convince grant committees to finance further research on the language of proteins. Because there is no shortage of questions that arose during this project.

PAP - Science in Poland, Ludwika Tomala

lt/ zan/ kap/

tr. RL

Copyright © Foundation PAP 2018