Shower      11/20/2023

Life deciphered. Jumping genes Comparative and functional genomics of plants

Jumping genes

In the middle of the last century, American researcher Barbara McClintock discovered amazing genes in corn that can independently change their position on the chromosomes. Now they are called “jumping genes” or transposable (mobile) elements. The discovery was not recognized for a long time, considering mobile elements to be a unique phenomenon characteristic only of corn. However, it was for this discovery that McClintock was awarded the Nobel Prize in 1983 - today jumping genes have been found in almost all studied species of animals and plants.

Where did jumping genes come from, what do they do in a cell, are they useful? Why, with genetically healthy parents, can a Drosophila fruit fly family, due to jumping genes, produce mutant offspring with high frequency or even be childless? What is the role of jumping genes in evolution?

It must be said that the genes that ensure the functioning of cells are located on chromosomes in a certain order. Thanks to this, it was possible to construct so-called genetic maps for many species of unicellular and multicellular organisms. However, there is an order of magnitude more genetic material between genes than within them! What role this “ballast” part of DNA plays has not been fully established, but it is here that mobile elements are most often found, which not only move themselves, but can also take neighboring DNA fragments with them.

Where do jumping genes come from? It is assumed that at least some of them originate from viruses, since some mobile elements are capable of forming viral particles (for example, the mobile element gipsy in the fruit fly Drosophila melanogaster). Some mobile elements appear in the genome through the so-called horizontal transfer from other species. For example, it has been established that mobile hobo-element (translated into Russian it is called a tramp) Drosophila melanogaster repeatedly reintroduced into the genome of this species. There is a version that some regulatory sections of DNA may also have autonomy and a tendency to “vagrancy”.

Useful ballast

On the other hand, most of the jumping genes, despite the name, behave quietly, although they make up a fifth of the total genetic material Drosophila melanogaster or almost half of the human genome.

The redundancy of DNA, which was mentioned above, has its advantage: ballast DNA (including passive mobile elements) takes the hit if foreign DNA is introduced into the genome. The likelihood that a new element will be integrated into a useful gene and thereby disrupt its function is reduced if there is much more ballast DNA than significant DNA.

Some redundancy of DNA is useful in the same way as the “redundancy” of letters in words: we write “Maria Ivanovna”, but say “Marivan”. Some letters are inevitably lost, but the meaning remains. The same principle works at the level of significance of individual amino acids in a protein-enzyme molecule: only the sequence of amino acids that forms the active center is strictly conserved. Thus, at different levels, redundancy turns out to be a kind of buffer that provides a reserve of system strength. This is how mobile elements that have lost mobility turn out to be not useless for the genome. As they say, “from a thin sheep at least a tuft of wool,” although perhaps another proverb would be better suited here - “every bast in a line.”

Mobile elements that have retained the ability to jump move along Drosophila chromosomes with a frequency of 10–2–10–5 per gene per generation, depending on the type of element, genetic background and external conditions. This means that one out of a hundred jumping genes in a cell can change its position after the next cell division. As a result, after several generations, the distribution of mobile elements along the chromosome can change very significantly.

It is convenient to study this distribution on polytene (multi-stranded) chromosomes from the salivary glands of Drosophila larvae. These chromosomes are many times thicker than usual, which greatly simplifies their examination under a microscope. How are such chromosomes obtained? In the cells of the salivary glands, the DNA of each chromosome is multiplied, as during normal cell division, but the cell itself does not divide. As a result, the number of cells in the gland does not change, but over 10-11 cycles, several thousand identical DNA strands accumulate in each chromosome.

It is partly due to polytene chromosomes that jumping genes in Drosophila are better studied than in other multicellular organisms. As a result of these studies, it turned out that even within the same Drosophila population it is difficult to find two individuals that have chromosomes with the same distribution of transposable elements. It is no coincidence that it is believed that most of the spontaneous mutations in Drosophila are caused by the movement of these “jumpers”.

The consequences may vary...

Based on their effect on the genome, active mobile elements can be divided into several groups. Some of them perform functions that are extremely important and useful for the genome. For example, telomeric The DNA located at the ends of chromosomes in Drosophila consists of special mobile elements. This DNA is extremely important - its loss entails the loss of the entire chromosome during cell division, which leads to cell death.

Other mobile elements are outright “pests”. At least that is what they are considered to be at the moment. For example, mobile elements of the R2 class can be specifically incorporated into arthropod genes encoding one of the ribosomal proteins, the cellular “factories” for protein synthesis. Individuals with such disorders survive only because only a portion of the many genes encoding these proteins are damaged in the genome.

There are also mobile elements that move only in reproductive tissues that produce germ cells. This is explained by the fact that in different tissues the same mobile element can produce enzyme protein molecules required for movement that differ in length and function.

An example of the latter is the P-element Drosophila melanogaster, which entered its natural populations through horizontal transfer from another species of Drosophila no more than a hundred years ago. However, there is hardly a population on Earth now Drosophila melanogaster, in which the P-element would not be found. It should be noted that most of its copies are defective, moreover, the same version of the defect was found almost everywhere. The role of the latter in the genome is unique: it is “intolerant” towards its fellows and plays the role of a repressor, blocking their movement. So the protection of the Drosophila genome from the jumps of the “stranger” may be partially carried out by its own derivatives.

The main thing is to choose the right parents!

Most of the jumps of mobile elements do not affect the appearance of the Drosophila, because they occur on ballast DNA, but there are other situations when their activity increases sharply.

Surprisingly, the most powerful factor inducing the movement of jumping genes is poor parental selection. For example, what happens if you cross females from a laboratory population? Drosophila melanogaster, which do not have the P-element (because their ancestors were caught from nature about a hundred years ago), with males carrying the P-element? In hybrids, due to the rapid movement of the mobile element, a large number of different genetic disorders can appear. This phenomenon, called hybrid dysgenesis, is caused by the fact that there is no repressor in the maternal cytoplasm that prohibits the movement of the transposable element.

Thus, if grooms from population A and brides from population B can create large families, then the opposite is not always true. A family of genetically healthy parents can produce a large number of mutant or infertile offspring, or even be childless if the father and mother have a different set of mobile elements in their genome. Especially many violations appear if the experiment is carried out at a temperature of 29° C. The influence of external factors, superimposed on the genetic background, enhances the effect of genome mismatch, although these factors themselves (even ionizing radiation) alone are not capable of causing such massive movements of mobile elements.

Similar events in Drosophila melanogaster may occur with the participation of other families of mobile elements.

"Mobile" evolution

The cellular genome can be considered as a kind of ecosystem of permanent and temporary members, where neighbors not only coexist, but also interact with each other. The interaction of host genes with mobile elements is still poorly understood, but many results can be given - from the death of the organism in the event of damage to an important gene to the restoration of previously damaged functions.

It happens that the jumping genes themselves interact with each other. Thus, a phenomenon resembling immunity is known, when a mobile element cannot penetrate in close proximity to an already existing one. However, not all mobile elements are so delicate: for example, P-elements can easily penetrate each other and take their fellow players out of the game.

In addition, there is a kind of self-regulation in the number of mobile elements in the genome. The fact is that mobile elements can exchange homologous regions with each other - this process is called recombination. As a result of such interaction, mobile elements may, depending on their orientation, lose ( deletion) or expand ( inversion) fragments of host DNA located between them. If a significant piece of a chromosome is lost, the genome will die. In the case of an inversion or small deletion, chromosome diversity is created, which is considered a necessary condition for evolution.

If recombinations occur between mobile elements located on different chromosomes, the result is the formation of chromosomal rearrangements, which during subsequent cell divisions can lead to genome imbalance. And an unbalanced genome, like an unbalanced budget, is divided very poorly. So the death of unsuccessful genomes is one of the reasons why active mobile elements do not fill chromosomes indefinitely.

A natural question arises: how significant is the contribution of mobile elements to evolution? Firstly, most of the mobile elements are introduced, roughly speaking, wherever they need to be, as a result of which they can damage or change the structure or regulation of the gene into which they are introduced. Then natural selection rejects unsuccessful options, and successful options with adaptive properties are fixed.

If the consequences of the introduction of a mobile element turn out to be neutral, then this variant can persist in the population, providing some diversity in the gene structure. This can come in handy in unfavorable conditions. Theoretically, with the massive movement of mobile elements, mutations can appear in many genes simultaneously, which can be very useful in case of a sharp change in living conditions.

So, to summarize: there are many mobile elements in the genome and they are different; they can interact both with each other and with host genes; can harm and be irreplaceable. Genome instability caused by the movement of mobile elements can end in tragedy for the individual, but the ability to change quickly is a necessary condition for the survival of a population or species. Thanks to this, diversity is created, which is the basis for natural selection and subsequent evolutionary transformations.

An analogy can be drawn between jumping genes and immigrants: some immigrants or their descendants become equal citizens, others are given residence permits, and others - those who do not comply with the laws - are deported or imprisoned. And mass migrations of people can quickly change the state itself.

Literature

Ratner V. A., Vasilyeva L. A. Induction of transpositions of mobile genetic elements by stress influences. Russian binding. 2000.

Gvozdev V. A. Mobile DNA of eukaryotes // Soros educational journal. 1998. No. 8.

Publishing house "BINOM. Knowledge Laboratory is releasing a book of memoirs by geneticist Craig Venter, Life Deciphered. Craig Venter is known for his work on reading and deciphering the human genome. In 1992, he founded the Institute for Genome Research (TIGR). In 2010, Venter created the world's first artificial organism - the synthetic bacterium Mycoplasma laboratorium. We invite you to read one of the chapters of the book, in which Craig Venter talks about the work of 1999–2000 to sequence the genome of the Drosophila fly.

Forward and only forward

The fundamental aspects of heredity turned out, to our surprise, to be quite simple, and therefore there was hope that perhaps nature is not so unknowable, and its incomprehensibility, repeatedly proclaimed by various people, is just another illusion, the fruit of our ignorance. This makes us optimistic, because if the world were as complex as some of our friends claim, biology would have no chance of becoming an exact science.

Thomas Hunt Morgan. Physical basis of heredity

Many people have asked me why, of all the living creatures on our planet, I chose the fruit fly; others wondered why I didn’t immediately move on to deciphering the human genome. The point is that we needed a basis for future experiments, we wanted to be sure of the correctness of our method before spending almost $100 million on sequencing the human genome.

The little fruit fly played a huge role in the development of biology, especially genetics. The genus of Drosophila includes various flies - vinegar, wine, apple, grape, and fruit - in total about 26 hundred species. But say the word "drosophila" and any scientist will immediately think of one specific species - Drosophilamelanogaster. Because it reproduces quickly and easily, this tiny fly serves as a model organism for evolutionary biologists. They use it to shed light on the miracle of creation - from the moment of fertilization to the emergence of an adult organism. Thanks to Drosophila, many discoveries have been made, including the discovery of homeobox-containing genes that regulate the general structure of all living organisms.

Every student of genetics is familiar with the experiments on Drosophila performed by Thomas Hunt Morgan, the father of American genetics. In 1910, he noticed male mutants with white eyes among the usual red-eyed flies. He crossed a white-eyed male with a red-eyed female and found that their offspring were red-eyed: white-eyedness turned out to be a recessive trait, and we now know that for flies to have white eyes, you need two copies of the white-eyed gene, one from each parent. Continuing to cross mutants, Morgan discovered that only males exhibited the trait of white eyes, and concluded that this trait was associated with the sex chromosome (Y chromosome). Morgan and his students studied heritable traits in thousands of fruit flies. Today, experiments with Drosophila are carried out in molecular biology laboratories around the world, where more than five thousand people study this small insect.

I learned firsthand the importance of Drosophila when I used libraries of its cDNA genes to study adrenaline receptors and discovered their equivalent in the fly - octopamine receptors. This discovery indicated the commonality of the evolutionary heredity of the nervous system of the fly and man. Trying to make sense of cDNA libraries of the human brain, I found genes with similar functions by computer comparison of human genes with Drosophila genes.

The Drosophila gene sequencing project began in 1991, when Jerry Rubin of the University of California, Berkeley, and Allen Spradling of the Carnegie Institution decided it was time to take on the task. By May 1998, 25% of the sequencing had already been completed, and I made a proposal that Rubin said was “too good to pass up.” My idea was quite risky: thousands of fruit fly researchers from different countries would have to closely examine every letter of the code we obtained, comparing it with Jerry's own high-quality, reference data, and then make a conclusion about the suitability of my method.

The original plan was to complete sequencing of the fly genome within six months by April 1999, and then begin the attack on the human genome. It seemed to me that this was the most effective and clear way to demonstrate that our new method works. And if we don’t succeed, I thought, then it would be better to quickly verify this using the example of Drosophila than by working on the human genome. But in truth, complete failure would be the most spectacular failure in the history of biology. Jerry was also putting his reputation on the line, so everyone at Celera was determined to support him. I asked Mark Adams to lead our part of the project, and since Jerry also had a top-notch team at Berkeley, our collaboration went swimmingly.

First of all, the question arose about the purity of the DNA that we had to sequence. Like people, flies vary at the genetic level. If there is more than 2% genetic variation in a population, and we have 50 different individuals in the selected group, then decoding turns out to be very difficult. Jerry's first step was to inbreed the flies as much as possible to give us a uniform DNA variant. But inbreeding was not enough to ensure genetic purity: when extracting fly DNA, there was a risk of contamination with genetic material from bacterial cells in the fly's food or in its intestines. To avoid these problems, Jerry preferred to extract DNA from fly embryos. But even from embryonic cells, we first had to isolate nuclei with the DNA we needed, so as not to contaminate it with extranuclear DNA of mitochondria - the “power plants” of the cell. As a result, we received a test tube with a cloudy solution of pure Drosophila DNA.

In the summer of 1998, Ham's team, having such pure fly DNA, began creating libraries of its fragments. Ham himself most liked to cut DNA and overlap the resulting fragments, lowering the sensitivity of his hearing aid so that no extraneous sounds would distract him from his work. The creation of libraries was supposed to be the beginning of large-scale sequencing, but so far only the sounds of drills, hammers and saws were heard everywhere. A whole army of builders was constantly an eyesore nearby, and we continued to solve the most important problems - troubleshooting the operation of sequencers, robots and other equipment, trying not in years, but in a matter of months to create a real sequencing “factory” from scratch.

The first Model 3700 DNA sequencer was delivered to Celera on December 8, 1998 to great excitement and a collective sigh of relief. The device was removed from a wooden box, placed in a windowless room in the basement - its temporary home, and immediately began testing. Once it started working, we got very high-quality results. But these early sequencers were quite unstable, and some were faulty from the start. There were also constant problems with the workers, sometimes almost daily. For example, a serious error appeared in the robotic manipulator control program - sometimes the robot’s mechanical arm extended over the device at high speed and crashed into the wall. As a result, the sequencer stopped, and a repair team had to be called in to fix it. Some sequencers have failed due to stray laser beams. To protect against overheating, foil and tape were used, since at high temperatures the yellow-colored Gs fragments evaporated from the sequences.

Although the devices were now being supplied regularly, about 90% of them were faulty from the start. On some days the sequencers did not work at all. I firmly believed in Mike Hunkapiller, but my faith was greatly shaken when he began to blame our failures on our employees, construction dust, the slightest temperature fluctuations, the phases of the moon, and so on. Some of us even turned gray from stress.

The dead 3700s were sitting in the cafeteria waiting to be sent back to ABI, and eventually it got to the point where we had to eat lunch practically in the sequencers' morgue. I was in despair - after all, I needed a certain number of working devices every day, namely 230! For about $70 million, ABI promised to provide us with either 230 perfectly functional devices that would work all day without interruption, or 460 that would work for at least half a day. In addition, Mike should have doubled the number of qualified technical personnel to immediately repair sequencers after failure.

However, what is the interest in doing all this for the same money! In addition, Mike now has another client - a government genomic project, whose leaders have already begun purchasing hundreds of devices without any testing. The future of Celera depended on these sequencers, but Mike apparently did not realize that the future of ABI also depended on them. Conflict was inevitable, as was evident at an important meeting between ABI engineers and my team at Celera.

After we reported on the huge number of defective instruments and how long it took to fix sequencer failures, Mike again tried to place all the blame on my employees, but even his own engineers disagreed with him. Tony White eventually intervened. “I don’t care how much it costs or who needs to be killed for it,” he said. Then for the first and last time he really took my side. He ordered Mike to ensure delivery of the new sequencers as quickly as possible, even at the expense of other customers and even if it was not yet known how much it would cost.

Tony also ordered Mike to hire twenty more technicians to quickly repair and determine the cause of all problems. In reality, this was easier said than done because experienced workers were in short supply. To begin with, Eric Lander poached two of the most qualified engineers, and in Mike's opinion, we were also to blame. Turning to Mark Adams, Mike said, “You should have hired them before someone else did.” After such a statement, I completely lost all respect for him. After all, according to our agreement, I could not hire ABI employees, while Lander and other leaders of the government genome project had the right to do so, so very soon the best ABI engineers began working for our competitors. By the end of the meeting, I realized that the problems remained, but a ray of hope for improvement had dawned.

And so it happened, although not immediately. Our arsenal of sequencers increased from 230 to 300 devices, and if 20-25% of them failed, we still had about 200 working sequencers and somehow coped with the tasks. The technical staff worked heroically and steadily increased the pace of repair work, reducing downtime. All this time I thought about one thing: what we are doing is doable. Failures occurred for a thousand reasons, but failure was not part of my plans.

We began sequencing the Drosophila genome in earnest on April 8, around the time we should have completed this work. I, of course, understood that White wanted to get rid of me, but I did everything in my power to complete the main task. Tension and anxiety haunted me at home, but I could not discuss these problems with my “confidant”. Claire showed her disdain when she saw how preoccupied I was with Celera's affairs. She felt like I was repeating the same mistakes I made while working at TIGR/HGS. By July 1st I was feeling deeply depressed, just as I had in Vietnam.

Since the conveyor method had not yet worked for us, we had to do hard, exhausting work - to “glue” the genome fragments back together. To detect matches without being distracted by repetitions, Gene Myers proposed an algorithm based on the key principle of my version of the shotgun method: sequence both ends of all resulting clones. Because Ham was getting clones of three precisely known sizes, we knew that the two terminal sequences were at a strictly defined distance from each other. As before, this method of “matching” will give us an excellent opportunity to reassemble the genome.

But since each end of the sequence was sequenced separately, for this assembly method to work accurately, it was necessary to keep careful records - to be absolutely sure that we were able to connect all pairs of end sequences correctly: after all, if at least one in a hundred attempts leads to an error and no matching one is found a couple for consistency, everything will go down the drain and the method will not work. One way to avoid this is to use barcodes and sensors to track each step of the process. But at the beginning of the work, laboratory technicians did not have the necessary software and equipment for sequencing, so they had to do everything manually. At Celera, a small team of less than twenty people processed a record 200,000 clones every day. We could anticipate some errors, such as misreading data from 384 wells, and then use the computer to find the clearly erroneous operation and correct the situation. Of course, there were still some shortcomings, but this only confirmed the team’s skill and confidence that we could eliminate errors.

Despite all the difficulties, we were able to read 3156 million sequences in four months, a total of about 1.76 billion nucleotide pairs contained between the ends of 1.51 million DNA clones. Now it was the turn of Gene Myers, his team and our computer - it was necessary to put all the sections together into Drosophila chromosomes. The longer the sections became, the less accurate the sequencing became. In the case of Drosophila, the sequences averaged 551 base pairs and the average accuracy was 99.5%. Given 500-letter sequences, almost anyone can locate the matches by moving one sequence along another until a match is found.

To sequence Haemophilus influenzae, we had 26 thousand sequences. To compare each of them with all the others would require 26 thousand comparisons squared, or 676 million. The Drosophila genome, with its 3.156 million reads, would require about 9.9 trillion comparisons. In the case of human and mouse, where we produced 26 million sequence reads, about 680 trillion comparisons were required. It is therefore not surprising that most scientists were very skeptical about the possible success of this method.

Although Myers promised to fix everything, he constantly had doubts. Now he worked days and nights, looked exhausted and somehow gray. In addition, he had problems in his family, and he began to spend most of his free time with the journalist James Shreve, who wrote about our project and, like a shadow, followed the progress of the research. Trying to somehow distract Gene, I took him with me to the Caribbean to relax and sail on my yacht. But even there he sat for hours, hunched over his laptop, frowning his black eyebrows and squinting his black eyes from the bright sun. And, despite incredible difficulties, Gene and his team were able to generate more than half a million lines of computer code for the new assembler in six months.

If sequencing results were 100% accurate, with no duplicate DNA, genome assembly would be a relatively simple task. But in reality, genomes contain a large number of repeated DNA of different types, lengths and frequencies. Short repeats of less than five hundred base pairs are relatively easy to deal with; longer repeats are more difficult. To solve this problem, we used a “pair finding” method, that is, we sequenced both ends of each clone and obtained clones of different lengths to ensure the maximum number of matches.

The algorithms, encoded in Jin's team's half-million lines of computer code, suggested a step-by-step scenario - from the most "harmless" actions, such as simply overlapping two sequences, to more complex ones, such as using detected pairs to merge islands of overlapping sequences. It was like putting together a puzzle, where small islands of assembled sections are put together to form larger islands, and then the whole process is repeated again. Only our puzzle had 27 million pieces. And it was very important that the sections were taken from a sequence of high quality assembly: imagine what would happen if you assemble a puzzle, and the colors or images of its elements are fuzzy and blurry. For long-range order of the genome sequence, a significant proportion of the reads must be in the form of matching pairs. Given that the results were still being tracked manually, we were relieved to find that 70% of the sequences we had were exactly like this. Computer modelers explained that with a lower percentage it would have been impossible to assemble our “Humpty Dumpty.”

And now we were able to use the Celera assembler to sequence the sequence: in the first stage, the results were adjusted to achieve the highest accuracy; in the second step, the Screener program removed contaminating sequences from the plasmid or E. coli DNA. The assembly process can be disrupted by as little as 10 base pairs of the “foreign” sequence. In the third step, the Screener program checked each fragment for compliance with known repeating sequences in the fruit fly genome - data from Jerry Rubin, who “kindly” provided them to us. The locations of repeats with partially overlapping regions were recorded. In the fourth step, another program (Overlapper) discovered the overlapping areas by comparing each fragment with all the others - a colossal experiment in processing a huge amount of numerical data. We compared 32 million fragments every second, aiming to find at least 40 overlapping base pairs with less than 6% differences. When we discovered two overlapping regions, we combined them into a larger fragment, the so-called “contig” - a set of overlapping fragments.

Ideally, this would be enough to assemble the genome. But we had to contend with stutters and repeats in the DNA code, which meant that one piece of DNA could overlap with several different regions, creating spurious connections. To simplify the task, we left only uniquely connected fragments, the so-called “unitigs”. The program we used to perform this operation (Unitigger) essentially removed all the DNA sequence that we could not identify with certainty, leaving only these units. This step not only gave us the opportunity to consider other options for assembling the fragments, but also significantly simplified the task. After reduction, the number of overlapping fragments was reduced from 212 million to 3.1 million, and the problem was simplified by 68 times. The pieces of the puzzle gradually but steadily fell into place.

And then we could use information about the way the sequences of the same clone were paired using a “skeleton” algorithm. All possible units with mutually overlapping base pairs were combined into special frameworks. To describe this stage in my lectures, I draw an analogy with the children's toy construction set Tinkertoys. It consists of sticks of different lengths, which can be inserted into holes located on wooden key parts (balls and disks), and thus create a three-dimensional structure. In our case, the key parts are units. Knowing that paired sequences are located at the ends of clones 2 thousand, 10 thousand or 50 thousand base pairs long - that is, they seem to be at a distance of a certain number of holes from each other - they can be lined up.

Testing this technique on Jerry Rubin's sequence, which was about one-fifth of the fruit fly genome, resulted in only 500 gaps. Testing on our own data in August, we ended up with more than 800,000 small fragments. A significantly larger amount of data for processing showed that the technique worked poorly - the result was the opposite of what was expected. Over the next few days, panic grew and the list of possible mistakes lengthened. From the top floor of Building No. 2, an adrenaline rush seeped into the room jokingly called the “Tranquil Chambers.” However, there was no sense of peace or serenity there, especially for at least a couple of weeks, when employees literally wandered around in circles, looking for a way out of the situation.

The problem was eventually solved by Arthur Delcher, who worked with the Overlapper program. He noticed something strange on line 678 of the 150,000 lines of code, where a minor discrepancy meant that an important part of the match was not recorded. The error was corrected, and on September 7 we had 134 cell scaffolds covering the actual (euchromatic) fruit fly genome. We were delighted and breathed a sigh of relief. The time has come to announce our success to the whole world.

The Genome Sequencing Conference, which I started hosting several years ago, provided an excellent opportunity for this. I was sure that there would be a large number of people eager to ascertain whether we had kept our promise. I decided that Mark Adams, Gene Myers and Jerry Rubin should talk about our achievements, and above all about the sequencing process, genome assembly and the significance of this for science. Due to the influx of people wanting to come to the conference, I had to move it from Hilton Head to the larger Fontainebleau Hotel in Miami. The conference was attended by representatives of large pharmaceutical and biotech companies, specialists in genomic research from around the world, quite a lot of columnists, reporters and representatives of investment companies - everyone was there. Our competitors from Incyte spent a lot of money on organizing a reception after the conference, corporate video filming, etc. - they did everything to convince the public that they were offering “the most detailed information about the human genome.”

We gathered in a large conference room. Decorated in neutral colors, decorated with wall lamps, it was designed for two thousand people, but people kept coming, and soon the hall was filled to capacity. The conference opened on September 17, 1999, with presentations from Jerry, Mark, and Gene at the first session. After a short introduction, Jerry Rubin announced that the audience was about to hear about the best joint project of famous companies in which he had ever been involved. The atmosphere was heating up. The audience realized that he would not have spoken so pompously if we had not prepared something truly sensational.

In the ensuing silence, Mark Adams began to describe in detail the work of our “fabricated shop” at Celera and our new genome sequencing methods. However, he did not say a word about the assembled genome, as if teasing the audience. Then Gene came out and talked about the principles of the shotgun method, about sequencing Haemophilus, and about the main stages of the assembler. Using computer animation, he demonstrated the entire process of reverse genome assembly. The time allotted for presentations was running out, and many had already decided that everything would be limited to an elementary presentation using PowerPoint, without presenting specific results. But then Gene noted with a malicious smile that the audience would probably still want to see real results and would not be satisfied with an imitation.

It was impossible to present our results more clearly and expressively than Gene Myers did. He realized that the sequencing results alone would not make the right impression, so to make it more convincing, he compared them with the results of Jerry's painstaking research using the traditional method. They turned out to be identical! Thus, Jin compared the results of our genome assembly with all known markers mapped to the fruit fly genome decades ago. Out of thousands of markers, only six did not match the results of our assembly. By carefully examining all six, we were convinced that Celera's sequencing was correct and that errors were contained in work done in other laboratories using old methods. Finally, Gene said that we had just started sequencing human DNA, and the repeats would probably be less of a problem than with Drosophila.

Loud and prolonged applause followed. The roar that did not stop during the break meant that we had achieved our goal. One of the journalists noticed a participant in the government genome project shaking his head sadly: “It looks like these scoundrels are really going to do everything.” 1 We left the conference with a new charge of energy.

There were two important problems left to solve, both of which were familiar to us. The first is how to publish the results. Despite a memorandum of understanding we had signed with Jerry Rubin, our business team was not comfortable with the idea of ​​transferring valuable Drosophila sequencing results to GenBank. They proposed placing the fruit fly sequencing results in a separate database at the National Center for Biotechnology Information, where everyone could use them under one condition - not for commercial purposes. The hot-tempered, chain-smoking Michael Ashburner of the European Bioinformatics Institute was extremely unhappy about this. He believed that Celera had “cheated everyone” 2. (He wrote to Rubin: “What the hell is going on at Celera?” 3) Collins was also unhappy, but more importantly, so was Jerry Rubin. In the end, I still sent our results to GenBank.

The second problem concerned Drosophila - we had the results of sequencing its genome, but we did not understand at all what they meant. We had to analyze them if we wanted to write a paper, just as we did four years ago with Haemophilus. Analyzing and characterizing the fly's genome could take more than a year - and I didn't have that time, because now I had to focus on the human genome. After discussing this with Jerry and Mark, we decided to involve the scientific community in the work on Drosophila, turning it into an exciting scientific problem, and thus quickly move the matter forward, making a fun holiday out of the boring process of genome description - like an international scouting jamboree. We called it the Genomic Jamboree and invited leading scientists from around the world to come to Rockville for about a week or ten days to analyze the genome of the fly. Based on the results obtained, we planned to write a series of articles.

Everyone liked the idea. Jerry began sending out invitations to our event to groups of leading researchers, and Celera bioinformatics specialists decided what computers and programs would be needed to make the scientists' work as efficient as possible. We agreed that Celera would pay their travel and accommodation expenses. Among those invited were my harshest critics, but we hoped that their political ambitions would not affect the success of our venture.

In November, about 40 Drosophila specialists arrived to us, and even for our enemies the offer was too attractive to refuse. At first, when the participants realized that they had to analyze more than a hundred million base pairs of genetic code within a few days, the situation was quite tense. While the newly arrived scientists slept, my staff worked around the clock, developing programs to solve unforeseen problems. By the end of the third day, when it turned out that new software tools allow scientists, as one of our guests said, “to make amazing discoveries in a few hours that previously took almost a lifetime,” the situation calmed down. Every day in the middle of the day, at the signal of the Chinese gong, everyone gathered together to discuss the latest results, solve current problems and draw up a work plan for the next round.

Every day the discussions became more and more interesting. Thanks to Celera, our guests had the opportunity to be the first to look into a new world, and what was revealed exceeded expectations. It soon turned out that we did not have enough time to discuss everything we wanted and understand what it all meant. Mark threw a celebratory dinner, which didn't last very long as everyone quickly rushed back to the labs. Soon lunches and dinners were consumed right in front of computer screens with data about the Drosophila genome displayed on them. For the first time, long-awaited families of receptor genes were discovered, along with a surprising number of fruit fly genes similar to human disease genes. Each discovery was accompanied by joyful screams, whistles and friendly pats on the shoulder. Surprisingly, in the midst of our scientific feast, one couple found time to get engaged.

There was, however, some concern: during the work, scientists discovered only about 13 thousand genes instead of the expected 20 thousand. Since the “lowly” worm C. elegans has about 20 thousand genes, many believed that the fruit fly must have more of them, since it has 10 times more cells and even has a nervous system. There was one simple way to make sure there was no error in the calculations: take the 2,500 known genes of the fly and see how many of them we could find in our sequence. After careful analysis, Michael Cherry of Stanford University reported that he had found all but six genes. After discussion, these six genes were classified as artifacts. The fact that the genes were identified without errors inspired us and gave us confidence. A community of thousands of scientists dedicated to Drosophila research had spent decades tracking those 2,500 genes, and now as many as 13,600 were in front of them on the computer screen.

During the inevitable photo shoot at the end of the job, an unforgettable moment came: after the traditional pat on the shoulder and friendly handshakes, Mike Ashburner got down on all fours so that I could immortalize myself in the photo with my foot on his back. So he wanted - despite all his doubts and skepticism - to give credit to our achievements. A famous geneticist and Drosophila researcher, he even came up with an appropriate caption for the photo: “Standing on the shoulders of a giant.” (He had a rather frail figure.) “Let us give credit to those who deserve it,” he later wrote 4 . Our opponents tried to present the delays in transferring sequencing results to a public database as a departure from our promises, but they, too, were forced to admit that the meeting made “an extremely valuable contribution to global fruit fly research” 5 . Having experienced what true “scientific nirvana” is, everyone parted as friends.

We decided to publish three large papers: one on whole genome sequencing with Mike as the first author, one on genome assembly with Gene as the first author, and a third on comparative genomics of the worm, yeast and human genome with Jerry as the first author. The papers were submitted to Science in February 2000 and published in a special issue on March 24, 2000, less than a year after my conversation with Jerry Rubin in Cold Spring Harbor. 6 Before publication, Jerry arranged for me to speak at the annual Drosophila Research Conference in Pittsburgh, which was attended by hundreds of the most eminent people in the field. On every chair in the room, my staff placed a CD containing the entire Drosophila genome, as well as reprints of our papers published in Science. Jerry introduced me very warmly, assuring the crowd that I had fulfilled all my obligations and that we had worked well together. My talk ended with a report on some of the research done during the meeting and a brief commentary on the data on the CD. The applause after my speech was as surprising and pleasant as it was five years ago when Ham and I first presented the Haemophilus genome at a microbiology convention. Subsequently, papers on the Drosophila genome became the most frequently cited papers in the history of science.

Although thousands of fruit fly researchers around the world were delighted with the results, my critics quickly went on the offensive. John Sulston called the attempt to sequence the fly's genome a failure, even though the sequence we obtained was more complete and more accurate than the result of his painstaking ten-year effort to sequence the genome of the worm, which took another four years to complete after publishing the draft in Science. Sulston's colleague Maynard Olson called the Drosophila genome sequence a "disgrace" that the government's Human Genome Project would have to sort out, "by the grace" of Celera. In fact, Jerry Rubin's team was able to quickly close the remaining gaps in the sequence by publishing and comparatively analyzing the already sequenced genome in less than two years. These data confirmed that we had 1–2 errors per 10 kb in the entire genome and less than 1 error per 50 kb in the working (euchromatic) genome.

However, despite the general acclaim of the Drosophila project, tensions in my relationship with Tony White reached a fever pitch in the summer of 1999. White could not come to terms with the attention that the press paid to my person. Every time he came to Celera, he passed by copies of articles about our achievements hanging on the walls in the hallway next to my office. And here we enlarged one of them - the cover of the Sunday supplement of the USA Today newspaper. On it, under the heading “Will this ADVENTURIST make the greatest scientific discovery of our time?” 7 showed me, in a blue checkered shirt, crossing my legs, and around me Copernicus, Galileo, Newton and Einstein floated in the air - and no sign of White.

Every day, his press secretary called to see if Tony could take part in the seemingly endless stream of interviews taking place at Celera. He calmed down a little - and even then only briefly, when the next year she managed to get his photograph placed on the cover of Forbes magazine as the man who was able to increase the capitalization of PerkinElmer from $1.5 billion to $24 billion 8 . (“Tony White turned poor PerkinElmer into a high-tech gene catcher.”) Tony was also haunted by my social activities.

I gave a talk about once a week, accepting a small fraction of the huge number of invitations that I constantly received because the world wanted to know about our work. Tony even complained to the board of directors of PerkinElmer, by then renamed the PE Corporation, that my trips and appearances violated corporate rules. During a two-week vacation (at my own expense) at my home on Cape Cod, Tony flew to Celera with CFO Dennis Winger and Applera general counsel William Sauch to interview my top employees about “Venter's management effectiveness.” They hoped to gather enough dirt to justify my dismissal. White was shocked when everyone said that if I quit, they would quit too. This caused a lot of tension within our team, but it also brought us closer together than ever. We were ready to celebrate every victory as if it were our last.

After the publication of the fly's genome sequence - by then the largest sequence in history - Gene, Ham, Mark and I toasted to having stood Tony White long enough to have our success recognized. We have proven that our method will also work when sequencing the human genome. Even if Tony White stopped funding the next day, we knew that our main achievement would remain with us. More than anything, I wanted to leave Celera and not have to deal with Tony White, but since I wanted to sequence the genome of Homo sapiens even more, I had to make a compromise. I tried as best I could to please White, just to continue the work and complete my plan.

Notes

1. Shreeve J. The Genome War: How Craig Venter Tried to Capture the Code of Life and Save the World (New York: Ballantine, 2005), p. 285.

2. Ashburner M. Won for All: How the Drosophila Genome Was Sequenced (Cold Spring Harbor Laboratory Press, 2006), p. 45.

3. Shreeve J. The Genome War, p. 300.

4. Ashburner M. Won for All, p. 55.

5. Sulston J., Ferry G. The Common Thread (London: Corgi, 2003), p. 232.

6. Adams M. D., Celniker S. E. et al. "The Genome Sequence of Drosophila Melanogaster", Science, no. 287, 2185–95, March 24, 2000.

7. Gillis J. “Will this MAVERICK Unlock the Greatest Scientific Discovery of His Age? Copernicus, Newton, Einstein and VENTER?”, USA Weekend, January 29–31, 1999.

8. Ross P. E. “Gene Machine”, Forbes, February 21, 2000.

Craig Venter


To the 50th anniversary of the discovery of the structure of DNA

A.V. Zelenin

PLANT GENOME

A. V. Zelenin

Zelenin Alexander Vladimirovich- Doctor of Biological Sciences,
Head of Laboratory, Institute of Molecular Biology named after. V.A. Engelhardt RAS.

The impressive achievements of the Human Genome program, as well as the success of work on deciphering the so-called ultra-small (viruses), small (bacteria, yeast) and medium-sized (roundworm, Drosophila) genomes, made it possible to move to a large-scale study of large and extra-large plant genomes. The urgent need for a detailed study of the genomes of the most economically important plants was emphasized at a meeting on plant genomics held in 1997 in the USA [,]. Over the years since then, undoubted successes have been achieved in this area. In 2000, a publication appeared on the complete sequencing (establishment of the linear nucleotide sequence of all nuclear DNA) of the genome of the small mustard - Arabidopsis, and in 2001 - on the preliminary (draft) sequencing of the rice genome. Work on sequencing large and ultra-large plant genomes (corn, rye, wheat) has been repeatedly reported, but these messages did not contain specific information and were rather declarations of intent.

It is expected that deciphering plant genomes will open up broad prospects for science and practice. First of all, the identification of new genes and the chain of their genetic regulation will significantly increase plant productivity through the use of biotechnological approaches. The discovery, isolation, reproduction (cloning) and sequencing of genes responsible for such important functions of the plant organism as reproduction and productivity, processes of variability, resistance to adverse environmental factors, as well as homologous pairing of chromosomes, is associated with the emergence of new opportunities for improving the selection process . Finally, isolated and cloned genes can be used to obtain transgenic plants with fundamentally new properties and analyze the mechanisms of regulation of gene activity.

The importance of studying plant genomes is also emphasized by the fact that so far the number of localized, cloned and sequenced plant genes is small and, according to various estimates, varies between 800 and 1200. This is 10-15 times less than, for example, in humans.

The United States remains the undoubted leader in the large-scale study of plant genomes, although intensive research on the rice genome is being carried out in Japan, and in recent years in China. In addition to US laboratories, European research groups took an active part in deciphering the Arabidopsis genome. The apparent leadership of the United States is causing serious concern among European scientists, which they clearly expressed at a meeting meaningfully titled “Prospects for Genomics in the Postgenomic Era,” held in France at the end of 2000. The advance of American science in studying the genomes of agricultural plants and creating transgenic plant forms, according to European scientists, threatens that in the not too distant future (from two to five decades), when population growth will put humanity in the face of a general food crisis, the European economy and science will become dependent on American technology. In this regard, it was announced the creation of a Franco-German scientific program for the study of plant genomes (Plantgene) and the investment of significant funds in it.

Obviously, the problems of plant genomics should attract the close attention of Russian scientists and science organizers, as well as governing bodies, since we are talking not only about scientific prestige, but also about the national security of the country. In one or two decades, food will become the most important strategic resource.

DIFFICULTIES IN STUDYING PLANT GENOMES

Studying plant genomes is a much more complex task than studying the genome of humans and other animals. This is due to the following circumstances:

huge genome sizes, reaching tens and even hundreds of billions of nucleotide pairs (bp) for individual plant species: the genomes of the main economically important plants (except rice, flax and cotton) are either close in size to the human genome or exceed it many times (table);

Sharp fluctuations in the number of chromosomes in different plants - from two in some species to several hundred in others, and it is not possible to identify a strict correlation between the genome size and the number of chromosomes;

An abundance of polyploid (containing more than two genomes per cell) forms with similar but not identical genomes (allopolyploidy);

The extreme enrichment of plant genomes (up to 99%) with “insignificant” (non-coding, that is, not containing genes) DNA, which greatly complicates the joining (arrangement in the correct order) of sequenced fragments into a common large-sized DNA region (contig);

Incomplete (compared to the genomes of Drosophila, human and mouse) morphological, genetic and physical mapping of chromosomes;

The practical impossibility of isolating individual chromosomes in a pure form using methods usually used for this purpose for human and animal chromosomes (flow sorting and the use of cell hybrids);

The difficulty of chromosomal mapping (determining the location on the chromosome) of individual genes using hybridization in situ, due to both the high content of “insignificant” DNA in plant genomes and the peculiarities of the structural organization of plant chromosomes;

The evolutionary distance of plants from animals, which seriously complicates the use of information obtained from sequencing the genome of humans and other animals to study plant genomes;

The long process of reproduction of most plants, which significantly slows down their genetic analysis.

CHROMOSOMAL GENOME STUDIES

Chromosomal (cytogenetic) studies of genomes in general and plants in particular have a long history. The term “genome” was proposed to denote a haploid (single) set of chromosomes with the genes they contain in the first quarter of the 20th century, that is, long before the role of DNA as a carrier of genetic information was established.

Description of the genome of a new, previously unstudied genetically multicellular organism usually begins with the study and description of the complete set of its chromosomes (karyotype). This, of course, also applies to plants, a huge number of which have not even begun to be studied.

Already at the dawn of chromosomal studies, genomes of related plant species were compared based on the analysis of meiotic conjugation (unification of homologous chromosomes) in interspecific hybrids. Over the past 100 years, the capabilities of chromosomal analysis have expanded dramatically. Nowadays, more advanced technologies are used to characterize plant genomes: various variants of the so-called differential staining, which makes it possible to identify individual chromosomes based on morphological characteristics; hybridization in situ, making it possible to localize specific genes on chromosomes; biochemical studies of cellular proteins (electrophoresis and immunochemistry) and, finally, a set of methods based on the analysis of chromosomal DNA up to its sequencing.

Rice. 1. Karyotypes of cereals: a - rye (14 chromosomes), b - durum wheat (28 chromosomes), c - soft wheat (42 chromosomes), d - barley (14 chromosomes)
The karyotypes of cereals, primarily wheat and rye, have been studied for many years. It is interesting that in different species of these plants the number of chromosomes is different, but always a multiple of seven. Individual cereal species can be reliably identified by their karyotype. For example, the rye genome consists of seven pairs of large chromosomes with intensely colored heterochromatic blocks at their ends, often called segments or bands (Fig. 1a). Wheat genomes already have 14 and 21 pairs of chromosomes (Fig. 1, b, c), and the distribution of heterochromatic blocks in them is not the same as in rye chromosomes. The individual genomes of wheat, designated A, B and D, also differ from each other. An increase in the number of chromosomes from 14 to 21 leads to a sharp change in the properties of wheat, which is reflected in their names: durum, or macaroni, wheat and soft, or bread, wheat . The D gene, which contains genes for gluten proteins, is responsible for the acquisition of high baking properties by soft wheat, which gives the dough the so-called germination. It is this genome that is given special attention in the selection improvement of bread wheat. Another 14-chromosome cereal, barley (Fig. 1, d), is not usually used to make bread, but it serves as the main raw material for the production of such common products as beer and whiskey.

The chromosomes of some wild plants used to improve the quality of the most important agricultural species, for example the wild relatives of wheat - Aegilops, are being intensively studied. New plant forms are created through crossing (Fig. 2) and selection. In recent years, significant improvements in research methods have made it possible to begin studying the genomes of plants whose karyotype features (mainly small chromosome sizes) made them previously inaccessible for chromosomal analysis. Thus, only recently were all chromosomes of cotton, chamomile and flax identified for the first time.

Rice. 2. Karyotypes of wheat and wheat-Aegilops hybrid

a - hexaploid common wheat ( Triticum astivum), consisting of A, B and O genomes; b - tetraploid wheat ( Triticum timopheevi), consisting of A and G genomes. contains genes for resistance to most wheat diseases; c - hybrids Triticum astivum X Triticum timopheevi, resistant to powdery mildew and rust, the replacement of part of the chromosomes is clearly visible
PRIMARY STRUCTURE OF DNA

As molecular genetics developed, the very concept of a genome expanded. Now this term is interpreted both in the classical chromosomal and in the modern molecular sense: the entire genetic material of an individual virus, cell and organism. Naturally, after studying the complete primary structure of genomes (as the complete linear sequence of nucleic acid bases is often called) of a number of microorganisms and humans, the question of sequencing plant genomes came up.

Of the many plant organisms, two were chosen for study - Arabidopsis, representing the class of dicotyledons (genome size 125 million bp), and rice from the class of monocotyledons (420-470 million bp). These genomes are small compared to other plant genomes and contain relatively few repeated sections of DNA. Such features gave hope that the selected genomes would be accessible for relatively rapid determination of their primary structure.

Rice. 3. Arabidopsis - small mustard - a small plant from the cruciferous family ( Brassicaceae). In a space equal in area to one page of our magazine, up to a thousand individual Arabidopsis organisms can be grown
The basis for choosing Arabidopsis was not only the small size of its genome, but also the small size of the organism, which makes it easy to grow in laboratory conditions (Fig. 3). We took into account its short reproductive cycle, thanks to which it is possible to quickly conduct crossing and selection experiments, detailed genetics, ease of manipulation with changing growing conditions (changing the salt composition of the soil, adding different nutrients, etc.) and testing the effect on plants of various mutagenic factors and pathogens (viruses, bacteria, fungi). Arabidopsis has no economic value, therefore its genome, along with the mouse genome, was called a reference genome, or, less accurately, a model genome.*
* The appearance of the term “model genome” in Russian literature is the result of an inaccurate translation of the English phrase model genome. The word "model" means not only the adjective "model", but also the noun "sample", "standard", "model". It would be more correct to talk about a sample genome, or a reference genome.
Intensive work on sequencing the Arabidopsis genome began in 1996 by an international consortium that included scientific institutions and research groups from the USA, Japan, Belgium, Italy, Great Britain and Germany. In December 2000, extensive information became available summarizing the determination of the primary structure of the Arabidopsis genome. For sequencing, we used classical, or hierarchical, technology: first, individual small sections of the genome were studied, from which larger sections (contigs) were made, and at the final stage, the structure of individual chromosomes. The nuclear DNA of the Arabidopsis genome is distributed among five chromosomes. In 1999, the results of sequencing two chromosomes were published, and the publication of information about the primary structure of the remaining three completed the sequencing of the entire genome.

Of 125 million nucleotide pairs, the primary structure of 119 million has been determined, which is 92% of the entire genome. Only 8% of the Arabidopsis genome, containing large blocks of repeating DNA sections, turned out to be inaccessible for study. In terms of completeness and thoroughness of sequencing of eukaryotic genomes, Arabidopsis remains in the top three champions along with the unicellular yeast organism Saccharomyces cerevisiae and multicellular animal organism Caenorhabditis elegance(see table).

About 15 thousand individual genes encoding proteins were found in the Arabidopsis genome. Approximately 12 thousand of these are contained in two copies per haploid (single) genome, so the total number of genes is 27 thousand. The number of genes in Arabidopsis is not much different from the number of genes in organisms such as humans and mice, but the size of its genome 25-30 times less. This circumstance is associated with important features in the structure of individual Arabidopsis genes and the overall structure of its genome.

Arabidopsis genes are compact, containing only a few exons (protein-coding regions), separated by short (about 250 bp) non-coding DNA stretches (introns). The gaps between individual genes average 4.6 thousand nucleotide pairs. For comparison, we point out that human genes contain many tens and even hundreds of exons and introns, and intergenic regions have sizes of 10 thousand nucleotide pairs or more. It is believed that the presence of a small compact genome contributed to the evolutionary stability of Arabidopsis, since its DNA became less of a target for various damaging agents, in particular, for the introduction of virus-like repeating DNA fragments (transposons) into the genome.

Other molecular features of the Arabidopsis genome include the enrichment of exons with guanine and cytosine (44% in exons and 32% in introns) compared to animal genes, as well as the presence of twice repeated (duplicated) genes. It is believed that this doubling occurred as a result of four simultaneous events, which consisted in the doubling (repetition) of part of the Arabidopsis genes, or the fusion of related genomes. These events, which took place 100-200 million years ago, are a manifestation of the general tendency towards polyploidization (a multiple increase in the number of genomes in an organism), characteristic of plant genomes. However, some facts show that in Arabidopsis the duplicated genes are nonidentical and function differently, which may be due to mutations in their regulatory regions.

Another object of complete DNA sequencing was rice. The genome of this plant is also small (12 chromosomes, giving a total of 420-470 million bp), only 3.5 times larger than that of Arabidopsis. However, unlike Arabidopsis, rice is of enormous economic importance, being the basis of nutrition for more than half of humanity, therefore not only billions of consumers are vitally interested in improving its properties, but also a multimillion-dollar army of people actively involved in the very labor-intensive process of growing it.

Some researchers began studying the rice genome back in the 80s of the last century, but this work reached a serious scale only in the 90s. In 1991, a program to decipher the structure of the rice genome was created in Japan, combining the efforts of many research groups. In 1997, on the basis of this program, the International Rice Genome Project was organized. Its participants decided to concentrate their efforts on sequencing one of the rice subspecies ( Oriza sativajaponica), in the study of which significant progress had already been achieved by that time. The Human Genome program became a serious incentive and, figuratively speaking, a guiding star for such work.

As part of this program, the strategy of “chromosomal” hierarchical division of the genome, which participants in the international consortium used to decipher the rice genome, was tested. However, if, when studying the human genome, fractions of individual chromosomes were isolated using various techniques, then material specific to individual rice chromosomes and their individual sections was obtained by laser microdissection (cutting out microscopic objects). On the microscope slide where the rice chromosomes are located, under the influence of a laser beam, everything except the chromosome or its sections intended for analysis is burned out. The remaining material is used for cloning and sequencing.

Numerous reports have been published on the results of sequencing individual fragments of the rice genome, carried out with high accuracy and detail characteristic of hierarchical technology. It was believed that determination of the complete primary structure of the rice genome would be completed by the end of 2003-mid-2004 and the results, together with data on the primary structure of the Arabidopsis genome, would be widely used in the comparative genomics of other plants.

However, in early 2002, two research groups - one from China, the other from Switzerland and the United States - published the results of full rough (rough) sequencing of the rice genome, performed using total cloning technology. In contrast to a step-by-step (hierarchical) study, the total approach is based on the simultaneous cloning of the entire genomic DNA in one of the viral or bacterial vectors and obtaining a significant (huge for medium and large genomes) number of individual clones containing different DNA segments. Based on the analysis of these sequenced sections and the overlapping of identical end sections of DNA, a contig is formed - a chain of DNA sequences joined together. The general (total) contig represents the primary structure of the entire genome or, at least, of an individual chromosome.

In such a schematic presentation, the strategy of total cloning seems uncomplicated. In fact, it encounters serious difficulties associated with the need to obtain a huge number of clones (it is generally accepted that the genome or its region being studied must be overlapped by clones at least 10 times), a gigantic volume of sequencing and the extremely complex work of joining clones, which requires the participation of bioinformatics specialists. A serious obstacle to total cloning is the variety of repeating DNA regions, the number of which, as already mentioned, increases sharply as the genome size increases. Therefore, the total sequencing strategy is used primarily in studying the genomes of viruses and microorganisms, although it was successfully applied to study the genome of a multicellular organism, Drosophila.

The results of total sequencing of this genome were “superimposed” on a huge array of information about its chromosomal, gene and molecular structure obtained over an almost 100-year period of studying Drosophila. And yet, in terms of the degree of sequencing, the Drosophila genome (66% of the total genome size) is significantly inferior to the Arabidopsis genome (92%), despite their fairly similar sizes - 180 million and 125 million nucleotide pairs, respectively. Therefore, it has recently been proposed to call the technology used to sequence the Drosophila genome mixed.

To sequence the rice genome, the above-mentioned research groups took two of its subspecies, the most widely cultivated in Asian countries - Oriza saliva L. ssp indicaj And Oriza saliva L. sspjaponica. The results of their research coincide in many ways, but also differ in many ways. Thus, representatives of both groups stated that they achieved contig overlap of approximately 92-93% of the genome. It has been shown that about 42% of the rice genome is represented by short DNA repeats consisting of 20 nucleotide pairs, and the majority of mobile DNA elements (transposons) are located in intergenic regions. However, information about the size of the rice genome varies significantly.

For the Japanese subspecies, the genome size is determined to be 466 million nucleotide pairs, and for the Indian subspecies - 420 million. The reason for this discrepancy is not clear. It may be a consequence of different methodological approaches to determining the size of the non-coding part of genomes, that is, it may not reflect the true state of affairs. But it is possible that a 15% difference in the size of the studied genomes really exists.

The second serious discrepancy was revealed in the number of detected genes: for the Japanese subspecies - from 46,022 to 55,615 genes per genome, and for the Indian subspecies - from 32,000 to 50,000. The reason for this discrepancy is not clear.

The incompleteness and inconsistency of the information received is noted in the comments to the published articles. It is also hoped that the gaps in knowledge of the rice genome will be eliminated by comparing the data from “rough sequencing” with the results of detailed, hierarchical sequencing carried out by participants in the International Rice Genome Project.

COMPARATIVE AND FUNCTIONAL GENOMICS OF PLANTS

The extensive data obtained, half of which (the results of the Chinese group) are publicly available, undoubtedly open up broad prospects both for the study of the rice genome and for plant genomics in general. A comparison of the properties of the Arabidopsis and rice genomes showed that most of the genes (up to 80%) identified in the Arabidopsis genome are also found in the rice genome, however, for approximately half of the genes found in rice, analogues (orthologs) have not yet been found in the Arabidopsis genome . At the same time, 98% of genes whose primary structure has been established for other cereals have been identified in the rice genome.

The significant (almost twofold) discrepancy in the number of genes in rice and Arabidopsis is puzzling. At the same time, the data from the rough transcript of the rice genome, obtained using total sequencing, are practically not compared with the extensive results of studying the rice genome using the method of hierarchical cloning and sequencing, that is, what has been done for the Drosophila genome has not been achieved. Therefore, it remains unclear whether the difference in the number of genes in Arabidopsis and rice reflects the true state of affairs or is explained by differences in methodological approaches.

Unlike the Arabidopsis genome, information about twin genes in the rice genome is not provided. It is possible that their relative abundance may be greater in rice than in Arabidopsis. This possibility is indirectly supported by data on the presence of polyploid forms of rice. Greater clarity on this issue can be expected after the completion of the International Rice Genome Project and obtaining a detailed picture of the primary DNA structure of this genome. Serious grounds for such hope are given by the fact that after the publication of works on the rough sequencing of the rice genome, the number of publications on the structure of this genome sharply increased, in particular, information appeared on the detailed sequencing of its chromosomes 1 and 4.

Knowing, at least approximately, the number of genes in plants is of fundamental importance for comparative plant genomics. At first it was believed that since all flowering plants are very close to each other in their phenotypic characteristics, their genomes should also be close. And if we study the Arabidopsis genome, we will get information about most genomes of other plants. Indirect confirmation of this assumption is provided by the results of sequencing the mouse genome, which is surprisingly close to the human genome (about 30 thousand genes, of which only 1 thousand turned out to be different).

It can be assumed that the reason for the differences in the genomes of Arabidopsis and rice lies in their belonging to different classes of plants - dicotyledons and monocotyledons. To clarify this issue, it is extremely desirable to know at least the rough primary structure of some other monocot plant. The most realistic candidate may be corn, whose genome is approximately equal to the human genome, but still significantly smaller than the genomes of other cereals. The food value of corn is well known.

The enormous material obtained from sequencing the genomes of Arabidopsis and rice is gradually becoming the basis for a large-scale study of plant genomes using comparative genomics methods. Such studies have general biological significance, as they make it possible to establish the main principles of the organization of the plant genome as a whole and their individual chromosomes, to identify common features of the structure of genes and their regulatory regions, and to consider the relationship between the functionally active (gene) part of the chromosome and various non-protein-coding intergenic DNA regions. Comparative genetics is also becoming increasingly important for the development of human functional genomics. It is for comparative studies that the genomes of puffer fish and mice were sequenced.

No less important is the study of individual genes responsible for the synthesis of individual proteins that determine specific functions of the body. It is in the detection, isolation, sequencing and establishment of the function of individual genes that the practical, primarily medical, significance of the Human Genome program lies. This circumstance was noted several years ago by J. Watson, who emphasized that the Human Genome program will be completed only when the functions of all human genes are determined.

Rice. 4. Classification by function of Arabidopsis genes

1 - genes for growth, division and DNA synthesis; 2 - RNA synthesis genes (transcription); 3 - genes for protein synthesis and modification; 4 - genes for development, aging and cell death; 5 - genes of cellular metabolism and energy metabolism; 6 - genes for intercellular interaction and signal transmission; 7 - genes for supporting other cellular processes; 8 - genes with unknown function
When it comes to the function of plant genes, we know less than one-tenth of what we know about human genes. Even in Arabidopsis, whose genome is much more studied than the human genome, the function of almost half of its genes remains unknown (Fig. 4). Meanwhile, plants, in addition to genes common to animals, have a significant number of genes specific only (or at least predominantly) to them. We are talking about genes involved in water transport and the synthesis of cell walls, which are absent in animals, about genes that ensure the formation and functioning of chloroplasts, photosynthesis, nitrogen fixation and the synthesis of numerous aromatic products. This list can be continued, but it is already clear how difficult the task is facing plant functional genomics.

Complete genome sequencing provides close to true information about the total number of genes of a given organism, allows more or less detailed and reliable information about their structure to be placed in data banks, and facilitates the work of isolating and studying individual genes. However, genome sequencing does not mean establishing the function of all genes.

One of the most promising approaches of functional genomics is based on identifying working genes on which mRNA transcription (reading) occurs. This approach, including the use of modern microarray technology, makes it possible to simultaneously identify up to tens of thousands of functioning genes. Recently, using this approach, the study of plant genomes has begun. For Arabidopsis, it was possible to obtain about 26 thousand individual transcripts, which greatly facilitates the possibility of determining the function of almost all of its genes. In potatoes, it was possible to identify about 20,000 thousand working genes that are important for understanding both the processes of growth and tuber formation, and the processes of potato disease. It is expected that this knowledge will improve the resistance of one of the most important food products to pathogens.

A logical development of functional genomics is proteomics. This new field of science studies the proteome, which typically refers to the complete set of proteins in a cell at a given time. This set of proteins, reflecting the functional state of the genome, changes all the time, while the genome remains unchanged.

The study of proteins has long been used to make judgments about the activity of plant genomes. As is known, enzymes found in all plants differ in the sequence of amino acids in individual species and varieties. Such enzymes, with the same function, but different sequences of individual amino acids, are called isoenzymes. They have different physicochemical and immunological properties (molecular weight, charge), which can be detected using chromatography or electrophoresis. For many years, these methods have been successfully used to study so-called genetic polymorphism, that is, differences between organisms, varieties, populations, species, in particular wheat and related forms of cereals. However, recently, due to the rapid development of DNA analysis methods, including sequencing, the study of protein polymorphism has been replaced by the study of DNA polymorphism. However, direct study of the spectra of storage proteins (prolamins, gliadins, etc.), which determine the basic nutritional properties of cereals, remains an important and reliable method for genetic analysis, selection and seed production of agricultural plants.

Knowledge of genes, the mechanisms of their expression and regulation is extremely important for the development of biotechnology and the production of transgenic plants. It is known that impressive successes in this area cause mixed reactions from the environmental and medical communities. However, there is an area of ​​plant biotechnology where these fears, if not completely groundless, then, in any case, seem insignificant. We are talking about creating transgenic industrial plants that are not used as food products. India recently harvested its first crop of transgenic cotton that is resistant to a number of diseases. There is information about the introduction of special genes encoding pigment proteins into the cotton genome and the production of cotton fibers that do not require artificial dyeing. Another industrial crop that may be subject to effective genetic engineering is flax. Its use as an alternative to cotton for textile raw materials has been discussed recently. This problem is extremely important for our country, which has lost its own sources of cotton raw materials.

PROSPECTS FOR STUDYING PLANT GENOMES

It is obvious that structural studies of plant genomes will be based on approaches and methods of comparative genomics using the results of deciphering the genomes of Arabidopsis and rice as the main material. A significant role in the development of comparative plant genomics will, without a doubt, be played by the information that sooner or later will be provided by total (rough) sequencing of the genomes of other plants. In this case, comparative plant genomics will be based on establishing genetic relationships between individual loci and chromosomes belonging to different genomes. We will talk not so much about the general genomics of plants, but about the selective genomics of individual chromosomal loci. Thus, it was recently shown that the gene responsible for vernalization is located in the VRn-AI locus of chromosome 5A of hexaploid wheat and the Hd-6 locus of chromosome 3 of rice.

The development of these studies will be a powerful impetus for the identification, isolation and sequencing of many functionally important plant genes, in particular genes responsible for disease resistance, drought resistance, and adaptability to various growing conditions. Functional genomics, based on mass identification (screening) of genes functioning in plants, will be increasingly used.

We can foresee further improvements in chromosomal technologies, primarily the microdissection method. Its use dramatically expands the possibilities of genomic research without requiring huge costs, such as total genome sequencing. The method of localizing individual genes on plant chromosomes using hybridization will become more widespread. in situ. At the moment, its use is limited by the huge number of repeating sequences in the plant genome, and possibly by the peculiarities of the structural organization of plant chromosomes.

In the foreseeable future, chromosomal technologies will also become of great importance for the evolutionary genomics of plants. These technologies, which are relatively inexpensive, make it possible to quickly assess intra- and interspecific variability and study complex allopolyploid genomes of tetraploid and hexaploid wheat and triticale; analyze evolutionary processes at the chromosomal level; investigate the formation of synthetic genomes and the introduction (introgression) of foreign genetic material; identify genetic relationships between individual chromosomes of different species.

The study of plant karyotype using classical cytogenetic methods, enriched by molecular biological analysis and computer technologies, will be used to characterize the genome. This is especially important for studying the stability and variability of the karyotype at the level of not only individual organisms, but also populations, varieties and species. Finally, it is difficult to imagine how one can estimate the number and spectra of chromosomal rearrangements (aberrations, bridges) without the use of differential staining methods. Such studies are extremely promising for monitoring the environment based on the state of the plant genome.

In modern Russia, it is unlikely that direct sequencing of plant genomes will be carried out. Such work, which requires large investments, is unsustainable for our current economy. Meanwhile, information about the structure of the genomes of Arabidopsis and rice, obtained by world science and available in international data banks, is sufficient for the development of domestic plant genomics. It is possible to foresee an expansion of research into plant genomes based on comparative genomics approaches to solve specific problems of breeding and crop production, as well as to study the origin of various plant species of economic importance.

It can be assumed that in domestic breeding practice and plant growing, genomic approaches such as genetic typing (RELF, RAPD, AFLP analyses, etc.), which are quite affordable for our budget, will be widely used. In parallel with direct methods for determining DNA polymorphism, approaches based on the study of protein polymorphism, primarily storage proteins of cereals, will be used to solve problems of genetics and plant breeding. Chromosome technologies will be widely used. They are relatively inexpensive, and their development requires quite moderate investments. In the field of chromosome research, domestic science is not inferior to the world.

It should be emphasized that our science has made a significant contribution to the formation and development of plant genomics [,].

The fundamental role was played by N.I. Vavilov (1887-1943).

In molecular biology and plant genomics, the pioneering contribution of A.N. is obvious. Belozersky (1905-1972).

In the field of chromosome research, it is necessary to note the work of the outstanding geneticist S.G. Navashin (1857-1930), who first discovered satellite chromosomes in plants and proved that it is possible to distinguish individual chromosomes by the characteristics of their morphology.

Another classic of Russian science G.A. Levitsky (1878-1942) described in detail the chromosomes of rye, wheat, barley, peas and sugar beets, introduced the term “karyotype” into science and developed the doctrine of it.

Modern specialists, relying on the achievements of world science, can make a significant contribution to the further development of plant genetics and genomics.

The author expresses his heartfelt gratitude to Academician Yu.P. Altukhov for critical discussion of the article and valuable advice.

The work of the team headed by the author of the article was supported by the Russian Foundation for Basic Research (grants No. 99-04-48832; 00-04-49036; 00-04-81086), the Program of the President of the Russian Federation for the support of scientific schools (grants No. 00-115 -97833 and NSh-1794.2003.4) and the Program of the Russian Academy of Sciences "Molecular genetic and chromosomal markers in the development of modern methods of selection and seed production."

LITERATURE

1. Zelenin A.V., Badaeva E.D., Muravenko O.V. Introduction to plant genomics // Molecular biology. 2001. T. 35. pp. 339-348.

2. Pen E. Bonanza for Plant Genomics // Science. 1998. V. 282. P. 652-654.

3. Plant genomics // Proc. Natl. Acad. Sci. USA. 1998. V. 95. P. 1962-2032.

4. Kartel N.A. and etc. Genetics. Encyclopedic Dictionary. Minsk: Technologia, 1999.

5. Badaeva E.D., Friebe B., Gill B.S. 1996. Genome differentiation in Aegilops. 1. Distribution of highly repetitive DNA sequences on chromosomes of diploid species // Genome. 1996. V. 39. P. 293-306.

History of chromosome analysis // Biol. membranes. 2001. T. 18. pp. 164-172.

On 05.09.2011 at 09:36, Limarev said:

Limarev V.N.

Decoding the human genome.

Fragment from the book by L.G. Puchko: “Radiethetic cognition of man”

To solve the problem of deciphering the genome, an international project “human genome” was organized with a budget of billions of dollars.

By 2000, the human genome was virtually mapped. Genes were counted, identified and recorded in databases. These are huge amounts of information.

Recording the human genome in digitized form takes about 300 terabytes of computer memory, which is equivalent to 3 thousand hard drives with a capacity of 100 gigabytes.

It turned out. That a person does not have hundreds of thousands, as previously thought, but just over 30 thousand genes. The fly has fruit flies, there are only half as many of them - about 13 thousand, and the mouse has almost the same number as a person. There are only about 1% of genes unique to humans in the deciphered genome. Most of the DNA helix, as it turned out, is occupied not by genes, but by so-called “empty sections”, in which genes are simply not encoded, as well as double fragments repeated one after another, the meaning and significance of which is unclear.

In a word, genes turned out to be not even the building blocks of life, but only elements of the blueprint according to which the building of the body is built. Building blocks, as was generally believed before the rise of genetics, are proteins.

It has become absolutely obvious that 1% of genes unique to humans cannot encode such a huge amount of information that distinguishes a person from a mouse. Where is all the information stored? For many scientists, the fact becomes undeniable that without the Divine principle it is impossible to explain human nature. A number of scientists suggest that, within the framework of existing ideas about the human body, it is in principle impossible to decipher the human genome.

The world is not known - it is knowable (my comments on the article).

1) Consider the fragment: “Without the Divine principle, it is impossible to explain human nature.”

The information presented above does not in any way indicate this.

The genome indeed has a more complex structure than previously thought.

But, after all, the computer mentioned in the article does not consist only of memory cells.

A computer has two memories: long-term and operational, as well as a processor in which information is processed. The electromagnetic field is also involved in information processing. In order to decipher genome information, it is necessary to understand how it occurs, not only the storage of information, but also its processing. I also admit the idea that some of the information is stored recorded through an electromagnetic field. And also outside a person, as I already wrote, in special information centers of the Supreme Mind.

Just imagine a continuous text encoded in binary code 0 or 1 in Morse code, while you do not know what language it is written in (English or French....) and you do not know that this continuous text consists of words, sentences, paragraphs, chapters, volumes, shelves, cabinets, etc.

It’s almost the same in biology, only everything here is encoded with a four-digit code and we have so far deciphered the order of elementary genes + - / *, but we don’t know the language and accordingly words, sentences, paragraphs, chapters, volumes, shelves, cabinets, etc. For us, the deciphered genome is still a solid text of 4-grade code and it is almost impossible to study it all head-on.

But it turns out that at certain periods of time (both in the individual and his cohort of generations and in the species, genus) some genes and their complexes (responsible for words, sentences, paragraphs, chapters, volumes, shelves, cabinets, etc.) are active , and in other periods of evolution they are passive, which I indirectly determined by various polygenic characteristics (as shown in the topic General Periodic Law of Evolution).

There are currently only two methods for studying genes, this is a simple laboratory calculation of the sum of genes (DNA) in a sample and there is a device that counts the amount of protein RNA produced stuck to the electronic chip produced specific DNA, but since at any given time a huge amount of DNA is active and, accordingly, a huge number of different proteins are produced through RNA, it is very difficult to separate “these noodles with a spoon, fork and Japanese chopsticks” in this soup and find what you are looking for - find cause-and-effect relationships between specific DNA (as a DNA complex) and its influence on a polygenic trait.

It seems that I have found a simple method of how to sort out this whole soup of DNA, RNA and their proteins that determine the degree of a polygenic trait.

As it turned out, each polygenic trait in the order of evolution of an individual (cohort of generations, species and genus) is periodic; therefore, it must be periodic in the activity of RNA and DNA and therefore you just need to find (first going into genetic details) the correlation between the metric change in the polygenic trait (in individual, cohort of generations, species, genus...) and the corresponding activity of RNA, DNA, proportional to these periods.

completely defined. Therefore, the work to decipher the nematode genome should be considered very successful.

Even greater success is associated with deciphering the Drosophila genome, only in

2 times smaller in size than human DNA and 20 times larger than nematode DNA. Despite the high degree of genetic knowledge of Drosophila, about 10% of its genes were unknown until this moment. But the most paradoxical thing is that the Drosophila, which is much more highly organized compared to the nematode, has fewer genes than a microscopic roundworm! From a modern biological point of view, this is difficult to explain. More genes than those of Drosophila are also present in the deciphered genome of a plant from the cruciferous family - Arabidopsis, widely used by geneticists as a classic experimental object.

The development of genomic projects was accompanied by intensive development in many areas of science and technology. Thus, bioinformatics received a powerful impetus for its development. A new mathematical apparatus was created for storing and processing huge amounts of information; supercomputer systems with unprecedented power have been designed; Thousands of programs have been written that allow, in a matter of minutes, to carry out a comparative analysis of various blocks of information, to enter new data into computer databases daily,

obtained in various laboratories around the world, and adapt new information to that which was accumulated earlier. At the same time, systems were developed for the efficient isolation of various elements of the genome and automatic sequencing, that is, determining the nucleotide sequences of DNA. On this basis, powerful robots were designed that significantly speed up sequencing and make it less expensive.

The development of genomics, in turn, has led to the discovery of a huge number of new facts. The significance of many of them remains to be assessed in

future. But even now it is obvious that these discoveries will lead to a rethinking of many theoretical positions concerning the emergence and evolution of various forms of life on Earth. They will contribute to a better understanding of the molecular mechanisms underlying the functioning of individual cells and their interactions; detailed decoding of many still unknown biochemical cycles;

analysis of their connection with fundamental physiological processes.

Thus, there is a transition from structural genomics to

functional, which in turn creates the prerequisites for

research into the molecular basis of the functioning of cells and the organism as a whole.

The information accumulated now will be the subject of analysis within

the next few decades. But every next step in

direction of deciphering the structure of genomes of different species, gives rise to new technologies that facilitate the process of obtaining information. So,

the use of data on the structure and function of genes of lower organized species of living beings can significantly speed up the search

are replacing rather labor-intensive molecular methods for searching for genes.

The most important consequence of deciphering the genome structure of a particular species is the ability to identify all its genes and,

accordingly, identification and determination of the molecular nature of transcribed RNA molecules and all its proteins. By analogy with the genome, the concepts of transcriptome, which unites a pool of RNA molecules formed as a result of transcription, and iproteome, which includes many proteins encoded by genes, were born. Thus, genomics creates the foundation for the intensive development of new sciences – proteomics and transcriptomics. Proteomics is the study of the structure and function of each protein; analysis of the protein composition of the cell; determination of the molecular basis of the functioning of an individual cell, which is

the result of the coordinated work of many hundreds of proteins, and

study of the formation of a phenotypic trait of an organism,

resulting from the coordinated work of billions of cells.

Very important biological processes also occur at the RNA level. Their analysis is the subject of transcriptomics.

The greatest efforts of scientists from many countries of the world working in the field of genomics were aimed at solving the international project “Human Genome”. Significant progress in this area is associated with the implementation of the idea,

proposed by J. S. Venter, search and analyze

expressed DNA sequences, which can later be used as a kind of “tags” or markers of certain regions of the genome. Another independent and no less fruitful approach was used in the work of the group led by Fr.

Collins. It is based on the primary identification of genes for hereditary human diseases.

Decoding the structure of the human genome has led to a sensational discovery. It turned out that the human genome contains only 32,000 genes, which is several times less than the number of proteins. At the same time, there are only 24,000 protein-coding genes; the products of the remaining genes are RNA molecules.

The percentage of similarity in DNA nucleotide sequences between different individuals, ethnic groups and races is 99.9%.

This similarity is what makes us human – Homo sapiens! All of our variability at the nucleotide level fits into a very modest figure - 0.1%.

Thus, genetics leaves no room for ideas of national or racial superiority.

But let's look at each other - we are all different. National, and even more so, racial differences are even more noticeable. So what number of mutations determine human variability, not in percentage, but in absolute terms? To get this estimate, you need to remember what the size of the genome is. The length of a human DNA molecule is

3.2x109 base pairs. 0.1% of this is 3.2 million nucleotides. But remember that the coding part of the genome occupies less than 3% of the total length of the DNA molecule, and mutations outside this region, most often, do not have any effect on phenotypic variability. Thus, to obtain an integral estimate of the number of mutations that affect the phenotype, we need to take 3% of 3.2 million nucleotides, which will give us a figure of the order of 100,000. That is, about 100 thousand mutations form our phenotypic variability. If we compare this figure with the total number of genes, it turns out that on average there are 3-4 mutations per gene.

What are these mutations? The vast majority of them (at least 70%)

determines our individual non-pathological variability, what distinguishes us, but does not make us worse in relation to each other. This includes characteristics such as eye color, hair, skin, body type, height, weight,

a type of behavior that is also largely genetically determined, and much more. About 5% of mutations are associated with monogenic diseases. About a quarter of the remaining mutations belong to the class of functional polymorphisms. They are involved in the formation of hereditary predisposition to widespread multifactorial pathology. Of course, these estimates are quite rough,

but they make it possible to judge the structure of human hereditary variability.

Chapter 1.16. Molecular genetic basis of evolution

The revolution in the field of molecular biology that occurred at the turn of the millennium, culminating in the deciphering of the structure of the genomes of many hundreds of species of microorganisms, as well as some species of protozoa,

yeast, plants, animals and humans, upended many traditional ideas of classical genetics and brought the possibility of studying the molecular mechanisms of evolution and speciation very close. A new science was born - comparative genomics,

making it possible to register the appearance in various phylogenetic lines of evolutionarily significant events occurring at the level of individual molecules. It turned out that in the general case, evolutionary progress is associated not only, and not so much with an increase in the number, extent and even complexity of the structural organization of genes, but to a much greater extent with a change in the regulation of their work, which determines the coordination and tissue specificity of the expression of tens of thousands of genes. This, ultimately, led to the appearance in higher organisms of more complex, highly specific, multifunctional complexes of interacting proteins capable of performing fundamentally new tasks.

Let us consider the nature of changes occurring in the process of evolution at three information levels: DNA - RNA - protein or genome - transcriptome - proteome. In general, we can say that as the complexity of the organization of life increases, the size of the genome increases. Thus, the DNA size of prokaryotes does not exceed 8x106 bp, it becomes twice as large in yeast and protozoa, 10-15 times larger in insects, and in mammals the increase reaches 3 orders of magnitude, that is, a thousand times (103).

However, this dependence is not linear. Thus, within mammals we no longer observe a significant increase in genome size. In addition, it is not always possible to observe the relationship between the size of the genome and the complexity of the organization of life. Thus, in some plants the genome size is an order of magnitude or even two orders of magnitude larger than that of humans. Let us recall that the increase in the genome size of eukaryotes compared to prokaryotes occurs mainly due to the appearance of non-coding sequences, that is, optional elements. We have already said that in the human genome exons total no more than 1-3%. This means that the number of genes in higher organisms can be only several times greater than in microorganisms.

The increasing complexity of eukaryotic organization is partly explained by the emergence of an additional regulatory system necessary for

ensuring tissue specificity of gene expression. One of the consequences of the discontinuous organization of genes that emerged in eukaryotes was the widespread occurrence of alternative splicing and alternative transcription. This led to the emergence of a new property in a huge number of genes - the ability to encode multiple functionally different protein isoforms. Thus, the total amount of proteins

that is, the size of the proteome; higher ones may have several times the number of genes.

In prokaryotes, intraspecific variability in the number of genes is permissible, and

similar differences between different strains of many microorganisms, in

including pathogenic ones, can amount to tens of percent. Moreover, the complexity of the organization of various types of microorganisms directly correlates with the number and length of coding sequences.

Thus, phenotypic intra- and interspecific variability is in strict association with very similar transcriptome and proteome sizes. In eukaryotes, the number of genes is a strictly determined species characteristic, and the increase in evolutionary complexity is based on another principle - the differential multi-level use of various components of a limited and fairly stable proteome.

Sequencing the genomes of nematodes and Drosophila showed that the sizes of the proteomes in these very different species are very similar and only twice as large as those of yeast and some types of bacteria. This pattern—a significant increase in the complexity of the organization of various life forms while maintaining or a relatively small increase in the size of the proteome—is characteristic of all subsequent evolution up to humans. So,

The proteomes of humans and mice practically do not differ from each other and are less than 2 times larger in size than the proteomes of the microscopic nematode roundworm or the fruit fly Drosophila. Moreover, the identity of the nucleotide sequences of human DNA and

great apes is 98.5%, and in coding regions reaches 99%. These figures differ little from the value of 99.9%,

determining intraspecific similarity in DNA nucleotide sequences between different individuals, peoples and races inhabiting our planet. So what changes, constituting no more than 1.5% of the entire genome, are key to the formation of a person? The answer to this question, apparently, should be sought not only at the genomic and proteomic levels.

Indeed, along with the relative stability of the proteome, in

In the process of evolution, there is a sharp increase in the size and complexity of the organization of the eukaryotic transcriptome due to the appearance in the genome of a huge number of transcribed and non-coding DNA, as well as a significant expansion of the class of RNA-coding genes. RNAs that do not code for proteins, the main source of which are introns,

constitute the vast majority of the transcriptome of higher organisms,

reaching 97-98% of all transcription units. The functions of these molecules are currently being intensively analyzed.

Thus, key evolutionary changes occur against the background of an increase in genome size, a fairly stable proteome, and a sharp increase in transcriptome size – Fig. 31.

Figure 31. Evolutionary changes occurring in three

information levels At the same time, the transition from simple forms of life to more complex ones is obvious

correlates with the emergence and widespread distribution in the genome of two fundamental and to some extent interrelated evolutionary acquisitions: non-coding DNA and repetitive elements. A direct consequence of these changes occurring at the genomic level is the appearance in the process of evolution of a huge number of non-protein-coding RNAs.

What is the structural basis of these evolutionary transformations?

All major evolutionary transitions: from prokaryotes to eukaryotes, from protozoa to metazoans, from the first animals to bilaterians, and from primitive chordates to vertebrates, were accompanied by a sharp increase in genome complexity. Apparently, such leaps in evolution are the result of rare cases of successful fusion of entire genomes of different species belonging to systematic classes that have diverged a considerable distance from each other. Thus, the symbiosis of Archaea and Bacteria marked the beginning of the transition from prokaryotes to eukaryotes. It is obvious that mitochondria, chloroplasts and some other cell organelles also appeared as a result of endosymbiosis. A fundamental property of higher eukaryotes, diploidy, arose as a result of well-regulated genomic duplication that occurred about 500 million years ago.

Genomic duplications within a species occurred quite frequently, and

examples of this are numerous cases of polyploidy in plants,

mushrooms and even sometimes in animals. However, potential mechanisms

leading to the emergence of fundamentally new forms of life in the process of evolution are not autopolyploidy, but hybridization and horizontal transfer or fusion of genomes. It is noteworthy that the most significant evolutionary transformations, accompanied by the fusion of entire genomes, occur under extraordinary conditions, during periods of major geological transitions, such as changes in the concentration of oxygen in the atmosphere, glaciation of the Earth, or the Cambrian Explosion.

In relatively calm geological conditions, duplications of individual genes or chromosomal segments with their subsequent divergence turn out to be more significant for evolution. A comparison of the nucleotide sequences of sequenced genomes shows that the frequency of gene duplications is quite high and, on average, is 0.01 per gene per million years. The vast majority of them do not manifest themselves over the next several million years, and only in rare cases

In cases, duplicated genes can acquire new adaptive functions. However, a large class of “silent” gene duplications serves as a kind of reserve fund for the birth of new genes and the formation of new species. The human genome contains from 10 to 20 thousand copies of processed genes that arose through the retroposition of mRNA.

Most of them belong to the class of pseudogenes, that is, they are not expressed either due to the presence of mutations or due to insertions into transcriptionally inactive regions of the genome. However, some of these genes are active, and the nature of their expression and even functions may be different,

than those of the founder genes.

They play a special role in the evolution of primates and humans. segmental duplications, belonging to the class of low copy repeats (LCR) and

arose less than 35 million years ago. These sequences are highly identical blocks of DNA, varying in size from one to several hundred kilobases. Most often, segmental duplications are localized in the pericentromeric or telomeric regions of various chromosomes, and in total they occupy about 5% of the human genome.

No segmental duplications were found in other sequenced genomes.

The minimal module of segmental duplication, called a duplicon, contains fragments of unrelated unprocessed genes, and

this distinguishes it from other known types of repeated sequences. Under certain conditions, duplicons can serve as sources for the creation of new chimeric transcribed genes or gene families from various combinations of coding exons present in them. It is estimated that between 150 and 350 genes may distinguish the chimpanzee and human genomes.

Without diminishing the importance of the appearance of new and disappearance of old coding sequences for speciation, we should emphasize the real possibility of the existence of other mechanisms,

playing a decisive role in the evolution of eukaryotes.

One of the driving mechanisms of evolution is mobile elements, found in all species studied in this regard.

Genomic changes accompanying the process of speciation may include extensive karyotype reorganizations, local chromosomal rearrangements, duplications of gene families, modifications of individual genes,

accompanied by their birth or loss, as well as differences in gene expression, regulated both at the level of transcription and at the levels of splicing or translation. Mobile elements are directly related to all these processes.

In some cases, transposable elements themselves carry sequences encoding enzymes whose presence is necessary for DNA transposition or RNA retroposition.

Similar sequences are present in the genome of retroviruses, LTR-

elements and transposons. Retrotransposons also include the most numerous class of transposable elements—Alu repeats. For the first time Alu-

repeats appear in primates about 50-60 million years ago from a small RNA-coding gene. In the process of further evolution, divergence and powerful amplification of this family occur. The transition from primates to humans is accompanied by an explosive increase in the number

Alu repeats, the number of copies of which, according to some estimates, reaches

1.1 million. Alu repeats occupy about 10% of the human genome, but their distribution is uneven, since they are mostly associated with genes. These elements are rarely present in coding exons and are quite often found in introns and non-coding regions of mRNA, influencing the stability of these molecules and/or translation efficiency. The presence of Alu sequences in the intronic regions of genes may be accompanied by a change in the nature of preRNA processing, since these sequences contain regions homologous to donor and acceptor splice sites. When Alu elements are inserted into the regulatory regions of a gene, transcription may be disrupted, resulting in