Noisy Nucleotides Scientific American Sept 92
DNA sequences show fractal correlations
Commuter traffic, earthquakes and the selection of presidential candidates usually seem to take place in random ways. But investigators of chaos theory who turn to patterns called fractals manage to find order in the midst of such unpredictable events. Now add DNA to the catalogue of things fractal-like. 'There is some magical phenomenon going on that we just do not understand,' says H. Eugene Stanley, a physicist at Boston University. The orderly patterns of fractals emerge because an incident in the apparently chaotic system is actually correlated with a previous occurrence, much as a long-lasting pocket of slowly moving vehicles can result from one rubber-necking motorist briefly hitting the brake.
Calculations by Stanley's research group and by Wentian Li of the Rockefeller University and Richard F. Voss of the EBM Thomas J. Watson Research Center have shown that the position of nucleotides-adenine, guanine, cytosine and thymine-in a DNA sequence depends to some extent on the nucleotides that preceded it. The patterns in nucleotide sequences are similar to flicker, or 1/f, noise (pronounced "one over eff"). These fluctuations are time-scale analogues to the shapes of fractals, such as snowflakes and coastlines, which have the property of self-similarity: the component parts resemble the structure as a whole. The pattems of 1/f noise are just as prevalent in nature as their geometric counterparts; they can be found in such diverse phenomena as electric circuits and flood records. 'It is a special form of correlations found in natural phenomena and in human behavior,' says Voss, who found its presence in music.
If signals are completely random, as are the outcomes of flipping a coin, the results would show a "white noise" signature. Unless the coin is weighted, the probability of heads will be one half regardless of what came before. The outcome would be a collection of random signals similar to those that produce the nishing sounds between FM stations. In fact, 'if you want to store as much information as possible over some length, the best storage method will be like white noise," notes Chung-Kang Peng, a member of Stanley's giroup. It is because every signal would be completely independent and carry its own message. If genetic data were stored as white noise, the probability of finding the same information along a strand of DNA would decay exponentially the further along the sequence one looked. 'If the second nucleotide knows 50 percent of what the first knows, then the third knows orfly 25 percent," Peng says.
But base pairs in DNA do not seem to occur in a completely random fashion. The researchers applied different statistical techniques to DNA sequences catalogued in GenBank a storehouse of genetic codes at Los Alamos National Laboratory. They found that the decay is much slower than exponential decrease. , The sequences show approximate 1/f Patterns. Roughly speaking, the f here represents the number of bases over which a particular nucleotide repeats. Along with Stanley's giroup, Li, a physicist, found that correlations exist in the intron sequences of DNA. Molecular biologists have sometimes referred to introns as 'junk DNA' because they do not encode structural information. The real information-carrying regions are located in sequences called exons. Yet unlike introns, exons lack long-range correlation and resemble white noise. Exactly why intron but not exon sequences show the correlations is not entirely known. The researchers think the long-range correlations-which extend over thousands of base-pair positions represent a trade-off between efficient information storage and protection against error in the genetic code. Because changing part forces a change in other parts, there is some redundancy in the code. Thus, the correlations 'would give some immunity to error during during transcription' Voss says. Exons, which hold the crucial data, would not exhibit long-range correlations because they need to carry as much information as possible.
Indeed, Voss has turned up some intriguing results based on evolutionary classification. He found that the sequences for organisms lowest on the evolutionary scale (bacteria and bacteiiophages) were the least conelated. The correlations increased for higher organisms reaching perfect 1/f patterns in invertebrates. Then the scaling correlations decreased for vertebrates, mamnmls, rodents and finally primates. Stanley and his colleagues will soon publish a slightly different result. The correlations increased as they moved up the entire evolutionary ladder. The group discovered that 'as you evolve, the long-range correlations become stronger and stronger Stanley says.
The results indicate that simpler organisms (that is those with short sequences) would not need the error protection required to maintain the more complex DNA sequences. "There seems to be a general principle involved" Voss observes. Nature has these fractal and 1/f type fluctuations" he remarks.
That may explain in part why music is pleasing. "One speculation is that music is trying to imitate nature and builds in 1/f noise," Voss explains. But a cohesive model to account for the ubiquity of 1/f and fractal phenomena eludes researchers. Like many a scientist and harried office worker before him, Li laments: 'There is still a lot of work to be done." -Philip Yam
Word use in yeast DNA changes in a regular way
in non-coding regions top but not in coding regions bottom.
Talking Trash: What's in a Word? Scientific American Mar 95
What's in a word? Several nucleotides, some researchers might say. By applying statistical methods developed by linguists, investigators have found that "junk" parts of .the genomes of many o rganisms may be expressing a language. These regions traditionally been regarded as 'useless' accumulations of material from millions of years of evolution 'The feeling is,' says Boston University physicist Eugene Stanley, 'that there's something going on in the noncoding region.'
Junk DNA got its name because the nucleotides there (the fundamental pieces of DNA, combined into so-called base pairs) do not encode instiuctions for making proteins, the basis for life. In fact, the vast majority of genetic material in organisms from bacteria to mammals consists of noncoding DNA segments, which are interspersed with the coding parts. In humans, about 97 percent of the genome is junk. Over the past 10 years biologists began to suspect that this feature is not entirely trivial. "It's unlikely that every base pair in noncoding DNA is critical, but it is also foolish to say that all of it is junk" notes Robert Tjian, a biochemist at the University of California at Berkeley. For instance, studies have found that mutations in certain parts of the noncoding regions lead to cancer. Physicists backed the suspicions a few years ago, when those studying fractals noticed certain patterns in junk DNA. They found that noncoding sequences display what are termed long-range correlations. That is, the position of a nucleotide depends to some extent on the placement of other nucleotides. Their pattems follow a fractal-like property called 1/f noise, which is inherent in .many physical systems that evolve over time, such as electronic circuits, periodicity of earthquakes and even traffic patterns. In the genome, however, the long-range correlations held only for the non-coding sequences; the coding parts exhibited an uncorrelated pattern. Those signs suggested that junk DNA might contain some kind of organized information. To decipher the message, Stanley and his colleagues Rosario N. Mantegna, Sergey V. Buldyrev and Shlomo Haviin collaborated with Amy L Goldberg, Chung-Kang Peng and Michael Simons of Harvard Medical School. They borrowed from the work of linguist George K. Zipf who by looking at texts from several languages ranked the frequency with which words occur. Plotting the rank of words against those in a text produces a distinct relation. The most common word "the" in English occurs 10 times ,than the 10th most common word, 100 times more often than the 100th most common, and so forth. The researchers tested the relation on 40 DNA sequences of species ranging from viruses to humans. They then grouped pairs of nucleotides to create words between three and eight pairs long (it takes three pairs to specify an amino acid). In every case, they found that noncoding regions followed the Zipf relation more closely than did coding regions, suggesting that junk DNA follows the structure of languages. 'We didn't expect the coding DNA to obey Zipf," Stanley notes. 'A code literal one if by land, two if by sea."You can't have any mistakes in a code. Language, in contrast, is a statistical, structured system with built-in redundancies A few mumbled words or scattered typos usually do not render a sentence incomprehensible. In fact, the workers tested this notion of repetition by applying a second analysis, this time from information theorist Claude E Shanon who in the 1950s quantified redundancies in languages. They found that junk DNA contains three to four times the redundancies of coding segments. Because of the statistical nature of the results, the researchers admit their findings are unlikely to help biologists identify functional aspects of junk DNA. Rather the work may indicate something about efficient information storage. "There has to be some sort of hierarchial arrangement of the information to allow one to use it in an efficient fashion and to have some adaptability and flexibility,' Goldberger observes.
Another speculation is quences may be essential to the way DNA has to fold to fit into the nucleus.
Some researchers question whether the group has found anything significant. One of those is Beniot Mandelbrot of Yale University. In the 1950s the mathematician pointed out that Zipf's law is a statistical numbers game that has little to do with recognizable language features, such as semantics. Moreover, he claims the group made several errors. 'Their evidence does not establish Zipf's law even remotely.' he says. But such criticisms are not stopping the Boston workers from trying to deciphers junk DNA's tongue. 'It could be a dead language," Stanley says, 'but the search will be exciting.'