Short reminder:
The human genome, with all its magic, is about 3,117,275,501 base pairs long. Source: Wikipedia
If you would encode that data digitally, and store it on a SSD drive, it would take up < 1 GB.
So, if we can do so much magic with 1 GB, that should be an inspiration to all software to do more, with less space.
Thank you for coming to my talk.
1 GB is okay, can we — for the sake of this argument — compress it?
Yes, but I don’t know how much (and it would vary based on numerous factors).
An uncompressed format would need 3,117,275,501 X 2 bits to be able to guarantee that it can encode any DNA sequence of 3,117,275,501 base pairs (caveats and nuances aside).
However, human DNA sequences aren’t completely random! There are constraints on what would actually be a valid human DNA sequence. That opens the possibility of compressing the data.
For example, you’ll never find someone with 3,117,275,501 of exactly the same base pairs (i.e. AAAAAAAAAA…AAAAAA), it’s impossible. Based solely on that, you don’t actually need all 3,117,275,501 X 2 bits of information. In fact, the set of valid human DNA sequences is probably considerably smaller than the set of all possible DNA sequences of the same size (can’t find any specific data here, so you’ll just have to “trust me bro”). So, a good/smart algorithm can make use of that to generate representations that require fewer bits of storage.
Another aspect of human DNA is that it contains a lot of repeated segments. A quick check of Wikipedia even suggests that 2/3rds of human DNA is composed of these repeating patterns. Repeating patterns like that, and particularly because they make up so much of our DNA, are ripe for compression.
I’m sure there are other aspects at play here, but those two facts in and of themselves pretty much guarantee that we can compress otherwise uncompressed binary representations of human DNA sequences.
You! Yes, you. We need this type of approach, can you now go make some useful software that technically won’t be a bloatware or a game that won’t exceed 50 gigs? Pretty please?
Yes, exactly.
And also, we probably don’t need most of our genome anyway. IIRC, 90% or something seems to have no apparent function. It’s just there as an artifact of evolution, and never got removed.
So, if you would leave all that out, the actually useful genome is much less than 3,117,275,501 base pairs.
The thing is, we have no idea, which genes are useful or not. It is often very difficult to say, and any error would probably lead to disease. So we don’t mess with DNA.
It is compressed by histones, except during cell division and when specific sections are expanded for reading and transcribing
yes, though I don’t know how much
Let’s just agree on lossless with solid error correction — we really don’t want to fumble this bag.
WindowsXP requires 650MB of disk space and a lot of Linux use less then that. However it is rather easy to have more, as storage is cheap and it is usually better to use a bit more storage and have a lower end processor instead.
Well yes, but there are >8 billion humans out there. And a single human would be quite useless. So it’s more like 8 Exabytes. Quite a lot.
Of course, there is a heap of research on the efficient lossless and lossy compressed representation of pangenomes, so my blind guys would be that we could probably losslessly store all human genomes in a petabyte or so.
I think it might be even less than that, considering that two humans have 99.9% the same genome.
So the differences are maybe 1 MB per person.
That would make for 8000 TB.
And then there’s compression. Obviously, there’s going to be a lot of representation if you take the genes of all humans, since they are more or less just passed on, with slight modifications.
So it’s definitely going to be 80 TB or less.
Right, peta is two steps above giga. Then I’ll go with one terabyte. Well, then there is roughly 10 bytes per genome. Hmm, that is a bit little. Maybe the 80TB estimate is quite good. Then it would be 1KB per genome.
You could probably build a phylogenetic tree by some heuristic, and then the differences on the edges are very small.
Or, build an index of all variants, and then represent each genome as a compressed bitvector with a one for each variant it contains.
Well, now it seems that this would still be many variants, given that there are so many single bases that may differ. So maybe 80TB is a bit too little.
Yeah, but nobody’s gonna encode all of humanity’s genes at once. It’s like taking the storage of all game sava data of all users combined. It doesn’t make sense.
Normally, you look at the storage space for one individual at a time.
There is an entire research field about looking at sets of genomes. It’s called pangenomics. I think they are at hundreds of thousands of human genomes of available data right now. Ten thousand from a few years ago I know for sure.
Considering multiple genomes is one of the keys to understanding the effects of the different genomic variants on the individual. One can for example look at various chronic diseases and see if there is anything special about the genomes of the sick individuals compared to the healthy ones.
This requires a lot of samples, because just comparing one sick and one healthy individual will bring up a lot of false positive variants, that differ between the individuals, but are not related to the disease.
thanks, I hadn’t thought of that.