Conrad Burden has a long connection with ANU, having completed his PhD in Theoretical Physics in 1983 before working as a particle physicist on campus for many years. In 2003, he joined the Mathematical Sciences Institute as a bioinformatician.
“The nice thing about maths is it’s very portable. I knew very little statistics before I moved here, but once you’ve got a good strong mathematics background you find that the same ideas in mathematics crop up everywhere. Sometimes the language is different but once you recognise the equations you are familiar with, albeit in a different guise, that’s the key.”
Burden’s work sits on the nexus between statistics and biology known as bioinformatics.
“Eight years ago I’d never heard of bioinformatics. It’s a pretty new field,” explains Burden who this year successfully secured a three year Discovery Grant with collaborator, Professor Susan Wilson to investigate new statistical methods for uncovering DNA and protein matches.
“Biologists have well established ways of measuring similarities between DNA sequences but these typically involve looking for very long alignments of hundreds or thousands of letters. This is not always an appropriate way to measure similarity, and even with fast computers, for large databases and long sequences, the calculation can get out of hand.”
In a novel approach, Burden and Wilson will be looking for similarities in ‘words’ formed by the four nucleotide base pairs that are the building blocks of the DNA double helix.
“As a measure of similarity between two sequences we are counting the number of matches of short words of some specified length, say eight letters, between the sequences. A computer can do that extremely quickly.”
The main focus of Burden’s work will be to work out the probability distribution of the number of word matches: “For instance if I had a big scrabble bag containing only four types of letters - A, T, G and C in the same proportion as they occur in the genome - and I randomly pull out two long sequences, how many word matches between the sequences would I expect there to be simply by chance? If I see ‘too many’ matches in a pair of real world DNA sequences, there’s a good change they are biologically related.”
Currently, Burden is testing his theory with family trees of proteins we know are related by other means.
“It’s a typical procedure applied mathematicians follow when constructing a new mathematical model: start with a case where you already know the answer, and see if the model gives that answer – in this case, see if you can construct the correct family tree. If it does, you have the confidence to then apply it to cases where you don’t know the answer.”
It’s still early days for the research, but Burden hopes this new approach to DNA sequence matching will be less time and resource intensive: “If we can find places in the genome in which certain ‘words’ occur more than we would expect by chance, biologists can focus their attention on that area, rather than wasting time working through the whole genome.”