Homework 2.2: Exponential conjugate prior (55 pts)

Data set download

This problem was motivated by discussion in section 2.10 ofHolmes and Huber’s book.


a) Show that the conjugate distribution for the Exponential distribution is the Gamma distribution. How are the parameters of the Gamma updated from the prior to the posterior? (You are welcome to look atWikipedia’s table of conjugate priorsto check your answer, but you should not look up the actual proof that the Gamma distribution is conjugate to the Exponential.)

b) Download the sequence of the chromosomal DNA of E coli strain ATCC BAA-196 here in FASTA format. This strain is resistant to multiple drugs and is used in studies of antibiotic resistance. The sequence was published in this paper. Read the sequence in as a single string, using the function below, if you like.

[1]:
def read_fasta_single_record(filename):
    """Read a sequence in from a FASTA file containing a single sequence.

    We assume that the first line of the file is the descriptor and all
    subsequent lines are sequence.
    """
    with open(filename, 'r') as f:
        # Read in descriptor
        descriptor = f.readline().rstrip()

        # Read in sequence, stripping the whitespace from each line
        seq = ''
        line = f.readline().rstrip()
        while line != '':
            seq += line
            line = f.readline().rstrip()

    return descriptor, seq

c) Find the index in the sequence of all Shine-Delgarno motifs, which is an important motif in initiation of protein synthesis. The Shine-Delgarno sequence is AGGAGGT. You can use the function below to find recognition sequences.

[2]:
import re


def recognition_sites_with_re(seq, recog_seq):
    """Find the indices of all recognition sites in a sequence."""
    sites = []
    for site in re.finditer(recog_seq, seq):
        sites.append(site.start())

    return sites

Store the number of bases between each occurrence of the motif as an array.

d) We will model the distance between each motif as Exponentially distributed. Explain why this may be a reasonable model.

e) Make an exploratory plot of the ECDF of inter-motif distances. Does it look Exponential?

f) Use an Exponential likelihood and a Gamma prior distribution to compute and plot the posterior distribution of the rate parameter (the inverse of the characteristic distance between motifs) of the Exponential distribution.