There are many ways to align two protein sequences against each other. First, however, we must remember that an alignment generated by software will represent only one of many different possible alignments. The alignment software sorts the generated alignments according to a calculated score, with the output being the one with the highest score. This suggests that the alignment score is essential, and its calculation needs careful consideration. The most straightforward score to assess how closely related two sequences are can be based on the number of identical amino acids that align against each other. Using this number, we can count the percentage of identical residues – called the percentage of sequence identity. The higher this percentage, the closer the compared sequences will be in terms of their evolutionary origin.
Even though many amino acids in a protein sequence can be invariant, depending on the evolutionary distance between the proteins, there will always be a substantial number of residue substitutions caused by mutations. Many replaced residues will be chemically equivalent to the "original" ones. For this reason, this type of conservation is called similarity, and it depends on the demand for the conservation of structure and function. For example, L and V will be equally tolerated within a protein's hydrophobic core, assuming enough space is available for the slightly larger side chain of leucine to be accommodated. The same applies, e.g., to K and R substitution, since both these residues are usually located on the surface and primarily interact with solvent or with the acidic side chains of E or D. On the other hand, substituting V with R may have a dramatic negative effect and destabilize or denature a protein.
The above suggests that we must consider both identities and similarities between the amino acids in calculating the alignment score. However, a question will arise: if we assign a score of 1 to each pair of identical residues, what score should we assign to a substitution like K with R or V with L compared to V with I or V with A? Our software will optimize the score for each possible alignment, and we will need to tell it how to count the contribution for each of the above and many other similar substitutions.
As an example, let us have a look at a simple alignment of a short segment of two sequences:
GCPFS-SPNVEA
GCPYGCSPEADA
GCPxx-SPxxxA
The identical (invariant) amino acids (matches) in the two sequences are highlighted in the third raw (GCP, SP, and A), while the differences (mismatches) are marked by an x. The cysteine residue in the second sequence does not seem to have a corresponding mate in the first. A dash marks this position. The percentage of identity for this sequence alignment is simply 6/12, which is 50%. Then, the score of the alignment can be calculated by a simple expression:
(Score) S= No of matches - length of sequence = 6 - 12 = -6
One shortcoming of this expression is that it does not consider the number of conserved substitutions. So, for example, F in the first sequence is replaced by a chemically equivalent Y, and E is replaced by a chemically equivalent D. This shows that for a more accurate calculation of an alignment score, we need a score for each such replacement.