2010年11月22日 星期一

mini project part 2


Topic :
Sequence alignment

Research question and objectives:
Is one vaccine enough for all races of the world?

Use sequence alignment to analyze 2 sets of H3N2 HA sequences of 2 different races/ continents and see if there is any difference.

Method:

With higher percentage of similarity between sequences, their functions or viral feature will be more similar and hence they have more chance to be killed by the same vaccine. In order to find out if there is any difference in H3N2, type A virus of different human race, two groups of sequences from different regions with populations of different races are used. The hemagglutinin (HA) is one of the relevant antigens in Influenza. It is responsible for entry of the virus into human body.

Four, H3N2 type A, HA protein sequences from China and Australia are chosen and downloaded from NCBI influenza virus resource database. 2 from Australia, ADF 83752 (AUS1), ADF83741 (AUS2); and 2 from China, ADK 87306 (CN1), ADK87308 (CN2) .These sequences are full length sequences, Human as the host. In order to maintain uniformity of the chosen sequences for comparison, these sequences are from the same year.  These sequences than are compared with sequence of the same geographic location and with different geographic locations using BLAST, from NCBI. Total of six comparisons were made. Algorithm and Scoring Parameters of Blast are set as follow,
              Gap cost: Existence 12, extension 1
              Scoring Matrix: BLOSUM 62  
Compositional adjustment: Conditional compositional score matrix adjustment
Word size:3
Max target sequences: 100
Expect threshold: 10
Max matches in a query range: 0


A multiple sequence alignment was also done for the four selected sequences, using the software provided on NCBI influenza virus resource database, to demonstrate the mismatches between sequences and their locations.


Result:

BLAST scores and percentage of similarity between sequences




Score/ percentage of similarity
Protein sequence 
AUS1
AUS2
CN1
CN2
AUS1
  ---------------


AUS2
1242/99%
----------------


CN1
1230/99%
1235/99%
-----------------

CN2
1239/99%
1244/99%
1232/99%
--------------



 
The table above demonstrates the percentage of identity and similarity between sequences. The scores are relatively close, the higest and the lowest score are 1244 and 1230 respectively. All pairs of sequences have a percentage of similarity of 99%. The sequences pair with the highest score are AUS2 and CN2. The lowest score are obtained from the AUS1 and CN1 pair, CN1 and CN2 pair has a score of 1232, only two marks higher than the lowest score.

Multiple sequences alignment of the four selected sequences



From the multiple sequence alignment(MSA)(please see diagram above), the locations with mismatches between sequences are highlighted in white, and stated the mismatch on the particular sequences. The number of mismatches out of 566 for AUS1 is 4, AUS2 is 2, CN1 is 7 and CN2 is 2. The mismatches are all at different locations along the sequence. For sequence with higher number of mismatches, the alignment score with other sequences is lower. The MSA result correlates well with the BLAST result.


Discussion:
The results suggests the sequences are very similar, the mismatches between sequences are not related to their geographic locations and hence the human races. These mismatches between sequences may due other factors, for example vaccine pressure, weather, life style and so forth. A study done by Rambaut, A in year 2008 suggested that seasonal difference has actually effect on Influenza A antigen drift. Futher, these viruses are under rapid mutations. These factors made the mismatches between sequences due to different races less prominent.

With this result, I would make a conclusion that one vaccine can prevent H3N2 type A virus for all human races of the world. Nevertheless, this have to be confirmed by a larger group of samples and require incoporation of sequences of more different geographical regions. In addition, neuraminidase, the protein responsible to budding of new virion out of host cells, needs to take into account.


Reference:
1.        Rambaut, A et al. The genomic and epidemiological dynamics of Human Influenza A virus. Nature 453,615-619 (2008)
2.        Zhen X. at al. Using a mutual information-based site transition network to map the genetic evolution of influenza A/H3N2 virus. Bioinformatics 25, 2309-2317(2009)
3.        Alexander, P. et al An RNA conformational shift in recent H5N1 influenza A viruses. Bioinformatics 23, 272-276
4.        ANSWER.COM http://www.answers.com/topic/antigenic-drift


2010年10月17日 星期日

Mini project part 1

Selected topic:
Sequence alignment

Review of Journal article:
Title: Using a mutual information-based site transition network to map the genetic evolution of influenza A/H3N2 virus
Author: Zhen Xia, Gulei Jin, Jun Zhu and Ruhong Zhou
Bioinformatics, Vol. 25, no.18, 2009, pages 2309-2317

Mapping of antigenic and genetic evolution pathways of Influenza A is important in vaccine development and prevention of the virus outbreak. Study of 4000 A/N3H2 Hemagglutinin (HA) sequences from 1968 to 2008 was done to model the evolutionary path of the virus, and to identify potential mutations in the future. The mutual information method was used to diagnosis the co-occurring mutations and correlations between each amino acid sites of N3H2 HA sequences, and to form a site transition network (STN). The effectiveness and accuracy of STN is compatible with phylogenetic tree and antigenic maps, with reduced cost.

The study indicates 63 out of 312 sites of the HA sequences have high interactions among themselves. STN demonstrates these 63 sites and their clear trajectories on modeling the antigenic transition during the evolution of Influenza. The study shows that the probability for a site to mutate to a new amino acid that did not happen before is low, and mutations have strong preference of mutation sites. Study of historical mutations allows the author to predict the future mutations, which possibly produce the next antigen change and the resultant amino acid of the new strain. With this information, predictions were made for each year of 1999-2008.  The accuracy of the prediction is approximately 70% in average. Clustering was performed to reveals information on different simultaneous multi-site mutation in antigenic drifts. Prediction for year 2009-2010 was made. It requires time to prove its accuracy. In this study, locations, seasons, new vaccine and other pressure have not taken into accounts. Future study should be made with consideration of these pressures.


Research question and objectives:
Is one vaccine enough for all human races of the world?

Use sequence alignment to analyze 2 sets of H3N2 HA sequences of 2 different human races/ continents and see if there is any difference in them.
                                                                              

2010年10月12日 星期二

Lecture 2

When comparing 2 (pairwise) or more (multiple) DNA or protein sequences by searching for a  series of individual characters or character patterns that are in the same order in the sequences, it is called Sequence alignment.

In order to produce Optimal alignment, gaps are used so that as many identical or similar characters as possible are into vertical register.

It is a powerful tool when exploring functional, structural and evolutionary data of  DNA or protein.

Global vs local

Global: comparing the whole length of the sequence up to both ends.Introduce gaps to matching as many characters as possible

Local: concentrates on the area(es) of the sequences where the longest matches are found.

Three Principle methods:
Dot Matrix analysis
DP algorithm
Word or K-tuple method

Dot matrix:
uses graph to display possible alignments. The possible alignments(s) will be shoen on the graph as a diagonal line running from top left to buttom right and vice versa.

Advantage: Shows the direct and inverted repeats easily
                  Shows the presence of insertion and deletions.
Disadvantage: Do not show the actual alignment.

Dot matrix:

2 approaches: Basic and filtering

Basic
Filtering
Sequence A is listed on the top of matrix Sequence B is listed on the left side of Starting with the first character of B, compare which every single characters of A, then repeat with the second character, third character and so forth
Sliding window can be used
2 sequences are compared at the same time
A dot is printed on the graph only if a certain minimal number of matches (stringency) occur when comparing these windows( ie window size)



Direct repeats show as diagonal lines running for top left to bottom right.
Invert repeats show as diagonal lines running from bottom right to top left

NW algorithm: (for global sequence alignment)
Scoring system is important for optimal alignment. 3 scores: Match, Mismatch and gap
       
Score matrix and backtracking:

Need to know:
Match=?          MISMATCH=?         GAP=?

Calculate the score of 3 directions: ,and
Put down the highest score in the box




     

2010年9月20日 星期一

Lecture 1

ASCII= format of text editing. Don't use function keys for text editin-g. The converter won't be able to recognize and ignore those keys. Also, when performing sequence alignment, those function keys will introduce error to the result.  DO not use MS WORDS for text editing when performing sequence alignment and data management. Use NOTEPAD instant.


Sequence format: Genbank-Amercia flatfile format, EMBL-European flatfile format. Fasta-simplest format, starts with a > and end with **.

Sequence database: Genbank, EMBL, DDBJ. They are databases from different part of the world. but they can talk to each other.

Entrez:

- a collective resource of sequences from several sources, including GenBank, Reference Sequence (RefSeq: for redundant datasets), and Protein Database (PDB).
- Lets a user access and retrieve specific information from a single database in addition to accessing integrated information from numerous NCBI databases. i.e. Genbank, EMBL, DDBJ