Assignment 6
Select one of your interesting sequences from the database (sequence should be longer than 300 base pair) to do the BLAST search and answer the following questions:
a. What are the different between 6 BLASTs(blastn, blastp, blastx, tblastn, tblastx, PSI-BLAST)?
blastn: Search a nucleotide database using a nucleotide query
blastp: Search protein database simply compares a protein query to a protein database using a protein query
blastx: Search protein database using a translated nucleotide query
tblastn: Search translated nucleotide database using a protein query
tblastx: Search translated nucleotide database using a translated nucleotide query
PSI-BLAST (protein-specific iterated BLAST): Search protein database using a protein query, allowing the user to build a PSSM (position-specific scoring matrix) using the results of the first BlastP run.)
b. Use your sequence to do 3 out of 6 BLASTs and discuss “What’s the strength and weakness of BLAST you have selected?”
Human hexose-6-phosphate dehydrogenase is chosen. Retrieve the nucleotide sequence from GenBank (http://www.ncbi.nlm.nih.gov/).
Search nucleotide for Homo sapiens hexose-6-phosphate dehydrogenase (glucose 1-dehydrogenase) (H6PD). NCBI accession number NM_004285.2
Click on “FASTA” and save the sequence as a text file. Then, blast the nucleotide sequence with BLAST program available on http://blast.ncbi.nlm.nih.gov/Blast.cgi.
The BLAST programs chosen are blastn, blastx, and tblastx.
Click on “blastn” from this page. Paste the nucleotide sequence in FASTA format.
Under “Choose Search Set”, choose “others” checkbox to choose nucleotide database including every organism.
Under “Program Selection” section, choose “somewhat similar sequence (blastn)” checkbox. Then, click on “BLAST” button. The result page will show up as following:
The BLAST result will be shown
blastx is done in a similar way to blastn.
Paste the nucleotide sequence
Choose database as “nr”
Then, click on “BLAST” button.
The result is shown
tblastx: paste the nucleotide sequence and choose database as “nr/nt”
The error occurs. The program cannot operated within the time allowed because the search is too large. No result is received.
The strength and weakness of the BLASTs chosen are:
blastn: it searches nucleotide query in nucleotide database. Consequently, it does not require much of time to operate since it aligns nucleotide query to nucleotide database. The nucleotide of query and result must be exact to be scored. Consequently, it is rather specific but if there is a polymorphism of nucleotide(s), that position is not scored. As a result, the total score is less than it should be since that position might not be significantly different as they give the same amino acid.
blastx: it seaches translated nucleotide in protein database. It takes sometimes to process the translation of the nucleotide query in all reading frame. However, the same amino acid may result from different codons. Translatinging nucleotide sequence into amino acid sequence is probably increasing the chance to identify a protein that their nucleotide sequences may differ due to genetic variation of codons. Moreover, the reading frame of translation might not be corrected as all reading frame are employed. It can be distinguish whether which reading frame is corresponded to the real reading frame of that gene.
tblastx: it searches translated nucleotide query in translated nucleotide database. Hence, it takes a plenty of time to process as well as much of CPU usage. This program essentially increases the chance of finding possible result as all reading frame of translated nucleotide in database and the nucleotide query are aligned. Incorrect reading frame may result but it provides all the possibility of the result that could be.
c. Show us the first hit on each BLAST with their identity or/and similarity scores.
blastn: NM_004285.3 Homo sapiens hexose-6-phosphate dehydrogenase, E-value 0.0, Maximum identity 100%
blastx: NP_004276.2 hexose-6-phosphate dehydrogenase precursor, E-value 0.0
tblastx: no result is obtained.
d. Summarize the result from 3 BLASTs you select.
blastn and blastx gave out the same result, which is hexose-6-phosphate dehydrogenase of human, with E-value = 0.0. Zero E-value means that the sequence of query is identical to that of the result, giving its reliability. tblastx could not operate the request as it requires too much CPU usage to translate a long nucleotide sequence and locally aligns them to the translated nucleotide database. blastn could be a potential tool since it is fast and accurate.
Assignment 5
Please use the bioinformatics tools to design these following items;
1. The real-time PCR primer and probe set(s) which can be used to distinguish between 2009 Swine-Origin Influenza A (H1N1)from other influenza subtypes.
Please also describe what are gene(s)/region(s) that you choose? And give us the reason why?
To distinguish 2009 swine-originated influenza A (H1N1) from other subtypes, real time PCR is a promising approach if a specific region is chosen. The virus is characterised by haemagglutinin 1 and neuraminidase 1 that present on the envelope of the virus. The region(s) of 2009 swine-originated that differs from other subtypes is determined by means of alignment in order to separate it from other subtypes.
Retrieve the hemagglutinin (HA) mRNA and amino acid sequence of swine influenza from GenBank (http://www.ncbi.nlm.nih.gov). Search nucleotide for H1N1 AND HA. Amino acid sequence is included after “translation” heading. The nucleotide sequences of HA gene are available from many countries. HA nucleotide sequences of H1N1 influenza that is not a swine-orginated are retrieved to compare with HA of swine-originated influenza.
GenBank accession number of HA:
NWS: U08903.1
Alberta: U47310.1
Ws: U08904.1
Swine influenza from Rio de Janeiro: CY054281.1
Swine influenza from Nebraska: S67220.1
Amino acid alignment of these 5 strains of H1N1 is done on ClustalW program (http://www.ebi.ac.uk/clustalw). Paste the sequences in FASTA format and run the program.
Asterisk represents the amino acid that is found in all strains. The region with asterisk will be ignored, while the region of amino acid sequences that the 2 swine influenza are similar but different from the rest are considered. This region will be subsequently used for identification of swine-originated influenza. The nucleotide alignment is also done to find the chosen region from amino acid alignment, which is the position 201-254, that corresponds to nucleotide sequence.
The chosen region of nucleotide is 623-761. This region is used for real time PCR primer and probe design in Primer3 program (http://frodo.wi.mit.edu/primer3). Paste the sequence of this region onto the program. The checkbox of Pick left primer, right primer and hybidization probe are chosen to design forward and reverse primers and the single probe.
The parameters of primers are set as following:
Product size ranges: 50-90, 80-120 bp
Primer size: min 19 bp, opt 20 bp, max 23 bp
Primer Tm: min 60 oC, opt 64 oC, 68 oC
Primer CG%: min 35, max 65
The parameters of probe (Hyb Oligo) is as following:
Hyb Oligo excluded region: 47,4 98,8 (these regions will not be considered for the probe since they are conserved region in all strains)
Hyb Oligo size: min 20 bp, opt 23 bp, max 26 bp
Hyb Oligo Tm: min 68 oC, opt 70 oC, max 70 oC
Hyb Oligo GC%: min 20, opt 60, max 80
Then, click on “Pick primers”
Forward primer: TCAACAAGCTCTCTACCAGAACG, Tm 60.95 oC, %GC 47.83
Reverse primer: TCGTTGCTATTTCTGGCTTGAAC, Tm 62.75 oC, %GC 43.48
Probe: TGCCTATGTTTTTGTGGGGTCATCA, Tm 68.29 oC, %GC 44.00
The start position is 651 and ends at position 762 (corresponding to the position of the HA gene). The probe binds from position 678-703. The product size is 89 bp.
2. The conventional PCR and sequencing primer set which can be used to identify oseltamivir resistance associated NA gene mutations: N1: H274Y
Sequencing of NA gene to identify mutation that leads to oseltamivir resistance in H1N1 virus
The nucleotide sequence of neuraminidase is retrieved from GenBank. The accession number of swine-originated influenza neuraminidase gene is GU371257.1, while that of oseltamivir resistance is GU371269.1. These sequences are aligned in order to determined the mutate region by using ClustalW program (http://www.ebi.ac.uk/Tools/clustalw2/index.html).
The nucleotide sequences are given in FASTA format. Click on “Run”. The alignment will appear.
Neuraminidase 1 protein has a nucleotide transversion of cytosine at position 827 in segment 6 to thyrimidine that leads to amino acid substitution of histidine to tyrosine. To identify oseltamivir resistance, this position should be determined.
Sequencing of this mutation can be achieved by using primers covering around this region. Nucleotide sequence of wild-type NA gene is applied to Primer3 program (http://frodo.wi.mit.edu/primer3).
On checkbox, choose “Pick left primer” and “Pick right primer”. The parameters are set as following:
- Targets (to indicate the nucleotides that we want to include in the product, in this case the mutation is at position 827): 825,5 (starts to include at position 825 for 5 bases)
- Product size ranges: 150-250, 100-300 bp
- General primer picking conditions:
Primer size: min 20 bp, opt 23, max 25
Primer Tm: min 55 oC, opt 60 oC, max 75 oC
Primer GC%: min 40, opt 50, max 60
Then, click on “Pick primers”. The result will show up
Sequencing primers are obtain as following:
The forward primer: CAGGCCTCATACAAGATCTTCAG, Tm 60.27 oC, %GC 47.83
The reverse primer: CCAGATTCTGATTGAAAGACACC, Tm 59.99 oC, %GC 43.48
The alignment of primers and the sequence show that the start position is 753 and end is 936. The product size is 184 bp. Asterisks show the included nucleotide that the mutate nucleotide is at position 827 or position 74 in the sequencing.
Conventional PCR of NA 1 gene
The parameters of conventional PCR primers are different from sequencing primers.
- Product size ranges: 250-300 300-400 400-500 bp
- General primer picking conditions:
Primer size: min 18 bp, opt 20 bp, max 22 bp
Primer Tm: min 52 oC, opt 55 oC, max 58 oC
Primer GC%: min 40, opt 50, max 60
Forward primer: GCTTTACTGTAATGACCGATG, 21mers, Tm 55.02 oC, %GC 42.86
Reverse primer: TGCCTGTCTTATCATTAGGG, 20 mers, Tm 55.35 oC, %GC 45.00
The product starts from position 718 - 1005 of NA gene. Product size: 288 bp
Assignment 4
Structural bioinformatics
Function of a protein can be predicted based on its 3D structure as a particular domain may serve a particular function. Therefore, many structures of proteins have been solved by means of X-ray crystallography or NMR. Still, it has not been practicable to obtain structure of every protein through these approaches. Consequenlty, 3D structure of a protein can be modelled based on a protein with similar amino acid sequence in which the 3D structure is available.
1. BLAST the nucleotide sequence with “blastx” program that is available in http://blast.ncbi.nlm.nih.gov/Blast.cgi
This BLAST program will translate the nucleotide query to amino acid sequences in all reading frames and search in protein database.
Paste the nucleotide sequence in FASTA format. Then, choose database as “Non-redundant protein sequences (nr)”.
Then, Click on “BLAST” button on the left. The program may take a few minutes to finish. The result will be shown as following:
2. Identify the unknown nucleotide sequence
Scroll down to the description box. The best matching protein to the query will be on the top of the list based on the lowest E-value.
The unknown nucleotide sequence is corresponded to BRC1 protein after it had been translated into amino acid sequence.
3. 3D structure of BRC1
Click on the accession number of the top-list protein. The information page of this protein will show up. Now we know what kind of protein this nucleotide sequence is. Then, we can look for the structure of this protein by scrolling down the information page and go to “LinkOut” under the “All links for this record” box on the right.
The following page will appear. Click on “MODBASE” to go to the comparative modeling of this protein.
The 3D structure of BRC1 protein is generated based on comparative modeling.
The template is a crystal structure of BRCA1/BARD1 ring heterodimer in which the structure can be assessed through PDB code 1JM7A on RCSB database.
The 3D structure of BRC1 protein allows us to understand its function regarding to its template, BRCA1, that consists of domains in the structure such as the DNA binding domain. Simulation of the structure can be further investigated in silico to observe how it interacts with other compounds, etc.
Assignment-3
Phylogenetic tree construction by using BioEdit program
Common ancestors of organisms can be investigated by means of phylogenetic analysis. Mitochondrial cytochrome b nucleotide sequence showing below is found to be of a dinosaur that lived 80 million years ago:
cccttctattattcattctcattctattcgttattcttgtactccacacatccaaacaac aaagcataatattccacccattgagtccattcctatcctgattcttagtccccgaacctt
ttacactcacatg
Phylogenetic tree of the dinosaur cytochrome b is constructed to find which organism(s) that share common ancestors with it.
1. Retrieve cytochrome b nucleotide sequences from Entrez Nucleotide database http://www.ncbi.nlm.nih.gov/sites/entrez. For each organism, choose nucleotide and add the organism scientific name along with “AND cytochrome b”–e.g. Homo sapiens AND cytochrome b. Then, the result page will show up and click on RefSeq tab as following:
Click on CYTB, which is an abbrevation of cytochrome b, under search in Gene section. The following page will appear.
Scroll down to “Genomic regions, transcripts, and products”. Click on the NCBI reference sequence number, NC_012920.1, and choose “FASTA” from Nucleotide link. The nucleotide sequence in FASTA format will appear and save it as text file. The organisms used in phylogenetic tree construction are shown with their NCBI accession number:
Human: Homo sapiens, NC_012920.1
Dog: Canis lupus, NC_002008.1
Rabbit: Oryctolagus cuniculus, NC_001913.1
Rhinoceros: Rhinoceros unicornis, NC_001779.1
Dugong: Dugong dugon, NC_003314.1
Mouse: Mus musculus, NC_010339.1
Whale: Balaenoptera edeni, NC_007938.1
Bovine: Bos taurus, NC_006853.1
Sicklebill: Epimachus fastuosus, GQ334244.1
Chicken: Gallus gallus, NC_001323.1
Magpie: Pica hudsonia, AY030114.1
Frog: Rana plancyi, NC_009264.1
The nucleotide search step can be shortened by using “CYTB” as a keyword instead of the full name “cytochrome b”, which will give several results as it is not a specific name–e.i. there can be cytochrome b reductase, etc. Then, go to RefSeq tab and the target sequence will be listed.
All of the nucleotide sequences of these organisms in FASTA format are ordered in the same text file as shown below. Nucleotide sequence of dinosaur cytochrome b is also added in this text file.
2. Open BioEdit, which can be downloaded from http://www.mbio.ncsu.edu/BioEdit/bioedit.html. Start alignment by going to the menu bar and choose ”File > New Alignment”
A new alignment page will appear.
3. Import nucleotide sequences in FASTA format by “File>Import>Sequence alignment file” on the menu bar.
The nucleotide sequences will appear on the right panel and the names of the sequence will appear on the left panel.
4. Align the sequences by highlighting all the names of the sequences. Then, choose “Accessory Application > ClustalW Multiple Alignment”
A popup window will show up and click on “RunClustalW” button.
The program will run in DOS window.
The alignment will be shown in a new window.
5. Calculate distance among these nucleotide sequences by selecting all sequences. Then, choose “Accessory Application>DNAmlk DNA Maximum Likelihood program with molecular clock”
A small window of the program will appear. Click on “Run Application”.
The distance among these organisms will appear
Save it as text file. The tree is shown as following
6. View the phylogram by using TreeViewX, which can be downloaded from http://darwin.zoology.gla.ac.uk/~rpage/treeviewx/download.html
The program will display the graphic tree, and it can be set to be view as slanted cladogram, rectangular cladogram, or phylogram.
Open the program, go to “File>New…” on the menu bar.
Open the text file of the tree from BioEdit by “File>Open…” and the graphic view of the tree will appear as following. The type of tree view can be chosen on “Trees” on the menu bar.
From this phylogram, there are 2 major groups of animals as there are 2 clades beginning on the left. Magpie and Sicklebill are closely related since both of them are passerines, while Chicken is categorized in this group as all of them are birds (Class Aves). They share ancestor with Frog, which is in another class of amphibian. The length of the clade line represents how closed they are, and Frog is very far from these birds. On the next group, Human and dinosaur are categorized in the same group even though the distance between them is rather far. Based on the phylogram, the dinosaur is closely related to human. Whale is closed to bovine and rhinoceros. Dog and mouse are closely related.
Assignment 2
1. What is the name of haploview format to use in this analysis?
In Haploview, which can be downloaded from http://www.broadinstitute.org/haploview/haploview-downloads, the input file formats accepted by Haploview are available in several formats such as linkage format, completely or pharsed haplotypes, HapMap project data dumps, PHASE format, and PLINK. In this analysis, the haploview format is in HapMap format. This type of input file format can be opened by choosing “HapMat Format” on the left and browse the text file containing haplotypes on chromosome X.
The haplotypes will be loaded as shown:
2. Please show us the marker and individual quality control of the genotype data use in the analysis?
- The marker and individual quality control of the genotype data are shown under “Check Markers” tab. The quality is assessed through following:
– ObsHET is the marker’s observed heterozygosity
– PredHET is the marker’s predicted heterozygosity
– HWpval is the Hardy-Weinberg equilibrium p value, which is the probability that its deviation from H-W equilibrium could be explained by chance
– %Geno is the percentage of non-missing genotypes for this marker
– FamTrio is the number of fully genotyped family trios for this marker
- MenErr is the number of observed Mendelian inheritance errors
– MAF is the minor allele frequency for this marker

3. Please show us the LD map then explain what do you get from the LD map?
- The LD map is displayed under “LD plot” tab. The LD scores is calculated from pairwise of each marker. If the score is high, this pair of markers is said to be strong linkage disequilibrium. The scores are clustered together based on their values. The color of the score becomes more reddish as the score increases, and they can be grouped in order to designate a haplotype block. In this LD map, there are 3 regions of grouping red blocks implying 3 haplotype blocks as shown by triangles grouping the regions. The numbers above the map show the marker numbers and names of the alleles.

4. How many haplotype blocks in this region of Chromosome X, then explain how to interpret them?
- There are 3 blocks of haplotypes in this region of chromosome X based on 95% confidence intervals, illustrated as Block 1, 2, and 3. The tag SNPs can be displayed by clicking on “Display” on the menu bar and choose “Show tags in blocks”. The tag SNPs are indicated by ticks under the marker numbers. There are 2 tag SNPs found in each haplotype blocks. Crossing regions show the likelihood of recombination between the 2 blocks. The thicker the crossing line, the stronger the recombination.

5. Could you find out the tagging SNP in each haplotype block, then explain what the tagging SNPs?
- The tag SNPs are the SNPs that can represent other SNPs because all of them are in linkage disequilibrium.
On ”Haplotypes” page, the tag SNPs are indicated by ticks beneath the marker numbers. There are 2 tag SNPs presenting in each 3 haplotype blocks. Or they can be shown on “Tagger” page by choosing “Configuration” tab. Then, select all alleles by clicking on “Include All” button and “pairwise tagging only”. Every allele is paired and tested. Hit on “Run Tagger” button and go to “Results” page.
On “Alleles captured by Current Selection” panel, 6 alleles are shown up. They contain tag SNPs.
Assignment1_2
Separate the list of FASTA file
1. Open Taverna, and type “split” in the search box. Under “Local Services”, “Split sting into string list by regular expression” is shown in red. The FASTA file from http://www.cs.manchester.ac.uk/~katy/taverna/fastaFile.txt contains some necleotide sequences, and these sequences will be separated individually by using the split service.
2. Right click on “Split sting into string list by regular expression” and choose “Add to model”. Each nucleotide sequence is called a ‘string’ and the regular expression is the pattern that separates each string.
3. This service requires 2 workflow inputs: the string and regex (regular expression). Right click on “Workflow inputs” in the low left panel and choose “Create New Input…”
4. Add “FASTA sequence” in the “Name for the new workflow input” box. This input will be the nucleotide sequence file. After that, add another workflow input as “pattern” to be assigned later on as a format that separates each nucleotide sequence.
5. Then, right click on the “Workflow outputs” to add the output of the splited FASTA files.
6. On the right panel, graphical representation of workflow inputs and output are illustrated. These boxes need to be connected together. Right click on “FASTA sequence” and choose “Processors > Split_string… > string”. Then, this box will be linked to the processor. The FASTA file will be added to this input later on.
7. Connect “pattern” by right clicking on it and choose “Processors > Split_string… > regex”.
8. Connect the split processor to the output by right clicking on “split” under Processors category.
9. After that, the workflow is established completely for this process on the right panel.
10. The workflow can now be run by choosing File > Run workflow… on the left corner.
11. A popup window will be appeared and enable adding value for the inputs. Right click on “FASTA_sequence” and choose “New input value”.
12. The FASTA file containing nucleotide sequences is added on the right panel.
13. For pattern, only “>” is added since each nucleotide sequence is begun with it. The service can find this symbol and separate each sequence.
14. Then, click on “Run workflow” botton. The result will appear on the main window program under the “Result” tab.
Assignment 1
Taverna
How to get nucleotide sequence from NCBI database
1. Open Taverna program. Type “get nucleotide” on the search box, and the results will appear in red letters. Right click on ”Get Nucleotide FASTA” under NCBI folder and choose “Add to model”.
2. To retrieve a nucleotide sequence, the input for the workflow must be given as the accession number of the query sequence. Right click on the “Workflow inputs” in the low left panel and choose “Create New Input…” to put the name of the sequence in. In this example, the accession number is given “ACC37599.1″. On the same way, an output is created under “Workflow outputs”. As a result, graphical boxes of the input and output workflow will appear in the right panel.
4. After the input and output of the workflow were created, the process of the workflow is built by connecting the input to the process. Right click on the input name and choose “Processors” and “Get Nucleotide FASTA” as illustrated.
5. The input box is now connected to the processor. Next, the output box should be connected to the processor by right clicking on the “output” under “Get Nucleotide FASTA” processor.
6. Then, the workflow has been established and is ready to be run. Click on “File” on the left corner and choose “Run workflow…”
7. A popup window will appear. The GenBank accession number is filled in the input name “AAC3799.1″ by right clicking and choose “New input value”. The accession number is typed on the right panel.
8. After the accession number has been given. Click on the “Run workflow” button.
9. The main program window will show the result. On the “Status” tab, it reports the process has been complete.
10. The nucleotide sequence is shown under the “Result” tab in FASTA format.













































































