王俊

论文题目:籼稻基因组工作框架图 

作者简介王俊,男,1976年06月出生,1997年09月师从于北京大学吴才宏教授,于2002年07月获博士学位。

                                      

             

 

基因组序列所提供的基本信息使人们能更全面地了解一个物种的各种生物学特性及进化学上的关系。基因组学和生物信息学就是采集、分析、加工和储存这些信息的科学。

1985年美国科学家首次提出全面解析人类基因组计划,1990年,在美国国立卫生院(NIH)和能源部(DOE30亿美元的资助下正式启动。中国作为参与国在1999626日正式加入到国际人类基因组计划协作组并承担1%的任务。2000626日,美国、英国、德国、法国、日本、中国同时宣布人类基因组工作框架图完成,并于20012月在英国著名杂志“Nature”上发表。20013月中国提前并超额完成了所承担的1%完成图的绘制及分析工作。

1%计划的主要承担单位-北京华大基因研究中心暨中国科学院基因组信息学中心又在20005月启动了中国超级杂交水稻基因组计划,该计划以中国主要水稻栽培种“籼稻”(Indica)为主要测序对象,采取了不同于人类基因组的序列测定及分析技术,于2001918日完成了其序列测定任务,并于20021月份完成了水稻工作框架图的组装及分析工作,其成果作为“Research Article”发表在美国著名杂志“Science”上,并被接受为封面故事。水稻基因组的两篇方法学文章也被基因组学领域的顶尖杂志“Genome Research”接受发表。这一系列成果在中国尚属首项。

本人在这个重大项目中负责了生物信息学的分析工作,我的指导老师是一个导师小组,小组成员包括了杨焕明、于军、Gane Ka-Shu Wong、李松岗、郝柏林、郑伟谋、陈润生等一大批国内外知名学者。论文中两位重要的指导老师,于军和Gane Ka-Shu Wong远在西雅图。论文的想法是由导师指导小组和我参与讨论后提出,由我负责将生物信息数据分析处理实现。由于这种项目是一个团队的整体贡献,项目中大量的重复工作需要大家的共同努力,就个人而言,我在其中是主要的参与者、贡献者和组织人。因此文章发表后,本人为这一Science文章的并列第一作者,和Genome Research 两篇方法学文章的第一和第二作者。

论文将从这两个计划出发,详细论述其生物信息学分析结果并作相互比较,同时引出以后可能有所进展的生物信息学研究方向。

(一) 基因组测序组装及序列评估

 人类基因组计划测序组装采取的是HS(Hierarchical sequencing)的策略。其基本思想是将人类基因组30亿碱基对分解至15万碱基对大小即细菌人工染色体(BAC)大小,利用密度较高的遗传图谱 (genetic map) BAC 克隆的指纹(Finger printing)图谱将BAC 克隆定位到染色体上,选取最优路径 (tiling parth),并对路径中的每一个BAC 克隆进行随机的鸟枪法测序(shotgun sequencing),利用Fish等其他技术检验定位信息,最后利用phred/phrap/consed对其进行组装分析。由于在每一个BAC 克隆中的重复序列分布并不复杂,所以对软件算法及计算能力要求不高,但需要大量的生物学工作,如制作遗传图谱、Finger Printing图谱及Fish验证等。

人类基因组计划的中国承担部分(北京区域)[c1] 是从3pter3pD3S3610,根据STS marker的信息和BAC fingerprinting 信息将BAC 克隆定位到“北京区域”并测序。同时,利用现有国际通用的phred/phrap/consed 软件包将序列组装完成。

水稻基因组计划采取的是全基因组鸟枪法测序策略(WGS)。其原理是将整个水稻基因组随机打断,测序并组装。由于在整个基因组上重复序列的分布和组成极为复杂,对组装软件的算法以及对计算机硬件能力的要求极高。但在突破了计算瓶颈后,可省去费时费力的mapping工作。

为了能够对全基因组的测序片段进行组装,我们自主开发了序列组装软件—— RePS (Repeat-masked Phrap with Scaffolding)。该软件首先在全基因组鸟枪法测序片段中,在20个碱基的水平上寻找、识别重复序列并在组装前将测序片段进行屏蔽。在组装过程中,利用现有的经典组装软件 Phrap 计算组装后单个碱基的错误率,同时依赖测序克隆的正反向信息来构建 contig 顺序和方向确定的 Scaffold 。为了检验软件的准确性和可行性,我们展示了人类基因组11.9 Mb的真实数据在4倍覆盖度和6倍覆盖度时的组装结果,同时对水稻全基因组进行了组装。

[c2] 对实际水稻3,570,000个测序反应的组装结果为127,550contig并连成了103,044scaffold [c3] ,N50[c4] 长度分别为6.69kb11.76kb,功能区的覆盖度达92%以上。

 

(二)            基因组基本组分分析

 基因组由ATCG四种碱基组成,其看似简单的序列中的各种特性分析却尤为重要,不仅涉及GC含量变化,同时还涉及嘌呤/嘧啶含量变化等。对其组分的研究不仅能了解各类基因组的基本信息,还能深入了解其在进化学上的各种关系。

 例如:对于GC含量来说,在人类基因组中有很多GC含量差异极大的孤岛。一般认为,富含GC的区域多为基因密集区,而富含AT区则多是基因缺乏区;即高GC区意味着外显子 (exon) 含量较高,而低GC意味着内含子 (intron) 含量较高。

 GC含量的变化在基因组水平上、基因水平上以及exon, intron水平上的不同,隐含了各种物种的生物机制,如转录机制等不同的特性。如在水稻基因组(可扩展至单子叶植物)中我们发现了区别于双子叶植物的极为特殊的梯度效应,即从5’ 端起始密码子开始,在转录方向上GC含量逐渐由高到低。该梯度效应不仅影响了水稻基因中密码子的使用模式和不同密码子的使用频率,还在氨基酸水平上影响了不同氨基酸的使用频率。在玉米中也发现了类似现象。由于这种草类(Grminea)基因,也许是区别于双子叶植物的单子叶植物的特性,增加了水稻基因组在基因注释时的复杂度和单子叶、双子叶植物同源性比较中的蛋白质同源水平识别的困难。

 

(三)            重复序列组分分析

     随着基因组数据的不断增加,重复序列的作用也越来越引起人们的兴趣,人们不再仅仅认为它们是“junk[c5] DNA。重复序列不但提供了生物进化的重要信息,而且作为一种“活跃分子”改变着基因组整体GC含量,并会产生全新基因,为染色体结构学、动力学及医学、遗传学研究提供了有力的工具。

 现有的重复序列的定义多为BDRs(生物学意义上的重复序列),在水稻基因组的研究过程中,我们在20mer水平上重新定义了重复序列即MDRs(数学意义上的重复序列)。包括了简单重复序列(SSR),转座子(transposon)及多拷贝基因(gene duplication)等各种情况。

各个物种在MDRs水平上的特性差异很大。如水稻基因组和人类基因组MDRscluster方式大不相同:人类基因组的MDRscluster比较平均地散布在基因组上而水稻基因组的MDRscluster则易集团分布。那么,这些MDRs大多集中在那里呢?基因内部还是基因之间?我们的研究表明,水稻基因组在20个碱基精确比对水平上的重复序列含量达到42.2%,绝大多数转座子位于基因之间的区域,内含子中几乎没有转座子序列,这与拟南芥相似;而在人类基因组中基因的内含子序列中则包含有大量的转座子序列。

 

(四)            基因注释

 现有的基因预测软件大多是基于隐马尔可夫模型(HMM),其中以genscanfgenesh为主要代表。但由于其自身模型的缺陷,在对各个物种进行分析处理时,均存在不同的问题。如:在对于人类基因组分析时,由于现有基因预测软件模型中内含子长度分布部分未解决大的内含子的问题,故对大基因的预测极其不准确;而对于水稻基因组,由于其5’3’端梯度效应所导致的密码子使用频率的变化使现有模型在转录方向上的预测效率差异极大。因此建立在HMM基础上的新的基因预测模型是下一步的具体工作,同时发展非HMM模型,突破隐马的限制也势在必行。

对于组装的全长466Mb水稻基因组,我们利用现有的基因预测软件中预测效率最好的 FgeneSH 软件进行了预测,大约有53,398~64,529个基因;在针对其基因预测准确性的基础上进行修正后,基因数目为46,022~55,615,这一数目比拟南芥基因组和人类基因组中基因的数目都要多。在整个组装序列中,利用全长 cDNA 的比对结果估计组装错误率为1.1%, 而通过现有的 STS、Unigene、全长cDNA 的数据估计其功能区域的总覆盖率约为92.0%。

在进行InterPro和Gene Ontology Consortium分类后,分别有15.9%和20.4%的水稻基因得以分类。水稻和拟南芥分类基因的百分比在InterPro功能目录上有类似的分布。

 

(五)            基因组间比较分析

基因组间比较分析涉及到基因组水平,基因水平,蛋白质水平等多层次的比较,在水稻分析研究中,主要集中在与拟南芥蛋白质水平上的比较分析。水稻和拟南芥由于GC含量的梯度效应的影响,在核酸水平上差异极大,故常规比较是采取蛋白质水平上的序列比较;比较结果显示只有49.4%的水稻基因组能在拟南芥中找到同源,而有80.6%的拟南芥基因能在水稻中找到同源。这未找到同源的一半以上的基因不仅仅是由于水稻有许多新的基因,同时也因为其梯度效应甚至对氨基酸产生影响,最后导致小的基因更不易比对上拟南芥。因此修正blastblosum62矩阵,使之适合于特定单子叶植物、双子叶植物的比对特性,消除gradient对其的影响,对更多的水稻基因进行功能分类是我们要着手解决的重要问题。

 比较基因组学同时还涉及到其他水平上的比较,如synteny区域的寻找与识别等平台的搭建工作,以及水稻两个亚种(JapanicaIndica)以及与拟南芥的比较工作等。同时,随着更多模式基因组如老鼠、河豚鱼、猪、大猩猩等基因组序列的测定,人类基因组也还有相当大的比较分析工作要做。

 

(六)            多态性研究

 单碱基多态性(Single Nucleotide PolymorphismSNP)是指位于不同的同源染色体上的DNA间的碱基差异。研究SNP对于基因组中标记的寻找、个体间差异、疾病相关研究等都有着相当重要的意义。

 对水稻基因组而言,大约有16%的区域在水稻两个不同亚种间是比对不上的区域,在比对上的区域中,大约每100个碱基有1SNP,也就是说,大约是人类基因组SNP频率的十倍。同时,重复序列区域的SNP频率是非重复序列的2倍,替换是插入/删除的2倍。

 这些SNP到底发生在基因组的哪些区域?这些区域的SNP又如何导致蛋白质功能或表达调控的变化,产生疾病或其它特异性差异等等,这些问题的研究对于深刻了解各个物种异常重要。

 

 

关键词:

籼稻、基因组、工作框架图、全基因组组装、GC含量梯度效应

 

Abstract

 

Rice is the most important crop for human consumption, providing staple food for more than half the world’s population. The euchromatic portion of the rice genome is estimated to be 430 Mb in size, which is the smallest of the cereal crops. It is 3.7 times larger than that of A. thaliana, and 6.7 times smaller than that of the human. The well-established protocols for high-efficiency genetic transformation, widespread availability of high-density genetic and physical maps, and high degrees of synteny among cereal genomes combine to make rice a unique organism for studying the physiology, developmental biology, genetics, and evolution of plants. We have produced a draft sequence of the rice genome for 93-11, which is a cultivar of Oryza sativa L. ssp. indica, the major rice subspecies grown in China and many other Asia-Pacific regions. It is the paternal cultivar of a super-hybrid rice, Liang-You-Pei-Jiu (LYP9), which has 20 to 30% more yield per hectare than the other rice crops in cultivation. Our discussion will focus largely on the genome landscape of rice, how it differs from that of the other sequenced plant, A. thaliana, and how both plant genomes differ from that of the human.

For the assembling, we developed a sequence assembler, RePS (repeat-masked Phrap with scaffolding), that explicitly identifies exact 20mer repeats from the shotgun data and removes them prior to the assembly. The established software Phrap is used to compute meaningful error probabilities for each base. Clone-end-pairing information is used to construct scaffolds that order and orient the contigs. We show with real data for human that reasonable assemblies are possible even at coverages of only 4× to 6×.

We start by computing the number of times that any 20-bp sequence (20-mer) appears in the data set. 20-mers that appear more often than a predefined threshold are flagged as Mathematically-Defined Repeats (MDRs). They may be attributed to micro-satellites, transposable elements (TEs), gene families, recently-duplicated chromosomal segments, or pseudogenes. RePS makes no effort to identify Biologically-Defined Repeats (BDRs), since if a 20-mer is repeated in the MDR sense, it will cause problems for the sequence assembly, regardless of the biological context. What it does instead is it masks the MDRs, so they become invisible to the sequence assembler Phrap. This reduces the computational load by orders-of-magnitude, and it reduces the likelihood of making a false join (i.e., a mis-assembly). However, it also introduce a new class of gaps, Repeat-Masked Gaps (RMGs), distinct from the other class of gaps, Lander-Waterman Gaps (LWGs), that is normally encountered in sequencing. In a RMG, the gap sequence is actually in the data set, but it was not usable because it was made invisible to Phrap by the masking. In a LWG, the gap sequence is truly missing, as a result of sampling statistics. RMGs can be closed using the clone-end pairing information, assuming that both clones ends are not fully-masked. After repeat-gap closure, and regardless of the nature of the remaining gaps, RePS analyzes the clone-end pairing information to construct scaffolds – non-overlapping contigs linked together in the correct order and orientation. Smaller gaps, as most LWGs tend to be, are subsequently easy to close by PCR. Larger gaps, those above a few Kb, tend to be RMGs from the nested retrotransposons in the intergenic regions between genes. Whether they should be closed or not remains an open question. We benchmarked our mis-assembly detection on two recently completed genomes: A. thaliana, which is of finished quality, and D. melanogaster, from the Celera 13x whole-genome-shotgun. All of the mis-assembly rates are worse than 10-4, or the standard for single base error probabilities (although this is a case of comparing apples and oranges). For A. thaliana, we detected problems in 0.2% of 4804 genes, and for D. melanogaster, we detected problems in 1.1% of 1889 genes. For our 93-11 contigs, we detected problems in 1.1% of 907 genes, which is impressive when one considers that it was assembled from only a 4.2x data set and that this data set had a 42.2% MDR content. After assembling, the indica rice genome was 466 megabases in size.

We describe a property of Gramineae genes, and perhaps all monocots genes, that is not observed in eudicot genes. Basically, there is a gradient in the compositional properties of rice genes, so that the 5’ and 3’-ends of the gene have rather different properties. Along the direction of transcription, beginning at the 5’-end, there are gradients in GC content, codon usage, and amino-acid usage. Almost every gene is affected. The phenomenon is most noticeable at the DNA sequence level, but there are consequences even at the protein sequence level. Ultimately, the compositional gradients make it difficult to distinguish paralogs and orthologs, let alone the hypotheses that rice and other plants, including A. thaliana, are hybrid and allopolyploid in their origin. The magnitude of these compositional gradients is large enough to have introduced serious complications in the annotation of the rice genome, and in the detection of protein homologies across the monocot-eudicot divide.

Gradients in the GC content (and codon usage) of rice genes present a unique problem for gene annotation, which can best be described as “squeezing a balloon”. Although certain ab initio gene-prediction programs can be instructed to employ different codon-usage statistics for different genes, none can employ different codon-usage statistics at different positions along the same gene. Since the 5’ and 3’-ends of a rice gene have different compositional properties, it is impossible to train any existing gene-prediction program to perform equally well at both ends. We depicts the performances of all of the ab initio gene-prediction program that have been trained for rice: FGeneSH, GenMark, GenScan, GlimmerM, and RiceHMM. Some perform better at the 5’-end, while others perform better at the 3’-end. The only one that does reasonably well at both ends is FGeneSH, which considers protein secondary structure in addition to codon usage statistics. Using FgeneSH to annotate the assembly, there is an estimated 46,022 to 55,615 genes.

Functional coverage in the assembled sequences was 92.0%. About 42.2% of the genome was in exact 20-nucleotide oligomer repeats, and most of the transposons were in the intergenic regions between genes.    In summary, 80.6% of A. thaliana genes have a homolog in rice. The average extent of the homology is 80.1% of the protein length, and there is 60.0% identity at the amino acid level. If instead of the full set of annotated genes, we use only those genes in SwissProt, 94.9% of the genes would have a homolog, for 86.7% of the protein length, at 72.9% amino acid identity. There are more homologs in the SwissProt data simply because they are more biased toward highly-conserved genes, but the quality of the hits is not significantly better, which indicates that the quality of the A. thaliana annotations is not a hindrance. The asymmetry in the monocot-eudicot analysis is particularly striking. Although 80.6% of predicted Arabidopsis thaliana genes had a homolog in rice, only 49.4% of predicted rice genes had a homolog in A. thaliana.

Differences between subspecies or cultivars of rice must be described at two levels, gross and nucleotide. At the gross level, we find kilobase-sized regions of high similarity interspersed with kilobase-sized regions of zero similarity, which is based on a comparison of two overlapping BACs from indica and japonica. Every un-alignable region coincides with a cluster of MDRs, traceable to length differences of 0.7~25-Kb between the two source sequences, and distributed in almost equal proportions between insertions and deletions. To the extent that BDRs could be identified, in roughly half the un-alignable regions, they were of the class of nested retrotransposons that inhabit the intergenic regions between genes. It is another confirmation of the observation that genome size can change rapidly in grasses. Based on 259-Kb of overlapping BAC sequences, we estimate that 16% of the indica to japonica map is un-alignable.

At the nucleotide level, excluding the un-alignable regions, we define polymorphism rates for repeated and unique sequence, partitioned in single-base substitutions (SNPs) and insertion-deletion polymorphisms (indels). By repeated sequence, we mean MDRs.. Two are based on a comparison of the 93-11 contigs to finished BAC sequence from Nipponbare (japonica) and GLA (indica), totaling 11.8-Mb and 0.9-Mb, respectively. The other is from a comparison of our 93-11 and PA64s contigs. Overall, there is about twice as much variation in the repeated regions as in the unique regions. The substitution rates are about 2~3 larger than the indel rates. Remarkably, there is little difference amongst the 3 rice comparisons. For the 93-11 to PA64s comparison, averaged over the repeated and unique regions, the SNP and indel rates are 1/231 bp-1 and 1/428 bp-1, respectively. Combining the SNP and indel rates, we get 1/150 bp-1. These numbers generally agree with, but are not directly comparable to, the 1/104 bp-1 number in maize. We did not feel that it was necessary to correct for sequencing errors since the polymorphism rates were so much larger than the sequencing error rates.

Single nucleotide polymorphisms (SNPs) are useful in genetic mapping, and are either directly applicable to phenotypes, or indirectly applicable through association studies. Polymorphisms in the unique regions are particularly useful because, unlike those in repeated regions, they are more reliably genotyped. We expect that genome-wide SNP mapping for plants will become more popular as new technologies are available, especially as some are tuned for plant applications. To begin, though, the community needs a genome-wide collection of rice SNPs, and that we now have.

 

 

Keyword:

Indica, Genome, Draft Sequence, Whole Genome Shotgun Assembler, GC Gradient

 回主页