三思后行的基因组项目

2022-05-05 14:19:05

最近看了一篇基因组综述类的文章，主要是指导要启动基因组项目的老师们的。虽然这篇文章中提及的测序技术有点过时（现在都是三代+光学+Hi-C），但是其他部分内容还是很有借鉴意义的。

文章题目是A field guide to whole-genome sequencing, assembly and annotation,2014年发表。

一、摘要

众所周知，曾经基因组项目的研究需要投入大量的人力、物力和财力，但是现在一个小实验室都可以承担一个基因组项目。为啥，因为测序技术的进步和成本的降低。

但是在启动基因组之前，首先要考虑的问题是测一个基因组对于解决当前的问题的帮助大不大。

the important first step is to thoroughly consider whether a genome sequence is necessary for addressing the biological question at hand.

如果确定了对于目前的生物学问题帮助很大，需要启动基因组项目。接下来就需要周密的计划来确保获取一个质量较高的参考基因组。

本文中对基因组项目的步骤进行了详细的介绍，希望能够帮助那些期望拥抱基因组的传统遗传学家。

接下来文章会重点对基因组背景、基因组启动前准备、测序、组装、注释等进行介绍。

二、背景

对于传统的遗传学家来说通常应用的技术手段就是各种分子标记，随着测序技术的发展对于传统分子标记有了两个影响。

第一个就是可以考虑基因组的变化、个体之间的差异，而不只是局限于分子标记。

第二个就是即使还是研究分子标记，但是研究的尺度可以变大，现在可以以整个基因组为视角去研究。在全基因组角度可以做的研究就很多了，如QTL定位等。

当然通常如上的分析是不需要参考基因组的。但是你要研究通过各种分析定位到的感兴趣的区域，或者RNA差异表达、甲基化修饰等，这些就需要一个质量很高的参考基因组啦。

三、基因组启动之前的考虑

在启动基因组之前要仔细考虑仔细考虑仔细考虑一个好的参考基因组是否有利于目前你生物问题的研究。

如果作用不是很大，就不要做。如果作用很大，那就开始吧。

当然如果决定要做的情况下，如果自己没有办法承受一个高质量的基因组组装的话。建议还不如做GBS、SLAF、转录组呢。

彩蛋

Box ； Before you start

Some important points to consider ● Availability of appropriate computational resources ● Collaboration with sequencing facility and bioinformatics groups ● Plan for amount and type of sequencing data needed ● Does funding allow to produce sufficient sequence coverage? If not, alternative approaches should be considered rather than producing a poor, low coverage, assembly ● Familiarization with data handling pipelines and file formats (see below) ● High-quality DNA sample (with individual metadata) ● Plan for analyses and publication Some useful resources Internet forums for discussions related to genome sequencing ● http://seqanswers.com/ ● http://www.biostars.org/ ● http://www.biosupport.se/ Entry points to genome sequencing, assembly and exemplary downstream analyses ● Library preparation and Sequencing: Mardis (2008, 2013) ● Quality filtering/preprocessing: Patel and Jain (2012), Zhou and Rokas (2014), Smeds and Kunstner (2011) € ● Genome assembly: Nagarajan and Pop (2013), Pop (2009), Flicek and Birney (2009) ● Assembly evaluation: Earl et al. (2011), Bradnam et al. (2013), Bao et al. (2011) ● Genome annotation: Yandell and Ence (2012) ● Mapping: Li and Durbin (2009), Trapnell and Salzberg (2009), Bao et al. (2011) ● Data handling: Li et al. (2009), Quinlan and Hall (2010) ● Variant calling: Nielsen et al. (2011), DePristo et al. (2011), Van der Auwera et al. (2013) ● Haplotype-based approaches: Browning and Browning (2011), Tewhey et al. (2011), Lawson et al. (2012) ● Population genomic summary statistics: Nielsen et al. (2012b), Danecek et al. (2011) Web resources ● Galaxy (http://galaxyproject.org/) ● Amazon cloud (http://aws.amazon.com/ec2/) ● Windows Azure (http://www.windowsazure.com/) ● Magellan: Cloud Computing for Science (http://www.alcf.anl. gov/magellan) ● Web Apollo (http://genomearchitect.org/) ● NCBI BioProject (http://www.ncbi.nlm.nih.gov/bioproject/) ● Genomes OnLine Database (http://genomesonline.org/cgi-bin/ GOLD/index.cgi) ● ENSEMBL genome database (http://www.ensembl.org/index. html) ● UCSC Genome Browser (http://genomebrowser.wustl.edu/) ● fastQCtoolkit for data preprocessing (http://www.bioinformatics. babraham.ac.uk/projects/fastqc) Genome size databases ● Plants: http://data.kew.org/cvalues/ ● Animals: http://www.genomesize.com/ Common file formats ● FASTA Nucleotide sequence (file extension .fas or .fa) ● FASTQ Nucleotide sequence including quality scores ● SAM Sequence alignment ● BAM Binary version of SAM

● GFF3 Annotation ● GTF Annotation ● BED Annotation ● VCF Variant calling

当然对于基因组组装每个人的认识是不一样的。如果你关注的是一些存在多拷贝的区域，比如MHC、OR等。这里就需要对组装的contig 质量有很高的要求。比如一个基因是100Kb，那么我们要求的contig组装的水平N99 必须大于100Kb。否则整体基因组对于后期的研究就是垃圾一堆，一丢丢用处都没有。

当然还有一种情况就是关注的局域部分组装对于整体更为重要，这里需要的是用BAC找出这些区域然后再去测序研究，这一部分也是很重要的，如果你是这样的要求，就没有必要去做全基因组测序啦。

What does it mean to ‘sequence a genome’?

基因组的组装更多的是指物理图谱的组装，这是相对于遗传图谱而言的。

但是我们应该认识到这里一个个体的测序，只是一个物种的一个代表，不能认为这就是一个物种的测序。

there is not one true sequence for a species because of individual

genomic variation.

就是针对一个个体而言，还是存在杂合的位点、inDel、cnv、sv等。

甚至取样部分不一样，都存在一定的变异。

The assembled genome

sequence of an individual will also be only one representation

of the total variation present in a species.

另外，对于单体型也是搞不定的。

however, that in diploid and polyploid

organisms, the genome assembly already reflects a

consensus sequence of several chromosome sets and fails to

capture haplotypic variation.

The principle of genome sequencing and assembly

组装其实就是将DNA片段打断之后，再组装的过程。对于二代测序而言，就是先利用overlap的方法组装Contig，然后利用图论的方法构建Scaffold，后面如果有遗传图，可以挂在到染色体水平。

当然现在组装是利用三代组装Contig、光学延长和组装Scaffold，利用Hi-C挂载染色体。

四、物种选择

由于物种之前的杂合会影响组装效果，希望在选材的时候进行选择多代近交、单性生殖或者孤雌生殖的个体。

Attached to the genome individual

should be metadata that might be important for future referencing,

such as the identity, age and sex of the individual,

time and exact place of sampling, etc.

五、组织

这里文章主要是研究脊椎动物的所以主要是动物如何取材。文章建议尽量不要用肌肉组织，因为肌肉中存在大量的线粒体。当然在组装之前，过滤掉线粒体序列对于组装是很有用的。不建议利用肠道和皮肤表层的组织。

六、测序和组装

这一部分由于测序技术的提高，文章中的已经木有啥借鉴意义了，因此略掉。

基因组组装部分目前也都利用了三代测序，也略啦。

七、基因组注释

基因组注释部分最重要的是基因注释。

基因组注释之前应该对重复序列进行屏蔽，然后再进行预测。

基因预测部分主要利用从头、同源和转录组三种手段进行预测。目前三代测序可以测通一条转录本，这里本来是很有难度的基因预测，目前准确性也有很大很大的提升了。

当然借助软件进行预测不能一蹴而就，可能需要利用一些软件进行手工注释。

推荐一些软件：WebApollo等。

八、发表基因组

然后把这些基因组和原始数据上传到NCBI、ENSEMBL等。

欢迎关注生信人