引用本文: | 许兰馨,杨海乐,刘志刚,杜浩.eDNA监测测序数据分析注释中参考数据库选择、指标阈值选择、目标数据准备的影响——以长江中游鱼类为监测目标.湖泊科学,2024,36(6):1843-1852. DOI:10.18307/2024.0631 |
| Xu Lanxin,Yang Haile,Liu Zhigang,Du Hao.The impacts of reference database selection, indicator threshold determination and target data preparation on the sequence data analysis of eDNA monitoring-Taking fish as the target in the middle Yangtze River. J. Lake Sci.2024,36(6):1843-1852. DOI:10.18307/2024.0631 |
|
|
|
本文已被:浏览 840次 下载 1103次 |
码上扫一扫! |
|
eDNA监测测序数据分析注释中参考数据库选择、指标阈值选择、目标数据准备的影响——以长江中游鱼类为监测目标 |
许兰馨1,2, 杨海乐1, 刘志刚1, 杜浩1
|
1.中国水产科学研究院长江水产研究所, 农业农村部淡水生物多样性保护重点实验室, 武汉 430223;2.南京农业大学无锡渔业学院, 无锡 214000
|
|
摘要: |
在基于宏条形码(meta-barcoding)的eDNA监测技术中,eDNA测序数据的分析和注释是决定监测结果判断和评估精准与否的基础,而参考数据库选择、指标阈值选择、目标数据准备是eDNA测序数据分析和注释中最为关键的3个技术环节。为厘清上述3个技术环节处理方案的影响,本研究以长江中游2组eDNA监测COI基因测序数据为分析对象,针对鱼类的检出进行3组实验来分别检验:1)不同参考数据库及物种注释算法对注释结果的影响;2)不同OTU聚类序列相似度和物种注释分类置信度(序列一致性和序列覆盖度)对注释结果的影响;3)目标数据中各物种不同序列丰富度对注释结果的影响。结果显示:1)Blast算法下,3个版本nt库注释出的物种基本一致(72%~78%),2个本地序列参考库注释出的物种也基本一致(91%~96%),这5个序列参考库注释出的物种52%~68%一致;nt库RDP Classifier算法注释出的物种覆盖95%以上Blast算法注释出的物种,并比Blast算法注释出的物种多151%~443%,多出的物种大都是错误注释,本地参考数据库RDP Classifier算法注释出的物种覆盖66%~85%的Blast算法注释出的物种,并存在数条只注释到科属的结果。2)OTU聚类序列相似度阈值,取值0.999比取值0.99获得的OTU多154%~209%,注释到鱼类的OTU多240%~490%;注释分类置信度阈值(Blast算法,序列一致性和序列覆盖度)从0.8到0.99注释获得的物种组成(94%以上)基本一致,OTU组成(83%以上)也基本一致,注释分类置信度阈值取0.7时注释获得的物种组成、OTU组成与取0.8及以上时注释获得的有较大差异。3)在OTU聚类序列相似度阈值为0.999、注释分类置信度阈值为0.9时,多序列数据注释所得鱼类物种数、OTU数最多,物种注释正确率最高(达81.49%),分别比单序列数据的多7%、215%和高5%。在具体eDNA测序数据的分析和注释中,可通过建立完善本地参考数据库、优化OTU聚类序列相似度和物种注释分类置信度(序列一致性和序列覆盖度)取值、增加目标数据的丰富度来提高注释结果的准确性,但受制于物种注释算法的局限性,物种注释错误和注释遗漏的问题可能将长期存在,物种注释正确率通常低于85%(基于COI基因的eDNA监测)。 |
关键词: 环境DNA 鱼类 宏条形码 参考数据库 OTU聚类序列相似度 物种注释分类置信度 长江中游 |
DOI:10.18307/2024.0631 |
分类号: |
基金项目:中央级公益性科研院所基本科研业务费专项(YFI202201)和农业财政专项“长江禁捕后常态化监测专项”(CJJC-2023-01)联合资助。 |
|
The impacts of reference database selection, indicator threshold determination and target data preparation on the sequence data analysis of eDNA monitoring-Taking fish as the target in the middle Yangtze River |
Xu Lanxin1,2, Yang Haile1, Liu Zhigang1, Du Hao1
|
1.Key Laboratory of Freshwater Biodiversity Conservation, Ministry of Agriculture and Rural Affairs, Yangtze River Fisheries Research Institute, Chinese Academy of Fishery Sciences, Wuhan 430223, P. R. China;2.Wuxi Fisheries College, Nanjing Agricultural University, Wuxi 214000, P. R. China
|
Abstract: |
In the meta-barcoding based eDNA monitoring technology, the analysis and annotation of eDNA sequence data serve as the foundation for obtaining accurate and reliable monitoring results. The selection of reference databases, the determination of analysis and annotation indicator thresholds, and the preparation of target data are the most critical technical steps in eDNA sequence data analysis and annotation. To clarify the impacts of these three technical aspects and provide scientific support for the standardization of eDNA monitoring technology, the current study used two sets of COI gene sequence data from eDNA monitoring in the middle reach of the Yangtze River as the analysis objects and designed three sets of experiments to test 1) the impacts of different reference databases and species annotation algorithms on the annotation results, 2) the impacts of different OTU clustering sequence similarity and species annotation classification confidence (sequence consistency and sequence coverage) on the annotation results, and 3) the impacts of different target sequence data richness of each species on the annotation results. The results showed that: 1) under the Blast algorithm, the annotated species matched with three versions of nt library from NCBI were generally consistent (72%-78%); those matched with two local sequence reference libraries were also generally consistent (91%-96%); and the annotated species from the five results matched with these five sequence reference libraries were consistent in 52%-68%. The RDP Classifier algorithm annotated species matched with nt libraries covered over 95% of Blast algorithm annotated species, and increased by 151%-443% species, but most additional species were misannotated. The RDP Classifier algorithm annotated species matched with local sequence reference libraries covered 66%-85% of Blast algorithm annotated species, and there were several results only annotated to family or genus level. 2) When the OTU clustering sequence similarity threshold was set to 0.999, it obtained 154%-209% more OTUs than when set to 0.99, and 240%-490% more annotated OTUs of fish were obtained. The classification confidence threshold (Blast algorithm) had little effect on species composition when changed from 0.8 to 0.99, with over 94% consistency, but there was a significant difference when it was set to 0.7. 3) When the OTU clustering sequence similarity threshold was 0.999 and the classification confidence threshold was 0.9, the number of fish species and OTUs obtained from multiple-sequences data annotation was the largest. It also had the highest species annotation accuracy (81.49%), which increased by 7% fish species, 215% OTUs and 5% accuracy respectively compared to single-sequence data annotation. In eDNA sequenc data analysis and annotation, accuracy can be improved by establishing and improving local reference databases, optimizing OTU clustering sequence similarity and species annotation classification confidence thresholds (sequence consistency and sequence coverage), increasing target sequence data richness. However, due to the limitation of species annotation algorithms, problems such as species annotation errors and omissions may persist in eDNA sequence data analysis and annotation in the future. Then, the species annotation accuracy of eDNA monitoring (based on the COI gene) would always be lower than 85%. |
Key words: Environmental DNA fish meta-barcoding reference database OTU clustering sequence similarity species annotation classification confidence middle Yangtze River |
|
|
|
|