python提取GFF3檔案資訊

前言

因為最近使用repet這個程式對基因組進行重複序列的注釋，但是最後輸出的結果是gff3格式的檔案，缺少統計資訊。因此用python寫了個指令碼，對gff3的資訊進行提取並統計。

gff3檔案

想要從gff3檔案中提取的資訊為重複序列的種類，數量，以及bp數。

gff3檔案分為9列，這次用到是第2、4、5、9列，分別代表**，序列起始位置，結束位置，以及屬性。

重複序列的種類可以從第9列屬性中提取，bp數以（結束位置-起始位置+1）進行計算，數量則在過程中統計。

思路不同的**，字尾不同，通過判斷相應的字串是否在第2列字串中來判斷是哪種**。再通過第9列屬性字串，判斷相應的分類**字串是否在其中，如果在其中，則將分類**放入乙個列表，然後以分類**為鍵，bp數為值生成字典。

for迴圈不斷讀入，列表增長，字典合併，最後將列表唯一化，通過counter類生成相應的字典，這個字典的值就是各分類的數量。

**生成字典是通過事先定義乙個函式的方式實現的：

def
return_tes
(lines):
tes=['ryx', 'rxx-lard', 'rix', 'rlx', 'rpx', 'rsx', 'rxx-trim', 'dyx', 'dhx',
'dxx-mite', 'dmx', 'dtx', 'rxx-chim', 'dxx-chim', 'nocat']
(seqid, source, seqtype, start, end, score, strand, phase, attributes) =\
lines.split('\t')
te_len = int(end) - int(start) + 1
for te in tes:
if te in attributes:
return

字典合併的函式：

#字典合併的函式，即相同的鍵的值進行相加
defunion_dict
(*objs):
_keys = set(sum([obj.keys() for obj in objs],))
_total = {}
for _key in _keys:
_total[_key] = sum([obj.get(_key,0) for obj in objs])
return _total

附上完整**：

#!/usr/bin/env python
#-*- coding: utf-8 -*-
"""根據repet輸出的gff3檔案統計重複序列的資訊
"""import os
from collections import counter
#使用字典返回tedenovo的結果
defreturn_tes
(lines):
tes=['ryx', 'rxx-lard', 'rix', 'rlx', 'rpx', 'rsx', 'rxx-trim', 'dyx', 'dhx',
'dxx-mite', 'dmx', 'dtx', 'rxx-chim', 'dxx-chim', 'nocat']
(seqid, source, seqtype, start, end, score, strand, phase, attributes) =\
lines.split('\t')
te_len = int(end) - int(start) + 1
for te in tes:
if te in attributes:
return 
#使用字典返回ssr的結果
defreturn_ssrs
(lines):
(seqid, source, seqtype, start, end, score, strand, phase, attributes) =\
lines.split('\t')
ssr_len = int(end) - int(start) + 1
return ['ssr', ssr_len]
#使用字典返回blastx和tblastx的結果
defreturn_blast
(lines):
blsat_te= #通過字典將repet中的分類**與repbase的分類**對應起來
(seqid, source, seqtype, start, end, score, strand, phase, attributes) =\
lines.split('\t')
blast_len = int(end) - int(start) + 1
for te in blsat_te:
if te in attributes:
return 
#字典合併的函式，即相同的鍵的值進行相加
defunion_dict
(*objs):
_keys = set(sum([obj.keys() for obj in objs],))
_total = {}
for _key in _keys:
_total[_key] = sum([obj.get(_key,0) for obj in objs])
return _total
os.system('cat *.gff3 > genome.gff3') #合併repet生成的所有gff3檔案
te = 'repet_tes'
ssr = 'repet_ssrs'
blast = 'repet_blastx'
tblast = 'repet_tblastx'
genome_size=0
ssr_num, ssr_len=[0]*2
tes=
tes_dict={}
rep=
rep_dict={}
with open('genome.gff3', 'r') as gff: #genome.gff3即前面合併後生成的檔案
for line in gff:
if line.startswith('##g'):
pass
elif line.startswith('##s'):
(t1, t2, t3, length) = line.split(' ')
genome_size += int(length)  #計算基因組總長度
else:
if te in line:
a = return_tes(line)
tes = tes + a.keys()  #種類存入列表
tes_dict = union_dict(tes_dict, a) #將每一行的結果放入字典
if ssr in line:
b = return_ssrs(line)
ssr_num += 1
#計算ssr的個數
ssr_len += b[1]  #計算ssr的總長度
if blast in line or tblast in line:
c = return_blast(line)
rep = rep + c.keys()
rep_dict = union_dict(rep_dict, c)
zero=  #建立乙個包含所有型別**的字典，然後將之前的結果與其相加，避免空值
tes_type = union_dict(dict(counter(tes)), zero) # 列表裡的重複值唯一化，並計算重複的次數（相當於每種型別的個數)
rep_type = union_dict(dict(counter(rep)), zero)
tes_dict = union_dict(tes_dict, zero)
rep_dict = union_dict(rep_dict, zero)
all_type = union_dict(tes_type, rep_type, zero)
all_len = union_dict(tes_dict, rep_dict, zero)

gff檔案用什麼開啟 GFF3格式檔案

gff3是gff注釋檔案的新標準。檔案中每一行為基因組的乙個屬性，分為9列，以tab分開。依次是 1.reference sequence 參照序列指出注釋的物件。如乙個染色體，轉殖或片段。可以有多個參照序列。該id的取名不能以開頭，不能包含空格。2.source 注釋的如果未知，則用點代替...

3 python3 檔案操作

python 檔案方法 1 開啟檔案 open 方法常用形式 open 檔名，開啟方式其中檔名是必須的是檔案的路徑開啟方式有多種這裡引用菜鳥教程的總結 mode 引數有模式描述 t文字模式預設 x寫模式，新建乙個檔案，如果該檔案已存在則會報錯。b二進位制模式。開啟乙個檔案進行更新可讀...

python 3 檔案管理

import os,tempfile,glob,shutil 建立目錄 os.mkdir r home rain test filedir 建立目錄以及所有path中包含的上級目錄 os.makedirs r home rain test test filedir 切換當前工作目錄 os.chdir...

python提取GFF3檔案資訊

gff檔案用什麼開啟 GFF3格式檔案

3 python3 檔案操作

python 3 檔案管理

相關推薦