获取指定列中的连续数字

生物信息学中通常用c.110A->G表示突变位点,要转回绝对坐标时,通常用c.110匹配到refgene。如果是下面的数据:

OTC     NM_000531       8.7Mb
OTC     NM_000531       9095
ASS1    NM_000050       c.1127-9_1185dup67(described
CPS1    NM_001122633    35
RYR1    NM_000540       27
NAT1    NM_001160175    6
G6PD    NM_000402       c.1084_1101delCTGAACGAGCGCAAGGCC
NAT2    NM_000015       c.857G>A

你必须转换成:

OTC     NM_000531       8.7
OTC     NM_000531       9095
ASS1    NM_000050       c.1127-9_1185
CPS1    NM_001122633    35
RYR1    NM_000540       27
NAT1    NM_001160175    6
G6PD    NM_000402       c.1084_1101
NAT2    NM_000015       c.857

-----------------------------------------------------------------------

第三列我只想要连续出现的数字片段(允许“-”和"_"),应该怎么取?
--------------------------------------------------------------------

cat i
OTC     NM_000531       8.7Mb
OTC     NM_000531       9095
ASS1    NM_000050       c.1127-9_1185dup67(described
CPS1    NM_001122633    35
RYR1    NM_000540       27
NAT1    NM_001160175    6
G6PD    NM_000402       c.1084_1101delCTGAACGAGCGCAAGGCC
NAT2    NM_000015       c.857G>A
sed -r 's/(.*\s)(c?[0-9._-]*).*/\1\2/' i
OTC     NM_000531       8.7
OTC     NM_000531       9095
ASS1    NM_000050       c.1127-9_1185
CPS1    NM_001122633    35
RYR1    NM_000540       27
NAT1    NM_001160175    6
G6PD    NM_000402       c.1084_1101
NAT2    NM_000015       c.857
---------------------------------------------------------------------------------

 

awk '{i=match($3, "^(c.)?[-_0-9]+", a); print $1"\t"$2"\t"a[0]}' i
----------------------------------------------------------------------------------
awk -F"\t" '{print $(NF-1)"\t"$NF"\tHet\t"$4}' $i".for_fr"|awk '{i=match($4, "(ins[a-z]*)|(del[a-z]*)|([A-Z]>)?[A-Z]*$", a); print $1"\t"$2"\t"$3"\t"a[0]}'|awk '{if(NF>3)print}' >$i".use.for_py"

 

posted on 2013-12-13 10:45  三川  阅读(287)  评论(0编辑  收藏  举报