awk - group adjacent rows by identical columns

Liang always brings me interesting quiz questions. Here is one:

If i have a table like below:

chr1	113438	114495	1	chr1	114142	114143

chr1	113438	114495	2	chr1	114171	114172

chr1	170977	174817	1	chr1	171511	171512

chr1	170977	174817	2	chr1	171514	171515

chr1	170977	174817	2	chr1	173545	173546

and I would like to collapse the rows if the first 3 columns are identical to make the following output:

chr1	113438	114495	114142,114143,114171,114172

chr1	170977	174817	171511,171512,171514,171515,173545,173546

Is there any easy awk approach to do it?

Since I am so rusty at awk, I had to google around to find the solution:

awk -F '\t' '

$1FS$2FS$3==x{

    printf ",%s,%s", $6, $7

    next

}

{

    x=$1FS$2FS$3

    printf "\n%s\t%s,%s", x, $6, $7

}

END {

    printf "\n"

}' test.txt

Assuming the input file is test.txt. Note that the input and output are both tab-separated.

Explanation:

x=$1FS$2FS$3: variable x stores the value of columns 1, 2, and 3 separated by field separator FS.

Print the first part of an output line (columns 1, 2, 3, 6, 7).

For next line, if columns 1, 2, and 3 equal x, print columns 6 and 7.

Group and then count:

https://stackoverflow.com/questions/14916826/awk-unix-group-by

have this text file:

name, age
joe,42
jim,20
bob,15
mike,24
mike,15
mike,54
bob,21

Trying to get this (count):

joe 1
jim 1
bob 2
mike 3

awk -F, 'NR>1{arr[$1]++}END{for (a in arr) print a, arr[a]}' file.txt

References:

http://azaleasays.com/2014/10/06/awk-group-adjacent-rows-by-identical-columns/

Group rows in text file and aggregate corresponding rows to column

keeping last record among group of records with common fields (awk)

awk - group adjacent rows by identical columns的更多相关文章

openpyxl使用sheet.rows或sheet.columns报TypeError: 'generator' object is not subscriptable解决方式
解决方案: 因为新版本的openpyxl使用rows或者columns返回一个生成器所以可以使用List来解决报错问题 >>> sheet.columns[0] Traceback ...
shell awk命令
语法: awk '{command}' filename 多个命令以分号分隔. awk 'BEGIN {command1} {command2} END{command3}' 注意:BEGIN , ...
MySQL如何优化GROUP BY ：松散索引扫描 VS 紧凑索引扫描
执行GROUP BY子句的最一般的方法:先扫描整个表,然后创建一个新的临时表,表中每个组的所有行应为连续的,最后使用该临时表来找到组并应用聚集函数.在某些情况中,MySQL通过访问索引就可以得到结果 ...
Crazy Rows
Problem You are given an N x N matrix with 0 and 1 values. You can swap any two adjacent rows of the ...
2009 Round2 A Crazy Rows （模拟）
Problem You are given an N x N matrix with 0 and 1 values. You can swap any two adjacent rows of the ...
awk 输出前 N 列的最简单方法
最近遇到一种场景,需要输出一个文本信息的前 N 列. 众所周知 cut 可以指定分隔符并指定列的范围,如 cut -d' ' -f-4 就是以空格为分隔符输出前 4 列.但是 cut 的分隔符只能是一 ...
Oracle 10gR2分析函数
Oracle 10gR2分析函数汇总 (Translated By caizhuoyi 2008‐9‐19) 说明: 1. 原文中底色为黄的部分翻译存在商榷之处,请大家踊跃提意见: 2. 原文中淡 ...
[转帖]Introduction to text manipulation on UNIX-based systems
Introduction to text manipulation on UNIX-based systems https://www.ibm.com/developerworks/aix/libra ...
R2—《R in Nutshell》读书笔记(连载)
R in Nutshell 前言例子(nutshell包) 本书中的例子包括在nutshell的R包中,使用数据,需加载nutshell包 install.packages("nutshe ...

随机推荐

Python之函数&参数&参数解构
1.1函数定义 def 函数名(参数列表): 函数体(代码块) [return 返回值] p 函数名就是标识符,命名要求一样语句块必须缩进,约定4个空格 Python的函数没有return语句,隐式 ...
爬取笔下wenxue小说
import urllib.request from bs4 import BeautifulSoup import re def gethtml(url): page=urllib.request. ...
python中使用rabbitmq消息中间件
上周一直在研究zeromq,并且也实现了了zeromq在python和ruby之间的通信,但是如果是一个大型的企业级应用,对消息中间件的要求比较高,比如消息的持久化机制以及系统崩溃恢复等等需求,这个时 ...
CSS position &居中（水平，垂直）
css position是个很重要的知识点: 知乎Header部分: 知乎Header-inner部分: position属性值: fixed:生成绝对定位的元素,相对浏览器窗口进行定位(位置可通过: ...
SpringMVC探究-----从HelloWorld开始
1.SpringMVC简介 Spring MVC框架是有一个MVC框架,通过实现Model-View-Controller模式来很好地将数据.业务与展现进行分离. 它的设计是围绕Dispatch ...
刨根究底字符编码之—UTF-16编码方式
在网上已经转悠好几天了, 这篇文章让我知道了UTF-16的前世今生, 感谢作者https://cloud.tencent.com/developer/article/1384687 1. UTF-16 ...
python 将字节写入文本文件
想在文本模式打开的文件中写入原始的字节数据将字节数据直接写入文件的缓冲区即可 >>> import sys >>> sys.stdout.write(b'Hell ...
flask 数据库操作（增删改查）
数据库操作现在我们创建了模型,生成了数据库和表,下面来学习常用的数据库操作,数据库操作主要是CRUD,即Create(创建).Read(读取/查询).Update(更新)和Delete(删除). S ...
GUI界面修饰
function varargout = GUI20(varargin) % GUI20 MATLAB code for GUI20.fig % GUI20, by itself, creates a ...
Oracle执行计划 explain plan
Rowid的概念:rowid是一个伪列,既然是伪列,那么这个列就不是用户定义,而是系统自己给加上的. 对每个表都有一个rowid的伪列,但是表中并不物理存储ROWID列的值.不过你可以像使用其它列那样 ...

awk - group adjacent rows by identical columns

awk - group adjacent rows by identical columns的更多相关文章

随机推荐

热门专题