使用Ruby处理大型CSV文件
处理大型文件是一种内存密集型操作,可能导致服务器耗尽RAM内存并交换到磁盘。让我们看一下使用Ruby处理CSV文件的几种方法,并测量内存消耗和速度性能。
Prepare CSV data sample
Before we start, let's prepare a CSV file data.csv
with 1 million rows (~ 75 MB) to use in tests.
require 'csv'
require_relative './helpers' headers = ['id', 'name', 'email', 'city', 'street', 'country'] name = "Pink Panther"
email = "pink.panther@example.com"
city = "Pink City"
street = "Pink Road"
country = "Pink Country" print_memory_usage do
print_time_spent do
CSV.open('data.csv', 'w', write_headers: true, headers: headers) do |csv|
1_000_000.times do |i|
csv << [i, name, email, city, street, country]
end
end
end
end
Memory used and time spent
This script above requires the helpers.rb
script which defines two helper methods for measuring and printing out the memory used and time spent.
require 'benchmark' def print_memory_usage
memory_before = `ps -o rss= -p #{Process.pid}`.to_i
yield
memory_after = `ps -o rss= -p #{Process.pid}`.to_i puts "Memory: #{((memory_after - memory_before) / 1024.0).round(2)} MB"
end def print_time_spent
time = Benchmark.realtime do
yield
end puts "Time: #{time.round(2)}"
end
The results to generate the CSV file are:
$ ruby generate_csv.rb
Time: 5.17
Memory: 1.08 MB
Output can vary between machines, but the point is that when building the CSV file, the Ruby process did not spike in memory usage because the garbage collector (GC) was reclaiming the used memory. The memory increase of the process is about 1MB, and it created a CSV file with size of 75 MB.
$ ls -lah data.csv
-rw-rw-r-- 1 dalibor dalibor 75M Mar 29 00:34 data.csv
Reading CSV from a file at once (CSV.read)
Let's build a CSV object from a file (data.csv
) and iterate with the following script:
require_relative './helpers'
require 'csv' print_memory_usage do
print_time_spent do
csv = CSV.read('data.csv', headers: true)
sum = 0 csv.each do |row|
sum += row['id'].to_i
end puts "Sum: #{sum}"
end
end
The results are:
$ ruby parse1.rb
Sum: 499999500000
Time: 19.84
Memory: 920.14 MB
Important to note here is the big memory spike to 920 MB. That is because we build the whole CSV object in memory. That causes lots of String objects to be created by the CSV library and the used memory is much more higher than the actual size of the CSV file.
Parsing CSV from in memory String (CSV.parse)
Let's build a CSV object from a content in memory and iterate with the following script:
require_relative './helpers'
require 'csv' print_memory_usage do
print_time_spent do
content = File.read('data.csv')
csv = CSV.parse(content, headers: true)
sum = 0 csv.each do |row|
sum += row['id'].to_i
end puts "Sum: #{sum}"
end
end
The results are:
$ ruby parse2.rb
Sum: 499999500000
Time: 21.71
Memory: 1003.69 MB
As we can see from the results, the memory increase is about the memory increase from the previous example plus the memory size of the file content that we read in memory (75MB).
Parsing CSV line by line from String in memory (CSV.new)
Let's now see what happens if we load the file content in a String and parse it line by line:
require_relative './helpers'
require 'csv' print_memory_usage do
print_time_spent do
content = File.read('data.csv')
csv = CSV.new(content, headers: true)
sum = 0 while row = csv.shift
sum += row['id'].to_i
end puts "Sum: #{sum}"
end
end
The results are:
$ ruby parse3.rb
Sum: 499999500000
Time: 9.73
Memory: 74.64 MB
From the results we can see that the memory used is about the file size (75 MB) because the file content is loaded in memory and the processing time is about twice faster. This approach is useful when we have the content that we don't need to read it from a file and we just want to iterate over it line by line.
Parsing CSV file line by line from IO object
Can we do any better than the previous script? Yes, if we have the CSV content in a file. Let's use an IO file object directly:
require_relative './helpers'
require 'csv' print_memory_usage do
print_time_spent do
File.open('data.csv', 'r') do |file|
csv = CSV.new(file, headers: true)
sum = 0 while row = csv.shift
sum += row['id'].to_i
end puts "Sum: #{sum}"
end
end
end
The results are:
$ ruby parse4.rb
Sum: 499999500000
Time: 9.88
Memory: 0.58 MB
In the last script we see less than 1 MB of memory increase. Time seems to be a very little slower compared to previous script because there is more IO involved. The CSV library has a built in mechanism for this, CSV.foreach
:
require_relative './helpers'
require 'csv' print_memory_usage do
print_time_spent do
sum = 0 CSV.foreach('data.csv', headers: true) do |row|
sum += row['id'].to_i
end puts "Sum: #{sum}"
end
end
结果类似:
$ ruby parse5.rb
Sum: 499999500000
Time: 9.84
Memory: 0.53 MB
想象一下,您需要处理10GB或更大的大型CSV文件。决定使用最后一个策略似乎是显而易见的。
使用Ruby处理大型CSV文件的更多相关文章
- 建议42:使用pandas处理大型CSV文件
# -*- coding:utf-8 -*- ''' CSV 常用API 1)reader(csvfile[, dialect='excel'][, fmtparam]),主要用于CSV 文件的读取, ...
- Python 从大型csv文件中提取感兴趣的行
帮妹子处理一个2.xG 大小的 csv文件,文件太大,不宜一次性读入内存,可以使用open迭代器. with open(filename,'r') as file # 按行读取 for line in ...
- 109.大型的csv文件的处理方式
HttpResponse对象将会将响应的数据作为一个整体返回,此时如果数据量非常大的话,长时间浏览器没有得到服务器的响应,就会超过默认的超时时间,返回超时.而StreamingHttpResponse ...
- Django学习笔记之视图高级-CSV文件生成
生成CSV文件 有时候我们做的网站,需要将一些数据,生成有一个CSV文件给浏览器,并且是作为附件的形式下载下来.以下将讲解如何生成CSV文件. 生成小的CSV文件 这里将用一个生成小的CSV文件为例. ...
- Django生成CSV文件
1.生成CSV文件 有时候我们做的网站,需要将一些数据,生成有一个CSV文件给浏览器,并且是作为附件的形式下载下来.以下将讲解如何生成CSV文件. 2.生成小的CSV文件 这里将用一个生成小的CSV文 ...
- POI以SAX方式解析Excel2007大文件(包含空单元格的处理) Java生成CSV文件实例详解
http://blog.csdn.net/l081307114/article/details/46009015 http://www.cnblogs.com/dreammyle/p/5458280. ...
- [Python]-pandas模块-CSV文件读写
Pandas 即Python Data Analysis Library,是为了解决数据分析而创建的第三方工具,它不仅提供了丰富的数据模型,而且支持多种文件格式处理,包括CSV.HDF5.HTML 等 ...
- CSV文件分割与列异常处理的python脚本
csv文件通常存在如下问题: 1. 文件过大(需要进行文件分割)2. 列异常(列不一致,如元数据列为10列,但csv文件有些行是11列,或者4列)本脚本用于解决此问题. #coding=utf-8 ' ...
- 用opencsv文件读写CSV文件
首先明白csv文件长啥样儿: 用excel打开就变成表格了,看不到细节 推荐用其它简单粗暴一点儿的编辑器,比如Notepad++, csv文件内容如下: csv文件默认用逗号分隔各列. 有了基础的了解 ...
随机推荐
- Django 下载和初识
Django Django官网下载页面 安装(安装最新LTS版): pip3 install django==1.11.9 创建一个django项目: 下面的命令创建了一个名为"mysite ...
- Educational Codeforces Round 53 (Rated for Div. 2) C. Vasya and Robot
题意:给出一段操作序列 和目的地 问修改(只可以更改 不可以删除或添加)该序列使得最后到达终点时 所进行的修改代价最小是多少 其中代价的定义是 终点序号-起点序号-1 思路:因为代价是终点序号减去 ...
- python初级装饰器编写
最近项目太忙好久没有学习python了,今天更新一下吧~~ 1.什么是python装饰器: 装饰器本质上是一个python函数,它可以让其他函数在不需要做任何代码变动的前提下增加额外的功能,装饰器的返 ...
- HAOI2015 简要题解
「HAOI2015」树上染色 题意 有一棵点数为 \(N\) 的树,树边有边权.给你一个在 \(0 \sim N\) 之内的正整数 \(K\),你要在这棵树中选择 \(K\) 个点,将其染成黑色,并将 ...
- 关于 atcoder 页面美化的 css
使用方式 把下面代码加入 ESI Stylish 即可. 这是一个 chrome 的插件,可以翻\(~\)墙(或者不需要)去下载. 这是本人瞎魔改的... 代码 Update on 12-17 \(a ...
- JavaWeb架构发展
原文:JavaWeb项目为什么我们要放弃jsp?为什么要前后端解耦?为什么要前后端分离?2.0版,为分布式架构打基础 前戏 前后端分离已成为互联网项目开发的业界标准使用方式,通过Nginx + Tom ...
- 【原创】hdu 1166 敌兵布阵(线段树→单点更新,区间查询)
学习线段树的第三天...真的是没学点啥好的,又是一道水题,纯模板,我个人觉得我的线段树模板还是不错的(毕竟我第一天相当于啥都没学...找了一整天模板,对比了好几个,终于找到了自己喜欢的类型),中文题目 ...
- 【原】本地仓库推送到远程仓库:fatal: refusing to merge unrelated histories
最近,在操作git的时候,遇到各种问题,下面总结一下. 最开始,我不是先把远程仓库拉取到本地 ,而是直接在本地先创建一个仓库,再git remote add添加远程仓库. 当然,gitee官方还是有操 ...
- CANOE入门(三)
最好的学习方式是什么?模仿.有人会问,那不是山寨么?但是我认为,那是模仿的初级阶段,当把别人最好的设计已经融化到自己的血液里,变成自己的东西,而灵活运用的时候,才是真正高级阶段.正所谓画虎画皮难画骨. ...
- 群福利:Redis云服务器免费领取(附Redis安装和连接远程连接Redis案例)
Redis安装:在线体验:https://try.redis.io Ubuntu:sudo apt-get install redis CentOS:yum install redis (root权限 ...