参考:https://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python

最优雅方式:

file.readlines() takes in an optional size argument which approximates the number of lines read in the lines returned.

  1. bigfile = open('bigfilename','r')
  2. tmp_lines = bigfile.readlines(BUF_SIZE)
  3. while tmp_lines:
  4. process([line for line in tmp_lines])
  5. tmp_lines = bigfile.readlines(BUF_SIZE)

或者:

To write a lazy function, just use yield:

  1. def read_in_chunks(file_object, chunk_size=1024):
  2. """Lazy function (generator) to read a file piece by piece.
  3. Default chunk size: 1k."""
  4. while True:
  5. data = file_object.read(chunk_size)
  6. if not data:
  7. break
  8. yield data
  9. f = open('really_big_file.dat')
  10. for piece in read_in_chunks(f):
  11. process_data(piece)

 

Read a file in chunks in Python

This article is just to demonstrate how to read a file in chunks rather than all at once.

This is useful for a number of cases, such as chunked uploading or encryption purposes, or perhaps where the file you want to interact with is larger than your machine memory capacity.

  1. # chunked file reading
  2. from __future__ import division
  3. import os
  4.  
  5. def get_chunks(file_size):
  6. chunk_start = 0
  7. chunk_size = 0x20000 # 131072 bytes, default max ssl buffer size
  8. while chunk_start + chunk_size < file_size:
  9. yield(chunk_start, chunk_size)
  10. chunk_start += chunk_size
  11.  
  12. final_chunk_size = file_size - chunk_start
  13. yield(chunk_start, final_chunk_size)
  14.  
  15. def read_file_chunked(file_path):
  16. with open(file_path) as file_:
  17. file_size = os.path.getsize(file_path)
  18.  
  19. print('File size: {}'.format(file_size))
  20.  
  21. progress = 0
  22.  
  23. for chunk_start, chunk_size in get_chunks(file_size):
  24.  
  25. file_chunk = file_.read(chunk_size)
  26.  
  27. # do something with the chunk, encrypt it, write to another file...
  28.  
  29. progress += len(file_chunk)
  30. print('{0} of {1} bytes read ({2}%)'.format(
  31. progress, file_size, int(progress / file_size * 100))
  32. )
  33.  
  34. if __name__ == '__main__':
  35. read_file_chunked('some-file.gif')

Also available as a Gist (https://gist.github.com/richardasaurus/21d4b970a202d2fffa9c)

The above will output:

  1. File size: 698837
  2. 131072 of 698837 bytes read (18%)
  3. 262144 of 698837 bytes read (37%)
  4. 393216 of 698837 bytes read (56%)
  5. 524288 of 698837 bytes read (75%)
  6. 655360 of 698837 bytes read (93%)
  7. 698837 of 698837 bytes read (100%)

Hopefully handy to someone. This of course isn’t the only way, you could also use `file.seek` in the standard library to target chunks.

Processing large files using python

In the last year or so, and with my increased focus on ribo-seq data, I have come to fully appreciate what the term big data means. The ribo-seq studies in their raw forms can easily reach into hundreds of GBs, which means that processing them in both a timely and efficient manner requires some thought. In this blog post, and hopefully those following, I want to detail some of the methods I have come up (read: pieced together from multiple stack exchange posts), that help me take on data of this magnitude. Specifically I will be detailing methods for python and R, though some of the methods are transferrable to other languages.

My first big data tip for python is learning how to break your files into smaller units (or chunks) in a manner that you can make use of multiple processors. Let’s start with the simplest way to read a file in python.


  1. with open("input.txt") as f:
  2. data = f.readlines()
  3. for line in data:
  4. process(line)
  5.  

This mistake made above, with regards to big data, is that it reads all the data into RAM before attempting to process it line by line. This is likely the simplest way to cause the memory to overflow and an error raised. Let’s fix this by reading the data in line by line, so that only a single line is stored in the RAM at any given time.


  1. with open("input.txt") as f:
  2. for line in f:
  3. process(line)
  4.  

This is a big improvement, namely it doesn’t crash when fed a big file (though also it’s shorter!). Next we should attempt to speed this up a bit by making use of all these otherwise idle cores.


  1. import multiprocessing as mp
  2. #init objects
  3. pool = mp.Pool(cores)
  4. jobs = []
  5. #create jobs
  6. with open("input.txt") as f:
  7. for line in f:
  8. jobs.append( pool.apply_async(process,(line)) )
  9. #wait for all jobs to finish
  10. for job in jobs:
  11. job.get()
  12. #clean up
  13. pool.close()
  14.  

Provided the order of which you process the lines don’t matter, the above generates a set (pool) of workers, ideally one for each core, before creating a bunch of tasks (jobs), one for each line, for the workers to do. I tend to use the Pool object provided by the multiprocessing module due to ease of use, however, you can spawn and control individual workers using mp.Process if you want finer control. For mere number crunching, the Pool object is very good.

While the above is now making use of all those cores, it sadly runs into memory problems once again. We specifically use apply_async function so that the pool isn’t blocked while each line processes. However, in doing so, all the data is read into memory once again; this time stored as individual lines associated with each job, waiting inline to be processed. As such, the memory will again overflow. Ideally the method will only read the line into memory when it is its turn to be processed.


  1. import multiprocessing as mp
  2. def process_wrapper(lineID):
  3. with open("input.txt") as f:
  4. for i,line in enumerate(f):
  5. if i != lineID:
  6. continue
  7. else:
  8. process(line)
  9. break
  10. #init objects
  11. pool = mp.Pool(cores)
  12. jobs = []
  13. #create jobs
  14. with open("input.txt") as f:
  15. for ID,line in enumerate(f):
  16. jobs.append( pool.apply_async(process_wrapper,(ID)) )
  17. #wait for all jobs to finish
  18. for job in jobs:
  19. job.get()
  20. #clean up
  21. pool.close()
  22.  

Above we’ve now changed the function fed to pool of workers to include opening the file, locating the specified line, reading it into memory, and then processing it. The only input now stored for each job spawned is the line number, thereby preventing the memory overflow. Sadly, the overhead involved in having to locate the line by reading iteratively through the file for each job is untenable, getting progressively more time consuming as you get further into the file. To avoid this we can use the seek function of file objects which skips you to a particular location within a file. Combining with the tell function, which returns the current location within a file, gives:


  1. import multiprocessing as mp
  2. def process_wrapper(lineByte):
  3. with open("input.txt") as f:
  4. f.seek(lineByte)
  5. line = f.readline()
  6. process(line)
  7. #init objects
  8. pool = mp.Pool(cores)
  9. jobs = []
  10. #create jobs
  11. with open("input.txt") as f:
  12. nextLineByte = f.tell()
  13. for line in f:
  14. jobs.append( pool.apply_async(process_wrapper,(nextLineByte)) )
  15. nextLineByte = f.tell()
  16. #wait for all jobs to finish
  17. for job in jobs:
  18. job.get()
  19. #clean up
  20. pool.close()
  21.  

Using seek we can move directly to the correct part of the file, whereupon we read a line into the memory and process it. We have to be careful to correctly handle the first and last lines, but otherwise this does exactly what we set out, namely using all the cores to process a given file while not overflowing the memory.

I’ll finish this post with a slight upgrade to the above as there is a reasonable amount of overhead associated with opening and closing the file for each individual line. If we process multiple lines of the file at a time as a chunk, we can reduce these operations. The biggest technicality when doing this is noting that when you jump to a location in a file, you are likely not located at the start of a line. For a simple file, as in this example, this just means you need to call readline, which reads to next newline character. More complex file types likely require additional code to locate a suitable location to start/end a chunk.


  1. import multiprocessing as mp,os
  2. def process_wrapper(chunkStart, chunkSize):
  3. with open("input.txt") as f:
  4. f.seek(chunkStart)
  5. lines = f.read(chunkSize).splitlines()
  6. for line in lines:
  7. process(line)
  8. def chunkify(fname,size=1024*1024):
  9. fileEnd = os.path.getsize(fname)
  10. with open(fname,'r') as f:
  11. chunkEnd = f.tell()
  12. while True:
  13. chunkStart = chunkEnd
  14. f.seek(size,1)
  15. f.readline()
  16. chunkEnd = f.tell()
  17. yield chunkStart, chunkEnd - chunkStart
  18. if chunkEnd > fileEnd:
  19. break
  20. #init objects
  21. pool = mp.Pool(cores)
  22. jobs = []
  23. #create jobs
  24. for chunkStart,chunkSize in chunkify("input.txt"):
  25. jobs.append( pool.apply_async(process_wrapper,(chunkStart,chunkSize)) )
  26. #wait for all jobs to finish
  27. for job in jobs:
  28. job.get()
  29. #clean up
  30. pool.close()
  31.  

Anyway, I hope that some of the above was either new or even perhaps helpful to you. If you know of a better way to do things (in python), then I’d be very interested to hear about it. In another post coming in the near future, I will expanded on this code, turning it into a parent class from which create multiple children to use with various file types.

python chunk 方式读取大文件——本质上还是file read自身支持的更多相关文章

  1. 在python中逐行读取大文件

    在我们日常工作中,难免会有处理日志文件的时候,当文件小的时候,基本不用当心什么,直接用file.read()或readlines()就可以了,但是如果是将一个10G大小的日志文件读取,即文件大于内存的 ...

  2. Python之requests模块-大文件分片上传

    最近在做接口测试时,拿到一个分片上传文件的接口,http接口请求头中的Content-Type为multipart/form-data.需要在客户端将大文件分片成数据块后,依次传给服务端,由服务端还原 ...

  3. python读取大文件只能读取部分的问题

    最近准备重新研究一下推荐系统的东西,用到的数据集是Audioscrobbler音乐数据集.我用python处理数据集中artist_data.txt这个文件的时候,先读取每一行然后进行处理: with ...

  4. Python读取大文件的"坑“与内存占用检测

    python读写文件的api都很简单,一不留神就容易踩"坑".笔者记录一次踩坑历程,并且给了一些总结,希望到大家在使用python的过程之中,能够避免一些可能产生隐患的代码. 1. ...

  5. Python逐块读取大文件行数的代码 - 为程序员服务

    Python逐块读取大文件行数的代码 - 为程序员服务 python数文件行数最简单的方法是使用enumerate方法,但是如果文件很大的话,这个方法就有点慢了,我们可以逐块的读取文件的内容,然后按块 ...

  6. Python读取大文件(GB)

    Python读取大文件(GB) - CSDN博客 https://blog.csdn.net/shudaqi2010/article/details/54017766

  7. 强悍的Python读取大文件的解决方案

    这是一道著名的 Python 面试题,考察的问题是,Python 读取大文件和一般规模的文件时的区别,也即哪些接口不适合读取大文件. 1. read() 接口的问题 f =open(filename, ...

  8. PHP 与Python 读取大文件的区别

    php读取大文件的方法   <?php function readFile($file) { # 打开文件 $handle = fopen($file, 'rb'); while (feof($ ...

  9. formdata方式上传文件,支持大文件分割上传

    1.upload.html <!DOCTYPE html> <html xmlns="http://www.w3.org/1999/html"> <h ...

随机推荐

  1. 【PL/SQL】九九乘法口诀表

    --输出屏幕信息 SET serveroutput ON; --打印口诀表 DECLARE V_NUMBER1 ); --外层循环变量 V_NUMBER2 ); --内层循环变量 BEGIN .. - ...

  2. 微信小程序php后台实现

    这里简单介绍用php后台实现获取openid并保存到数据库: 微信的登陆流程是这样的 首先前端发送请求到服务器: wx.login({ success: function (res) { var co ...

  3. 2016.01.05 DOM笔记(一) 查找元素

    DOM节点的种类 元素和标签是一个意思,例如<body>标签或者称为<body>元素 节点DOM的节点分为三类  元素节点,文本节点,属性节点 例如 <div id=‘b ...

  4. 几种fullpage用法及demo

    jQuery全屏滚动插件fullPage.js https://github.com/alvarotrigo/fullPage.js http://www.dowebok.com/77.html 全屏 ...

  5. CWnd* pParent

    Dlg(CWnd* pParent = NULL)的意思是:构造函数.创建对象时第一个调用的地方.CWnd* pParent=NULL是构造的参数,可以不传入,默认为NULL 构造函数(constru ...

  6. CAD处理键盘按钮被释放(com接口VB语言)

    主要用到函数说明: MxDrawXCustomEvent::KeyUp 键盘按钮被释放,详细说明如下: 参数 说明 iVk 是按钮码,如F8,的值为#define VK_F8 0x77 返回0消息继续 ...

  7. 怎么用最短时间高效而踏实地学习Python?

    之所以写这篇文章,在标题里已经表达得很清楚了.做技术的人都知道,时间就是金钱不是一句空话,同一个技术,你比别人早学会半年,那你就能比别人多拿半年的钱.所以有时候别人去培训我也不怎么拦着,为什么?因为培 ...

  8. 51nod1006 -最长公共子序列Lcs【动态规划】

    给出两个字符串A B,求A与B的最长公共子序列(子序列不要求是连续的). 比如两个串为: abcicba abdkscab ab是两个串的子序列,abc也是,abca也是,其中abca是这两个字符串最 ...

  9. THUSC2019 退役记

    Day -inf 这一个半月潜心搞文化课,把文化课的坑填上了不少,我文化课的底子真是薄啊 一年前没想过我还挺有希望进队的,最后还差点冲上 一年后说不定会发现我搞文化课也能搞得不错呢? 一切都是未知 t ...

  10. 第五节:web爬虫之urllib(一)

    一.urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False,    ...