面试题-python 如何读取一个大于 10G 的txt文件?
前言
用python 读取一个大于10G 的文件,自己电脑只有8G内存,一运行就报内存溢出:MemoryError
python 如何用open函数读取大文件呢?
读取大文件
首先可以自己先制作一个大于10G的txt文件
a = '''
2021-02-02 21:33:31,678 [django.request:93] [base:get_response] [WARNING]- Not Found: /http:/123.125.114.144/
2021-02-02 21:33:31,679 [django.server:124] [basehttp:log_message] [WARNING]- "HEAD http://123.125.114.144/ HTTP/1.1" 404 1678
2021-02-02 22:14:04,121 [django.server:124] [basehttp:log_message] [INFO]- code 400, message Bad request version ('HTTP')
2021-02-02 22:14:04,122 [django.server:124] [basehttp:log_message] [WARNING]- "GET ../../mnt/custom/ProductDefinition HTTP" 400 -
2021-02-02 22:16:21,052 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/login HTTP/1.1" 301 0
2021-02-02 22:16:21,123 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/login/ HTTP/1.1" 200 3876
2021-02-02 22:16:21,192 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/img/main_bg.png HTTP/1.1" 200 2801
2021-02-02 22:16:21,196 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/iconfont/style.css HTTP/1.1" 200 1638
2021-02-02 22:16:21,229 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/img/bg.jpg HTTP/1.1" 200 135990
2021-02-02 22:16:21,307 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/iconfont/fonts/icomoon.ttf?u4m6fy HTTP/1.1" 200 6900
2021-02-02 22:16:23,525 [django.server:124] [basehttp:log_message] [INFO]- "POST /api/login/ HTTP/1.1" 302 0
2021-02-02 22:16:23,618 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/index/ HTTP/1.1" 200 18447
2021-02-02 22:16:23,709 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/js/commons.js HTTP/1.1" 200 13209
2021-02-02 22:16:23,712 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/css/admin.css HTTP/1.1" 200 19660
2021-02-02 22:16:23,712 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/css/common.css HTTP/1.1" 200 1004
2021-02-02 22:16:23,714 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/js/app.js HTTP/1.1" 200 20844
2021-02-02 22:16:26,509 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/report_list/1/ HTTP/1.1" 200 14649
2021-02-02 22:16:51,496 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/test_list/1/ HTTP/1.1" 200 24874
2021-02-02 22:16:51,721 [django.server:124] [basehttp:log_message] [INFO]- "POST /api/add_case/ HTTP/1.1" 200 0
2021-02-02 22:16:59,707 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/test_list/1/ HTTP/1.1" 200 24874
2021-02-03 22:16:59,909 [django.server:124] [basehttp:log_message] [INFO]- "POST /api/add_case/ HTTP/1.1" 200 0
2021-02-03 22:17:01,306 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/edit_case/1/ HTTP/1.1" 200 36504
2021-02-03 22:17:06,265 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/add_project/ HTTP/1.1" 200 17737
2021-02-03 22:17:07,825 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/project_list/1/ HTTP/1.1" 200 29789
2021-02-03 22:17:13,116 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/add_config/ HTTP/1.1" 200 24816
2021-02-03 22:17:19,671 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/config_list/1/ HTTP/1.1" 200 19532
'''
while True:
with open("xxx.log", "a", encoding="utf-8") as fp:
fp.write(a)
循环写入到 xxx.log 文件,运行 3-5 分钟,pycharm 打开查看文件大小大于 10G
于是我用open函数 直接读取
f = open("xxx.log", 'r')
print(f.read())
f.close()
抛出内存溢出异常:MemoryError
Traceback (most recent call last):
File "D:/2021kecheng06/demo/txt.py", line 35, in <module>
print(f.read())
MemoryError
运行的时候可以看下自己电脑的内存已经占了100%, cpu高达91% ,不挂掉才怪了!
这种错误的原因在于,read()方法执行操作是一次性的都读入内存中,显然文件大于内存就会报错。
read() 的几种方法
1.read() 方法可以带参数 n, n 是每次读取的大小长度,也就是可以每次读一部分,这样就不会导致内存溢出
f = open("xxx.log", 'r')
print(f.read(2048))
f.close()
运行结果
2019-10-24 21:33:31,678 [django.request:93] [base:get_response] [WARNING]- Not Found: /http:/123.125.114.144/
2019-10-24 21:33:31,679 [django.server:124] [basehttp:log_message] [WARNING]- "HEAD http://123.125.114.144/ HTTP/1.1" 404 1678
2019-10-24 22:14:04,121 [django.server:124] [basehttp:log_message] [INFO]- code 400, message Bad request version ('HTTP')
2019-10-24 22:14:04,122 [django.server:124] [basehttp:log_message] [WARNING]- "GET ../../mnt/custom/ProductDefinition HTTP" 400 -
2019-10-24 22:16:21,052 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/login HTTP/1.1" 301 0
2019-10-24 22:16:21,123 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/login/ HTTP/1.1" 200 3876
2019-10-24 22:16:21,192 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/img/main_bg.png HTTP/1.1" 200 2801
2019-10-24 22:16:21,196 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/iconfont/style.css HTTP/1.1" 200 1638
2019-10-24 22:16:21,229 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/img/bg.jpg HTTP/1.1" 200 135990
2019-10-24 22:16:21,307 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/iconfont/fonts/icomoon.ttf?u4m6fy HTTP/1.1" 200 6900
2019-10-24 22:16:23,525 [django.server:124] [basehttp:log_message] [INFO]- "POST /api/login/ HTTP/1.1" 302 0
2019-10-24 22:16:23,618 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/index/ HTTP/1.1" 200 18447
2019-10-24 22:16:23,709 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/js/commons.js HTTP/1.1" 200 13209
2019-10-24 22:16:23,712 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/css/admin.css HTTP/1.1" 200 19660
2019-10-24 22:16:23,712 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/css/common.css HTTP/1.1" 200 1004
2019-10-24 22:16:23,714 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/js/app.js HTTP/1.1" 200 20844
2019-10-24 22:16:26,509 [django.server:124] [basehttp:log_message] [I
这样就只读取了2048个字符,全部读取的话,循环读就行
f = open("xxx.log", 'r')
while True:
block = f.read(2048)
print(block)
if not block:
break
f.close()
2.readline():每次读取一行,这个方法也不会报错
f = open("xxx.log", 'r')
while True:
line = f.readline()
print(line, end="")
if not line:
break
f.close()
- readlines():读取全部的行,生成一个list,通过list来对文件进行处理,显然这种方式依然会造成:MemoyError
真正 Pythonic 的方法
真正 Pythonci 的方法,使用 with 结构打开文件,fp 是一个可迭代对象,可以用 for 遍历读取每行的文件内容
with open("xxx.log", 'r') as fp:
for line in fp:
print(line, end="")
yield 生成器读取大文件
前面一篇讲yield 生成器的时候提到读取大文件,函数返回一个可迭代对象,用next()方法读取文件内容
def read_file(fpath):
BLOCK_SIZE = 1024
with open(fpath, 'rb') as f:
while True:
block = f.read(BLOCK_SIZE)
if block:
yield block
else:
return
if __name__ == '__main__':
a = read_file("xxx.log")
print(a) # generator objec
print(next(a)) # bytes类型
print(next(a).decode("utf-8")) # str
运行结果
<generator object read_file at 0x00000226B3005258>
b'\r\n2019-10-24 21:33:31,678 [django.request:93] [base:get_response] [WARNING]- Not Found: /http:/123.125.114.144/\r\n2019-10-24 21:33:31,679 [django.server:124] [basehttp:log_message] [WARNING]- "HEAD http://123.125.114.144/ HTTP/1.1" 404 1678\r\n2019-10-24 22:14:04,121 [django.server:124] [basehttp:log_message] [INFO]- code 400, message Bad request version (\'HTTP\')\r\n2019-10-24 22:14:04,122 [django.server:124] [basehttp:log_message] [WARNING]- "GET ../../mnt/custom/ProductDefinition HTTP" 400 -\r\n2019-10-24 22:16:21,052 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/login HTTP/1.1" 301 0\r\n2019-10-24 22:16:21,123 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/login/ HTTP/1.1" 200 3876\r\n2019-10-24 22:16:21,192 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/img/main_bg.png HTTP/1.1" 200 2801\r\n2019-10-24 22:16:21,196 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/iconfont/style.css HTTP/1.1" 200 1638\r\n2019-10-24 22:16:21,229 [django.server:124] '
[basehttp:log_message] [INFO]- "GET /static/assets/img/bg.jpg HTTP/1.1" 200 135990
2019-10-24 22:16:21,307 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/iconfont/fonts/icomoon.ttf?u4m6fy HTTP/1.1" 200 6900
2019-10-24 22:16:23,525 [django.server:124] [basehttp:log_message] [INFO]- "POST /api/login/ HTTP/1.1" 302 0
2019-10-24 22:16:23,618 [django.server:124] [basehttp:log_message] [INFO]- "GET /api/index/ HTTP/1.1" 200 18447
2019-10-24 22:16:23,709 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/js/commons.js HTTP/1.1" 200 13209
2019-10-24 22:16:23,712 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/css/admin.css HTTP/1.1" 200 19660
2019-10-24 22:16:23,712 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/css/common.css HTTP/1.1" 200 1004
2019-10-24 22:16:23,714 [django.server:124] [basehttp:log_message] [INFO]- "GET /static/assets/js/app.js HTTP/1.1" 200 20844
2019-10-24 22:16:26,509 [django.server:124] [basehtt
面试题-python 如何读取一个大于 10G 的txt文件?的更多相关文章
- 如何用 php 读取一个很大的 excel 文件。
这个程序是用php 读取一个很大的excel文件, 先将 excel 文件保存成csv 文件, 然后利用 迭代器 逐行读取 excel 单元格的值, 拿到值以后 做相应处理,并打印结果. <?p ...
- C#读取固定文本格式的txt文件
C#读取固定文本格式的txt文件 一个简单的C#读取txt文档的程序,文档中用固定的格式存放着实例数据. //判断关键字在文档中是否存在 ] == "设备ID:107157061" ...
- Java读取CSV数据并写入txt文件
读取CSV数据并写入txt文件 package com.vfsd; import java.io.BufferedWriter; import java.io.File; import java.io ...
- Python读取一个目录下的所有文件
#!/usr/bin/python # -*- coding:utf8 -*- import os allFileNum = 0 def printPath(level, path): global ...
- Python:遍历一个目录下所有的文件及文件夹,然后计算每个文件的字符和line的小程序
编写了一个遍历一个目录下所有的文件及文件夹,然后计算每个文件的字符和line的小程序,先把程序贴出来. #coding=utf-8 ''' Created on 2014年7月14日 @author: ...
- 【Python】读取各种文档(txt、csv、excel、pdf)方法
1.读取txt文件 注意事项: 1..txt文件同下方脚本所在的.py文件需要在同一个文件夹下 # coding=utf-8 txt读取 with open("1233.txt") ...
- matlab读取内容为二进制的TXT文件
本方法同样适合读取十六进制和二进制以外的其他进制文件,txt使用一个最简单的命令就可以读取 textread 这是一个十分有用,简便的函数(对于fopen fscanf而言)读取二进制txt文件:假如 ...
- python学习笔记(1)--遍历txt文件,正则匹配替换文字
遍历一个文件夹,把里面所有txt文件里的[]里的朗读时间删除,也就是替换为空. import os import re import shutil #os文件操作,re正则,shutil复制粘贴 pa ...
- python的OS模块生成100个txt文件
#!/user/bin/env/python35 # -*-coding:utf-8-*- # author:Keekuun """ 问题:生成一个文件夹,文件夹下面生成 ...
随机推荐
- 【noi 2.6_6049】买书(DP)
题意:有N元,有无限多本10.20.50和100元的书,问有几种购买方案. 解法:f[i]表示用 i 元的方案数.还有一个 j 循环这次买多少元的书. 注意--要先 j 循环,再 i 循环.因为要先考 ...
- 网易云音乐JS逆向解析歌曲链接
Request URL: https://music.163.com/weapi/song/enhance/player/url?csrf_token= FormData : params: BV ...
- 01、mysql安装配置
1.下载mysql软件安装包 MySQL版本:5.7.17 mysql下载地址:http://rj.baidu.com/soft/detail/12585.html?ald 2.配置mysql数据库与 ...
- ElasticSearch 搜索引擎概念简介
公号:码农充电站pro 主页:https://codeshellme.github.io 1,倒排索引 倒排索引是一种数据结构,经常用在搜索引擎的实现中,用于快速找到某个单词所在的文档. 倒排索引会记 ...
- 洛谷P1522 [USACO2.4]牛的旅行 Cow Tours
洛谷P1522 [USACO2.4]牛的旅行 Cow Tours 题意: 给出一些牧区的坐标,以及一个用邻接矩阵表示的牧区之间图.如果两个牧区之间有路存在那么这条路的长度就是两个牧区之间的欧几里得距离 ...
- LeetCode刷题笔记 - 12. 整数转罗马数字
学好算法很重要,然后要学好算法,大量的练习是必不可少的,LeetCode是我经常去的一个刷题网站,上面的题目非常详细,各个标签的题目都有,可以整体练习,本公众号后续会带大家做一做上面的算法题. 官方链 ...
- mysql-画图
目录 阿里数据库产品rds 淘宝数据库架构 数据库下载 Mysql3种安装方法 mysql_install_db安装数据库命令脚本中有生成初始mysql数据 也可以把mysql_install_db集 ...
- 在kubernetes集群里集成Apollo配置中心(1)之交付Apollo-adminservice至Kubernetes集群
1.部署apollo-adminservice软件包 apollo-adminservice软件包链接地址:https://github.com/ctripcorp/apollo/releases/d ...
- 修改jpg的图片大小
using System.Drawing.Imaging; public void ResizePic(string oldFilePath, int thumbnailImageWidth, int ...
- Chinese Parents Game
Chinese Parents Game <中国式家长>是一款模拟养成游戏. 玩家在游戏中扮演一位出生在普通的中式家庭的孩子. https://en.wikipedia.org/wiki/ ...