Recently we were building a Shiny App in which we had to load data from a very large dataframe. It was directly impacting the app initialization time, so we had to look into different ways of reading data from files to R (in our case customer provided csv files) and identify the best one.

The goal of my post is to compare:

  1. read.csv from utils, which was the standard way of reading csvfiles to R in RStudio,
  2. read_csv from readr which replaced the former method as a standard way of doing it in RStudio,
  3. load and readRDS from base, and
  4. read_feather from feather and fread from data.table.

Data

First let’s generate some random data

set.seed(123)
df <- data.frame(replicate(10, sample(0:2000, 15 * 10^5, rep = TRUE)),
replicate(10, stringi::stri_rand_strings(1000, 5)))

and save the files on a disk to evaluate the loading time. Besides thecsv format we will also need featherRDS and Rdata files.

path_csv <- '../assets/data/fast_load/df.csv'
path_feather <- '../assets/data/fast_load/df.feather'
path_rdata <- '../assets/data/fast_load/df.RData'
path_rds <- '../assets/data/fast_load/df.rds'
library(feather)
library(data.table)
write.csv(df, file = path_csv, row.names = F)
write_feather(df, path_feather)
save(df, file = path_rdata)
saveRDS(df, path_rds)

Next let’s check our files sizes:

files <- c('../assets/data/fast_load/df.csv', '../assets/data/fast_load/df.feather', '../assets/data/fast_load/df.RData', '../assets/data/fast_load/df.rds')
info <- file.info(files)
info$size_mb <- info$size/(1024 * 1024)
print(subset(info, select=c("size_mb")))
##                                       size_mb
## ../assets/data/fast_load/df.csv 1780.3005
## ../assets/data/fast_load/df.feather 1145.2881
## ../assets/data/fast_load/df.RData 285.4836
## ../assets/data/fast_load/df.rds 285.4837

As we can see both csv and feather format files are taking much more storage space. Csv more than 6 times and feather more than 4 times comparing to RDS and RData.

Benchmark

We will use microbenchmark library to compare the reading times of the following methods:

  • utils::read.csv
  • readr::read_csv
  • data.table::fread
  • base::load
  • base::readRDS
  • feather::read_feather

in 10 rounds.

library(microbenchmark)
benchmark <- microbenchmark(readCSV = utils::read.csv(path_csv),
readrCSV = readr::read_csv(path_csv, progress = F),
fread = data.table::fread(path_csv, showProgress = F),
loadRdata = base::load(path_rdata),
readRds = base::readRDS(path_rds),
readFeather = feather::read_feather(path_feather), times = 10)
print(benchmark, signif = 2)
##Unit: seconds
## expr min lq mean median uq max neval
## readCSV 200.0 200.0 211.187125 210.0 220.0 240.0 10
## readrCSV 27.0 28.0 29.770890 29.0 32.0 33.0 10
## fread 15.0 16.0 17.250016 17.0 17.0 22.0 10
## loadRdata 4.4 4.7 5.018918 4.8 5.5 5.9 10
## readRds 4.6 4.7 5.053674 5.1 5.3 5.6 10
## readFeather 1.5 1.8 2.988021 3.4 3.6 4.1 10

And the winner is… feather! However, using feather requires prior conversion of the file to the feather format.
Using load or readRDS can improve performance (second and third place in terms of speed) and has a benefit of storing smaller/compressed file. In both cases you will have to convert your file to the proper format first.

When it comes to reading from csv format fread significantly beatsread_csv and read.csv, and thus is the best option to read a csv file.

In our case we decided to go with feather file since conversion fromcsv to this format is just a one time job and we didn’t have a strict limitation on a storage space to consider usage of Rds or RDataformat.

The final workflow was:

  1. reading a csv file provided by our customer using fread,
  2. writing it to feather using write_feather, and
  3. loading a feather file on app initialization using read_feather.

First two tasks were done once and outside of a Shiny App context.

There is also quite interesting benchmark done by Hadley here on reading complete files to R. Unfortunately, if you use functions defined in that post, you will end up with an character type object, and you will have to apply string manipulations to obtain a commonly and widely used dataframe.

转自:http://blog.appsilondatascience.com/rstats/2017/04/11/fast-data-load.html

Fast data loading from files to R的更多相关文章

  1. pytorch例子学习-DATA LOADING AND PROCESSING TUTORIAL

    参考:https://pytorch.org/tutorials/beginner/data_loading_tutorial.html DATA LOADING AND PROCESSING TUT ...

  2. Redisql: the lightning fast data polyglot【翻译】 - Linvo's blog - 博客频道 - CSDN.NET

    Redisql: the lightning fast data polyglot[翻译] - Linvo's blog - 博客频道 - CSDN.NET Redisql: the lightnin ...

  3. 安装mysql时出现initialize specified but the data directory has files in in.Aborting.该如何解决

    eclipse中写入sql插入语句时,navicat中显示的出现乱码(???). 在修改eclipse工作空间编码.navicate中的数据库编码.mysql中my.ini中的配置之后还是出现乱码. ...

  4. The multi-part request contained parameter data (excluding uploaded files) that exceeded the limit for maxPostSize set on the associated connector.

    springboot 表单体积过大时报错: The multi-part request contained parameter data (excluding uploaded files) tha ...

  5. Springboot 上传报错: Failed to parse multipart servlet request; nested exception is java.lang.IllegalStateException: The multi-part request contained parameter data (excluding uploaded files) that exceede

    Failed to parse multipart servlet request; nested exception is java.lang.IllegalStateException: The ...

  6. MYSQL常见安装错误集:[ERROR] --initialize specified but the data directory has files in it. Abort

    1.[ERROR] --initialize specified but the data directory has files in it. Abort [错误] -初始化指定,但数据目录中有文件 ...

  7. Data Manipulation with dplyr in R

    目录 select The filter and arrange verbs arrange filter Filtering and arranging Mutate The count verb ...

  8. 启动MySQL5.7时报错:initialize specified but the data directory has files in it. Aborting.

    启动MySQL5.7时报错:initialize specified but the data directory has files in it. Aborting 解决方法: vim /etc/m ...

  9. STM32 GPIO fast data transfer with DMA

    AN2548 -- 使用 STM32F101xx 和 STM32F103xx 的 DMA 控制器 DMA控制器 DMA是AMBA的先进高性能总线(AHB)上的设备,它有2个AHB端口: 一个是从端口, ...

随机推荐

  1. AndroidStudio升级后出现Refresh gradle project和connection timed out的原因和解决方法

    笔者发现现在升级AndroidStudio不需要FQ了,于是在看到了升级提醒后手贱点击了升级.可悲剧的一幕发生了, 正在写的一个项目从上到下密密麻麻的错误,看了一下提示要求升级Gradle 那就升级吧 ...

  2. [Paxos] Paxos Made Simple 读后感

    Paxos 由著名图灵奖获得者Leslie Lamport提出,该算法是分布式一致性算法中的奠基之作,今天初读此文仅将相关学习心得予以记录. 1.Paxos 是什么?主要用来解决什么问题? Paxos ...

  3. jquery template.js前端模板引擎

    作为现代应用,ajax的大量使用,使得前端工程师们日常的开发少不了拼装模板,渲染模板 在刚有web的时候,前端与后端的交互,非常直白,浏览器端发出URL,后端返回一张拼好了的HTML串.浏览器对其进行 ...

  4. HibernateTemplate的使用

    HibernateTemplate 提供了非常多的常用方法来完成基本的操作,比如增加.删除.修改及查询等操作,Spring 2.0 更增加对命名 SQL 查询的支持,也增加对分页的支持.大部分情况下, ...

  5. 在Oracle中添加用户登录名称

    第一步,打开Oracle客户端单击 “帮助”-->"支持信息"-->”TNS名“,加入红色部分.页面如下: 第二步,再次打开Oracle客户端时,就会显示数据库了,只需 ...

  6. C++学习笔记1(扩充:C++中的格式控制)

    前一章,我们了解了再C++中的标准的输入输出问题,那么肯能就有人会问了再C语言中我们可以灵活的控制输出和显示,那么再再C++中可以实现吗?我的回答是当然可以的,只不过再C++中的控制可能相比较而言要比 ...

  7. CSS新内容

     margin 外边距                                 * margin  属性值最多有4个                 * ① 只写一个值:四个方向的margin ...

  8. Oracle空间查询 ORA-28595

    可使用数据库管理系统 (DBMS) 的结构化查询语言 (SQL).数据类型和表格式来处理地理数据库或安装了 ST_Geometry 类型的数据库中所存储的信息. 例如,在ArcMap中我们使用&quo ...

  9. DirectFB环境搭建

    一.下载安装包 http://www.directfb.org/index.php?path=Main%2FDownloads git clone git://git.directfb.org/git ...

  10. 【原创】bootstrap框架的学习 第五课

    一.Bootstrap 中定义了所有的 HTML 标题(h1 到 h6)的样式. <!DOCTYPE html> <html> <head> <title&g ...