Recently we were building a Shiny App in which we had to load data from a very large dataframe. It was directly impacting the app initialization time, so we had to look into different ways of reading data from files to R (in our case customer provided csv files) and identify the best one.

The goal of my post is to compare:

  1. read.csv from utils, which was the standard way of reading csvfiles to R in RStudio,
  2. read_csv from readr which replaced the former method as a standard way of doing it in RStudio,
  3. load and readRDS from base, and
  4. read_feather from feather and fread from data.table.

Data

First let’s generate some random data

set.seed(123)
df <- data.frame(replicate(10, sample(0:2000, 15 * 10^5, rep = TRUE)),
replicate(10, stringi::stri_rand_strings(1000, 5)))

and save the files on a disk to evaluate the loading time. Besides thecsv format we will also need featherRDS and Rdata files.

path_csv <- '../assets/data/fast_load/df.csv'
path_feather <- '../assets/data/fast_load/df.feather'
path_rdata <- '../assets/data/fast_load/df.RData'
path_rds <- '../assets/data/fast_load/df.rds'
library(feather)
library(data.table)
write.csv(df, file = path_csv, row.names = F)
write_feather(df, path_feather)
save(df, file = path_rdata)
saveRDS(df, path_rds)

Next let’s check our files sizes:

files <- c('../assets/data/fast_load/df.csv', '../assets/data/fast_load/df.feather', '../assets/data/fast_load/df.RData', '../assets/data/fast_load/df.rds')
info <- file.info(files)
info$size_mb <- info$size/(1024 * 1024)
print(subset(info, select=c("size_mb")))
##                                       size_mb
## ../assets/data/fast_load/df.csv 1780.3005
## ../assets/data/fast_load/df.feather 1145.2881
## ../assets/data/fast_load/df.RData 285.4836
## ../assets/data/fast_load/df.rds 285.4837

As we can see both csv and feather format files are taking much more storage space. Csv more than 6 times and feather more than 4 times comparing to RDS and RData.

Benchmark

We will use microbenchmark library to compare the reading times of the following methods:

  • utils::read.csv
  • readr::read_csv
  • data.table::fread
  • base::load
  • base::readRDS
  • feather::read_feather

in 10 rounds.

library(microbenchmark)
benchmark <- microbenchmark(readCSV = utils::read.csv(path_csv),
readrCSV = readr::read_csv(path_csv, progress = F),
fread = data.table::fread(path_csv, showProgress = F),
loadRdata = base::load(path_rdata),
readRds = base::readRDS(path_rds),
readFeather = feather::read_feather(path_feather), times = 10)
print(benchmark, signif = 2)
##Unit: seconds
## expr min lq mean median uq max neval
## readCSV 200.0 200.0 211.187125 210.0 220.0 240.0 10
## readrCSV 27.0 28.0 29.770890 29.0 32.0 33.0 10
## fread 15.0 16.0 17.250016 17.0 17.0 22.0 10
## loadRdata 4.4 4.7 5.018918 4.8 5.5 5.9 10
## readRds 4.6 4.7 5.053674 5.1 5.3 5.6 10
## readFeather 1.5 1.8 2.988021 3.4 3.6 4.1 10

And the winner is… feather! However, using feather requires prior conversion of the file to the feather format.
Using load or readRDS can improve performance (second and third place in terms of speed) and has a benefit of storing smaller/compressed file. In both cases you will have to convert your file to the proper format first.

When it comes to reading from csv format fread significantly beatsread_csv and read.csv, and thus is the best option to read a csv file.

In our case we decided to go with feather file since conversion fromcsv to this format is just a one time job and we didn’t have a strict limitation on a storage space to consider usage of Rds or RDataformat.

The final workflow was:

  1. reading a csv file provided by our customer using fread,
  2. writing it to feather using write_feather, and
  3. loading a feather file on app initialization using read_feather.

First two tasks were done once and outside of a Shiny App context.

There is also quite interesting benchmark done by Hadley here on reading complete files to R. Unfortunately, if you use functions defined in that post, you will end up with an character type object, and you will have to apply string manipulations to obtain a commonly and widely used dataframe.

转自:http://blog.appsilondatascience.com/rstats/2017/04/11/fast-data-load.html

Fast data loading from files to R的更多相关文章

  1. pytorch例子学习-DATA LOADING AND PROCESSING TUTORIAL

    参考:https://pytorch.org/tutorials/beginner/data_loading_tutorial.html DATA LOADING AND PROCESSING TUT ...

  2. Redisql: the lightning fast data polyglot【翻译】 - Linvo's blog - 博客频道 - CSDN.NET

    Redisql: the lightning fast data polyglot[翻译] - Linvo's blog - 博客频道 - CSDN.NET Redisql: the lightnin ...

  3. 安装mysql时出现initialize specified but the data directory has files in in.Aborting.该如何解决

    eclipse中写入sql插入语句时,navicat中显示的出现乱码(???). 在修改eclipse工作空间编码.navicate中的数据库编码.mysql中my.ini中的配置之后还是出现乱码. ...

  4. The multi-part request contained parameter data (excluding uploaded files) that exceeded the limit for maxPostSize set on the associated connector.

    springboot 表单体积过大时报错: The multi-part request contained parameter data (excluding uploaded files) tha ...

  5. Springboot 上传报错: Failed to parse multipart servlet request; nested exception is java.lang.IllegalStateException: The multi-part request contained parameter data (excluding uploaded files) that exceede

    Failed to parse multipart servlet request; nested exception is java.lang.IllegalStateException: The ...

  6. MYSQL常见安装错误集:[ERROR] --initialize specified but the data directory has files in it. Abort

    1.[ERROR] --initialize specified but the data directory has files in it. Abort [错误] -初始化指定,但数据目录中有文件 ...

  7. Data Manipulation with dplyr in R

    目录 select The filter and arrange verbs arrange filter Filtering and arranging Mutate The count verb ...

  8. 启动MySQL5.7时报错:initialize specified but the data directory has files in it. Aborting.

    启动MySQL5.7时报错:initialize specified but the data directory has files in it. Aborting 解决方法: vim /etc/m ...

  9. STM32 GPIO fast data transfer with DMA

    AN2548 -- 使用 STM32F101xx 和 STM32F103xx 的 DMA 控制器 DMA控制器 DMA是AMBA的先进高性能总线(AHB)上的设备,它有2个AHB端口: 一个是从端口, ...

随机推荐

  1. Android 布局(线性布局、相对布局)

    一.线性布局(LinearLayout) <LinearLayout****</LinearLayout>1. orientation(布局方向)value=0 horizontal ...

  2. Oracle wm_concat()函数

    oracle wm_concat(column)函数使我们经常会使用到的,下面就教您如何使用oraclewm_concat(column)函数实现字段合并 如: shopping:   ------- ...

  3. Anaconda配置多spyder多python环境

    作者:桂. 时间:2017-04-17  22:02:37 链接:http://www.cnblogs.com/xingshansi/p/6725298.html  前言 最近在看<统计学习方法 ...

  4. 串口屏Modbus协议,串口屏的modbus协议资料,串口屏modbus通讯协议开发,串口屏之modbus协议使用技巧

    串口屏Modbus协议,串口屏的modbus协议资料,串口屏modbus通讯协议开发,串口屏之modbus协议使用技巧 本例程中用51单片机作为Modbus从机,从机的设备地址为2,从机有4个寄存器, ...

  5. Unity -JsonUtility的使用

    今天,为大家分享一下unity上的Json序列化,应该一说到这个词语,我们肯定会觉得,这应该是很常用的一个功能点:诚然,我们保存数据的时候,也许会用到json序列化,所以,我们有必要快速了解一下它的简 ...

  6. 笔记整理:计算CPU使用率 ----linux 环境编程 从应用到内核

    linux 提供time命令统计进程在用户态和内核态消耗的CPU时间: [root@localhost ~]# time sleep real 0m2.001s user 0m0.001s sys 0 ...

  7. Executor框架学习笔记

    Java中的线程即是工作单元也是执行机制,从JDK 5后,工作单元与执行机制被分离.工作单元包括Runnable和Callable,执行机制由JDK 5中增加的java.util.concurrent ...

  8. Docker 架构详解 - 每天5分钟玩转容器技术(7)

    Docker 的核心组件包括: Docker 客户端 - Client Docker 服务器 - Docker daemon Docker 镜像 - Image Registry Docker 容器 ...

  9. python13_day4

    上周复习 1,python基础 2,基本数据类型 3,函数式编程 函数式编程.三元运行.内置函数.文件处理 容易出问题的点 函数默认返回值为none,对于列表字典,传入引用. 1 2 3 4 5 6 ...

  10. .NET遇上Docker - Docker集成Cron定时运行.NETCore(ConsoleApp)程序.md

    配置项目的Docker支持 对于VS中Docker的配置,依旧重复一些废话. 给项目添加Docker支持,VS2015可以直接使用Docker for VS插件,VS2017在安装时选择容器支持.VS ...