Scalaz（58）－ scalaz-stream: fs2-并行运算示范，fs2 parallel processing

从表面上来看，Stream代表一连串无穷数据元素。一连串的意思是元素有固定的排列顺序，所以对元素的运算也必须按照顺序来：完成了前面的运算再跟着进行下一个元素的运算。这样来看，Stream应该不是很好的并行运算工具。但是，fs2所支持的并行运算方式不是以数据元素而是以Stream为运算单位的：fs2支持多个Stream同时进行运算，如merge函数。所以fs2使Stream的并行运算成为了可能。

一般来说，我们可能在Stream的几个状态节点要求并行运算：

1、同时运算多个数据源头来产生不排序的数据元素

2、同时对获取的一连串数据元素进行处理，如：map（update）,filter等等

3、同时将一连串数据元素无序存入终点（Sink）

我们可以创建一个例子来示范fs2的并行运算：模拟从3个文件中读取字串，然后统计在这3个文件中母音出现的次数。假设文件读取和母音统计是有任意时间延迟的（latency）,我们看看如何进行并行运算及并行运算能有多少效率上的提升。我们先设定一些跟踪和模拟延迟的帮助函数：

 def log[A](prompt: String): Pipe[Task,A,A] = _.evalMap { a => Task.delay{ println(s"$prompt>"); a }}

                                                   //> log: [A](prompt: String)fs2.Pipe[fs2.Task,A,A]

 def randomDelay[A](max: FiniteDuration): Pipe[Task,A,A] = _.evalMap { a =>

   val delay: Task[Int] = Task.delay { scala.util.Random.nextInt(max.toMillis.toInt) }

   delay.flatMap {d => Task.now(a).schedule(d.millis) }

 }                                                 //> randomDelay: [A](max: scala.concurrent.duration.FiniteDuration)fs2.Pipe[fs2.

log是个跟踪函数，randomDelay是个延迟模拟函数，模拟在max内的任意时间延迟。

与scalaz-stream-0.8不同，fs2重新实现了文件操作功能：不再依赖java的字串（string）处理功能。也不再依赖scodec的二进制数据转换功能。下面是fs2的文件读取方法示范：

 val s1 = io.file.readAll[Task](java.nio.file.Paths.get("/Users/tiger-macpro/basic/BasicBackend.scala"),)

   //> s1  : fs2.Stream[fs2.Task,Byte] = evalScope(Scope(Bind(Eval(Snapshot),<function1>))).flatMap(<function1>)

 val s2 = io.file.readAll[Task](java.nio.file.Paths.get("/Users/tiger-macpro/basic/DatabaseConfig.scala"),)

   //> s2  : fs2.Stream[fs2.Task,Byte] = evalScope(Scope(Bind(Eval(Snapshot),<function1>))).flatMap(<function1>)

 val s3 = io.file.readAll[Task](java.nio.file.Paths.get("/Users/tiger-macpro/basic/BasicProfile.scala"),)

   //> s3  : fs2.Stream[fs2.Task,Byte] = evalScope(Scope(Bind(Eval(Snapshot),<function1>))).flatMap(<function1>)

fs2.io.file.readAll函数的款式如下：

def readAll[F[_]](path: Path, chunkSize: Int)(implicit F: Effect[F]): Stream[F, Byte] ={...}

readAll分批（by chunks）从文件中读取Byte类型数据（当返回数据量小于chunkSize代表完成读取），返回结果类型是Stream[F,Byte]。我们需要进行Byte>>>String转换及分行等处理。fs2在text对象里提供了相关函数：

object text {

  private val utf8Charset = Charset.forName("UTF-8")

  /** Converts UTF-8 encoded byte stream to a stream of `String`. */

  def utf8Decode[F[_]]: Pipe[F, Byte, String] =

    _.chunks.through(utf8DecodeC)

  /** Converts UTF-8 encoded `Chunk[Byte]` inputs to `String`. */

  def utf8DecodeC[F[_]]: Pipe[F, Chunk[Byte], String] = {

    /**

      * Returns the number of continuation bytes if `b` is an ASCII byte or a

      * leading byte of a multi-byte sequence, and -1 otherwise.

      */

    def continuationBytes(b: Byte): Int = {

      if      ((b & 0x80) == 0x00)  // ASCII byte

      else if ((b & 0xE0) == 0xC0)  // leading byte of a 2 byte seq

      else if ((b & 0xF0) == 0xE0)  // leading byte of a 3 byte seq

      else if ((b & 0xF8) == 0xF0)  // leading byte of a 4 byte seq

      else                        - // continuation byte or garbage

    }

...

/** Encodes a stream of `String` in to a stream of bytes using the UTF-8 charset. */

  def utf8Encode[F[_]]: Pipe[F, String, Byte] =

    _.flatMap(s => Stream.chunk(Chunk.bytes(s.getBytes(utf8Charset))))

  /** Encodes a stream of `String` in to a stream of `Chunk[Byte]` using the UTF-8 charset. */

  def utf8EncodeC[F[_]]: Pipe[F, String, Chunk[Byte]] =

    _.map(s => Chunk.bytes(s.getBytes(utf8Charset)))

  /** Transforms a stream of `String` such that each emitted `String` is a line from the input. */

  def lines[F[_]]: Pipe[F, String, String] = {

...

utf8Encode,utf8Decode,lines这几个函数正是我们需要的，它们都是Pipe类型。我们可以把这几个Pipe直接用through接到Stream上：

 val startTime = System.currentTimeMillis         //> startTime  : Long = 1472444756321

  val s1lines = s1.through(text.utf8Decode).through(text.lines)

      .through(randomDelay( millis)).runFold()((b,_) => b + ).unsafeRun

                                                   //> s1lines  : Int = 479

  println(s"reading s1 $s1lines lines in ${System.currentTimeMillis - startTime}ms")

                                                   //> reading s1 479 lines in 5370ms

  val startTime2 = System.currentTimeMillis        //> startTime2  : Long = 1472444761691

  val s2lines = s2.through(text.utf8Decode).through(text.lines)

    .through(randomDelay( millis)).runFold()((b,_) => b + ).unsafeRun

                                                   //> s2lines  : Int = 174

  println(s"reading s2 $s2lines lines in ${System.currentTimeMillis - startTime2}ms")

                                                   //> reading s2 174 lines in 1923ms

  val startTime3 = System.currentTimeMillis        //> startTime3  : Long = 1472444763614

  val s3lines = s3.through(text.utf8Decode).through(text.lines)

    .through(randomDelay( millis)).runFold()((b,_) => b + ).unsafeRun

                                                   //> s3lines  : Int = 174

 println(s"reading s3 $s3lines lines in ${System.currentTimeMillis - startTime3}ms")

                                                   //> reading s3 174 lines in 1928ms

 println(s"reading all three files ${s1lines+s2lines+s3lines} total lines in ${System.currentTimeMillis - startTime}ms")

                                                   //> reading all three files 827 total lines in 9221ms

在以上的例子里我们用runFold函数统计文件的文字行数并在读取过程中用randomDelay来制造了随意长度的拖延。上面3个文件的字串读取和转换处理一共877行、9221ms。

我们知道fs2的并行运算函数concurrent.join函数类型款式是这样的：

def join[F[_],O](maxOpen: Int)(outer: Stream[F,Stream[F,O]])(implicit F: Async[F]): Stream[F,O] = {...}

join运算的对象outer是个两层Stream（Streams of Stream）：Stream[F,Stream[F,P]]，我们需要先进行类型款式调整：

 val lines1 = s1.through(text.utf8Decode).through(text.lines).through(randomDelay( millis))

   //> lines1  : fs2.Stream[fs2.Task,String] = evalScope(Scope(Bind(Eval(Snapshot),<function1>))).flatMap(<function1>).flatMap(<function1>)

 val lines2 = s2.through(text.utf8Decode).through(text.lines).through(randomDelay( millis))

   //> lines2  : fs2.Stream[fs2.Task,String] = evalScope(Scope(Bind(Eval(Snapshot),<function1>))).flatMap(<function1>).flatMap(<function1>)

 val lines3 = s3.through(text.utf8Decode).through(text.lines).through(randomDelay( millis))

   //> lines3  : fs2.Stream[fs2.Task,String] = evalScope(Scope(Bind(Eval(Snapshot),<function1>))).flatMap(<function1>).flatMap(<function1>)

 val ss: Stream[Task,Stream[Task,String]] = Stream(lines1,lines2,lines3)

   //> ss  : fs2.Stream[fs2.Task,fs2.Stream[fs2.Task,String]] = Segment(Emit(Chunk(evalScope(Scope(Bind(Eval(Snapshot),<function1>))).flatMap(<function1>).flatMap(<function1>), evalScope(Scope(Bind(Eval(Snapshot),<function1>))).flatMap(<function1>).flatMap(<function1>), evalScope(Scope(Bind(Eval(Snapshot),<function1>))).flatMap(<function1>).flatMap(<function1>))))

现在这个ss的类型复合我们的要求。我们可以测试一下并行运算的效率：

 val ss_start = System.currentTimeMillis           //> ss_start  : Long = 1472449962698

 val ss_lines = fs2.concurrent.join()(ss).runFold()((b,_) => b + ).unsafeRun

                                                   //> ss_lines  : Int = 827

 println(s"parallel reading all files ${ss_lines} total lines in ${System.currentTimeMillis - ss_start}ms")

                                                   //> parallel reading all files 827 total lines in 5173ms

读取同等行数但只用了5173ms，与之前的9221ms相比，大约有成倍的提速。

join(3)(ss)返回了一个合并的Stream，类型是Stream[Task,String]。我们可以运算这个Stream里母音出现的频率。我们先设计这个统计函数：

 //c 是个vowl

 def vowls(c: Char): Boolean = List('A','E','I','O','U').contains(c)

                                                   //> vowls: (c: Char)Boolean

 //直接用scala标准库实现

 def pipeVowlsCount: Pipe[Task,String,Map[Char,Int]] =

   _.evalMap (text => Task.delay{

      text.toUpperCase.toList.filter(vowls).groupBy(s => s).mapValues(_.size)

      }.schedule((text.length / ).millis))       //> pipeVowlsCount: => fs2.Pipe[fs2.Task,String,Map[Char,Int]]

注意我们使用了text => Task.delay{...}.schedule(d)，实际上我们完全可以用 text => Thread.sleep(d)，但是这样会造成了不纯代码，所以我们用evalMap来实现纯代码运算。试试统计全部字串内母音出现的总数：

 import scalaz.{Monoid}

 //为runFold提供一个Map[Char,Int]Monoid实例

 implicit object mapMonoid extends Monoid[Map[Char,Int]]  {

    def zero: Map[Char,Int] = Map()

    def append(m1: Map[Char,Int], m2: => Map[Char,Int]): Map[Char,Int] = {

      (m1.keySet ++ m2.keySet).map { k =>

        (k, m1.getOrElse(k,) + m2.getOrElse(k,))

      }.toMap

    }

 }

 val vc_start = System.currentTimeMillis           //> vc_start  : Long = 1472464772465

 val vowlsLine = fs2.concurrent.join()(ss).through(pipeVowlsCount)

     .runFold(Map[Char,Int]())(mapMonoid.append(_,_)).unsafeRun

   //> vowlsLine  : scala.collection.immutable.Map[Char,Int] = Map(E -> 3381, U - 838, A -> 2361, I -> 2031, O -> 1824)

 println(s"parallel reading all files and counted vowls sequencially in ${System.currentTimeMillis - vc_start}ms")

   //> parallel reading all files and counted vowls sequencially in 10466ms

我们必须为runFold提供一个Monoid[Map[Char,Int]]实例mapMonoid。

那我们又如何实现统计功能的并行运算呢? fs2.concurrent.join(maxOpen)(...)函数能把一个Stream截成maxOpen数的子Stream，然后对这些子Stream进行并行运算。那么我们又如何转换Stream[F,Stream[F,O]]类型呢？我们必须把Stream[F,O]的O升格成Stream[F,O]。我们先用一个函数来把O转换成Map[Char,Int]，然后把这个函数升格成Stream[Task,Map[Char,Int]，这个可以用Stream.eval实现：

 def fVowlsCount(text: String): Map[Char,Int] =

   text.toUpperCase.toList.filter(vowls).groupBy(s => s).mapValues(_.size)

                                                   //> fVowlsCount: (text: String)Map[Char,Int]

 val parVowlsLine: Stream[Task,Stream[Task,Map[Char,Int]]] = fs2.concurrent.join()(ss)

     .map {text => Stream.eval(Task {fVowlsCount(text)}.schedule((text.length / ).millis))}

     //> parVowlsLine  : fs2.Stream[fs2.Task,fs2.Stream[fs2.Task,Map[Char,Int]]] = attemptEval(Task).flatMap(<function1>).flatMap(<function1>).mapChunks(<function1>)

我们来检查一下运行效率：

 val parvc_start = System.currentTimeMillis        //> parvc_start  : Long = 1472465844694

 fs2.concurrent.join()(parVowlsLine)

   .runFold(Map[Char,Int]())(mapMonoid.append(_,_)).unsafeRun

   //> res0: scala.collection.immutable.Map[Char,Int] = Map(E -> 3381, U -> 838, A-> 2361, I -> 2031, O -> 1824)

 println(s"parallel reading all files and counted vowls in ${System.currentTimeMillis - parvc_start}ms")

   //> parallel reading all files and counted vowls in 4984ms

并行运算只需要4985ms，而流程运算需要10466+(9221-5173)=14xxx，这里有3，4倍的速度提升。

下面是这次讨论的示范源代码：

 import fs2._

 import scala.language.{higherKinds,implicitConversions,postfixOps}

 import scala.concurrent.duration._

 object fs2Merge {

 implicit val strategy = Strategy.fromFixedDaemonPool()

 implicit val scheduler = Scheduler.fromFixedDaemonPool()

 def log[A](prompt: String): Pipe[Task,A,A] = _.evalMap { a => Task.delay{ println(s"$prompt>"); a }}

 def randomDelay[A](max: FiniteDuration): Pipe[Task,A,A] = _.evalMap { a =>

   val delay: Task[Int] = Task.delay { scala.util.Random.nextInt(max.toMillis.toInt) }

   delay.flatMap {d => Task.now(a).schedule(d.millis) }

 }

  val s1 = io.file.readAll[Task](java.nio.file.Paths.get("/Users/tiger-macpro/basic/BasicBackend.scala"),)

  val s2 = io.file.readAll[Task](java.nio.file.Paths.get("/Users/tiger-macpro/basic/DatabaseConfig.scala"),)

  val s3 = io.file.readAll[Task](java.nio.file.Paths.get("/Users/tiger-macpro/basic/BasicProfile.scala"),)

  val startTime = System.currentTimeMillis

  val s1lines = s1.through(text.utf8Decode).through(text.lines)

      .through(randomDelay( millis)).runFold()((b,_) => b + ).unsafeRun

  println(s"reading s1 $s1lines lines in ${System.currentTimeMillis - startTime}ms")

  val startTime2 = System.currentTimeMillis

  val s2lines = s2.through(text.utf8Decode).through(text.lines)

    .through(randomDelay( millis)).runFold()((b,_) => b + ).unsafeRun

  println(s"reading s2 $s2lines lines in ${System.currentTimeMillis - startTime2}ms")

  val startTime3 = System.currentTimeMillis

  val s3lines = s3.through(text.utf8Decode).through(text.lines)

    .through(randomDelay( millis)).runFold()((b,_) => b + ).unsafeRun

 println(s"reading s3 $s3lines lines in ${System.currentTimeMillis - startTime3}ms")

 println(s"reading all three files ${s1lines+s2lines+s3lines} total lines in ${System.currentTimeMillis - startTime}ms")

 val lines1 = s1.through(text.utf8Decode).through(text.lines).through(randomDelay( millis))

 val lines2 = s2.through(text.utf8Decode).through(text.lines).through(randomDelay( millis))

 val lines3 = s3.through(text.utf8Decode).through(text.lines).through(randomDelay( millis))

 val ss: Stream[Task,Stream[Task,String]] = Stream(lines1,lines2,lines3)

 val ss_start = System.currentTimeMillis

 val ss_lines = fs2.concurrent.join()(ss).runFold()((b,_) => b + ).unsafeRun

 println(s"parallel reading all files ${ss_lines} total lines in ${System.currentTimeMillis - ss_start}ms")

 //c 是个vowl

 def vowls(c: Char): Boolean = List('A','E','I','O','U').contains(c)

 //直接用scala标准库实现

 def pipeVowlsCount: Pipe[Task,String,Map[Char,Int]] =

   _.evalMap (text => Task.delay{

      text.toUpperCase.toList.filter(vowls).groupBy(s => s).mapValues(_.size)

      }.schedule((text.length / ).millis))

 import scalaz.{Monoid}

 //为runFold提供一个Map[Char,Int]Monoid实例

 implicit object mapMonoid extends Monoid[Map[Char,Int]]  {

    def zero: Map[Char,Int] = Map()

    def append(m1: Map[Char,Int], m2: => Map[Char,Int]): Map[Char,Int] = {

      (m1.keySet ++ m2.keySet).map { k =>

        (k, m1.getOrElse(k,) + m2.getOrElse(k,))

      }.toMap

    }

 }

 val vc_start = System.currentTimeMillis

 val vowlsLine = fs2.concurrent.join()(ss).through(pipeVowlsCount)

     .runFold(Map[Char,Int]())(mapMonoid.append(_,_)).unsafeRun

 println(s"parallel reading all files and counted vowls sequencially in ${System.currentTimeMillis - vc_start}ms")

 def fVowlsCount(text: String): Map[Char,Int] =

   text.toUpperCase.toList.filter(vowls).groupBy(s => s).mapValues(_.size)

 val parVowlsLine: Stream[Task,Stream[Task,Map[Char,Int]]] = fs2.concurrent.join()(ss)

     .map {text => Stream.eval(Task {fVowlsCount(text)}.schedule((text.length / ).millis))}

 val parvc_start = System.currentTimeMillis

 fs2.concurrent.join()(parVowlsLine)

   .runFold(Map[Char,Int]())(mapMonoid.append(_,_)).unsafeRun

 println(s"parallel reading all files and counted vowls in ${System.currentTimeMillis - parvc_start}ms")

 }

Scalaz（58）－ scalaz-stream: fs2-并行运算示范，fs2 parallel processing的更多相关文章

Scalaz（59）－ scalaz-stream: fs2-程序并行运算，fs2 running effects in parallel
scalaz-stream-fs2是一种函数式的数据流编程工具.fs2的类型款式是:Stream[F[_],O],F[_]代表一种运算模式,O代表Stream数据元素的类型.实际上F就是一种延迟运算机 ...
Scalaz（52）－ scalaz-stream: 并行运算－parallel processing concurrently by merging
如果scalaz-stream真的是一个实用的数据流编程工具库的话,那它应该能处理同时从多个数据源获取数据以及把数据同时送到多个终点(Sink),最重要的是它应该可以实现高度灵活的多线程运算.但是:我 ...
FunDA（15）－示范：任务并行运算 - user task parallel execution
FunDA的并行运算施用就是对用户自定义函数的并行运算.原理上就是把一个输入流截分成多个输入流并行地输入到一个自定义函数的多个运行实例.这些函数运行实例同时在各自不同的线程里同步运算直至耗尽所有输入. ...
Scalaz（55）－ scalaz-stream: fs2-基础介绍，fs2 stream transformation
fs2是scalaz-stream的最新版本,沿用了scalaz-stream被动式(pull model)数据流原理但采用了全新的实现方法.fs2比较scalaz-stream而言具备了:更精简的基 ...
Scalaz（57）－ scalaz-stream: fs2-多线程编程，fs2 concurrency
fs2的多线程编程模式不但提供了无阻碍I/O(java nio)能力,更为并行运算提供了良好的编程工具.在进入并行运算讨论前我们先示范一下fs2 pipe2对象里的一些Stream合并功能.我们先设计 ...
Scalaz（53）－ scalaz-stream: 程序运算器－application scenario
从上面多篇的讨论中我们了解到scalaz-stream代表一串连续无穷的数据或者程序.对这个数据流的处理过程就是一个状态机器(state machine)的状态转变过程.这种模式与我们通常遇到的程序流 ...
FunDA（16）－示范：整合并行运算 - total parallelism solution
在对上两篇讨论中我们介绍了并行运算的两种体现方式:并行构建数据源及并行运算用户自定义函数.我们分别对这两部分进行了示范.本篇我准备示范把这两种情况集成一体的并行运算模式.这次介绍的数据源并行构建方式也 ...
Scalaz（24）－泛函数据结构： Tree-数据游览及维护
上节我们讨论了Zipper-串形不可变集合(immutable sequential collection)游标,在串形集合中左右游走及元素维护操作.这篇我们谈谈Tree.在电子商务应用中对于xml, ...
Scalaz（23）－泛函数据结构： Zipper-游标定位
外面沙尘滚滚一直向北去了,意识到年关到了,码农们都回乡过年去了,而我却留在这里玩弄“拉链”.不要想歪了,我说的不是裤裆拉链而是scalaz Zipper,一种泛函数据结构游标(cursor).在函数式 ...

随机推荐

使用GDB调试Go语言
用Go语言已经有一段时间了,总结一下如何用GDB来调试它! ps:网上有很多文章都有描述,但是都不是很全面,这里将那些方法汇总一下 GDB简介 GDB是GNU开源组织发布的⼀一个强⼤大的UNIX下的 ...
京东招聘Java开发人员
软件开发工程师(JAVA) 岗位职责: 1. 负责京东核心业务系统的需求分析.设计.开发工作 2. 负责相关技术文档编写工作 3. 解决系统中的关键问题和技术难题任职要求: 1. 踏实诚恳.责任心强 ...
列出场景对象Lightmap属性
首先上效果图: 编辑器代码: using UnityEngine; using UnityEditor; using System.Collections; public class Lightmap ...
学用MVC4做网站六:后台管理(续)
关于后台的说明: 后台将会用easyui + ajax模式. 这里涉及两个问题,一个是使用easyui如何在前台验证模型的问题,另一个是ajax提交后返回数据. 一.Easyui验证前台验证采用ea ...
窥探Swift系列博客说明及其Swift版本间更新
Swift到目前为止仍在更新,每次更新都会推陈出新,一些Swift旧版本中的东西在新Swift中并不适用,而且新版本的Swift会添加新的功能.到目前为止,Swift为2.1版本.去年翻译的Swift ...
窥探Swift之需要注意的基本运算符和高级运算符
之前更新了一段时间有关Swift语言的博客,连续更新了有6.7篇的样子.期间间更新了一些iOS开发中SQLite.CollectionViewController以及ReactiveCocoa的一些东 ...
web前端基础知识总结
上个寒假总结的web前端的一些知识点给大家分享一下 1.<html>和</html> 标签限定了文档的开始和结束点. 属性: (1) dir: 文本的显示方向,默认是从左向右 ...
c 语言的位运算符复习
转载和修正,原文连接:http://www.cnblogs.com/911/archive/2008/05/20/1203477.html 位运算是指按二进制进行的运算.在系统软件中,常常需要处理二进 ...
简析Geoserver中获取图层列表以及各图层描述信息的三种方法
文章版权由作者李晓晖和博客园共有,若转载请于明显处标明出处:http://www.cnblogs.com/naaoveGIS/. 1.背景实际项目中需要获取到Geoserver中的图层组织以及各图层 ...
DotNet指定文件显示的尺寸
在项目中开发中,有时候需要将文件的尺寸进行控制,例如需要将文件的尺寸指定为字节,TB等.现在提供一个方法,实现将指定文件的尺寸, 提供:"字节", "KB", ...

Scalaz（58）－ scalaz-stream: fs2-并行运算示范，fs2 parallel processing

Scalaz（58）－ scalaz-stream: fs2-并行运算示范，fs2 parallel processing的更多相关文章

随机推荐

热门专题