本章节根据源代码分析Spark Structured Streaming(Spark2.4)在进行DataSourceProvider查找的流程,首先,我们看下读取流数据源kafka的代码:

  1. SparkSession sparkSession = SparkSession.builder().getOrCreate();
  2. Dataset<Row> sourceDataset = sparkSession.readStream().format("kafka").option("xxx", "xxx").load();

sparkSession.readStream()返回的对象是DataSourceReader

DataSourceReader(https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala),其中上边代码中的load()方法,正是DataSourceReader的方法。

format参数kafka在DataSourceReader中作为source属性:

  1. @InterfaceStability.Evolving
  2. final class DataStreamReader private[sql](sparkSession: SparkSession) extends Logging {
  3. /**
  4. * Specifies the input data source format.
  5. *
  6. * @since 2.0.0
  7. */
  8. def format(source: String): DataStreamReader = {
  9. this.source = source
  10. this
  11. }
  12. 。。。
  13. }

DataSourceReader#format(source:String)中参数往往是csv/text/json/jdbc/kafka/console/socket等

DataSourceReader#load()方法

  1. /**
  2. * Loads input data stream in as a `DataFrame`, for data streams that don't require a path
  3. * (e.g. external key-value stores).
  4. *
  5. * @since 2.0.0
  6. */
  7. def load(): DataFrame = {
  8. if (source.toLowerCase(Locale.ROOT) == DDLUtils.HIVE_PROVIDER) {
  9. throw new AnalysisException("Hive data source can only be used with tables, you can not " +
  10. "read files of Hive data source directly.")
  11. }
  12.  
  13. val ds = DataSource.lookupDataSource(source, sparkSession.sqlContext.conf).newInstance()
  14. // We need to generate the V1 data source so we can pass it to the V2 relation as a shim.
  15. // We can't be sure at this point whether we'll actually want to use V2, since we don't know the
  16. // writer or whether the query is continuous.
  17. val v1DataSource = DataSource(
  18. sparkSession,
  19. userSpecifiedSchema = userSpecifiedSchema,
  20. className = source,
  21. options = extraOptions.toMap)
  22. val v1Relation = ds match {
  23. case _: StreamSourceProvider => Some(StreamingRelation(v1DataSource))
  24. case _ => None
  25. }
  26. ds match {
  27. case s: MicroBatchReadSupport =>
  28. val sessionOptions = DataSourceV2Utils.extractSessionConfigs(
  29. ds = s, conf = sparkSession.sessionState.conf)
  30. val options = sessionOptions ++ extraOptions
  31. val dataSourceOptions = new DataSourceOptions(options.asJava)
  32. var tempReader: MicroBatchReader = null
  33. val schema = try {
  34. tempReader = s.createMicroBatchReader(
  35. Optional.ofNullable(userSpecifiedSchema.orNull),
  36. Utils.createTempDir(namePrefix = s"temporaryReader").getCanonicalPath,
  37. dataSourceOptions)
  38. tempReader.readSchema()
  39. } finally {
  40. // Stop tempReader to avoid side-effect thing
  41. if (tempReader != null) {
  42. tempReader.stop()
  43. tempReader = null
  44. }
  45. }
  46. Dataset.ofRows(
  47. sparkSession,
  48. StreamingRelationV2(
  49. s, source, options,
  50. schema.toAttributes, v1Relation)(sparkSession))
  51. case s: ContinuousReadSupport =>
  52. val sessionOptions = DataSourceV2Utils.extractSessionConfigs(
  53. ds = s, conf = sparkSession.sessionState.conf)
  54. val options = sessionOptions ++ extraOptions
  55. val dataSourceOptions = new DataSourceOptions(options.asJava)
  56. val tempReader = s.createContinuousReader(
  57. Optional.ofNullable(userSpecifiedSchema.orNull),
  58. Utils.createTempDir(namePrefix = s"temporaryReader").getCanonicalPath,
  59. dataSourceOptions)
  60. Dataset.ofRows(
  61. sparkSession,
  62. StreamingRelationV2(
  63. s, source, options,
  64. tempReader.readSchema().toAttributes, v1Relation)(sparkSession))
  65. case _ =>
  66. // Code path for data source v1.
  67. Dataset.ofRows(sparkSession, StreamingRelation(v1DataSource))
  68. }
  69. }

val ds=DataSoruce.lookupDataSource(source ,….).newInstance()用到了该source变量,要想知道ds是什么(Dataset还是其他),需要查看DataSource.lookupDataSource(source,。。。)方法实现。

DataSource.lookupDataSource(source, sparkSession.sqlContext.conf)解析

DataSource源代码文件:https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

其中lookupDataSource方法是DataSource类的object对象中定义的:

  1. object DataSource extends Logging {
  2.  
  3. 。。。。。/**
  4. * Class that were removed in Spark 2.0. Used to detect incompatibility libraries for Spark 2.0.
  5. */
  6. private val spark2RemovedClasses = Set(
  7. "org.apache.spark.sql.DataFrame",
  8. "org.apache.spark.sql.sources.HadoopFsRelationProvider",
  9. "org.apache.spark.Logging")
  10.  
  11. /** Given a provider name, look up the data source class definition. */
  12. def lookupDataSource(provider: String, conf: SQLConf): Class[_] = {
  13. val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) match {
  14. case name if name.equalsIgnoreCase("orc") &&
  15. conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" =>
  16. classOf[OrcFileFormat].getCanonicalName
  17. case name if name.equalsIgnoreCase("orc") &&
  18. conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "hive" =>
  19. "org.apache.spark.sql.hive.orc.OrcFileFormat"
  20. case "com.databricks.spark.avro" if conf.replaceDatabricksSparkAvroEnabled =>
  21. "org.apache.spark.sql.avro.AvroFileFormat"
  22. case name => name
  23. }
  24. val provider2 = s"$provider1.DefaultSource"
  25. val loader = Utils.getContextOrSparkClassLoader
  26. val serviceLoader = ServiceLoader.load(classOf[DataSourceRegister], loader)
  27.  
  28. try {
  29. serviceLoader.asScala.filter(_.shortName().equalsIgnoreCase(provider1)).toList match {
  30. // the provider format did not match any given registered aliases
  31. case Nil =>
  32. try {
  33. Try(loader.loadClass(provider1)).orElse(Try(loader.loadClass(provider2))) match {
  34. case Success(dataSource) =>
  35. // Found the data source using fully qualified path
  36. dataSource
  37. case Failure(error) =>
  38. if (provider1.startsWith("org.apache.spark.sql.hive.orc")) {
  39. throw new AnalysisException(
  40. "Hive built-in ORC data source must be used with Hive support enabled. " +
  41. "Please use the native ORC data source by setting 'spark.sql.orc.impl' to " +
  42. "'native'")
  43. } else if (provider1.toLowerCase(Locale.ROOT) == "avro" ||
  44. provider1 == "com.databricks.spark.avro" ||
  45. provider1 == "org.apache.spark.sql.avro") {
  46. throw new AnalysisException(
  47. s"Failed to find data source: $provider1. Avro is built-in but external data " +
  48. "source module since Spark 2.4. Please deploy the application as per " +
  49. "the deployment section of \"Apache Avro Data Source Guide\".")
  50. } else if (provider1.toLowerCase(Locale.ROOT) == "kafka") {
  51. throw new AnalysisException(
  52. s"Failed to find data source: $provider1. Please deploy the application as " +
  53. "per the deployment section of " +
  54. "\"Structured Streaming + Kafka Integration Guide\".")
  55. } else {
  56. throw new ClassNotFoundException(
  57. s"Failed to find data source: $provider1. Please find packages at " +
  58. "http://spark.apache.org/third-party-projects.html",
  59. error)
  60. }
  61. }
  62. } catch {
  63. case e: NoClassDefFoundError => // This one won't be caught by Scala NonFatal
  64. // NoClassDefFoundError's class name uses "/" rather than "." for packages
  65. val className = e.getMessage.replaceAll("/", ".")
  66. if (spark2RemovedClasses.contains(className)) {
  67. throw new ClassNotFoundException(s"$className was removed in Spark 2.0. " +
  68. "Please check if your library is compatible with Spark 2.0", e)
  69. } else {
  70. throw e
  71. }
  72. }
  73. case head :: Nil =>
  74. // there is exactly one registered alias
  75. head.getClass
  76. case sources =>
  77. // There are multiple registered aliases for the input. If there is single datasource
  78. // that has "org.apache.spark" package in the prefix, we use it considering it is an
  79. // internal datasource within Spark.
  80. val sourceNames = sources.map(_.getClass.getName)
  81. val internalSources = sources.filter(_.getClass.getName.startsWith("org.apache.spark"))
  82. if (internalSources.size == 1) {
  83. logWarning(s"Multiple sources found for $provider1 (${sourceNames.mkString(", ")}), " +
  84. s"defaulting to the internal datasource (${internalSources.head.getClass.getName}).")
  85. internalSources.head.getClass
  86. } else {
  87. throw new AnalysisException(s"Multiple sources found for $provider1 " +
  88. s"(${sourceNames.mkString(", ")}), please specify the fully qualified class name.")
  89. }
  90. }
  91. } catch {
  92. case e: ServiceConfigurationError if e.getCause.isInstanceOf[NoClassDefFoundError] =>
  93. // NoClassDefFoundError's class name uses "/" rather than "." for packages
  94. val className = e.getCause.getMessage.replaceAll("/", ".")
  95. if (spark2RemovedClasses.contains(className)) {
  96. throw new ClassNotFoundException(s"Detected an incompatible DataSourceRegister. " +
  97. "Please remove the incompatible library from classpath or upgrade it. " +
  98. s"Error: ${e.getMessage}", e)
  99. } else {
  100. throw e
  101. }
  102. }
  103. }
  104. 、、、
  105. }

其业务流程:

1)优先从object DataSource预定义backwardCompatibilityMap中查找provider;

2)查找失败,返回原名字;

3)使用serviceLoader加载DataSourceRegister的子类集合;

4)过滤3)中集合中shortName与provider相等的provider;

5)返回providerClass。

其中的backwardCompatibilityMap也是DataSource的object对象中的定义的,相当于是一个预定义provider的集合。

  1. object DataSource extends Logging {
  2.  
  3. /** A map to maintain backward compatibility in case we move data sources around. */
  4. private val backwardCompatibilityMap: Map[String, String] = {
  5. val jdbc = classOf[JdbcRelationProvider].getCanonicalName
  6. val json = classOf[JsonFileFormat].getCanonicalName
  7. val parquet = classOf[ParquetFileFormat].getCanonicalName
  8. val csv = classOf[CSVFileFormat].getCanonicalName
  9. val libsvm = "org.apache.spark.ml.source.libsvm.LibSVMFileFormat"
  10. val orc = "org.apache.spark.sql.hive.orc.OrcFileFormat"
  11. val nativeOrc = classOf[OrcFileFormat].getCanonicalName
  12. val socket = classOf[TextSocketSourceProvider].getCanonicalName
  13. val rate = classOf[RateStreamProvider].getCanonicalName
  14.  
  15. Map(
  16. "org.apache.spark.sql.jdbc" -> jdbc,
  17. "org.apache.spark.sql.jdbc.DefaultSource" -> jdbc,
  18. "org.apache.spark.sql.execution.datasources.jdbc.DefaultSource" -> jdbc,
  19. "org.apache.spark.sql.execution.datasources.jdbc" -> jdbc,
  20. "org.apache.spark.sql.json" -> json,
  21. "org.apache.spark.sql.json.DefaultSource" -> json,
  22. "org.apache.spark.sql.execution.datasources.json" -> json,
  23. "org.apache.spark.sql.execution.datasources.json.DefaultSource" -> json,
  24. "org.apache.spark.sql.parquet" -> parquet,
  25. "org.apache.spark.sql.parquet.DefaultSource" -> parquet,
  26. "org.apache.spark.sql.execution.datasources.parquet" -> parquet,
  27. "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" -> parquet,
  28. "org.apache.spark.sql.hive.orc.DefaultSource" -> orc,
  29. "org.apache.spark.sql.hive.orc" -> orc,
  30. "org.apache.spark.sql.execution.datasources.orc.DefaultSource" -> nativeOrc,
  31. "org.apache.spark.sql.execution.datasources.orc" -> nativeOrc,
  32. "org.apache.spark.ml.source.libsvm.DefaultSource" -> libsvm,
  33. "org.apache.spark.ml.source.libsvm" -> libsvm,
  34. "com.databricks.spark.csv" -> csv,
  35. "org.apache.spark.sql.execution.streaming.TextSocketSourceProvider" -> socket,
  36. "org.apache.spark.sql.execution.streaming.RateSourceProvider" -> rate
  37. )
  38. }
  39. 。。。
  40. }

shortName为kafka且实现了DataSourceRegister接口的类:

满足“shortName为kafka且实现了DataSourceRegister接口的类”就是:KafkaSourceProvider(https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala

  1. /**
  2. * The provider class for all Kafka readers and writers. It is designed such that it throws
  3. * IllegalArgumentException when the Kafka Dataset is created, so that it can catch
  4. * missing options even before the query is started.
  5. */
  6. private[kafka010] class KafkaSourceProvider extends DataSourceRegister
  7. with StreamSourceProvider
  8. with StreamSinkProvider
  9. with RelationProvider
  10. with CreatableRelationProvider
  11. with TableProvider
  12. with Logging {
  13. import KafkaSourceProvider._
  14.  
  15. override def shortName(): String = "kafka"
  16. 。。。。
  17. }

DataSourceRegister类定义

https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala

  1. /**
  2. * Data sources should implement this trait so that they can register an alias to their data source.
  3. * This allows users to give the data source alias as the format type over the fully qualified
  4. * class name.
  5. *
  6. * A new instance of this class will be instantiated each time a DDL call is made.
  7. *
  8. * @since 1.5.0
  9. */
  10. @InterfaceStability.Stable
  11. trait DataSourceRegister {
  12.  
  13. /**
  14. * The string that represents the format that this data source provider uses. This is
  15. * overridden by children to provide a nice alias for the data source. For example:
  16. *
  17. * {{{
  18. * override def shortName(): String = "parquet"
  19. * }}}
  20. *
  21. * @since 1.5.0
  22. */
  23. def shortName(): String
  24. }

继承了DataSourceRegister的类有哪些?

继承了DataSourceRegister的类包含:

https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala

https://github.com/apache/spark/blob/branch-2.4/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala

https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

https://github.com/apache/spark/blob/branch-2.4/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala

https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala

https://github.com/apache/spark/blob/branch-2.4/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala

https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/RateStreamProvider.scala

https://github.com/apache/spark/blob/branch-2.4/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala

https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala

https://github.com/apache/spark/blob/branch-2.4/sql/hive/src/test/scala/org/apache/spark/sql/sources/SimpleTextRelation.scala

https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala

https://github.com/apache/spark/blob/branch-2.4/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala

https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala

https://github.com/apache/spark/blob/branch-2.4/sql/core/src/test/scala/org/apache/spark/sql/sources/fakeExternalSources.scala

https://github.com/apache/spark/blob/branch-2.4/sql/core/src/test/scala/org/apache/spark/sql/sources/DDLSourceLoadSuite.scala

https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala

https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/noop/NoopDataSource.scala

https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/console.scala

https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala

https://github.com/apache/spark/blob/branch-2.4/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala

https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/TextSocketSourceProvider.scala

Spark2.x(六十):在Structured Streaming流处理中是如何查找kafka的DataSourceProvider?的更多相关文章

  1. C#高级编程六十九天----DLR简介 .在.NET中使用DLR(转载) 我也来说说Dynamic

    DLR 一.近年来,在TIOBE公司每个月发布的编程语言排行榜中,C#总是能挤进前十名,而在最近十年来,C#总体上呈现上升的趋势.C#能取得这样的成绩,有很多因素,其中它在语言特性上的锐意进取让人印象 ...

  2. Spark2.x(六十一):在Spark2.4 Structured Streaming中Dataset是如何执行加载数据源的?

    本章主要讨论,在Spark2.4 Structured Streaming读取kafka数据源时,kafka的topic数据是如何被执行的过程进行分析. 以下边例子展开分析: SparkSession ...

  3. Spark学习进度11-Spark Streaming&Structured Streaming

    Spark Streaming Spark Streaming 介绍 批量计算 流计算 Spark Streaming 入门 Netcat 的使用 项目实例 目标:使用 Spark Streaming ...

  4. Kafka:ZK+Kafka+Spark Streaming集群环境搭建(二十三)Structured Streaming遇到问题:Set(TopicName-0) are gone. Some data may have been missed

    事情经过:之前该topic(M_A)已经存在,而且正常使用structured streaming消费了一段时间,后来删除了topic(M_A),重新创建了topic(M-A),程序使用新创建的top ...

  5. Spark2.3(四十二):Spark Streaming和Spark Structured Streaming更新broadcast总结(二)

    本次此时是在SPARK2,3 structured streaming下测试,不过这种方案,在spark2.2 structured streaming下应该也可行(请自行测试).以下是我测试结果: ...

  6. Spark2.2(三十八):Spark Structured Streaming2.4之前版本使用agg和dropduplication消耗内存比较多的问题(Memory issue with spark structured streaming)调研

    在spark中<Memory usage of state in Spark Structured Streaming>讲解Spark内存分配情况,以及提到了HDFSBackedState ...

  7. Spark2.3(三十五)Spark Structured Streaming源代码剖析(从CSDN和Github中看到别人分析的源代码的文章值得收藏)

    从CSDN中读取到关于spark structured streaming源代码分析不错的几篇文章 spark源码分析--事件总线LiveListenerBus spark事件总线的核心是LiveLi ...

  8. Spark2.x(五十五):在spark structured streaming下sink file(parquet,csv等),正常运行一段时间后:清理掉checkpoint,重新启动app,无法sink记录(file)到hdfs。

    场景: 在spark structured streaming读取kafka上的topic,然后将统计结果写入到hdfs,hdfs保存目录按照month,day,hour进行分区: 1)程序放到spa ...

  9. Spark2.3(三十四):Spark Structured Streaming之withWaterMark和windows窗口是否可以实现最近一小时统计

    WaterMark除了可以限定来迟数据范围,是否可以实现最近一小时统计? WaterMark目的用来限定参数计算数据的范围:比如当前计算数据内max timestamp是12::00,waterMar ...

随机推荐

  1. linux技能五 文件权限

    文件权限:-rw-r--r--.  1 fileInUser fileInGroup 1623 5月 4 19:33 fileName -:第一个-是文件类型 rw-:文件的所有者权限 r--:文件的 ...

  2. ningx配置本地https环境

    由于项目改成了https访问,所以本地开发的时候也要通过https验证,避免页面发送http请求. 例如原来是这样访问:http://192.168.88.88:8080/ 或 http://loca ...

  3. liunx 安装nc/netcat centos安装netcat

    如果本文对你有用,请爱心点个赞,提高排名,帮助更多的人.谢谢大家!❤ 如果解决不了,可以在文末进群交流. 1:下载源码包,wget https://sourceforge.net/projects/n ...

  4. React: 无状态组件生成真实DOM结点

    在上一篇文章中,我们总结并模拟了 JSX 生成真实 DOM 结点的过程,今天接着来介绍一下无状态组件的生成过程. 先以下面一段简单的代码举例: const Greeting = function ({ ...

  5. MySQL Charset--UTF8和UTF8MB4对比测试

    UTF8和UTF8MB4 在早期MySQL版本中,使用只支持最长三字节的UTF8字符集便可以存放所有Unicode字符.随着Unicode的完善,Unicode字符集收录的字符数量越来越多,最新版本的 ...

  6. Java开发环境之MyEclipse

    查看更多Java开发环境配置,请点击<Java开发环境配置大全> 拾贰章:MyEclipse安装教程 1)下载MyEclipse安装包 http://www.myeclipsecn.com ...

  7. Android开发之常用Intent.Action【转】

    1.从google搜索内容 Intent intent = new Intent(); intent.setAction(Intent.ACTION_WEB_SEARCH); intent.putEx ...

  8. 10 分钟上手 Vue 组件 Vue-Draggable

    Vue 综合了 Angualr 和 React 的优点,因其易上手,轻量级,受到了广泛应用.成为了是时下火热的前端框架,吸引着越来越多的前端开发者! 本文将通过一个最简单的拖拽例子带领大家快速上手 V ...

  9. Mac电脑永久路由的添加方法是是什么? Mac校园网连接教程

    学校校园网面向全校师生开放,无奈Windows用户基数大,学校只为Windows平台制作了内网连接工具,Mac平台资源较少,本人查阅相关资料后,总结整理出以下步骤,方便本校学生连接校园网. 有永久路由 ...

  10. Windows Cmd 命令管理服务

    今天在Windows 干净环境上安装软件过程中,安装完成后,发现部署在IIS 上的网站无法使用,提示  "您提交的参数有误!,请重新提交" 纯净的windows 7 x64位环境, ...