From:  johndcook.com/blog

For a set of positive probabilities pi summing to 1, their entropy is defined as

(For this post, log will mean log base 2, not natural log.)

This post looks at a couple questions about computing entropy. First, are there any numerical problems computing entropy directly from the equation above?

Second, imagine you don’t have the pi values directly but rather counts ni that sum to N. Then pi = ni/N. To apply the equation directly, you’d first need to compute N, then go back through the data again to compute entropy. If you have a large amount of data, could you compute the entropy in one pass?

To address the second question, note that

so you can sum ni and ni log ni in the same pass.

One of the things you learn in numerical analysis is to look carefully at subtractions. Subtracting two nearly equal numbers can result in a loss of precision. Could the numbers above be nearly equal? Maybe if the ni are ridiculously large. Not just astronomically large — astronomically large numbers like the number of particles in the universe are fine — but ridiculously large, numbers whose logarithms approach the limits of machine-representable numbers. (If we’re only talking about numbers as big as the number of particles in the universe, their logs will be at most three-digit numbers).

Now to the problem of computing the sum of ni log ni. Could the order of the terms matter? This also applies to the first question of the post if we look at summing the pi log pi. In general, you’ll get better accuracy summing a lot positive numbers by sorting them and adding from smallest to largest and worse accuracy by summing largest to smallest. If adding a sorted list gives essentially the same result when summed in either direction, summing the list in any other order should too.

To test the methods discussed here, I used two sets of count data, one on the order of a million counts and the other on the order of a billion counts. Both data sets had approximately a power law distribution with counts varying over seven or eight orders of magnitude. For each data set I computed the entropy four ways: two equations times two orders. I convert the counts to probabilities and use the counts directly, and I sum smallest to largest and largest to smallest.

For the smaller data set, all four methods produced the same answer to nine significant figures. For the larger data set, all four methods produced the same answer to seven significant figures. So at least for the kind of data I’m looking at, it doesn’t matter how you calculate entropy, and you might as well use the one-pass algorithm to get the result faster.

[转载] Calculating Entropy的更多相关文章

  1. 【转载】7 Steps for Calculating the Largest Lyapunov Exponent of Continuous Systems

    原文地址:http://sprott.physics.wisc.edu/chaos/lyapexp.htm The usual test for chaos is calculation of the ...

  2. [转载] TLS协议分析 与 现代加密通信协议设计

    https://blog.helong.info/blog/2015/09/06/tls-protocol-analysis-and-crypto-protocol-design/?from=time ...

  3. 转载:scikit-learn学习之决策树算法

    版权声明:<—— 本文为作者呕心沥血打造,若要转载,请注明出处@http://blog.csdn.net/gamer_gyt <—— 目录(?)[+] ================== ...

  4. 【转载】计算机视觉(CV)前沿国际国内期刊与会议

    计算机视觉(CV)前沿国际国内期刊与会议这里的期刊大部分都可以通过上面的专家们的主页间接找到1.国际会议 2.国际期刊 3.国内期刊 4.神经网络 5.CV 6.数字图象 7.教育资源,大学 8.常见 ...

  5. 【转载】nginx 并发数问题思考:worker_connections,worker_processes与 max clients

    注:这个文章主要是作者一直在研究nginx作为http server和反向代理服务器时候所谓最大的max_clients和 worker_connections的计算公式, 其实最后的结论也没有卡上公 ...

  6. ZOJ3827 ACM-ICPC 2014 亚洲区域赛的比赛现场牡丹江I称号 Information Entropy 水的问题

    Information Entropy Time Limit: 2 Seconds      Memory Limit: 131072 KB      Special Judge Informatio ...

  7. 最大似然估计 (Maximum Likelihood Estimation), 交叉熵 (Cross Entropy) 与深度神经网络

    最近在看深度学习的"花书" (也就是Ian Goodfellow那本了),第五章机器学习基础部分的解释很精华,对比PRML少了很多复杂的推理,比较适合闲暇的时候翻开看看.今天准备写 ...

  8. 基于Spark自动扩展scikit-learn (spark-sklearn)(转载)

    转载自:https://blog.csdn.net/sunbow0/article/details/50848719 1.基于Spark自动扩展scikit-learn(spark-sklearn)1 ...

  9. softmax,softmax loss和cross entropy的区别

     版权声明:本文为博主原创文章,未经博主允许不得转载. https://blog.csdn.net/u014380165/article/details/77284921 我们知道卷积神经网络(CNN ...

随机推荐

  1. linux-01Red Hat Enterprise Linux 7(RHEL7)配置静态IP地址

    为方便在学习linux readhat7,在本地安装安装了虚拟机.为能够用win7连接虚拟机的linux远程客户端操作,则需要虚拟机和win本地的网络互通: 操作如下:1.本地配置ip地址 : 2.验 ...

  2. 用Nifi合并二个API、计算并生成新的API

    1. 全景图   2. 合并 根据attribute合并flowfile:   合并 json, 并增加code,message等:   3. 计算方差: 在ExecuteScript里只能用纯pyt ...

  3. 跟我一起学WCF(13)——WCF系列总结

    引言 WCF是微软为了实现SOA的框架,它是对微乳之前多种分布式技术的继承和扩展,这些技术包括Enterprise Service..NET Remoting.XML Web Service.MSMQ ...

  4. C#设计模式总结

    一.引言 经过这段时间对设计模式的学习,自己的感触还是很多的,因为我现在在写代码的时候,经常会想想这里能不能用什么设计模式来进行重构.所以,学完设计模式之后,感觉它会慢慢地影响到你写代码的思维方式.这 ...

  5. 冲刺阶段 day1

    Day 1 项目进展: 通过之前的项目学习,我们已经对我们的耿丹师生基本信息系统项目有了一个方向,又通过昨天和今天的站立会议,大体的编程安排也已经确定了下来,按照之前的编程分工,大家已经开始进行.首先 ...

  6. 在Android中调用C#写的WebService(附源代码)

    由于项目中要使用Android调用C#写的WebService,于是便有了这篇文章.在学习的过程中,发现在C#中直接调用WebService方便得多,直接添加一个引用,便可以直接使用将WebServi ...

  7. HTTP权威指南阅读笔记四:连接管理

    HTTP通信是由TCP/IP承载的,HTTP紧挨着TCP,位于其上层,所以HTTP事务的性能很大程度上取决于底层TCP通道的性能. HTTP事务的时延 如图: HTTP事务的时延有以下几种主要原因. ...

  8. jenkins2 groovy脚本参考

    使用plugin生成groovy脚本,或者参考已有的groovy脚本. 文章来自:http://www.ciandcd.com文中的代码来自可以从github下载: https://github.co ...

  9. AngularJS快速入门指南10:DOM节点

    AngularJS通过指令将application数据绑定到HTML DOM元素的属性上. ng-disabled指令 ng-disabled指令将AngularJS application数据绑定到 ...

  10. ueditor编辑器和at.js集成

    源码: <%@ page language="java" import="java.util.*" pageEncoding="UTF-8&qu ...