a possible low-level optimization

http://www.1point3acres.com/bbs/thread-212960-1-1.html

第二轮白人小哥，一开始问了一道至今不懂的问题，好像是给一个vector<uint8_t> nums, 然后又给一个256位的vector<int> counts，遍历nums，然后counts[nums]++，问如何进行优化，提示说要用到CPU cache之类的东西(完全不知道)。小白哥见我懵逼，后来又给了一道3sum，迅速做出。

uint8_t input[];

uint32_t count[];

void count_it()

{

    for (int i = ; i < sizeof(input) / sizeof(input[]); i++) {

        ++count[input[i]];

    }

}

how to optimize? possible points to consider:

a) target "count" array size is 4B*256=1KB, which can fit into L1 cache, so no need to worry about that;

b) input array access is sequential, which is actually cache friendly;

c) update to "count" could have false sharing, but given it's all in L1 cache, that's fine;

d) optimization 1: the loop could be unrolled to reduce loop check;

e) optimization 2: input array could be pre-fetched (i.e. insert PREFETCH instructions beforehand);

    for (int i = ; i < sizeof(input) / sizeof(input[]);) {

        // typical cache size is 64 bytes

        __builtin_prefetch(&input[i+], , ); // prefetch for read, high locality

        for (int j = ; j < ; j++) {

            int k = i + j * ;

            ++count[input[k]];

            ++count[input[k+]];

            ++count[input[k+]];

            ++count[input[k+]];

            ++count[input[k+]];

            ++count[input[k+]];

            ++count[input[k+]];

            ++count[input[k+]];

        }

        i += ;

    }

(see https://gcc.gnu.org/onlinedocs/gcc-5.4.0/gcc/Other-Builtins.html for __builtin_prefetch)

f) optimization 3: multi-threading, but need to use lock instruction when incrementing the count;

g) optimization 4: vector extension CPU instructions: "gather" instruction to load sparse locations (count[xxx]) to a zmmx register (512bit, 64byte i.e. 16 integers), then it can process 16 input uchar8_t in one go; then add a constant 512bit integer which adds 1 to each integer. corresponding "scatter" instruction will store back the updated count.

a possible low-level optimization的更多相关文章

Solr实现Low Level查询解析（QParser）
Solr实现Low Level查询解析(QParser) Solr基于Lucene提供了方便的查询解析和搜索服务器的功能,可以以插件的方式集成,非常容易的扩展我们自己需要的查询解析方式.其中,Solr ...
C++ Low level performance optimize 2
C++ Low level performance optimize 2 上一篇文章讨论了一些底层代码的优化技巧,本文继续讨论一些相关的内容. 首先,上一篇文章讨论cache missing的重要性 ...
C++ Low level performance optimize
C++ Low level performance optimize 1. May I have 1 bit ? 下面两段代码,哪一个占用空间更少,那个速度更快?思考10秒再继续往下看:) //v1 ...
zabbix监控redis多实例（low level discovery）
对于多实例部署的tomcat.redis等应用,可以利用zabbix的low level discovery功能来实现监控,减少重复操作. 注:Zabbix版本: Zabbix 3.0.2 一.服务 ...
使用Java Low Level REST Client操作elasticsearch
Java REST客户端有两种风格: Java低级别REST客户端(Java Low Level REST Client,以后都简称低级客户端算了,难得码字):Elasticsearch的官方low- ...
Zabbix监控Low level discovery实时监控网站URL状态
今天我们来聊一聊Low level discovery这个功能,我们为什么要用到loe level discovery这个功能呢? 很多时候,在使用zabbix监控一些东西,需要对类似于Itens进行 ...
ChibiOS/RT 2.6.9 CAN Low Level Driver for STM32
/* ChibiOS - Copyright (C) 2006..2015 Giovanni Di Sirio Licensed under the Apache License, Version 2 ...
Consumer设计-high/low Level Consumer
1 Producer和Consumer的数据推送拉取方式 Producer Producer通过主动Push的方式将消息发布到Broker n Consumer Consumer通过Pull从Br ...
zabbix（10）自动发现规则(low level discovery)
1.概念在配置Iterms的过程中,有时候需要对类似的Iterms进行添加,这些Iterms具有共同的特征,表现为某些特定的参数是变量,而其他设置都是一样的,例如:一个程序有多个端口,而需要对端口配 ...
Elasticsearch java api操作（一）（Java Low Level Rest Client）
一.说明: 一.Elasticsearch提供了两个JAVA REST Client版本: 1.java low level rest client: 低级别的rest客户端,通过http与集群交互, ...

随机推荐

iOS 可选择的购物车
最近看了淘宝的购物车,于是做了一个可选择的购物车模板. 如果有好的建议请提出,带我日后更新.
成为高级Java工程师，你必须要看的技术书籍
学习的最好途径就是看书 "学习的最好途径就是看书",这是我自己学习并且小有了一定的积累之后的第一体会.个人认为看书有两点好处: 1.能出版出来的书一定是经过反复的思考.雕琢和审核的 ...
Chrome性能分析工具lightHouse用法指南
本文主要讲如何使用Chrome开发者工具linghtHouse进行页面性能分析. 1.安装插件非常简单,点击右上角的“添加至Chrome”即可. 2.使用方式 1)打开要测试的页面,点击浏览器右上角 ...
Coursera公开课Functional Programming Principles in Scala习题解答：Week 2
引言 OK.时间非常快又过去了一周.第一周有五一假期所以感觉时间绰绰有余,这周中间没有假期仅仅能靠晚上加周末的时间来消化,事实上还是有点紧张呢! 后来发现每堂课的视频还有相应的课件(Slide).字幕 ...
python cookbook第三版学习笔记五：datetime
Python中表示时间的模块是datetime,引入下面的模块 from datetime import datetime,timedelta print datetime.today() #打印出 ...
linux c编程：网络编程
在网络上,通信服务都是采用C/S机制,也就是客户端/服务器机制.流程可以参考下图: 服务器端工作流程: 使用socket()函数创建服务器端通信套接口使用bind()函数将创建的套接口与服务器地址绑 ...
Android/iOS Remote debugging
简单介绍使用下面方法可以定位webview中的元素,无法定位view中的元素. 原文地址:http://mp.weixin.qq.com/s/y_UfdgjT_pkKgYivJmqt7Q webvi ...
Qt — 子窗体操作父窗体中的方法
父窗体与子窗体各自的代码如下: 1. 父窗体的代码: void FartherWindow::addactions() { SubWindow subwindow(this); // 把父窗体本身t ...
mysql忘记root密码或报错：ERROR 1044 (42000): Access denied for user ”@’localhost’ to database ‘xx‘
有的时候忘记了root密码或其他用户的密码,登录的时候报错:ERROR 1044 (42000): Access denied for user ”@’localhost’ to database ' ...
IOS平台的几个推送服务的对比
http://blog.163.com/scuqifuguang@126/blog/static/171370086201399113833299/ 最近研究了一下极光推送(JPush) ...

a possible low-level optimization

a possible low-level optimization的更多相关文章

随机推荐

热门专题