nanoporetech/nanonet
nanonet | Remove extraneous currennt call. | 4 months ago | |
sample_data | Update example data to contain raw | a year ago | |
.gitignore | Get 2d caller to running on windows | a year ago | |
COPYRIGHT | Add open source licensing | 10 months ago | |
LICENSE.md | Add open source licensing | 10 months ago | |
MANIFEST.in | deb packaging, datafiles still missing | a year ago | |
Makefile | Dispense with repack_deb.sh | a year ago | |
README.md | Revert "Resolves #19 Add simulation" | 7 months ago | |
TRAIN_INSTALL.md | Add prerequisites to training. | 4 months ago | |
discover_tests.py | Unittests for nn.py | a year ago | |
nanoccompiler.py | Support Mingw-w64 compiler on Windows (#11) | a year ago | |
nanonetcall.spec | Add pyinstaller spec | a year ago | |
setup.py | Fix #51 (#52) | 3 months ago | |
stdeb.cfg | Dispense with repack_deb.sh | a year ago |
README.md
(c)2016: Oxford Nanopore Technologies, Limited
Nanonet
Nanonet provides recurrent neural network basecalling for Oxford Nanopore MinION data. It represents the first generation of such a basecaller from Oxford Nanopore Technologies, and is provided as a technology demonstrator. Nanonet is provided unsupported by Oxford Nanopore Technologies, see LICENSE.md for more information.
For training networks, Nanonet leverages currennt to run recurrent neural networks. Currennt is generally run with GPUs to aid performance but can be run in a CPU only environment. The basecaller does not require currennt, and is written in pure python with minimal requirements.
Installation
Nanonet contains implementations of both 1D and 2D basecalling, with OpenCL versions of each of these. By default, using the instructions in this section only the canonical 1D basecalling library will be compiled; OpenCL acceleration of 1D basecalling and any 2D basecalling support will not be configured. See later sections of this documented for setting up these components.
Requirements
Nanonet is a python-based commandline suite consisting of several programs for performing basecalling with recurrent neural networks. It has been developed using 64-bit Python 2.7 available from python.org.
In addition to Python the only required dependencies are h5py and numpy. These can be downloaded and installed/compiled automatically, though installing them from your system's package repository is generally preferable in the first instance. For Windows Christophe Golke maintains a repository of compiled python wheels at:
http://www.lfd.uci.edu/~gohlke/pythonlibs/
For OSX homebrew is easiest:
brew tap homebrew/science
brew install hdf5
Easy Windows Installation
Python wheels for nanonet are available from the github page under the releases tab:
https://github.com/nanoporetech/nanonet-dev/releases
to install these run the following from a command prompt:
pip install nanonet-2.0.0-cp27-cp27m-win_amd64.whl
The wheel currently enables CPU implementataions of both 1D and 2D calling, and additionally the OpenCL implementation of 1D calling. To enable the latter please refer to the OpenCL section below. Wheels for other platforms may be made available in the future.
Setup
The basecalling component of nanonet should install quite trivially using the standard python mechanism on most platforms:
python setup.py install
The basecaller contains small amounts of C code for performing event detection. A C compiler is therefore required. Under Linux and OSX, one is likely installed already, for Windows one can download and install the Microsoft Visual C++ Compiler for Python 2.7 from:
https://www.microsoft.com/en-gb/download/details.aspx?id=44266
Most of these documentation assume you are using this compiler on Windows. The section Compiling with MinGW-w64explains how to use GCC based compilation on Windows.
Optional Watcher Component
Nanonet contains an optional component to watch a filesystem for read files as they are produced by MinKnow. This feature is not installed by default. To install it run
pip install -e .[watcher]
from the source directory, or simply
pip install watchdog
for any location. This will allow use of the --watch
option of the basecaller.
Peforming basecalling
Nanonet provides a single program for basecalling Oxford Nanopore Technolgies' reads from .fast5 files. The output is to stdout as fasta formatted sequences.
If you followed the instructions above the program nanonetcall
should be on your path. It requires a single argument:
nanonetcall {input_folder} > {output.fasta}
To test your installation several .fast5 files are provided in the sample_data
folder of the source distribution, as a concrete example the command:
nanonetcall --jobs 5 --chemistry r9 sample_data > basecalls.fa
will produced output along the following lines:
Basecalled 5 reads (25747 bases, 49340 events) in 36.6668331623s (wall time)
Profiling
---------
Feature generation: 1.08410215378
Load network: 0.0128238201141
Run network: 15.6343309879 (1.5990450771 kb/s, 3.13412835112 kev/s)
Decoding: 19.8838129044 (1.25730412574 kb/s, 2.46431608644 kev/s)
Filesystem watching
Nanonet has the ability to watch a filesystem as reads are produced. This behaviour is enabled with the --watch
option:
nanonetcall input_folder --watch 600 > basecalls.fa
where the option value is a timeout in seconds, when no new reads are seen for this time nanonetcall will exit.
Input files
nanonetcall operates from single-read .fast5 files as output by MinKnow. These should contain raw data; the event detection step and segmentation into template and complement sections will be performed by nanonet.
Using multiple CPUs
By default nanonetcall will use a maximum of one CPU. This can be altered through use of the --jobs
option. In using this option be aware that higher numbers will lead to increased memory usage.
2D Basecalling
The 2D basecalling library has a dependency on the boost C++ library. For this reason it is not compiled by default. If you know that you have a working boost installation you can simply use the following to enable 2D basecalling:
python setup.py install with2d
Boost can be installed on most Linux systems via the system package manager. For example on Ubuntu it should be sufficient to install the boost-python package:
sudo apt-get install libboost-python1.54-dev libboost-python1.54.0
You may elect to use a different version of boost if you wish. On OSX boost can again use homebrew:
brew install boost --with-python
brew install boost-python
On Windows the simplest method to install boost is to obtain a precompiled version from sourceforge:
https://sourceforge.net/projects/boost/files/boost-binaries/
nanonet has been tested with version 1.55.0 on windows. It is important to use the version compiled with the same version of the Microsoft Visual C compiler you are using. If you followed the instructions above to install the Microsoft Visual C++ Compiler for Python 2.7 you should download the package labelled msvc-9.0-64.exe
, i.e. the file available here:
https://sourceforge.net/projects/boost/files/boost-binaries/1.55.0/boost_1_55_0-msvc-9.0-64.exe/download
The above installer will by default install boost to c:\local\boost_1_55_0
. If you change this path you will also need to edit the setup.py
file in nanonet. To run the 2D basecaller you will need also to add the library location to your Path
environment variable. In Windows Powershell this can be done with:
$env:Path += ";c:\local\boost_1_55_0\lib64-msvc-9.0"
Once you have installed boost on your OS, the 2D basecalling components can be compiled and set up with:
python setup.py install with2d
Performing 2D basecalling currently requires use of a distinct program from the pure 1D basecaller. The interface of this program is much the same as nanonetcall
, for example a basic use would simply require:
nanonet2d --chemistry r9 sample_data calls
The second option here specifies a prefix for output fasta files; three files will be created: one each for template, complement and 2D basecalls.
OpenCL Support
Nanonet contains OpenCL accelerated versions of both 1D and 2D basecalling. Currently the implementations of these use different mechanisms for creating the OpenCL kernels: 1D basecalling acceleration is driven through pyopencl
whilst 2D acceleration directly interfaces with OpenCL libraries.
Configuring a working OpenCL environment depends heavily on your OS and device you wish to target. If you are unfamiliar with how to do this you are recommended to start with the pyopencl
documentation:
https://wiki.tiker.net/PyOpenCL
1D Acceleration
Once you have the pyopencl
examples working you should be able to run the accelerated version of 1D basecalling:
nanonetcall --chemistry r9 --platforms <VENDOR:DEVICE:1> --exc_opencl sample_data
where VENDOR
and DEVICE
will be machine dependent. To see available devices you can examine the output of:
nanonetcall --list_platforms
The --exc_opencl
option above instructs nanonet to use only the OpenCL device(s) listed on the command line. Without this option, CPU resources will also be used. You may wish to experiment with this option and the --jobs
option to achieve optimal throughput.
2D Acceleration
To enable OpenCL acceration on 2D basecalling it is necessary to compile additional components during the setup of nanonet by using the opencl2d
argument:
python setup.py install opencl2d
With a working OpenCL runtime and development environment 2D basecalling can be accelerated by simply adding and option on the commandline:
nanonet2d --opencl_2d --chemistry r9 sample_data calls
The program will automatically choose an OpenCL device to use, giving preference to GPU devices over CPU ones. It is not currently possible to use OpenCL acceleration for the 1D basecalling necessary for performing a 2D basecall.
Compiling with MinGW-w64
The Microsoft Visual C++ Compiler for Python 2.7 compiler is fairly old and can produce less than efficient code compared to more recent compilers. The following describes how one may instead compile nanonet using MinGW. This process is not recommended for those unfamiliar with compilation toolchains. We use specific, but not the latest, versions of MinGW and Boost in what follows. Be aware that these instructions may not work with later versions of these tools.
First download the following build of MinGW-w64:
https://sourceforge.net/projects/mingwbuilds/files/host-windows/releases/4.8.1/64-bit/threads-posix/seh/x64-4.8.1-release-posix-seh-rev5.7z/download
together with Boost 1.55.0:
https://sourceforge.net/projects/boost/files/boost/1.55.0/boost_1_55_0.7z/download
The files are packaged with 7zip, which you should install if you haven't previously. Extract both of the packages to:
C:\local\
Open Windows Powershell and run the following to build the boost-python library:
$env:Path += ';C:\local\mingw64\bin'
cd c:\local\boost_1_55_0\
.\bootstrap.bat gcc
.\b2.exe toolset=gcc address-model=64 link=shared define=MS_WIN64 --with-python stage
The distutils
module of Python 2.7 supports using MinGW as a compiler, although unfortunately it is likely to cause .dlls to be incorrectly linked when using MinGW-w64. For this reason and to streamline the build process, nanonet patches the class distutils.cygwinccompiler.Mingw32CCompiler
with its own version and manipulates some distutils
internals in order to force use of this class when a users elects to use MinGW-w64. All this means is that to compile and setup an in-place (development) install of nanonet run:
python setup.py develop --user with2d mingw
If you have and existing MinGW-w64 setup you can use the environment variables BOOST_ROOT
, BOOST_LIB
, and BOOST_PYTHON
to specify respectively the location of you boost install, relative location of the boost libraries, and the name of your boost-python library. For example the defaults are:
BOOST_ROOT = "c:\local\boost_1_55_0\"
BOOST_LIB = "stage\lib"
BOOST_PYTON = "boost_python-mgw48-mt-1_55"
Benchmarks
The results below are indicative and provided for reference only. The performance of nanonet should not be assumed to be equal to that of a production implementation.
AWS g2.2xlarge Ubuntu 16.04 Xeon E5-2670 @ 2.60GHz
- numpy is linked to a multithreaded blas, it appears best to set the number of threads to 1.
- 1D performance scales nicely with number of CPU processes
- 1D GPU performance is rough 4.5X for single concurrent kernel or approaching 6.5X for ten concurrent kernels.
- 2D GPU performance is 7X.
1D basecalling
$ nanonetcall more_sample_data/ > /dev/null
Basecalled 20 reads (30160 bases, 62600 events) in 39.9055719376s (wall time)
Run network: 23.59 ( 1.278 kb/s, 2.653 kev/s)
Decoding: 14.53 ( 2.076 kb/s, 4.308 kev/s)
$
$ OMP_NUM_THREADS=1 nanonetcall more_sample_data/ > /dev/null
Basecalled 20 reads (30160 bases, 62600 events) in 30.2339270115s (wall time)
Run network: 13.78 ( 2.189 kb/s, 4.544 kev/s)
Decoding: 14.74 ( 2.046 kb/s, 4.247 kev/s)
$
$ OMP_NUM_THREADS=1 nanonetcall more_sample_data/ --jobs 4 > /dev/null
Basecalled 20 reads (30160 bases, 62600 events) in 7.72924900055s (wall time)
Run network: 13.84 ( 2.180 kb/s, 4.525 kev/s)
Decoding: 15.00 ( 2.011 kb/s, 4.174 kev/s)
$
$ OMP_NUM_THREADS=1 nanonetcall more_sample_data/ --platforms nvidia:0:1 --exc_opencl > /dev/null
Basecalled 20 reads (30160 bases, 62600 events) in 8.73325586319s (wall time)
Run network: 3.83 ( 7.868 kb/s, 16.331 kev/s)
Decoding: 1.54 (19.623 kb/s, 40.730 kev/s)
2D basecalling
$ OMP_NUM_THREADS=1 nanonet2d more_sample_data calls
Processed 20 reads in 214.572195053s (wall time)
Template Run network: 14.01 ( 2.153 kb/s, 4.469 kev/s)
Template Decoding: 15.76 ( 1.914 kb/s, 3.973 kev/s)
2D calling: 156.85 ( 0.188 kb/s)
$
$ OMP_NUM_THREADS=1 nanonet2d more_sample_data calls --opencl_2d
Processed 20 reads in 78.750191927s (wall time)
Template Run network: 13.92 ( 2.167 kb/s, 4.498 kev/s)
Template Decoding: 15.50 ( 1.946 kb/s, 4.038 kev/s)
2D calling: 21.72 ( 1.355 kb/s)
Intel NUC5i7RYH Ubuntu 14.04 i7-5557U CPU @ 3.10GHz
- Again CPU performance better with single threaded blas
- 1D CPU marginally better than AWS machine, though relative speed of network and coding altered.
- Two physical cores: performance with 4 processes not significantly faster than 2.
- 1D Iris 6100 roughly 3.5X over CPU core. Concurrent kernel execution is not handled well.
- 2D GPU 5X over CPU.
1D basecalling
$ nanonetcall more_sample_data/ > /dev/null
Basecalled 20 reads (30160 bases, 62600 events) in 34.7250430584s (wall time)
Run network: 16.35 ( 1.845 kb/s, 3.829 kev/s)
Decoding: 16.26 ( 1.855 kb/s, 3.851 kev/s)
$
$ OMP_NUM_THREADS=1 nanonetcall more_sample_data/ > /dev/null
Basecalled 20 reads (30160 bases, 62600 events) in 28.1761329174s (wall time)
Run network: 10.19 ( 2.960 kb/s, 6.143 kev/s)
Decoding: 16.13 ( 1.870 kb/s, 3.881 kev/s)
$
$ OMP_NUM_THREADS=1 nanonetcall more_sample_data/ --jobs 2 >/dev/null
Basecalled 20 reads (30160 bases, 62600 events) in 14.3247690201s (wall time)
Run network: 10.38 ( 2.907 kb/s, 6.033 kev/s)
Decoding: 16.11 ( 1.872 kb/s, 3.885 kev/s)
$
$ OMP_NUM_THREADS=1 nanonetcall more_sample_data/ --jobs 4 >/dev/null
Basecalled 20 reads (30160 bases, 62600 events) in 12.9280588627s (wall time)
Run network: 18.68 ( 1.615 kb/s, 3.352 kev/s)
Decoding: 29.16 ( 1.034 kb/s, 2.147 kev/s)
$
$ OMP_NUM_THREADS=1 nanonetcall more_sample_data/ --platforms intel:0:1 --exc_opencl > /dev/null
Basecalled 20 reads (30160 bases, 62600 events) in 10.3367981911s (wall time)
Run network: 5.54 ( 5.444 kb/s, 11.299 kev/s)
Decoding: 2.88 (10.474 kb/s, 21.739 kev/s)
2D basecalling
$ OMP_NUM_THREADS=1 nanonet2d more_sample_data calls
Processed 20 reads in 180.179665089s (wall time)
Template Run network: 10.15 ( 2.970 kb/s, 6.165 kev/s)
Template Decoding: 16.59 ( 1.818 kb/s, 3.773 kev/s)
2D calling: 127.52 ( 0.231 kb/s)
$
$ OMP_NUM_THREADS=1 nanonet2d more_sample_data calls --opencl_2d > /dev/null
Processed 20 reads in 80.0777461529s (wall time)
Template Run network: 10.15 ( 2.973 kb/s, 6.170 kev/s)
Template Decoding: 16.73 ( 1.803 kb/s, 3.742 kev/s)
2D calling: 26.19 ( 1.124 kb/s)
E5-2650 v4 @ 2.20GHz (x2, 24 cores total) GTX 1080 Ubuntu 14.04
- Timings likely a little off as GPU used by other processes
- Utilise very little (<200Mb) of GPU memory
1D basecalling
$ nanonetcall more_sample_data/ --platforms nvidia:0:20 --exc_opencl > /dev/null
Basecalled 20 reads (30160 bases, 62600 events) in 3.88349485397s (wall time)
Run network: 1.32 (22.829 kb/s, 47.384 kev/s)
Decoding: 0.91 (33.268 kb/s, 69.051 kev/s)
2D basecalling
$ OMP_NUM_THREADS=1 nanonet2d more_sample_data calls --opencl_2d
Processed 20 reads in 56.9682049751s (wall time)
Template Run network: 10.10 ( 2.986 kb/s, 6.199 kev/s)
Template Decoding: 11.60 ( 2.601 kb/s, 5.398 kev/s)
2D calling: 13.30 ( 2.214 kb/s)
$
$ OMP_NUM_THREADS=1 nanonet2d more_sample_data calls --opencl_2d --jobs 4
Processed 20 reads in 15.9661970139s (wall time)
Template Run network: 10.19 ( 2.959 kb/s, 6.141 kev/s)
Template Decoding: 11.74 ( 2.570 kb/s, 5.334 kev/s)
2D calling: 15.96 ( 1.844 kb/s)
Training a network
The package provides also an interface to currennt for training networks from .fast5 files. The type of neural networks implemented by currennt require labelled input data. Nanonet does not provide a method for labelling data. It does however provide a hook to create labelled data. To install currennt please refer to the full installation instructions in TRAIN_INSTALL.md.
To run the trainer we specify training data, and validation data. The former will be used to train the network whilst the latter is used to check that the network does not become overtrained to the training data.
nanonettrain --train <training_data> --val <validation_data> \
--output <output_model_prefix> --model <input_model_spec>
An example input model can be found in nanonet/data/default_model.tmpl
. The only other consideration is that the size of the first ("input") layer of the network must correspond to the feature vectors created bynanonet.features.events_to_features
. The nanonettrain program will try to enforce these considerations. In contructing models one should assign the input layer a size of <n_features>
and the final two layers as <n_states>
, as in the example.
Training is an intensive process, even on a GPU expect it to take hours not minutes. It is not recommended to attempt training models without GPU support.
Trouble Shooting
If you performed a user install (setup.py install --user
) the nanonetcall
program may not be on your path. This is because the location into which setuptools installs programs for users is not often a default item in the user's path. On OSX the location is typically:
/Users/<username>/Library/Python/2.7/bin/
whilst on Ubuntu it is:
~/.local/bin
If nanonet
complains that it cannot locate the currennt
executable you will need to set the CURRENNT
environment variable to the location of the executable.
nanoporetech/nanonet的更多相关文章
- AI佳作解读系列(四)——数据增强篇
前言 在深度学习的应用过程中,数据的重要性不言而喻.继上篇介绍了数据合成(个人认为其在某种程度上可被看成一种数据增强方法)这个主题后,本篇聚焦于数据增强来介绍几篇杰作! (1)NanoNets : H ...
随机推荐
- ACM程序设计选修课——Problem E:(ds:图)公路村村通(优先队列或sort+克鲁斯卡尔+并查集优化)
畅通工程 Time Limit: 1000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others) Total Submi ...
- HDU——1405The Last Practice(试手map)
The Last Practice Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 65536/32768 K (Java/Others ...
- BZOJ3930 [CQOI2015]选数 【容斥】
题目 我们知道,从区间[L,H](L和H为整数)中选取N个整数,总共有(H-L+1)^N种方案.小z很好奇这样选出的数的最大公约数的规律,他决定对每种方案选出的N个整数都求一次最大公约数,以便进一步研 ...
- ECharts学习总结(二)-----图表组件漏斗图(funnel)
今天在学习ECharts时,想要在ECharts图表的原生态Demo中抠出漏斗图,却不知如何下手,经过一番研究,特总结如下: 首先我们需要这样做 1.拷贝出两个js文件:esl.js 和echarts ...
- [USACO15FEB]Superbull (最小生成树)
题目链接 Solution 基本上就是个板子. 因为 \(n\) 很小,只有 \(2000\),所以直接暴力建图,然后跑最小生成树就好了. Code #include<bits/stdc++.h ...
- 马士兵hadoop第二课:hdfs集群集中管理和hadoop文件操作(转)
马士兵hadoop第一课:虚拟机搭建和安装hadoop及启动 马士兵hadoop第二课:hdfs集群集中管理和hadoop文件操作 马士兵hadoop第三课:java开发hdfs 马士兵hadoop第 ...
- C语言第三题
1.完成char *p,char *q 的字符串比较,相等返回1,错误返回-1 2.用户需要计算的2个20位数数字的加法,写程序实现让用户输入两个20位以上的数字并且相加,输出结果
- Java面试进阶部分集合
框架基础 反射:反射是Java开发的一类动态相关机制.因为本身Java语言并不是一款动态语言,如果我们想要得到程序动态的效果,因此便引入了反射机制这一概念. 怎么表达反射? 能用反射做什么? PS:某 ...
- XSD(XML Schema Definition)学习笔记
今天学习了XSD相关的知识,为了以后查找的方便,写一些笔记. 一.什么是XSD? 1.XSD全称:XML Schema Definition.XML Schema 的作用是定义 XML 文档的合法构建 ...
- python 之 处理excel表的xlwt模块学习记录
python 操作excel表的常用模块主要有2个: 1:xlrd:读取excel表 2:xlwt:创建并写入excel表 安装方法: 可以直接下载安装:https://pypi.python.org ...