mysql批量数据导入探究

最近工作碰到一个问题，如何将大量数据（100MB+）导入到远程的mysql server上。

尝试1：　　

　　Statement执行executeBatch的方法。每次导入1000条记录。时间为12s/1000条。比较慢。

　　对于1M次的插入这意味着需要4个多小时，期间还会因为网络状况，数据库负载等因素而把载入延迟提升到85s/1000条甚至更高。　　

　　效果较差。

尝试2：

　　使用PreparedStatement，该方法需要预先给定insert操作的“格式”。

　　实测用这种方式插入的效率为每秒钟数十行。

　　注意，将rewriteBatchedStatements设为true之后，在不到一分钟时间里面就将78万条数据全部导入数据库了。这是一个行之有效的方法。

　　代码：

 import java.io.BufferedReader;

 import java.io.FileReader;

 import java.sql.Connection;

 import java.sql.DriverManager;

 import java.sql.PreparedStatement;

 /**

  *

  */

 public class PreparedStatementTestMain {

     private static PreparedStatement ps;

     public static void main(String[] args) {

         try{

             Class.forName("com.mysql.jdbc.Driver");

             Connection conn = DriverManager.getConnection("jdbc:mysql://remote-host/test?user=xxx&password=xxx");

             String sql = "insert into test values(?,?,?,?,?,?,?,?,?,?,?)";

             ps = conn.prepareStatement(sql);

             BufferedReader in = new BufferedReader(new FileReader("xxxx"));

             String line;

             int count =0;

             while((line = in.readLine())!=null){

                 count+=1;

                 String[] values = line.split("\t",-1);

                 //ps.setInt(1,count);

                 for(int i =1;i<values.length;i++) {

 //                    if(i==6){

 //                        ps.setInt(i+1,Integer.parseInt(values[i]));

 //                    }else{

 //                    if(values[i]==null){

 //                        ps.setString(i," ");

 //                    }else {

                         ps.setString(i, values[i]);

 //                    }

 //                    }

                 }

                 ps.addBatch();

                 System.out.println("Line "+count);

             }

             ps.executeBatch();

             ps.close();

         }catch(Exception e){

             e.printStackTrace();

         }

     }

 }

尝试3：

　　使用mysqlimport工具。经过实测，速度接近于尝试2中加上rewriteBatchedStatements之后的速度。不过前提是数据必须要保存为文件。

另外一个思路：

　　多线程插入。

　　测试：　

import java.sql.*;

import java.util.Properties;

import java.util.Random;

import java.util.concurrent.*;

public class TestMultiThreadInsert {

    private static final String dbClassName = "com.mysql.jdbc.Driver";

    private static final String CONNECTION = "jdbc:mysql://host/";

    private static final String USER = "x";

    private static final String PASSWORD = "xxx";

    private static final int THREAD_NUM=10;

    private static void executeSQL(Connection conn, String sql) throws SQLException {

            Statement stmt = conn.createStatement();

            stmt.execute(sql);

    }

    private static void ResetEnvironment() throws SQLException {

        Properties p = new Properties();

        p.put("user", USER);

        p.put("password", PASSWORD);

        Connection conn = DriverManager.getConnection(CONNECTION, p);

        {

            for (String query: new String[] {

                    "USE test",

                    "CREATE TABLE IF NOT EXISTS  MTI (ID INT AUTO_INCREMENT PRIMARY KEY,MASSAGE VARCHAR(9) NOT NULL)",

                    "TRUNCATE TABLE MTI"

            }) {

                executeSQL(conn, query);

            }

        }

    }

    private static void worker() {

        Properties properties = new Properties();

        properties.put("user", USER);

        properties.put("password", PASSWORD);

        try{

            Connection conn = DriverManager.getConnection(CONNECTION, properties);

            executeSQL(conn, "USE test");

            while (!Thread.interrupted()) {

                executeSQL(conn, String.format("INSERT INTO MTI VALUES (NULL,'hello')"));

                System.out.println("Inserting "+value+" finished.");

            }

        } catch (SQLException e) {

            e.printStackTrace();

        }

    }

    public static void main(String[] args) throws ClassNotFoundException, SQLException, InterruptedException {

        Class.forName(dbClassName);

        ResetEnvironment();

        ExecutorService executor = Executors.newFixedThreadPool(THREAD_NUM);

        for (int i = 0; i < THREAD_NUM; i++) {

            executor.submit(new Runnable() {

                public void run() {

                    worker();

                }

            });

        }

        Thread.sleep(20000);

        executor.shutdownNow();

        if (!executor.awaitTermination(5, TimeUnit.SECONDS)) {

            System.err.println("Pool did not terminate");

        }

    }

}

　　20个线程分别单条单条地插入，20秒钟插入2923条；

　　10个线程分别单条单条地插入，20秒钟插入1699条；

　　1个线程单条单条地插入，20秒钟330条。

　　预测：将多线程与PreparedStatement结合预计可以提高插入速度。

　　但是使用多线程插入会不可避免的要考虑到一个问题：写锁。

　　虽然上面的程序确实证明了多线程插入的可行性，但是背后的逻辑是什么样的呢？有必要进行一下解读。

　　上面的代码中的多线程对应的是多个连接（可参考：https://dev.mysql.com/doc/refman/5.5/en/connection-threads.html），通过多线程主要是提高了命令提交速度，而不是多个执行线程。至于如何执行，还需要考察InnoDB（目前所用的数据库引擎）对数据插入的处理机制。

　　为了解决这个问题，通过搜索，查到了这些可能存在联系的关键词：

　　1.io_threads（https://dev.mysql.com/doc/refman/5.5/en/innodb-performance-multiple_io_threads.html）,

　　2.锁，

　　3.insert buffer（https://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-performance-change_buffering.html）。

　　关于insert buffer，理解下面这句话是关键：

　　innodb使用insert buffer"欺骗"数据库:对于为非唯一索引，辅助索引的修改操作并非实时更新索引的叶子页,而是把若干对同一页面的更新缓存起来做合并为一次性更新操作,转化随机IO 为顺序IO,这样可以避免随机IO带来性能损耗，提高数据库的写性能。　　

　　要理解上面那句话，要先知道innoDB使用了什么样的数据结构来存储数据。

　　B+Tree！

　　关于B+Tree 网上一堆说明，这里不作赘述。

　　关于io_threads，是利用多个线程来处理对数据页的读写。

　　有一个问题依然没有说明白：锁！

　　（官方）locking

　The system of protecting a transaction from seeing or changing data that is being queried or changed by other transactions. The locking strategy must balance reliability and consistency of database operations (the principles of the ACID philosophy) against the performance needed for goodconcurrency. Fine-tuning the locking strategy often involves choosing an isolation level and ensuring all your database operations are safe and reliable for that isolation level.

innodb为了提高读的性能，自定义了read write lock，也就是读写锁。其设计原则是：
1、同一时刻允许多个线程同时读取内存中的变量
2、同一时刻只允许一个线程更改内存中的变量
3、同一时刻当有线程在读取变量时不允许任何线程写存在
4、同一时刻当有线程在更改变量时不允许任何线程读，也不允许出自己以外的线程写（线程内可以递归占有锁）。
5、当有rw_lock处于线程读模式下是有线程写等待，这时候如果再有其他线程读请求锁的时，这个读请求将处于等待前面写完成。

　　既然有了锁，那么如何利用多线程的写操作来提高效率呢？

　思考角度：提高互斥锁的切换效率！

　　怎么做到？

参考http://www.2cto.com/database/201411/352586.html

　　https://dev.mysql.com/doc/refman/5.5/en/innodb-performance-latching.html

　　　　On many platforms, Atomic operations can often be used to synchronize the actions of multiple threads more efficiently than Pthreads. Each operation to acquire or release a lock can be done in fewer CPU instructions, wasting less time when threads contend for access to shared data structures. This in turn means greater scalability on multi-core platforms.

　　　　On platforms where the GCC, Windows, or Solaris functions for atomic memory access are not available, InnoDB uses the traditional Pthreads method of implementing mutexes and read/write locks.

　　mutex

Informal abbreviation for "mutex variable". (Mutex itself is short for "mutual exclusion".) The low-level object that InnoDB uses to represent and enforce exclusive-access locks to internal in-memory data structures. Once the lock is acquired, any other process, thread, and so on is prevented from acquiring the same lock. Contrast with rw-locks, which InnoDB uses to represent and enforce shared-access locks to internal in-memory data structures. Mutexes and rw-locks are known collectively as latches.

rw-lock

The low-level object that InnoDB uses to represent and enforce shared-access locks to internal in-memory data structures following certain rules. Contrast with mutexes, which InnoDB uses to represent and enforce exclusive access to internal in-memory data structures. Mutexes and rw-locks are known collectively as latches.

rw-lock types include s-locks (shared locks), x-locks (exclusive locks), and sx-locks (shared-exclusive locks).

An s-lock provides read access to a common resource.
An x-lock provides write access to a common resource while not permitting inconsistent reads by other threads.
An sx-lock provides write access to a common resource while permitting inconsistent reads by other threads. sx-locks were introduced in MySQL 5.7 to optimize concurrency and improve scalability for read-write workloads.

The following matrix summarizes rw-lock type compatibility.

	`S`	`SX`	`X`
`S`	Compatible	Compatible	Conflict
`SX`	Compatible	Conflict	Conflict
`X`	Conflict	Conflict	Conflict

补充：

rewriteBatchedStatements到底为什么对速度优化这个多？

　　一种说法：这样做的目的是为了让mysql能够将多个mysql insert语句打包成一个packet和mysql服务器通信。这样可以极大降低网络开销。

　　另一种说法：

　　Rewriting Batches

　　 “rewriteBatchedStatements=true”

　 Affects (Prepared)Statement.add/executeBatch()

　　 Core concept - remove latency

　　 Special treatment for prepared INSERT statements

　　——Mark Matthews - Sun Microsystems

PreparedStatement VS Statement

　　数据库系统会对sql语句进行预编译处理（如果JDBC驱动支持的话），预处理语句将被预先编译好，这条预编译的sql查询语句能在将来的使用中重用。

mysql批量数据导入探究的更多相关文章

mysql批量数据脚本
mysql批量数据脚本 1 建表 create table dept( id int unsigned primary key auto_increment, deptno mediumint uns ...
ABAP-2-会计凭证批量数据导入本地ACCESS
ABAP-1-会计凭证批量数据导入本地ACCESS 上一版本出现问题: A.若TXT数据条目超过800万(大概1.3G),则将TXT导入ACCESS过程不成功,ACCESS数据表为空.(Access单 ...
ABAP-1-会计凭证批量数据导入本地ACCESS
公司会计凭证导入ACCESS数据库,需要发送给审计,原先的方案是采用DEPHI开发的功能(调用函数获取会计凭证信息,然后INSERT到ACCESS数据表),运行速度非常慢,业务方要求对该功能进行优化, ...
使用pandas把mysql的数据导入MongoDB。
使用pandas把mysql的数据导入MongoDB. 首先说下我的需求,我需要把mysql的70万条数据导入到mongodb并去重, 同时在第二列加入一个url字段,字段的值和第三列的值一样,代码如 ...
"C#"：MySql批量数量导入
现在对数据库(以MySql为例)的操作大多会封装成一个类,如下例所示: namespace TESTDATABASE { public enum DBStatusCode { ALL_OK, MySq ...
Mysql 大量数据导入
今天试图用heidisql 导入一个150M的数据文件(.sql), 结果报out of memory 错误.在网上搜了很多案例,都没能解决问题.我甚至怀疑是mysql 的default的内存设置的太 ...
通过管道传输快速将MySQL的数据导入Redis
通过管道传输pipe将MySQL数据批量导入Redis 自Redis 2.6以上版本起,Redis支持快速大批量导入数据,即官网的Redis Mass Insertion,即Pipe传输, ...
mysql的数据导入导出
1.Navicat for Mysql XML导出导入格式支持二进制数据:虽然同步数据人眼看不出区别,但是java尝试读取数据时,报datetime字段取出的值为“0000-00-00 00:00:0 ...
MySQL之数据导入导出
日常开发中,经常会涉及到对于数据库中数据的导入与导出操作,格式也有很多: TXT,CSV,XLS,SQL等格式,所以,在此总结一下,省的总是百度查询. 一导出 1) 常用的方式就是使用现成的工具例如 ...

随机推荐

SDOI 2017 天才黑客
/* 根据claris的博客以及 beginend 的博客来写的首先考虑如何求出最短路可以从样例看出路径是从边走到边的, 所以我们将边看作点有共同端点的两边之间互相连边, 边权为lcp. ...
C# 程序A发送Log记录给程序B，程序B处理和分析Log记录
C# 程序A发送Log记录给程序B,程序B处理和分析Log记录关键字:C# ;Log记录 ;在线Log记录;Socket:httplistener 一.常用场景 1. APP开发,在真机或者虚拟机上 ...
JavaScript中判断函数、变量是否存在
转载:http://www.jb51.net/article/67551.htm 一.是否存在指定函数 function isExitsFunction(funcName) { try { if (t ...
Java内存原型分析:基本知识
转载: Java内存原型分析:基本知识 java虚拟机内存原型寄存器:我们在程序中无法控制栈:存放基本类型的数据和对象的引用,但对象本身不存放在栈中,而是存放在堆中堆:存放用new产生的数据静 ...
oracle10偶然性卡住登陆
连接数据库异常:登陆数据库后以"conn /as sysdba"方式登陆正常,数据库轻载,无压力:于是检查数据库的监听器,输入"lsntctl services" ...
13.mysql基本查询
1. 给表起个别名:但是,前面的也是需要进行修改的,否则会报错的: select * from s.name from students as s; 2. 为字段起别名 select s,name a ...
spring boot 自定义异常
1.创建一个异常: public class LdapQueryException extends Exception { private Integer code; private String m ...
leetcode138
/** * Definition for singly-linked list with a random pointer. * struct RandomListNode { * int label ...
获取DataView行数据
1. dv.Table.Rows[0]["price"].ToString();这种方法虽然很长,但意思很清晰. 2. dv[0]["price"].T ...
关闭IPV6
[root@bgw-t ~]# cat > /etc/modprobe.d/ipv6.conf << EOF alias net-pf-10 off options ipv6 dis ...

mysql批量数据导入探究

mysql批量数据导入探究的更多相关文章

随机推荐

热门专题