千万数据去重_mysql去重,3亿多数据量
差不多3億6千萬數據,需要去重。因為數據量太大,所以:
將數據load data infile到大表里,不進行任何去重操作,沒有任何約束。然后將數據分成幾十個小表,用這幾十個小表去對比大表去重。得到去重后的小表。去重以后的小表,根據字段進行hash算出后兩位數字,重新建好新表,將去重后小表的數據,插入到帶有hash數字新表中。
存儲過程如下(去重):
DELIMITER //
/*tblname 動態控制表名*/
CREATE PROCEDURE create_imsi(IN tblname varchar(200))
begin
declare age int default 1;
declare done int(1) default 0;
declare v_imsi varchar(200);
/*定義游標*/
declare cur_l cursor for select imsi from sqlstr;
/*定義異常*/
DECLARE CONTINUE HANDLER FOR SQLSTATE '02000' set done=1;
drop view if exists sqlstr;
/*定義視圖*/
set @tbl = CONCAT("create view sqlstr as select a.imsi from tbl_new a,(select imsi from phone_",tblname," group by imsi having count(imsi) > 1) b where a.imsi = b.imsi group by imsi");
/*執行視圖語句*/
PREPARE stmt FROM @tbl;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
OPEN cur_l;
FETCH cur_l INTO v_imsi;
while (done <> 1)
do
/*對比大表數據,刪除小表中的重復數據*/
set @del = CONCAT("delete from phone_",tblname," where imsi=",v_imsi);
PREPARE stmt1 FROM @del;
EXECUTE stmt1;
DEALLOCATE PREPARE stmt1;
FETCH cur_l INTO v_imsi;
end while;
close cur_l;
end//
DELIMITER ;
2、根據hash算法插入新表:
DELIMITER //
CREATE PROCEDURE insert_imsi(IN tblname varchar(20))
begin
declare age int default 1;
declare done int(1) default 0;
declare done1 int(1) default 0;
declare v_imsi varchar(200);
declare v_e varchar(2000);
declare v_number varchar(3000);
declare v_ctype varchar(2000);
declare cur_l cursor for select split from sqlstr;
DECLARE CONTINUE HANDLER FOR SQLSTATE '02000' set done=1;
DECLARE CONTINUE HANDLER FOR 1146 set done1=3;
DECLARE CONTINUE HANDLER FOR SQLSTATE '23000' set done1=1;
DECLARE CONTINUE HANDLER FOR SQLSTATE '42000' set done1=2;
DECLARE CONTINUE HANDLER FOR SQLSTATE 'HY000' set done1=3;
drop view if exists sqlstx;
drop view if exists sqlstr;
set @sqlstx = CONCAT("create view sqlstr as SELECT imsi,number,ctype,mod(conv(right(md5(imsi),2),16,10),100) split from imsi_phone_",tblname);
PREPARE stmt1 FROM @sqlstx;
EXECUTE stmt1;
DEALLOCATE PREPARE stmt1;
OPEN cur_l;
WHILE done <> 1
DO
FETCH cur_l INTO v_e;
set @ins = concat("insert into imsi_",v_e,"(imsi,number,ctype) select imsi,number,ctype from sqlstr where split = '",v_e,"'");
PREPARE stmt3 FROM @ins;
EXECUTE stmt3;
END WHILE;
close cur_l;
end//
DELIMITER ;
報錯:1、ERROR 1243 (HY000) at line 1: Unknown prepared statement handler (stmt3) given to EXECUTE
2、ERROR 1054 (42S22) at line 1: Unknown column '000cdc41b2a02518' in 'where clause'
由于set @dat = concat("insert into imsi_",v_e,"(imsi,number,ctype) select imsi,number,ctype from imsi_phone_",tblname," where imsi=‘’",v_imsi,“‘’”);沒有在(=)那里加單引號,因為字段里有字母。
參數優化:
由于建表使用innodb引擎,所以此優化是針對innodb引擎的:
1、innodb_flush_log_at_trx_commit參數設置為0,減少刷新。
2、set?sql_log_bin=0 暫時不產生二進制日志
3、sync_binlog 設置為0,減少刷新
4、innodb_buffer_pool_size 盡可能設置最大
5、set foreign_key_checks=0 去除外鍵檢查
6、減少不必要的索引,有重復數據的話,主鍵是必須要的
7、innodb_change_buffer_max_size 上限為50,這里我設置為40,因為load是插入數據,所以設置插入緩沖
8、binlog_cache_size 如果必須要開啟二進制日志,設置此參數盡可能大
9、innodb_flush_method 刷新模式,設置為O_DIRECT
10、innodb_io_capacity 刷新臟頁,根據你的硬盤設置
11、innodb_log_buffer_size 盡可能設置最大
12、unique_checks 設置為不檢查:set?unique_checks=0;
13、alter table tablename disable keys;設置表忽略索引,如果有。
總結
以上是生活随笔為你收集整理的千万数据去重_mysql去重,3亿多数据量的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: springboot接收多对象_Spri
- 下一篇: linux cmake编译源码,linu