日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 运维知识 > 数据库 >内容正文

数据库

千万数据去重_mysql去重,3亿多数据量

發布時間:2025/3/19 数据库 35 豆豆
生活随笔 收集整理的這篇文章主要介紹了 千万数据去重_mysql去重,3亿多数据量 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

差不多3億6千萬數據,需要去重。因為數據量太大,所以:

將數據load data infile到大表里,不進行任何去重操作,沒有任何約束。然后將數據分成幾十個小表,用這幾十個小表去對比大表去重。得到去重后的小表。去重以后的小表,根據字段進行hash算出后兩位數字,重新建好新表,將去重后小表的數據,插入到帶有hash數字新表中。

存儲過程如下(去重):

DELIMITER //

/*tblname 動態控制表名*/

CREATE PROCEDURE create_imsi(IN tblname varchar(200))

begin

declare age int default 1;

declare done int(1) default 0;

declare v_imsi varchar(200);

/*定義游標*/

declare cur_l cursor for select imsi from sqlstr;

/*定義異常*/

DECLARE CONTINUE HANDLER FOR SQLSTATE '02000' set done=1;

drop view if exists sqlstr;

/*定義視圖*/

set @tbl = CONCAT("create view sqlstr as select a.imsi from tbl_new a,(select imsi from phone_",tblname," group by imsi having count(imsi) > 1) b where a.imsi = b.imsi group by imsi");

/*執行視圖語句*/

PREPARE stmt FROM @tbl;

EXECUTE stmt;

DEALLOCATE PREPARE stmt;

OPEN cur_l;

FETCH cur_l INTO v_imsi;

while (done <> 1)

do

/*對比大表數據,刪除小表中的重復數據*/

set @del = CONCAT("delete from phone_",tblname," where imsi=",v_imsi);

PREPARE stmt1 FROM @del;

EXECUTE stmt1;

DEALLOCATE PREPARE stmt1;

FETCH cur_l INTO v_imsi;

end while;

close cur_l;

end//

DELIMITER ;

2、根據hash算法插入新表:

DELIMITER //

CREATE PROCEDURE insert_imsi(IN tblname varchar(20))

begin

declare age int default 1;

declare done int(1) default 0;

declare done1 int(1) default 0;

declare v_imsi varchar(200);

declare v_e varchar(2000);

declare v_number varchar(3000);

declare v_ctype varchar(2000);

declare cur_l cursor for select split from sqlstr;

DECLARE CONTINUE HANDLER FOR SQLSTATE '02000' set done=1;

DECLARE CONTINUE HANDLER FOR 1146 set done1=3;

DECLARE CONTINUE HANDLER FOR SQLSTATE '23000' set done1=1;

DECLARE CONTINUE HANDLER FOR SQLSTATE '42000' set done1=2;

DECLARE CONTINUE HANDLER FOR SQLSTATE 'HY000' set done1=3;

drop view if exists sqlstx;

drop view if exists sqlstr;

set @sqlstx = CONCAT("create view sqlstr as SELECT imsi,number,ctype,mod(conv(right(md5(imsi),2),16,10),100) split from imsi_phone_",tblname);

PREPARE stmt1 FROM @sqlstx;

EXECUTE stmt1;

DEALLOCATE PREPARE stmt1;

OPEN cur_l;

WHILE done <> 1

DO

FETCH cur_l INTO v_e;

set @ins = concat("insert into imsi_",v_e,"(imsi,number,ctype) select imsi,number,ctype from sqlstr where split = '",v_e,"'");

PREPARE stmt3 FROM @ins;

EXECUTE stmt3;

END WHILE;

close cur_l;

end//

DELIMITER ;

報錯:1、ERROR 1243 (HY000) at line 1: Unknown prepared statement handler (stmt3) given to EXECUTE

2、ERROR 1054 (42S22) at line 1: Unknown column '000cdc41b2a02518' in 'where clause'

由于set @dat = concat("insert into imsi_",v_e,"(imsi,number,ctype) select imsi,number,ctype from imsi_phone_",tblname," where imsi=‘’",v_imsi,“‘’”);沒有在(=)那里加單引號,因為字段里有字母。

參數優化:

由于建表使用innodb引擎,所以此優化是針對innodb引擎的:

1、innodb_flush_log_at_trx_commit參數設置為0,減少刷新。

2、set?sql_log_bin=0  暫時不產生二進制日志

3、sync_binlog  設置為0,減少刷新

4、innodb_buffer_pool_size    盡可能設置最大

5、set foreign_key_checks=0  去除外鍵檢查

6、減少不必要的索引,有重復數據的話,主鍵是必須要的

7、innodb_change_buffer_max_size    上限為50,這里我設置為40,因為load是插入數據,所以設置插入緩沖

8、binlog_cache_size  如果必須要開啟二進制日志,設置此參數盡可能大

9、innodb_flush_method    刷新模式,設置為O_DIRECT

10、innodb_io_capacity    刷新臟頁,根據你的硬盤設置

11、innodb_log_buffer_size  盡可能設置最大

12、unique_checks  設置為不檢查:set?unique_checks=0;

13、alter table tablename disable keys;設置表忽略索引,如果有。

總結

以上是生活随笔為你收集整理的千万数据去重_mysql去重,3亿多数据量的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。