nodejs 遍历 mysql 100 万记录，怎么操作比较好？

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

这是一个创建于 3582 天前的主题，其中的信息可能已经有所发展或是发生改变。

1.一次性读入觉得不太合适，内存问题？
2.1000条/次的读入，先select count(*), 然后计算出需要select多少次，for循坏一下，但是整体代码不太美观

数据库这块经验不多，大家是否有好的建议，请指点！

读入

select

Nodejs

18 条回复 • 2015-01-17 22:00:30 +08:00

forest520

2015-01-17 17:36:56 +08:00 via iPhone

100万很少了好不好

zjmdp

2015-01-17 17:38:18 +08:00

@forest520 直接100万一次性读出？

sleshep

2015-01-17 17:40:39 +08:00

你要做什么操作?

zjmdp

2015-01-17 17:42:52 +08:00

@sleshep apns拉取所有用户的device token

msg7086

2015-01-17 18:08:27 +08:00

问题不在于读入，而在于读入以后要干嘛。
mysql可以unbuffer发送数据，100万根本不是问题。

minbaby

2015-01-17 18:13:26 +08:00

while True:
data = select * from table limit offset, limit
offset += limit
if not data:
break
do_something()

不要砍我.…..我只会这么干

minbaby

2015-01-17 18:13:42 +08:00

我擦, 为什么缩进不见了

zjmdp

2015-01-17 18:16:21 +08:00 via iPhone

@minbaby 这种方法我也想到过，但是nodejs天生异步io，这种同步方法有点尴尬，而且吞吐量明显受限

minbaby

2015-01-17 18:17:42 +08:00

@zjmdp 没写过node, 如果是我的话, 我会在 do_something 这里用多线程开跑,

zjmdp

2015-01-17 18:25:03 +08:00 via iPhone

@minbaby 你说的方法肯定是可行的，只是觉得稍复杂一点，不知道mysql有没有内建方法应付这种情况

EPr2hh6LADQWqRVH

2015-01-17 18:28:08 +08:00 via iPhone

on data

vivisidea

2015-01-17 20:19:39 +08:00

@zjmdp MySQL ResultSet有Stream Mode，用limit/offset方式的话offset大了之后效率会慢

<code>
ResultSet

By default, ResultSets are completely retrieved and stored in memory. In most cases this is the most efficient way to operate and, due to the design of the MySQL network protocol, is easier to implement. If you are working with ResultSets that have a large number of rows or large values and cannot allocate heap space in your JVM for the memory required, you can tell the driver to stream the results back one row at a time.

To enable this functionality, create a Statement instance in the following manner:

stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
The combination of a forward-only, read-only result set, with a fetch size of Integer.MIN_VALUE serves as a signal to the driver to stream result sets row-by-row. After this, any result sets created with the statement will be retrieved row-by-row.

</code>

http://dev.mysql.com/doc/connector-j/en/connector-j-reference-implementation-notes.html

lincanbin

2015-01-17 20:52:14 +08:00

流读取，不过还是得看你想遍历这些数据干什么

invite

2015-01-17 21:04:06 +08:00

这个需要SELECT COUNT(*) 么?

假设每次SELECT 一个A, 那在SELECT 的时候, LIMIT A + 1, 然后看看结果集有没有A+1, 没有就自动退出了.

zjmdp

2015-01-17 21:14:26 +08:00 via iPhone

@lincanbin 4楼有说明

@invite 嗯，可以不count(*)

zjmdp

2015-01-17 21:27:09 +08:00

@invite 但你说的这种方式对于每一次select都要等待前一次结束，但node天然异步IO，可以并发select，提高吞吐量

lianghui

2015-01-17 21:37:21 +08:00

以前遍历整个表的处理，如果有自增id，比如uid，

select min(uid) as start, max(uid) as end from table_name;

稍稍的评估下保证用户矢量的区间阈值，抽取保证适当的数据

current = start + threshold;

while (current < end)
{
// to do

current += threshold;

current = (current <= end) ? current : end;

}

zjmdp

2015-01-17 22:00:30 +08:00

@lianghui 也是一种方法，正好有自增id

仔细看了node mysql的文档，发现支持Streaming query，所以理所当然就用这个了，算是query大结果集比较完美的方案