Heritrix多线程的问题

zzxplayful

浏览: 51303 次
性别:
来自: 北京

最近访客更多访客>>

liunancun

dagf113225

s周萌

wxy6822363

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

多线程 Scheme .net Apache

我现在是用一台主机抓取数据，所以我想把Heritrix的链接散列到多个线程中，可是当我散列的ELFHashQueueAssignmentPolicy写好后，第一次执行的时候，只能解析出30个dns：任务就自动的结束了，可是，当第二次或是第三次的时候，就可以实现多个线程了
另外我已经把Heritrix.properties文件和AbstractFrontier中相应的位置都已经改了，希望您能帮我看看，谢谢了。

/*******************************************************************************
* 文件说明:
*
* 项目名: WebCrawler
* 文件名: ELFHashAssignmentPolicy.java
* 包名: com.hotct.heritrixExt.common.frontier
*
* 创建人: zhangzhenxin
* 创建时间: 下午03:50:01
* 创建日期: 2007-10-30
******************************************************************************/
package com.hotct.heritrixExt.common.frontier;

import java.util.logging.Level;
import java.util.logging.Logger;

import org.apache.commons.httpclient.URIException;
import org.archive.crawler.datamodel.CandidateURI;
import org.archive.crawler.framework.CrawlController;
import org.archive.crawler.frontier.HostnameQueueAssignmentPolicy;
import org.archive.crawler.frontier.QueueAssignmentPolicy;
import org.archive.net.UURI;
import org.archive.net.UURIFactory;

/**
* <h>类型描述</h>
*
* @author zhangzhenxin
* @date 2007-10-30
*/
public class ELFHashAssignmentPolicy extends QueueAssignmentPolicy {

private static final Logger logger = Logger
.getLogger(ELFHashAssignmentPolicy.class.getName());

private static String DEFAULT_CLASS_KEY = "default...";

private static final String DNS = "dns";
/**
*
*/
@Override
public String getClassKey(CrawlController controller, CandidateURI cauri) {
String uri = cauri.getUURI().toString();
String scheme = cauri.getUURI().getScheme();
String candidate = null;

try {
if (scheme.equals(DNS)) {
if (cauri.getVia() != null) {
// Special handling for DNS: treat as being
// of the same class as the triggering URI.
// When a URI includes a port, this ensures
// the DNS lookup goes atop the host:port
// queue that triggered it, rather than
// some other host queue
UURI viaUuri = UURIFactory.getInstance(cauri.flattenVia());
candidate = viaUuri.getAuthorityMinusUserinfo();
// adopt scheme of triggering URI
scheme = viaUuri.getScheme();
} else {
candidate = cauri.getUURI().getReferencedHost();
}
} else {
// String uri = cauri.getUURI().toString();
long hash = ELFHash(uri);
candidate = Long.toString(hash % 100);
}

if (candidate == null || candidate.length() == 0) {
candidate = DEFAULT_CLASS_KEY;
}
} catch (URIException e) {
logger.log(Level.INFO,
"unable to extract class key; using default", e);
candidate = DEFAULT_CLASS_KEY;
}

return candidate.replace(':', '#');
}

public static long ELFHash(String str) {
long hash = 0;
long x = 0;
for (int i = 0; i < str.length(); i++) {
hash = (hash << 4) + str.charAt(i);
if ((x = hash & 0xF0000000L) != 0) {
hash ^= (x >> 24);
hash &= ~x;
}
}
return (hash & 0x7FFFFFFF);
}

}

分享到：

je分词的问题 | 关于hibernate多表查询的问题

2007-11-16 19:06
浏览 2334
评论(1)
论坛回复 / 浏览 (1 / 3677)
分类:企业架构
查看更多

1 楼 D04540214 2008-04-06

我也遇到相同的问题，不知道lz有没有解决？

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论