`
zzxplayful
  • 浏览: 51303 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

Heritrix多线程的问题

阅读更多
  我现在是用一台主机抓取数据,所以我想把Heritrix的链接散列到多个线程中,可是当我散列的ELFHashQueueAssignmentPolicy写好后,第一次执行的时候,只能解析出30个dns:任务就自动的结束了,可是,当第二次或是第三次的时候,就可以实现多个线程了
另外我已经把Heritrix.properties文件和AbstractFrontier中相应的位置都已经改了,希望您能帮我看看,谢谢了。


/*******************************************************************************
* 文件说明:
*
* 项目名: WebCrawler
* 文件名: ELFHashAssignmentPolicy.java
* 包名: com.hotct.heritrixExt.common.frontier
*
* 创建人: zhangzhenxin
* 创建时间: 下午03:50:01
* 创建日期: 2007-10-30
******************************************************************************/
package com.hotct.heritrixExt.common.frontier;

import java.util.logging.Level;
import java.util.logging.Logger;

import org.apache.commons.httpclient.URIException;
import org.archive.crawler.datamodel.CandidateURI;
import org.archive.crawler.framework.CrawlController;
import org.archive.crawler.frontier.HostnameQueueAssignmentPolicy;
import org.archive.crawler.frontier.QueueAssignmentPolicy;
import org.archive.net.UURI;
import org.archive.net.UURIFactory;

/**
* <h>类型描述</h>
*
* @author zhangzhenxin
* @date 2007-10-30
*/
public class ELFHashAssignmentPolicy extends QueueAssignmentPolicy {

private static final Logger logger = Logger
.getLogger(ELFHashAssignmentPolicy.class.getName());

private static String DEFAULT_CLASS_KEY = "default...";

private static final String DNS = "dns";
/**
*
*/
@Override
public String getClassKey(CrawlController controller, CandidateURI cauri) {
String uri = cauri.getUURI().toString();
String scheme = cauri.getUURI().getScheme();
String candidate = null;

try {
if (scheme.equals(DNS)) {
if (cauri.getVia() != null) {
// Special handling for DNS: treat as being
// of the same class as the triggering URI.
// When a URI includes a port, this ensures
// the DNS lookup goes atop the host:port
// queue that triggered it, rather than
// some other host queue
UURI viaUuri = UURIFactory.getInstance(cauri.flattenVia());
candidate = viaUuri.getAuthorityMinusUserinfo();
// adopt scheme of triggering URI
scheme = viaUuri.getScheme();
} else {
candidate = cauri.getUURI().getReferencedHost();
}
} else {
// String uri = cauri.getUURI().toString();
long hash = ELFHash(uri);
candidate = Long.toString(hash % 100);
}

if (candidate == null || candidate.length() == 0) {
candidate = DEFAULT_CLASS_KEY;
}
} catch (URIException e) {
logger.log(Level.INFO,
"unable to extract class key; using default", e);
candidate = DEFAULT_CLASS_KEY;
}

return candidate.replace(':', '#');
}

public static long ELFHash(String str) {
long hash = 0;
long x = 0;
for (int i = 0; i < str.length(); i++) {
hash = (hash << 4) + str.charAt(i);
if ((x = hash & 0xF0000000L) != 0) {
hash ^= (x >> 24);
hash &= ~x;
}
}
return (hash & 0x7FFFFFFF);
}

}
分享到:
评论
1 楼 D04540214 2008-04-06  
我也遇到相同的问题 ,不知道lz有没有解决 ?

相关推荐

Global site tag (gtag.js) - Google Analytics