web-harvest的使用有三种方式:
1、在图形界面下使用提供的IDE下
打开一个配置文件,运行后即可看到运行结果
2、在命令行下模式下运行
句法为:
java -jar webharvest_all_XX.jar [-h] config= [workdir=] [debug=yes|no]
[proxyhost= [proxyport=]]
[proxyuser= [proxypassword=]]
[proxynthost=]
[proxyntdomain=]
[loglevel=]
[logpropsfile=]
[#var1= [#var2=…]]
如(使用home目录的test.xml配置文件):
java -jar webharvest_all_1.jar config=/home/zhang/test.xml
java -jar webharvest_all_1.jar config=/home/zhang/test.xml proxyhost=218.25.86.254 proxyport=80
3、使用java变成环境的API:
import org.webharvest.definition.ScraperConfiguration;
import org.webharvest.runtime.Scraper;
import org.webharvest.runtime.variables.Variable;
import mypackage.MyXmlLibrary;
public class WebHarvestTest {
public static void main(String[] args) {
ScraperConfiguration config =
new ScraperConfiguration(”c:/wh/configs/news.xml”);
Scraper scraper = new Scraper(config, “c:/wh/work/”);
scraper.addVariableToContext(”username”, “web-harvest”);
scraper.addVariableToContext(”password”, “web-harvest”);
scraper.addVariableToContext(”myXmlLib”, new MyXmlLibrary());
scraper.setDebug(true);
scraper.execute();
// takes variable created during execution
Variable articles = (Variable) scraper.getContext().get(”articles”);
// do something with articles…
}
}
TAG: web-harvest




