首页 诗词 字典 板报 句子 名言 友答 励志 学校 网站地图
当前位置: 首页 > 教程频道 > JAVA > J2SE开发 >

用Java Socket请求网页数据,遇到301永久重定向有关问题

2012-02-05 
用Java Socket请求网页数据,遇到301永久重定向问题Java codepackage wadihu.crawlimport java.io.Buffere

用Java Socket请求网页数据,遇到301永久重定向问题

Java code
package wadihu.crawl;import java.io.BufferedReader;import java.io.IOException;import java.io.InputStreamReader;import java.net.InetAddress;import java.net.InetSocketAddress;import java.net.SocketAddress;import java.nio.ByteBuffer;import java.nio.channels.ClosedChannelException;import java.nio.channels.SelectionKey;import java.nio.channels.Selector;import java.nio.channels.SocketChannel;import java.nio.charset.Charset;import java.util.Iterator;import java.util.LinkedList;import java.util.Queue;/** 爬行类,专门负责网页的下载, 以非阻塞方式连接 */    public class CrawlOrder1 {        private boolean shutdown = false; // 用于控制Connector线程        private Selector selector; // 注册选择器        private Queue<Target> targetLists = new LinkedList<Target>(); // 任务队列        private Queue<Target> taskLists = new LinkedList<Target>(); // 待抓取任务对列        public CrawlOrder1() throws IOException {            selector = Selector.open(); // 打开选择器            RW rw = new RW();            rw.start();            System.out.println("读写线程已启动...");            receiveTarget();  // 用户提交URL任务输入        }        /**用户输入URL请求 */        public void receiveTarget() throws IOException {            BufferedReader buf = new BufferedReader(new InputStreamReader(System.in));            String msg = null;            while((msg = buf.readLine()) != null) {                if(!msg.equals("bye")) {                    Target target = new Target(msg);                    addTarget(target);                }                else {                    shutdown = true;                    selector.wakeup();                    System.out.println("系统已经停止");                    break;                }            }        }        /** 向任务队列添加任务          * @throws IOException */        public void addTarget(Target target) throws IOException {            synchronized (targetLists) {                targetLists.add(target);            }            selector.wakeup();        }         /** 注册读写事件 */        public void registerRW() {            synchronized(targetLists) {                while(targetLists.size() > 0) {                    Target target = targetLists.poll();                    try {                        target.socketChannel.register(selector, SelectionKey.OP_WRITE|SelectionKey.OP_READ, target);                    } catch (ClosedChannelException e) {                        e.printStackTrace();                    }                }            }        }        /** 读写就绪事件发生,处理读写的事件         * @throws IOException */        public void processSelectdRWKeys() throws IOException {            for (Iterator<?> it =  selector.selectedKeys().iterator(); it.hasNext();) {                SelectionKey selectionKey = (SelectionKey) it.next();                it.remove();                SocketChannel socketChannel = (SocketChannel) selectionKey.channel();                if(selectionKey.isWritable()) {                    String head = "GET / HTTP/1.1\r\nHOST:" + socketChannel.socket().getInetAddress().getHostName() + "\r\n" + "Accept:*/*\r\n" + "User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1;)\r\n\r\n";                    ByteBuffer buffer = ByteBuffer.wrap(head.getBytes());                    socketChannel.write(buffer);                    socketChannel.register(selector, SelectionKey.OP_READ);                } else if(selectionKey.isReadable()) {                    ByteBuffer buffer = ByteBuffer.allocate(1024);                    int ret = socketChannel.read(buffer);                    if (ret < 0) {                        socketChannel.close();                        selectionKey.cancel();                    }                    buffer.flip();                    Charset ch = Charset.forName("gb2312");                    System.out.println(ch.decode(buffer));                }            }        }        /** 建立读写内部类 */        private class RW extends Thread {            public void run() {                while(!shutdown) {                    try {                        registerRW();                        if(selector.select(500) > 0) {                            processSelectdRWKeys();                        }                    } catch (ClosedChannelException e) {                        e.printStackTrace();                    } catch (IOException e) {                        e.printStackTrace();                    }                }                try {                    selector.close();                } catch (IOException e) {                    e.printStackTrace();                }            }        }        public static void main(String[] args) throws IOException {            new CrawlOrder1();        }    }    /** 一项抓取任务,外部类  */    class Target {        SocketAddress address;        SocketChannel socketChannel;        public Target(String host) throws IOException {            address = new InetSocketAddress(InetAddress.getByName(host), 80);            this.socketChannel = SocketChannel.open(address);            this.socketChannel.configureBlocking(false);        }    } 



[解决办法]
我也写了蜘蛛,其中要判断返回的代码,遇到重定向的时候,就把里面的Location的URL再请求一次就可以了。

不过如果用apache的http clinet来下载的话,这些功能它都帮你做了
[解决办法]
用Hjava.net.HttpURLConnection类来代替Socket类,上面帮你封装好了,可以搞定redirect问题
[解决办法]
你要用SOCKET就要知道整个HTTP协议栈里面LOCATION是怎么写的,既然用JAVA,就不要太多考虑效率,不然就用C好了。

JAVA的优势在于方便的架构。

再说了,写蜘蛛用JAVA的瓶颈不再效率,在带宽,我是写过的,100M的带宽,全用HTTPURLCONNECTION,沾满了,才20%的CPU占用率
[解决办法]
帮顶,学习

热点排行
Bad Request.