Nodejs爬虫实战（四）

2021年07月22日留言

1. 淘宝天猫页面重定向

尝试抓取淘宝页面
打印res.statusCode,res.headers.location
res.statusCode返回的状态码;

res.headers.location返回的地址

发现状态码为302，百度HTTP状态码

302 Move Temporarily

请求的资源临时从不同的 URI响应请求。由于这样的重定向是临时的，客户端应当继续向原有地址发送以后的请求。只有在Cache-Control或Expires中进行了指定的情况下，这个响应才是可缓存的。

如果这不是一个 GET 或者 HEAD 请求，那么浏览器禁止自动进行重定向，除非得到用户的确认，因为请求的条件可能因此发生变化。

注意：虽然RFC 1945和RFC 2068规范不允许客户端在重定向时改变请求的方法，但是很多现存的浏览器将302响应视作为303响应，并且使用 GET 方式访问在 Location 中规定的 URI，而无视原先请求的方法。状态码303和307被添加了进来，用以明确服务器期待客户端进行何种反应。
淘宝返回给我们的只是一个临时的html页面，并不是真正的还有数据的页面。

判断是否重定向，递归寻找真页面

 if(res.statusCode == 302 || res.statusCode == 301){
         console.log(`第${index}次重定向`,res.headers.location);
         GetUrl(res.headers.location,success)
     }

2. 转码

拿到了真实页面之后，打开发现有编码错误的问题
引用gbk模块
gbk提供了编码转换的方法
gbk.toString('utf-8',data);
结束

###### 完整代码


    var index = 0;
    const fs = require('fs');
    const url = require('url');
    const gbk = require('gbk');
    
    GetUrl('https://detail.tmall.com/item.htm?spm=a230r.1.14.6.68624507tWuF7E&id=560257961625&cm_id=140105335569ed55e27b&abbucket=18&sku_properties=10004:709990523',(data)=>{
    
        var html = gbk.toString('utf-8',data);
        console.log(html)
        //console.log('终于我走出来了')
        //fs.writeFile('iponex.html',data);
        //console.log(str)
    })
    function GetUrl(sUrl,success){
        index++;
        var urlObj = url.parse(sUrl);
        var http ='';
        if(urlObj.protocol == 'http:'){
            http = require('http');
        }
        else{
            http = require('https');
        }
    
        let req = http.request({
            'hostname':urlObj.hostname,
            'path':urlObj.path
        },res=>{
            if(res.statusCode == 200){
                var arr = [];
                var str = '';
                res.on('data',buffer=>{
                    arr.push(buffer);
                    //str +=buffer;
                });
                res.on('end',()=>{
                    let b = Buffer.concat(arr);
    
                    success && success(b);
    
                })
            }
            else if(res.statusCode == 302 || res.statusCode == 301){
                console.log(`第${index}次重定向`,res.headers.location);
                GetUrl(res.headers.location,success)
            }
            //console.log(res.statusCode,res.headers.location)
            
            
            
        });
    
        req.end();
        req.on('error',()=>{
            console.log('404了，哥们');
        })
    }

原文链接: http://enofeng.github.io/2021/07/22/Nodejs爬虫实战（四）/

版权声明: 转载请注明出处.

Programmer Home

Learn and Life

Nodejs爬虫实战（四）

1. 淘宝天猫页面重定向

2. 转码

文章目录