Troubleshooting Scrapy Error: Host Header Issues and Solutions

Network security

Identifying the Problem

Sure, here’s a rewritten version of the content with “Scrapy error” as a keyword:Today, while restoring an old project, I repeatedly faced a Scrapy error when downloading a specific link. The error message was as follows:

[]

I had encountered this error before but didn’t know how to resolve it. After Googling, some suggested adding a UA header, while others said it was a proxy issue. I tested the proxy specifically using curl and found that the proxy was OK.

Then I checked the UserAgent, tried various attempts, all to no avail,

headers = {    'User-Agent': "xxxxx",    'Content-Type': "text/plain; charset=UTF-8",    'Host': "xxxx",    'Cache-Control': "no-cache",}

Eventually, I found that Host was incorrect. After commenting out Host, data collection worked normally. I then re-captured the packets of this request and found that the Host had changed. Since I’m not familiar with Host, I wrote an article to document it and to review the HTTP protocol.

Host Field

Explanation of Host

The domain name of the server (used for virtual hosting), as well as the Transmission Control Protocol port number the server is listening on. If the requested port corresponds to the standard port of the service, the port number can be omitted. It has been a required field since Hypertext Transfer Protocol version 1.1 (HTTP/1.1).

Host Example

Host: en.wikipedia.org:80Host: en.wikipedia.org

Status: Permanent

Source: HTTP Header Field

This article explains it clearly: Network—An Article Explaining in Detail the Concept of the Request Header Host

https://blog.csdn.net/netdxy/article/details/51195560

Summary

  1. The host field can be a domain name or an IP address. After the host field domain/IP, a port number can follow, such as Host: www.6san.com:8080.
  2. The host can be customized by the program, and some programs define a false host to prevent interception by operators or firewalls.
  3. In HTTP1.1, the host field can be empty; in HTTP1.0, the host field can be missing.
  4. The HTTP response header does not include the host field, so the http.host filtered by Wireshark are all request packets.
  5. Because the host field in the HTTP header can be customized by the program, the value of the host field can have many special cases, such as a HOST header containing multiple ‘/’ characters, and ending with a “.”.

In HTTP 1.1, the host field cannot be missing. If missing, the server returns a 400 bad request. In HTTP1.1, the host field cannot be missing, but it can be empty.

Share this