Identifying the Problem
Sure, hereâs a rewritten version of the content with âScrapy errorâ as a keyword:Today, while restoring an old project, I repeatedly faced a Scrapy error when downloading a specific link. The error message was as follows:
[]
I had encountered this error before but didnât know how to resolve it. After Googling, some suggested adding a UA header, while others said it was a proxy issue. I tested the proxy specifically using curl and found that the proxy was OK.
Then I checked the UserAgent, tried various attempts, all to no avail,
headers = { 'User-Agent': "xxxxx", 'Content-Type': "text/plain; charset=UTF-8", 'Host': "xxxx", 'Cache-Control': "no-cache",}
Eventually, I found that Host was incorrect. After commenting out Host, data collection worked normally. I then re-captured the packets of this request and found that the Host had changed. Since Iâm not familiar with Host, I wrote an article to document it and to review the HTTP protocol.
Host Field
Explanation of Host
The domain name of the server (used for virtual hosting), as well as the Transmission Control Protocol port number the server is listening on. If the requested port corresponds to the standard port of the service, the port number can be omitted. It has been a required field since Hypertext Transfer Protocol version 1.1 (HTTP/1.1).
Host Example
Host: en.wikipedia.org:80Host: en.wikipedia.org
Status: Permanent
Source: HTTP Header Field
This article explains it clearly: NetworkâAn Article Explaining in Detail the Concept of the Request Header Host
https://blog.csdn.net/netdxy/article/details/51195560
Summary
- The host field can be a domain name or an IP address. After the host field domain/IP, a port number can follow, such as Host: www.6san.com:8080.
- The host can be customized by the program, and some programs define a false host to prevent interception by operators or firewalls.
- In HTTP1.1, the host field can be empty; in HTTP1.0, the host field can be missing.
- The HTTP response header does not include the host field, so the http.host filtered by Wireshark are all request packets.
- Because the host field in the HTTP header can be customized by the program, the value of the host field can have many special cases, such as a HOST header containing multiple â/â characters, and ending with a â.â.
In HTTP 1.1, the host field cannot be missing. If missing, the server returns a 400 bad request. In HTTP1.1, the host field cannot be missing, but it can be empty.