python - Why this code doesn't works with all URLs? -
I'm very new to the python, playing with some code. I'm actually trying to parse an html webpage and extract some information from the parsed document:
from urlib import request from bs4 import from BeautifulSoup #some code here ... link = Str (input ("Enter url:") sock = request.urlopen (link) pageText = sock.read () sock.close () #some code here ... file = open ("C: / test. Txt ", 'w') File.write (pageText.decode (" utf-8 ")) # Here are some code ... Let me fix this error in the file.write () line Is getting, and I still have no clue on the internet but how to fix it A: Traceback (most recent call final): File "C: / User / Monster / Processor Projects / TestProotia / TestFile", line 16, A
In & lt; Module & gt; Encrypt codec in. File.write (pageText.decode ("utf-8")) "C: \ Python34 \ lib \ encodings \ cp1252.py", line 19, .CHarmap_encode (input, auto.error, encoding_table) [0] Unicode encoded error: 'charmap' codec can not encode characters in the position of 413334-413340: Character Maps to be & lt; Undefined & gt;
My code works is perfect for some sites like www.google.com or www.flipkart.com and gives error for some URLS Like www.facebook.com and www.youtube.com I think that a possible reason does not work for www.facebook.com and youtube.com because they are developed in PHP or any other language, and an HTML web page Is not right?
The problem is that you write in a text file with the cp1252 encoding , But your data contains characters which are not present in cp1252 . In Python, the function takes an optional encoding for text files. Docks say, if you do not specify anything: The default encoding platform is dependent (whichever locale.getpreferredcoding () returns) < / Blockquote> On Windows, the "Favorite Encoding" returned by that function is set as your default for your system. In the US version of Windows, if you have not changed the settings, then the preconfigured default is "code page 1252", which is Microsoft's variation on the variety of IBM on Latin-1, it can only handle 256 different characters (Approximately, but not enough, similar to the first 256 characters in Unicode). If you have any other letter, you will get an error.
The reason for this works on some pages but there is no other that some pages do not contain common English characters which are fits in each if you really want to save a UTF-8 text file. , You have to explicitly do it:
f = open ('C: /test.txt', 'w', encoding = 'utf-8') f.write ( PageText.decode ('utf-8')) If you have a Cp1252 text file or, rather, whatever your system's default encoding is, -8 Maybe that runs your script on a Mac or Shift-JIS-based CP 9 a Japanese Windows box at 32 ???? By leaving or avoiding those that are not fit in c121252, you can also do this:
f = open ('C: /test.txt', 'w' , Errors = 'substitution') f.write (pageText.decode ('utf-8')) Or, of course, if you want cp1252 then no matter what the system Set, say:
f = open ('C: /test.txt', 'w', encoding = 'c p1252', errors = 'replace') f.write ( PageText.decode ('utf-8')) If you are doing raw bytes what they are doing, open the file in binary mode and in the first place Do not open decode bytes: f = open ('C: /test.txt', 'wb') F.write (Pagetext) Of course, if you open that file in a cp1252 (or Shift-JIS, etc.) text editor, will it look like a mosaic? | But this is no longer the fault of your program. :)
However, here's another problem you're assuming that every web page is UTF-8. This is not true. The pre-HTML5 web pages are actually by default in Latin-1, but they can specify a different encoding in the header (or meta tag or for XHTML). Top-level XML Tag) Specifically, try it with Facebook page: & gt; & Gt; & Gt; Print (sock.getheader ('content-type')) 'text / html; Charset = utf-8 ' How do you know that, in this case, UTF-8
For HTML5, is it ?? | Ideally you would like to use a library which does this for you. (Since you are already using beautiful soup, because in many general cases its "Unicode, Dmitit" will work well and it works very well for pre-HTMF but a standard-correct implementation Even better.)
Comments
Post a Comment