windows - R: can't read unicode text files even when specifying the encoding -
I am using R 3.1.1 on Windows 7 32bits. I'm having a lot of problems reading some text files on which I want to do text analysis. According to Notepad ++, the files are encoded with "UCS-2 Little Endian" . (GrepWin, a device whose name says it all, says that the file is "Unicode".) The problem is that I do not even want to read the file that encoding is specified (these characters are standard Spanish Latin set- ± ÃÆ'à à ⠀ œA- ³- and it should be handled smoothly with CP1252 or similar.) Any ideas? Thank you! Edit: "UTF-16", "UTF-16LE" and "UTF-16BE" encondings evenly fail. < Div class = "post-text" itemprop = "text"> After more thorough reading of the documentation, I got an answer to my question. The encoding to be assumed for the input string. It is used to mark the respective character strings in the Latin-1 or UTF -8: do not encode again input part of the latter, encoding connection Specify as or through options (encoding =): see examples. The correct way to read a file with an unusual encoding, then,
& gt; Sys.getlocale () [1] "LC_COLLATE = Spanish_spen 0.1252; Elsi_sitiwaiipi = Spesaispiani 12.252; Elsiaimattiarai Spanish_spen = 0.1252; Elsi_anayrarik = C; Elsitiaimaiiii = Spesaispiai .1252" & gt; ReadLines ( "filename.txt") [1] "¡¾" "" "" "" "" ... ... ReadLines ( "filename.txt", encoding = "UTF -8") [1] "\ xff \ xfeE "" "" "" "" "... ... ReadLines (" filename.txt ", encoding =" UCS2LE ") [1]" ÃÆ'à ¢ a, ¬ Å ¡Ãƒâ € SA, a "" "" "" "" "" "" ... ... ReadLines ( "filename.txt", encoding = "UCS-2") [1] "ÃÆ'à ¢ a, ¬ Å ¡Ãƒâ € SA,  € "" "" "" "" ... ...
encoding ultimate
readlines contains only apply to the ultimate input string documentation says:
filetext & lt; - ReadLines (cone & lt; - file ("UnicodeFile.txt", encoding = "UCS-2 LE")) Close (con)
Comments
Post a Comment