How to Read Text Files in R
This page includes all the material you lot demand to bargain with strings in R. The department on regular expressions may be useful to understand the rest of the page, fifty-fifty if it is not necessary if you just need to perform some simple tasks.
This folio may be useful to :
- perform statistical text assay.
- collect data from an unformatted text file.
- deal with character variables.
In this folio, nosotros acquire how to read a text file and how to use R functions for characters. There are two kind of function for characters, unproblematic functions and regular expressions. Many functions are part of the standard R base package.
help.search ( keyword = "character" , package = "base" ) Even so, their proper name and their syntax is non intuitive to all users. Hadley Wickham has adult the stringr package which defines functions with similar behaviour but their names are easier to retain and their syntax much more than systematic[1].
- Keywords : text mining, natural linguistic communication processing
- Come across CRAN Task view on Natural Language Processing[2]
- See also the following packages tm, tau, languageR, scrapeR.
Reading and writing text files [edit | edit source]
R can read any text file using readLines() or scan(). Information technology is possible to specify the encoding of the imported text file with readLines(). The entire contents of the text file can be read into an R object (e.grand., a character vector). scan() is more than flexible. The kind of data expected tin be specified in the second statement (eastward.yard., character(0) for a cord).
text <- readLines ( "file.txt" , encoding = "UTF-eight" ) scan ( "file.txt" , character ( 0 )) # separate each give-and-take browse ( "file.txt" , graphic symbol ( 0 ), quote = NULL ) # get rid of quotes scan ( "file.txt" , character ( 0 ), sep = "." ) # separate each sentence browse ( "file.txt" , character ( 0 ), sep = "\n" ) # separate each line We can write the content of an R object into a text file using cat() or writeLines(). By default true cat() concatenates vectors when writing to the text file. Yous tin change it by adding options sep="\n" or make full=Truthful. The default encoding depends on your computer.
cat ( text , file = "file.txt" , sep = "\n" ) writeLines ( text , con = "file.txt" , sep = "\due north" , useBytes = FALSE ) Before reading a text file, you tin look at its properties. nlines() (parser parcel) and countLines() (R.utils package) count the number of lines in the file. count.chars() (parser package) counts the number of bytes and characters in each line of a file. You lot tin can also brandish a text file using file.evidence().
Grapheme encoding [edit | edit source]
R provides functions to deal with diverse set of encoding schemes. This is useful if you deal with text file which have been created with some other operating arrangement and peculiarly if the linguistic communication is not English and has many accents and specific characters. For case, the standard encoding scheme in Linux is "UTF-8" whereas the standard encoding scheme in Windows is "Latin1". The Encoding() functions returns the encoding of a string. iconv() is similar to the unix command iconv and converts the encoding.
-
iconvlist()gives the listing of available encoding scheme on your computer. -
readLines(),scan()andfile.show()have besides an encoding pick. -
is.utf8()(tau) tests if the encoding is "utf8". -
is.locale()(tau) tests if encoding is the same as the default encoding on your figurer. -
translate()(tau) translates the encoding into the electric current locale. -
fromUTF8()(descr) is less general thaniconv(). -
utf8ToInt()(base)
Example [edit | edit source]
The following case was run nether Windows. Thus, the default encoding is "latin1".
> texte <- "Hé hé" > Encoding ( texte ) [ i ] "latin1" > texte2 <- iconv ( texte , "latin1" , "UTF-8" ) > Encoding ( texte2 ) [ ane ] "UTF-viii" Regular Expressions [edit | edit source]
A regular expression is a specific pattern in a set of strings. For example, i could accept the post-obit design : 2 digits, 2 messages and 4 digits. R provides powerful functions to deal with regular expressions. Two types of regular expressions are used in R [3]
- extended regular expressions, used past
'perl = FALSE'(the default), - Perl-like regular expressions used by
'perl = TRUE'.
At that place is a also an selection called 'fixed = Truthful' which can exist considered every bit a literal regular expression. stock-still() (stringr) is equivalent to fixed=True in the standard regex functions. These functions are past default case sensitive. This tin can be changed by specifying the option ignore.case = TRUE.
If you are not a specialist in regular expression you may detect the glob2rx() useful. This function suggests some regular expression for a specific ("glob" or "wildcard") design :
> glob2rx ( "abc.*" ) [ 1 ] "^abc\\." Functions which employ regular expressions in R [edit | edit source]
-
sub(),gsub(),str_replace()(stringr) make some substitutions in a cord. -
grep(),str_extract()(stringr) extract some value -
grepl(),str_detect()(stringr) observe the presence of a pattern. - run across also
splitByPattern()(R.utils) - Come across also
gsubfn()in the gsubfn package.
Extended regular expressions (The default) [edit | edit source]
-
"."stands for any character. -
"[ABC]"means A,B or C. -
"[A-Z]"means whatever upper letter between A and Z. -
"[0-nine]"means any digit betwixt 0 and 9.
Here is the list of metacharacters '$ * + . ? [ ] ^ { } | ( ) \'. If you need to use one of those characters, precede them with a doubled backslash.
Here are some classes of regular expressions : For numbers :
-
'[:digit:]'Digits:'0 i 2 3 4 v 6 7 eight ix'.
For letters :
-
'[:alpha:]'Alphabetic characters:'[:lower:]'and'[:upper:]'. -
'[:upper:]'Upper-case letters. -
'[:lower:]'Lower-case messages.
Note that the set of alphabetic characters includes accents such equally é è ê which are very common in some languages similar French. Therefore, it is more general than "[A-Za-z]" which does not include letters with accent.
For other characters :
-
'[:punct:]'Punctuation characters:'! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~'. -
'[:space:]'Space characters: tab, newline, vertical tab, form feed, railroad vehicle render, and space. -
'[:bare:]'Blank characters: infinite and tab. -
'[:cntrl:]'Control characters.
For combination of other classes :
-
[:alnum:]Alphanumeric characters:'[:alpha:]'and'[:digit:]'. -
'[:graph:]'Graphical characters:'[:alnum:]'and'[:punct:]'. -
'[:print:]'Printable characters:'[:alnum:]','[:punct:]'and space. -
'[:xdigit:]'Hexadecimal digits:'0 i 2 3 4 5 6 vii 8 9 A B C D Due east F a b c d e f'.
Yous can quantify the number of repetition by adding after the regular expression the following characters :
-
'?'The preceding particular is optional and will exist matched at well-nigh once. -
'*'The preceding item volition be matched zero or more times. -
'+'The preceding particular will be matched one or more than times. -
'{n}'The preceding particular is matched exactly 'n' times. -
'{n,}'The preceding detail is matched 'northward' or more times. -
'{northward,chiliad}'The preceding item is matched at least 'n' times, but not more than than 'k' times.
-
^to force the regular expression to be at the showtime of the string -
$to force the regular expression to exist at the cease of the string
If you want to know more, have a await at the 2 following help files :
>? regexp # gives some general explanations >? grep # help file for grep(),regexpr(),sub(),gsub(),etc Perl-like regular expressions [edit | edit source]
| | This department is a stub. You can help Wikibooks by expanding it. |
Information technology is as well possible to use "perl-like" regular expressions. You but need to utilise the selection perl=True.
Examples [edit | edit source]
If you want to remove infinite characters in a string, you can utilise the \\s Perl macro.
sub ( '\\s' , '' , 10 , perl = True ) See also [edit | edit source]
- Perl Programming/Regular Expressions
Concatenating strings [edit | edit source]
-
paste()concatenates strings. -
str_c()(stringr) does a similar job. -
cat()prints and concatenates strings.
Examples [edit | edit source]
> paste ( "toto" , "tata" , sep = ' ' ) [ 1 ] "toto tata" > paste ( "toto" , "tata" , sep = "," ) [ ane ] "toto,tata" > str_c ( "toto" , "tata" , sep = "," ) [ ane ] "toto,tata" > x <- c ( "a" , "b" , "c" ) > paste ( x , collapse = " " ) [ 1 ] "a b c" > str_c ( x , collapse = " " ) [ 1 ] "a b c" > true cat ( c ( "a" , "b" , "c" ), sep = "+" ) a + b + c Splitting a string [edit | edit source]
-
strsplit(): Divide the elements of a character vector 'x' into substrings according to the matches to substring 'dissever' within them. - See likewise
str_split()(stringr).
> unlist ( strsplit ( "a.b.c" , "\\." )) [ ane ] "a" "b" "c" -
tokenize()(tau) dissever a string into tokens.
> tokenize ( "abc defghk" ) [ 1 ] "abc" " " "defghk" Counting the number of characters in a string [edit | edit source]
-
nchar()gives the length of a string. Notation that that for not-ASCII encodings, there is more one way to measure out such a length. - See likewise
str_length()(stringr)
> nchar ( "abcdef" ) [ i ] vi > nchar ( NA ) [ ane ] NA > nchar ( "René" ) [ 1 ] 4 > nchar ( "René" , type = "bytes" ) [ 1 ] five Detecting the presence of a substring [edit | edit source]
Detecting a blueprint in a string ? [edit | edit source]
-
grepl()returns a logical expression (TRUE or Fake). -
str_detect()(stringr) does a like job.
> cord <- "23 mai 2000" > string2 <- "1 mai 2000" > regexp <- "([[:digit:]]{ii}) ([[:blastoff:]]+) ([[:digit:]]{4})" > grepl ( pattern = regexp , 10 = string ) [ 1 ] True > str_detect ( cord , regexp ) [ i ] TRUE > grepl ( design = regexp , x = string2 ) [ one ] Faux The 1st one is truthful and the second one is false since there is only one digit in the first number.
Counting the occurrence of each pattern in a string ? [edit | edit source]
-
textcnt()(tau) counts the occurrence of each pattern or each term in a text.
> string <- "blabla 23 mai 2000 blabla 18 mai 2004" > textcnt ( string , n = 1L , method = "string" ) blabla mai 2 2 attr (, "grade" ) [ one ] "textcnt" [edit | edit source]
[edit | edit source]
-
cpos()(cwhmisc) returns the position of a substring in a string. -
substring.location()(cwhmisc) does the same chore but returns the beginning and the last position.
> cpos ( "abcdefghijklmnopqrstuvwxyz" , "p" , first = one ) [ ane ] 16 > substring.location ( "abcdefghijklmnopqrstuvwxyz" , "def" ) $ beginning [ i ] 4 $ last [ 1 ] half dozen [edit | edit source]
-
regexpr()returns the position of the regular expression.str_locate()(stringr) does the aforementioned job.gregexpr()is similar toregexpr()only the starting position of every friction match is returned.str_locate_all()(stringr) does the same job.
> regexp <- "([[:digit:]]{2}) ([[:alpha:]]+) ([[:digit:]]{4})" > string <- "blabla 23 mai 2000 blabla 18 mai 2004" > regexpr ( pattern = regexp , text = string ) [ 1 ] 8 attr (, "friction match.length" ) [ 1 ] 11 > gregexpr ( blueprint = regexp , text = string ) [[ 1 ]] [ 1 ] 8 27 attr (, "friction match.length" ) [ 1 ] eleven 11 > str_locate ( cord , regexp ) start end [ 1 ,] viii 18 > str_locate_all ( string , regexp ) [[ i ]] start end [ i ,] 8 eighteen [ ii ,] 27 37 [edit | edit source]
[edit | edit source]
-
substr()takes a sub string. -
str_sub()(stringr) is similar.
> substr ( "simple text" , 1 , 3 ) [ i ] "sim" > str_sub ( "elementary text" , ane , 3 ) [ 1 ] "sim" [edit | edit source]
-
showtime.word()Start Word in a String or Expression in the Hmisc package
> first.word ( "abc def ghk" ) [ 1 ] "abc" [edit | edit source]
-
grep()returns the value of the regular expression ifvalue=Tand its position ifvalue=F.
> grep ( blueprint = regexp , 10 = string , value = T ) [ 1 ] "23 mai 2000" > grep ( pattern = regexp , x = string2 , value = T ) character ( 0 ) > grep ( pattern = regexp , x = string , value = F ) [ 1 ] 1 > grep ( design = regexp , x = string2 , value = F ) integer ( 0 ) -
str_extract(),str_extract_all(),str_match(),str_match_all()(stringr) andk()(caroline package) are similar togrep().str_extract()andstr_extract_all()return a vector.str_match()andstr_match_all()return a matrix andg()a dataframe.
> library ( "stringr" ) > regexp <- "([[:digit:]]{2}) ([[:alpha:]]+) ([[:digit:]]{4})" > string <- "blabla 23 mai 2000 blabla 18 mai 2004" > str_extract ( cord , regexp ) [ 1 ] "23 mai 2000" > str_extract_all ( string , regexp ) [[ ane ]] [ 1 ] "23 mai 2000" "xviii mai 2004" > str_match ( cord , regexp ) [, 1 ] [, 2 ] [, iii ] [, four ] [ i ,] "23 mai 2000" "23" "mai" "2000" > str_match_all ( cord , regexp ) [[ 1 ]] [, ane ] [, 2 ] [, iii ] [, 4 ] [ 1 ,] "23 mai 2000" "23" "mai" "2000" [ ii ,] "xviii mai 2004" "18" "mai" "2004" > library ( "caroline" ) > m ( pattern = regexp , vect = string , names = c ( "day" , "month" , "year" ), types = rep ( "character" , 3 )) twenty-four hour period month year 1 18 mai 2004 - Named capture regular expressions can be used to define cavalcade names in the regular expression (this also serves to document the regular expression). Install the namedCapture parcel via
devtools::install_github("tdhock/namedCapture")to usestr_match_all_named(). It uses the base of operations officegregexpr(perl=Truthful)to parse a Perl-Compatible Regular Expression, and returns a list of match matrices with column names:
> named.regexp <- paste0 ( + "(?<day>[[:digit:]]{2})" , + " " , + "(?<month>[[:blastoff:]]+)" , + " " , + "(?<twelvemonth>[[:digit:]]{4})" ) > namedCapture :: str_match_all_named ( string , named.regexp ) [[ ane ]] day month year [ i ,] "23" "mai" "2000" [ 2 ,] "18" "mai" "2004" Making some substitution inside a string [edit | edit source]
Substituting a pattern in a string [edit | edit source]
-
sub()makes a exchange. -
gsub()is similar tosub()just supplant all occurrences of the blueprint whereassub()merely replaces the get-go occurrence. -
str_replace()(stringr) is similar to sub,str_replace_all()(stringr) is like to gsub.
In the following instance, we take a French date. The regular design is the following : 2 digits, a blank, some letters, a blank, 4 digits. We capture the ii digits with the [[:digit:]]{2} expression, the letters with [[:alpha:]]+ and the 4 digits with [[:digit:]]{4}. Each of these iii substrings is surrounded with parenthesis. The offset substring is stored in "\\one", the 2nd ane in "\\two" and the third i in "\\3".
cord <- "23 mai 2000" regexp <- "([[:digit:]]{2}) ([[:alpha:]]+) ([[:digit:]]{4})" sub ( pattern = regexp , replacement = "\\one" , x = string ) # returns the start part of the regular expression sub ( pattern = regexp , replacement = "\\2" , 10 = string ) # returns the 2d part sub ( pattern = regexp , replacement = "\\iii" , ten = string ) # returns the tertiary part In the following example, nosotros compare the outcome of sub() and gsub(). The offset i removes the start infinite whereas the second one removes all spaces in the text.
> text <- "abc def ghk" > sub ( pattern = " " , replacement = "" , ten = text ) [ ane ] "abcdef ghk" > gsub ( pattern = " " , replacement = "" , 10 = text ) [ 1 ] "abcdefghk" Substituting characters in a cord ? [edit | edit source]
-
chartr()substitutes characters in an expression. It stands for "character translation". -
replacechar()(cwhmisc) does the aforementioned task ... - besides every bit
str_replace_all()(stringr).
> chartr ( old = "a" , new = "o" , x = "baba" ) [ 1 ] "bobo" > chartr ( old = "ab" , new = "ot" , x = "baba" ) [ 1 ] "toto" > replacechar ( "abc.def.ghi.jkl" , "." , "_" ) [ 1 ] "abc_def_ghi_jkl" > str_replace_all ( "abc.def.ghi.jkl" , "\\." , "_" ) [ i ] "abc_def_ghi_jkl" Converting letters to lower or upper-case [edit | edit source]
-
tolower()converts upper-case characters to lower-case. -
toupper()converts lower-case characters to upper-case. -
capitalize()(Hmisc) capitalize the first letter of a string - See also
cap(),capitalize(),lower(),lowerize()andCapLeading()in the cwhmisc package.
> tolower ( "ABCdef" ) [ ane ] "abcdef" > toupper ( "ABCdef" ) [ i ] "ABCDEF" > capitalize ( "abcdef" ) [ ane ] "Abcdef" Filling a string with some character [edit | edit source]
-
padding()(cwhmisc) fills a cord with some characters to fit a given length. Run across alsostr_pad()(stringr).
> library ( "cwhmisc" ) > padding ( "abc" , 10 , " " , "heart" ) # adds blanks such that the length of the cord is x. [ ane ] " abc " > str_pad ( "abc" , width = 10 , side = "center" , pad = "+" ) [ 1 ] "+++abc++++" > str_pad ( c ( "1" , "xi" , "111" , "1111" ), three , side = "left" , pad = "0" ) [ 1 ] "001" "011" "111" "1111" Notation that str_pad() is very boring. For instance for a vector of length 10,000, we accept a very long computing fourth dimension. padding()does non seem to handle graphic symbol vectors but the best solution may be to use the sapply() and padding() functions together.
> library ( "stringr" ) > library ( "cwhmisc" ) > a <- rep ( i , ten ^ iv ) > system.time ( b <- str_pad ( a , 3 , side = "left" , pad = "0" )) utilisateur système écoulé 50.968 0.208 73.322 > organisation.time ( c <- sapply ( a , padding , space = 3 , with = "0" , to = "left" )) utilisateur système écoulé seven.700 0.020 12.206 Removing leading and trailing spaces [edit | edit source]
-
trimws()(memisc bundle) trim leading and trailing white spaces. -
trim()(gdata packet) does the aforementioned job. - Meet also
str_trim()(stringr)
> library ( "memisc" ) > trimws ( " abc def " ) [ ane ] "abc def" > library ( "gdata" ) > trim ( " abc def " ) [ 1 ] "abc def" > str_trim ( " abd def " ) [ 1 ] "abd def" Comparing two strings [edit | edit source]
Assessing if they are identical [edit | edit source]
-
==returns TRUE if both strings are the same and fake otherwise.
> "abc" == "abc" [ i ] TRUE > "abc" == "abd" [ one ] FALSE Computing distance betwixt strings [edit | edit source]
Few packages implement the Levenshtein distance between ii strings:
-
adist()in base package utils -
stringMatch()in MiscPsycho -
stringdist()in stringdist -
levenshteinDist()in RecordLinkage
A benchmark comparison the speed of levenshteinDist() and stringdist() is available here: [1].
Example with utils [edit | edit source]
> adist ( "test" , "tester" ) [ 1 ] 2 Example with MiscPsycho [edit | edit source]
stringMatch() (MiscPsycho) computes If normalize="Yes" the levenshtein distance is divided by the maximum length of each cord.
> library ( "MiscPsycho" ) > stringMatch ( "test" , "tester" , normalize = "NO" , punishment = ane , example.sensitive = TRUE ) [ i ] 2 Gauge matching [edit | edit source]
agrep() search for approximate matches using the Levenshtein distance.
- If 'value = TRUE', this returns the value of the cord
- If 'value = FALSE' this returns the position of the string
- max returns the maximal levenshtein distance.
> agrep ( pattern = "laysy" , x = c ( "i lazy" , "1" , "1 LAZY" ), max = 2 , value = TRUE ) [ 1 ] "1 lazy" > agrep ( "laysy" , c ( "1 lazy" , "i" , "1 LAZY" ), max = 3 , value = TRUE ) [ 1 ] "1 lazy" Miscellaneous [edit | edit source]
-
deparse(): Turn unevaluated expressions into grapheme strings. -
char.expand()(base of operations) expands a string with respect to a target. -
pmatch()(base) andcharmatch()(base) seek matches for the elements of their start argument among those of their second.
> pmatch ( c ( "a" , "b" , "c" , "d" ), table = c ( "b" , "c" ), nomatch = 0 ) [ i ] 0 1 two 0 -
make.unique()makes a grapheme cord unique. This is useful if yous want to utilize a string as an identifier in your data.
> make.unique ( c ( "a" , "a" , "a" )) [ 1 ] "a" "a.1" "a.2" References [edit | edit source]
- ↑ Hadley Wickham "stringr: modern, consistent string processing" The R Journal, December 2010, Vol 2/ii, http://journal.r-projection.org/archive/2010-2/RJournal_2010-2_Wickham.pdf
- ↑ http://cran.r-project.org/spider web/views/NaturalLanguageProcessing.html
- ↑ In onetime versions (< two.x) we had also basic regular expressions in R :
- extended regular expressions, used by
extended = Truthful(the default), - basic regular expressions, as used past
extended = FALSE(obsolete in R ii.x).
'extended = FALSE') are now obsolete, theextendedoption is obsolete in version 2.xi. - extended regular expressions, used by
Source: https://en.wikibooks.org/wiki/R_Programming/Text_Processing
0 Response to "How to Read Text Files in R"
Post a Comment