03. Exploring the corpus - 전처리 및 간단한 변환

728x90

1. Exploring the Corpus

inspect() 를 이용하여 문서의 데이터가 제대로 로딩되었는지 확인이 가능하다.

> inspect(docs[2])
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]]
NULL

2. Preparing the Corpus

텍스트 분석을 위해서 경우에 따라서는 전처리 과정이 필요할 수 있다. 아래에서 보듯이 대상 텍스트를 소문자로 변환시키고, 숫자를 제거하는 등의 기능을 갖고 있음을 확인할 수 있다.

> getTransformations()
## [1] "removeNumbers" "removePunctuation" "removeWords"
## [4] "stemDocument" "stripWhitespace"

변환을 위해서는 tm_map() 을 사용한다. 아래에서 살펴본다.

3. Simple Transforms

아래 예제를 통해 간단히 살펴본다.

> library(NLP)
> library(tm) // tm 실행
> con <- file("./corpus/txt/corpustext.txt")
> lines <- readLines(con)
> close(con)
> rm(con)
> head(lines, 10) // 위에서부터 10개 읽어온다
[1] "(Sample 1)"
[2] "STRICKLAND: Good morning@."
[3] "Marsha is on her way. @She called from the car phone I think. It sounded like the car phone, to let us know that she would be delayed."
[4] "I would like to welcome @two people who haven't been with us before."
[5] "Suzanne Clewell, we're @delighted to have you with us today. Suzanne, would you tell us a little bit about what you do?"
[6] "CLEWELL: Yes. I'm the @Coordinator for Reading Language Arts with the Montgomery County Public Schools which is the suburban district surrounding Washington. We have 173 schools and 25 elementary schools."
[7] "It's great to be here."
[8] "STRICKLAND: And I'll skip over to another member of the committee, but for her, this is her first meeting, too, Judith Langer. I think we all know her work, if we didn't know her."
[9] "Judith."
[10] "LANGER@: Hello. I'm delighted to be here." // 여기에 보이는 @를 제거하는 예제를 실행해보자.
> lines <- head(lines, 10)
> doc <- Corpus(VectorSource(lines))
> summary(doc)
Length Class Mode
1 2 PlainTextDocument list
2 2 PlainTextDocument list
3 2 PlainTextDocument list
4 2 PlainTextDocument list
5 2 PlainTextDocument list
6 2 PlainTextDocument list
7 2 PlainTextDocument list
8 2 PlainTextDocument list
9 2 PlainTextDocument list
10 2 PlainTextDocument list
> doc[[1]]
<<PlainTextDocument (metadata: 7)>>
(Sample 1)
> doc[[2]]
<<PlainTextDocument (metadata: 7)>>
STRICKLAND: Good morning@. // <- @가 보인다.
> doc[[3]]
<<PlainTextDocument (metadata: 7)>>
Marsha is on her way. @She called from the car phone I think. It sounded like the car phone, to let us know that she would be delayed.
> doc[[4]]
<<PlainTextDocument (metadata: 7)>>
I would like to welcome @two people who haven't been with us before.
> inspect(doc[1])
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
(Sample 1)

> toSpace <- content_transformer(function(x,pattern) gsub(pattern,"",x)) // R 오브젝트의 컨텐츠를 수정
> doc <- tm_map(doc,toSpace, "@")
> inspect(doc[2])
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
STRICKLAND: Good morning. // @ 가 사라진 것이 보인다.

728x90

저작자표시 비영리 변경금지

'프로그래밍 Programming' 카테고리의 다른 글

05. Preparing the Corpus - 특정 변환 (0)	2014.11.22
04. Preparing the Corpus - 기본 변환 (0)	2014.11.22
02. Loading a Corpus (txt, pdf, word) (1)	2014.11.18
01. 텍스트마이닝(Text Mining)을 위한 패키지 준비 (0)	2014.11.14
Plot geoms(geometric objects) (1) - 추세선 그리기 (span, gam, lm, rlm) (0)	2014.11.09

갈루아의 반서재

03. Exploring the corpus - 전처리 및 간단한 변환

'프로그래밍 Programming' 카테고리의 다른 글

티스토리툴바