[NLP] 문자열 전처리 Text Preprocessing :: 토큰화 Tokenization

Python/기타 2020. 2. 27.

[ 문자열 전처리 Text Preprocessing ]

토큰화 (Tokenization)

- 구두점이나 특수문자를 전부 제거하면 토큰이 의미를 잃어버리는 경우가 발생하기도 한다.

- 주의해야 할 사항

1. 구두점이나 특수문자를 단순 제외해서는 안 된다.

1) 단어 자체에 구두점을 갖고 있는 경우 : ph.D, KT&G

2) 특수문자가 의미를 가지고 있는 경우 : $ 531, 17/05/31

2. 줄임말과 단어 내 띄어쓰기가 있는 경우

1) 줄임말 : I'm = I am

2) 하나의 단어이지만 띄어쓰기가 있는 경우 : New York

- NLTK : 영어 코퍼스를 토큰화하기 위한 도구 제공

★ 아포스트로피가 들어간 상황에서 Don't와 Mizy's의 토큰화

Don't / Don t / Dont / Do n't

Mizy's / Mizy s / Mizy / Mizys

word_tokenize

- 의미에 따른 분리

from nltk.tokenize import word_tokenize  

print(word_tokenize("Don't you know that? Mizy's ice cream store has moved."))

['Do', "n't", 'you', 'know', 'that', '?', 'Mizy', "'s", 'ice', 'cream', 'store', 'has', 'moved', '.']

WordPunctTokenizer

- 구두점을 별도로 분류

- 모두 분리

from nltk.tokenize import WordPunctTokenizer  

print(WordPunctTokenizer().tokenize("Don't you know that? Mizy's ice cream store has moved."))

['Don', "'", 't', 'you', 'know', 'that', '?', 'Mizy', "'", 's', 'ice', 'cream', 'store', 'has', 'moved', '.']

text_to_word_sequence

- 모든 알파벳을 소문자로 변경

- .,?! 등은 제거하지만 '는 보존

from tensorflow.keras.preprocessing.text import text_to_word_sequence

print(text_to_word_sequence("Don't you know that? Mizy's ice cream store has moved."))

["don't", 'you', 'know', 'that', "mizy's", 'ice', 'cream', 'store', 'has', 'moved']

2. 문장토큰화 (Sentence Tokenization)

- 온점은 문장의 끝이 아니라도 등장할 수 있다.

ex) 나는 Mr. Jae에게 갈 예정이야. 내 e-mail 주소는 mizykk@icloud.com이니 여기로 연락해줘.

sent_tokenize

from nltk.tokenize import sent_tokenize

text="She dreamed of autumn. Of chilly autumn winds and soft fall rains. She could even feel the cool moisture as the rain drops touched her face and ran down her cheeks."
print(sent_tokenize(text))

['She dreamed of autumn.', 'Of chilly autumn winds and soft fall rains.', 'She could even feel the cool moisture as the rain drops touched her face and ran down her cheeks.']

from nltk.tokenize import sent_tokenize

text="I'm going to Mr. Jae. My email address is mizykk@icloud.com so please contact here."
print(sent_tokenize(text))

["I'm going to Mr. Jae.", 'My email address is mizykk@icloud.com so please contact here.']

3. 품사 태깅 (Part of Speech tagging)

- 단어는 표기가 같아도 품사에 따라 의미가 크게 달라진다.

ex) run : 달리기(명사) / 달리다(동사)

from nltk.tokenize import word_tokenize

text="I'm actively looking for Ph.D. students. and you are a Ph.D. student."
print(word_tokenize(text))

['I', "'m", 'actively', 'looking', 'for', 'Ph.D.', 'students', '.', 'and', 'you', 'are', 'a', 'Ph.D.', 'student', '.']

from nltk.tag import pos_tag

x=word_tokenize(text)
print(pos_tag(x))

[('I', 'PRP'), ("'m", 'VBP'), ('actively', 'RB'), ('looking', 'VBG'), ('for', 'IN'), ('Ph.D.', 'NNP'), ('students', 'NNS'), ('.', '.'), ('and', 'CC'), ('you', 'PRP'), ('are', 'VBP'), ('a', 'DT'), ('Ph.D.', 'NNP'), ('student', 'NN'), ('.', '.')]

- NLTK에서 품사 태그 기준 : Penn Treebank POS Tags

PRP	인칭 대명사
VBP	동사
RB	부사
VBG	현재부사
IN	전치사
NNP	고유 명사
NNS	복수형 명사
CC	접속사
DT	관사

Reference

저작자표시

'Python > 기타' 카테고리의 다른 글

[NLP] 문자열 전처리 Text Preprocessing :: Stopword (0)	2020.03.04
[NLP] 한국어 자연어 처리 NLP :: KoNLP (0)	2020.02.27
Google Colaboratory에서 Kaggle API 사용하기 :: Kaggle 연결하기/다운로드 (0)	2020.01.28
Google Colaboratory 사용하기 :: 준비, mount (0)	2020.01.28
아나콘다(Anaconda) 설치하기 :: Jupyter Notebook(쥬피터노트북) (0)	2019.11.26

🐢🐢🐢..

[NLP] 문자열 전처리 Text Preprocessing :: 토큰화 Tokenization

[ 문자열 전처리 Text Preprocessing ]

토큰화 (Tokenization)

'Python > 기타' 카테고리의 다른 글

Comments

티스토리툴바