Skip to content

Commit

Permalink
preprocessing
Browse files Browse the repository at this point in the history
  • Loading branch information
SangminJe committed Jun 19, 2021
1 parent 2ce0542 commit dd26f1a
Show file tree
Hide file tree
Showing 5 changed files with 2,022 additions and 0 deletions.
140 changes: 140 additions & 0 deletions study-presentation/je-june-19/Regex.rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
---
title: "Regex_정규표현식"
author: "SangminJe"
date: '2021 6 19 '
output:
html_document:
number_sections: true
fig_caption: true
toc: true
fig_width: 5
fig_height: 4
theme: cosmo
highlight: tango
code_folding: show
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
fig.align = "center")
```

# REGEX {.tabset .tabset-fade}

예시문
Hi there, Nice to meet you. And Hello there and hi.
I love grey(gray) color not a gry, graay and graaay.
Ya ya YaYaYa Ya
abcdefghijklmnopqrstuvwxyz
ABSCEFGHIJKLMNOPQRSTUVWZYZ
1234567890
.[]{}()\^$|?*+
010-898-0893
010-405-3412
02-878-8888
[email protected]
[email protected]
[email protected]
https://www.youtu.be/-ZClicWm0zM
https://youtu.be/-ZClicWm0zM
youtu.be/-ZClicWm0zM

- [정규표현식 연습링크](https://regexr.com/5mhou)
- [엘리의 드림코딩 - 정규표현식](https://www.youtube.com/watch?v=t3M6toIflyQ)


## Groups and ranges

- **\|** : OR
- Hi|Hello : Hi나 Hello에 매칭되는 문자열을 반환
- **\(\)** : 그룹
- (Hi|Hello)|(And) :Hi와 Hello는 Group1에 매칭되며, And는 Group2에 매칭됨
- gr(e|a)y : grey나 gray 모두 매칭되어 반환
- gr(?:e|a)y : 매칭은 되지만 Group1, Group2는 지정이 되지 않음
- **\[\]** 괄호 안의 어떤 문자든 반환
- gr[abcdef]y / gr[a-f]y : 대괄호 안의 모든 문자열을 반환한다
- [a-zA-Z0-9] : a~z 까지의 문자와, A~Z까지의 문자와 0~9가 들어간 모든 숫자를 반환
- **\^** : NOT
- [^a-zA-Z0-9] : 영어와 숫자 빼고 모든 문자를 반환

## Quantifier

- **\?** : 없거나 있거나
- gra?y : gray, gry 모두 반환
- **\*** : 없거나 있거나 많거나
- gra*y : gray, graaay, gry 모두 반환
- **\+** : 한개 이상
- gra+y : gray, graaay만 추가
- **\{n\}** : n개만 특정해서 반환
- gra{2}y : graay만 반환
- **{min,max}** : min ~ max개 사이의 문자열을 반환
- gra{2,3}y : graay, graaay를 반환
- gra{2,}y : graay, graaay를 반환 (최소값만 지정하여 2개 이상인 것들을 반환)

## Boundary type

- **\\b** : 단어 경계
- \\bYa : 단어 앞에서 쓰이는 Ya를 반환함
- Ya\\b : 단어 뒤에서만 쓰이는 Ya를 반환
- **\\B** : 단어 경계의 반대
- \\BYa : 단어 앞에서 쓰이는 Ya는 빼고 반환함
- Ya\\B : 단어 뒤에서만 쓰이는 Ya는 빼고 반환
- **\^,\$** : 문장 경계
- ^Ya : 문장에서 시작하는 Ya 반환
- Ya$ : 문장에서 끝나는 Ya 반환

## Character Classes

- **.** : 모든 문자열
- . : 모든 문자열을 선택 가능
- **\\** : Escape
- **\\d** : 숫자를 모두 반환
- **\\D** : 숫자를 제외하고 모두 반환
- **\\w** : 모든 word
- **\\W** : 모든 문자를 제외하고 반환
- **\\s** : space 반환
- **\\S** : space를 제외하고 반환


## Exercise

```{r}
library(tidyverse)
```


1. 전화번호만 선택하기

```{r}
sentence <- "Hi there, Nice to meet you. And Hello there and hi. I love grey(gray) color not a gry, graay and graaay. Ya ya YaYaYa Ya .[]{}()^$|?*+ 010-898-0893 010-405-3412 02-878-8888 [email protected] [email protected] [email protected] https://www.youtu.be/-ZClicWm0zM https://youtu.be/-ZClicWm0zM youtu.be/-ZClicWm0zM"
print(sentence)
str_extract_all(sentence, "[0-9]+[\\-|\\.][0-9]+[\\-|\\.][0-9]+") %>% unlist()
str_extract_all(sentence, '\\d{2,3}[.-]\\d{3,4}[.-]\\d{3,4}') %>% unlist()
```

2. 이메일 주소 가져오기

```{r}
str_extract_all(sentence,
"[a-zA-Z0-9._+-]+@[a-zA-Z0-9-_]+\\.[a-zA-Z0-9.]+") %>%
unlist()
```

3. 인터넷 주소 가져오기

```{r}
# 유튜브 아이디 가져오기
str_extract_all(sentence, "(https?:\\/\\/)?(www.)?youtu.be\\/([a-zA-Z0-9-]+)")
# 뒤에 유튜브 ID만 추출
str_match(sentence, "(?:https?:\\/\\/)?(?:www.)?youtu.be\\/([a-zA-Z0-9-]+)") %>% .[2]
```
51 changes: 51 additions & 0 deletions study-presentation/je-june-19/data/sample_submission.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
일자,중식계,석식계
2021-01-27,0,0
2021-01-28,0,0
2021-01-29,0,0
2021-02-01,0,0
2021-02-02,0,0
2021-02-03,0,0
2021-02-04,0,0
2021-02-05,0,0
2021-02-08,0,0
2021-02-09,0,0
2021-02-10,0,0
2021-02-15,0,0
2021-02-16,0,0
2021-02-17,0,0
2021-02-18,0,0
2021-02-19,0,0
2021-02-22,0,0
2021-02-23,0,0
2021-02-24,0,0
2021-02-25,0,0
2021-02-26,0,0
2021-03-02,0,0
2021-03-03,0,0
2021-03-04,0,0
2021-03-05,0,0
2021-03-08,0,0
2021-03-09,0,0
2021-03-10,0,0
2021-03-11,0,0
2021-03-12,0,0
2021-03-15,0,0
2021-03-16,0,0
2021-03-17,0,0
2021-03-18,0,0
2021-03-19,0,0
2021-03-22,0,0
2021-03-23,0,0
2021-03-24,0,0
2021-03-25,0,0
2021-03-26,0,0
2021-03-29,0,0
2021-03-30,0,0
2021-03-31,0,0
2021-04-01,0,0
2021-04-02,0,0
2021-04-05,0,0
2021-04-06,0,0
2021-04-07,0,0
2021-04-08,0,0
2021-04-09,0,0
Loading

0 comments on commit dd26f1a

Please sign in to comment.