SQL 실습 ③ 데이터 가공하기

2023. 3. 7. 00:58

※ PostgreSQL 프로그램과 Dbeaver 툴로 연습한 기록

→ 실행하면서 익혀보자!

2021년 6월 1일 하루동안의 주문 (자체 연습 데이터)

gmv_trend : 온라인 쇼핑몰거래액 테이블

online_order : 주문 테이블

item : 상품 테이블

category : 카테고리 테이블

user_info : 유저 테이블

1. 원하는 형식으로 컬럼 가공하기

a) 숫자를 문자열로 바꿔주기

select dt, cast(dt as varchar) as yyyymmdd
from online_order oo

→ dt를 '연도'+'월'+'일' 가변형 문자열로 타입 바꿔주기

b) 문자열 컬럼에서 일부만 잘라내기

select dt, left(cast(dt as varchar), 4) as yyyy,
substring(cast(dt as varchar),5,2) as mm,
right(cast(dt as varchar), 2) as dd
from online_order oo

→ dt로 '연도' / '월' / '일' 만 분리한 컬럼 만들기

left(컬럼명, n) : 컬럼값을 왼쪽 기준으로 n개를 가져옴
substring(컬럼명,n,m) : 컬럼값을 왼쪽 기준으로 n번째부터 m개의 글자를 가져옴
right(컬럼명, n) : 컬럼값을 오른쪽 기준으로 n개를 가져옴

c) yyyy-mm-dd 형식으로 이어주기

방법 ① ( concat함수 이용하기)

select dt, 
concat(
left(cast(dt as varchar), 4), '-',
substring(cast(dt as varchar),5,2), '-',
right(cast(dt as varchar), 2) ) as yyyymmdd
from online_order oo

2021-06-01 출력

방법 ② ( '||' 이용하기 )

select dt, 
left(cast(dt as varchar), 4) || '-' ||
substring(cast(dt as varchar),5,2) || '-' ||
right(cast(dt as varchar), 2) as yyyymmdd
from online_order oo

2021-06-01 출력

d) null 값인 경우 임의값으로 바꿔주기

select oo.userid, coalesce(oo.userid,0)
from online_order oo 
left join user_info ui on oo.userid = ui.userid

→ coalesce(컬럼값, value) : 컬럼의 값이 null이면 value값으로 전환(컬럼값과 value값은 동일한 타입여야한다.)

select coalesce(ui.gender, 'NA') as gender, 
coalesce(ui.age_band, 'NA') as age_band,
sum(oo.gmv) as gmv
from online_order oo 
left join user_info ui on oo.userid = ui.userid
group by 1,2
order by 1,2;

→ null값이 나오는 이유 : online_order의 userid값이 원래부터 null인 경우가 있기 때문

e) 내가 원하는 컬럼 추가해보기

select distinct case 
	when gender = 'M' then '남성' 
	when gender = 'F' then '여성' 
	else 'NA' 
	end as gender
from user_info ui
;

→ 엑셀의 if문과 같은 case when else end 구문 사용

f) 연령대 그룹 만들어보기 (20대, 30대, 40대)

select
case 	when ui.age_band = '20~24' then '20s'
		when ui.age_band = '25~29' then '20s'
		when ui.age_band = '30~34' then '30s'
		when ui.age_band = '35~39' then '30s'
		when ui.age_band = '40~44' then '40s'
		when ui.age_band = '45~49' then '40s'
		else 'NA'
		end as age_group
, sum(gmv) as gmv
from online_order oo 
left join user_info ui on oo.userid = ui.userid 
group by 1
order by 1
;

g) TOP3 카테고리와 그 외 상품의 매출액 비교하기

select c.cate1, sum(gmv) as gmv
from online_order oo 
join item i on oo.itemid = i.id 
join category c on i.category_id = c.id 
group by 1
order by 2 desc 
;

select 
case when cate1 in ('스커트', '티셔츠', '원피스') then 'TOP 3'
else '기타'
end as item_type
, sum(gmv) as gmv
from online_order oo 
join item i on oo.itemid = i.id 
join category c on i.category_id = c.id 
group by 1
order by 2 desc 
;

h) 특정 키워드가 담긴 상품과 그렇지 않은 상품의 매출 비교하기 (+item 개수도 같이 확인!) (고난이도)

select
case	when item_name like '%깜찍%' then '깜찍 컨셉'
		when item_name like '%시크%' then '시크 컨셉'
		when item_name like '%청순%' then '청순 컨셉'
		when item_name like '%기본%' then '기본'
		else '미분류'
		end as item_concept
, sum(gmv) as gmv 
from online_order oo 
join item i on oo.itemid = i.id 
group by 1
order by 2 desc 
;

→ 어떤 아이템이 우리 쇼핑몰에서 가장 핫할까?

→ '깜찍시크 ~' 이런 값을 가진 아이템도 생각하며

case when구문을 잘 생각해서 짜야한다

2. 날짜 관련 함수 활용하기

굉장히 중요하긴 한데, 오라클,포스트그리, 각 프로그램마다 다르다
외우는 것보다 어떠한 기능이 있는지를 배우고, 나중에 실무에 사용하는 프로그램에 따라 구글링하여 사용해보자

a) 오늘을 나타내는 기본 구문

select now();
select current_timestamp;

select current_date;

b) 날짜 형식에서 문자 형식으로 변환하기(암기)

select to_char(now(),'yyyymmdd'), to_char(now(),'yyyy-mm-dd'), to_char(now(), 'yyyy / mm / dd');

c) 날짜 더하기/빼기

select now() + interval '1 month';
select now() - interval '1 month';
select now() + interval '-1 month';
select now() + interval '-1 day';

→ interval과 year, month, day 사용

d) 날짜로부터 연도, 월, 주 확인하기

select date_part('month', now());

select date_part('day', now());

e) 최근 1년 동안의 매출액 확인하기

select *
from gmv_trend gt 
where cast(yyyy as varchar) || cast(mm as varchar)
		<= cast(date_part('year', now()-interval '1 year') as varchar)
		|| cast(date_part('month', now()-interval '1 month') as varchar)
order by 2,3
;

→ cast함수는 자료형을 바꿔주는 함수

3. 할인률, 판매가, 이익률 계산하기 (사칙연산)

select *,
discount / gmv as discount_rate
from online_order oo
;

→ 할인률이 다 '0'으로 나오는 이유는 정수 나누기 정수를 하였기 때문

select *,
cast(discount as numeric) / gmv as discount_rate,
gmv - discount as paid_amount,
cast(product_profit as numeric) / gmv as product_margin,
cast(total_profit as numeric) / gmv as total_margin
from online_order oo
;

→ cast 함수로 자료형을 바꿔준 후 정상적으로 나온다. (numeric은 실수 자료형)

select c.cate1 ,
cast(discount as numeric) / gmv as discount_rate,
gmv - discount as paid_amount,
cast(product_profit as numeric) / gmv as product_margin,
cast(total_profit as numeric) / gmv as total_margin
from online_order oo 
join item i on oo.itemid = i.id 
join category c on i.category_id = c.id
;

→ 상품테이블과 조인을 하고, 카테고리테이블과 조인하여 상세 데이터 확인

select c.cate1 ,
sum(cast(discount as numeric)) / sum(gmv) as discount_rate,
sum(gmv) - sum(discount) as paid_amount,
sum(cast(product_profit as numeric)) / sum(gmv) as product_margin,
sum(cast(total_profit as numeric)) / sum(gmv) as total_margin
from online_order oo 
join item i on oo.itemid = i.id 
join category c on i.category_id = c.id
group by 1
order by 3 desc 
;

→ 가장 로우레벨인 online_order를 기준으로 데이터가 보여지기 때문에 별로 의미가 없고, 이중에 cate1을 기준으로 그룹바이를 해보자

→ 그룹바이를 사용할때, 나눗셈에서 분자와 분모에 모두 집계함수를 씌워줘야 한다

select c.cate1 ,
round(sum(cast(discount as numeric)) / sum(gmv),2) * 100 || '%' as discount_rate,
sum(gmv) - sum(discount) as paid_amount,
round(sum(cast(product_profit as numeric)) / sum(gmv),2) * 100 || '%' as product_margin,
round(sum(cast(total_profit as numeric)) / sum(gmv)*100) || '%' as total_margin
from online_order oo 
join item i on oo.itemid = i.id 
join category c on i.category_id = c.id
group by 1
order by 3 desc 
;

→ round(실수, n): 소수점 이후 n+1번째에서 반올림하여 소수점 n번째까지 보여줌
→ postgreSQL외의 다른 SQL프로그램에서 숫자와 문자가 결합이 안될 때가 있다.
이럴땐, 문법지식을 활용여 cast(round(sum(cast(total_profit as numeric)) / sum(gmv)*100) as varchar) || '%' as total_margin 이런식으로 활용해보자

4. 고객 관점에서의 분석 (인당 평균 구매수량 / 인당 평균 구매금액)

예시)

--100명의 고객이 구매를 하였고, 총 판매수량이 200개

--인당 평균 구매수량 = 총 판매수량 / 총 고객 수

--인당 평균 구매금액 = 총 구매금액 / 총 고객 수

1) 인당 구매수량이 높은 상품은?

select i.item_name,
sum(unitsold) as unitsold,
count(distinct userid) as user_count,
round(sum(cast(unitsold as numeric)) / count(distinct userid),2) as avg_unitsold_per_customer,
round(sum(cast(gmv as numeric)) / count(distinct userid)) as avg_gmv_per_customer
from online_order oo 
join item i on oo.itemid = i.id 
group by 1
order by 4 desc
;

→ 동일한 유저가 물건을 여러개 구입한 경우에 단순히 count를 하게되면 각각 개별적으로 count하기 때문에 중복이 일어날 수 있다. (의미없는 데이터가 됨) ==> 따라서 distinct(중복배제)를 써준다!
→ 인당 구매수량이 높은 아이템 인사이트 도출해보기
ex) 이 상품은 매출액이 높진 않은데, 인당 구매율이 높다? 인사이트 도출

2) 인당 구매금액이 높은 성/연령대는? (인당 구매금액 관점에서 분석해보기)

select gender, age_band,
sum(gmv) as gmv,
count(distinct oo.userid) as user_count,
sum(gmv) / count(distinct oo.userid) as avg_gmv_per_customer
from online_order oo 
join user_info ui on oo.userid = ui.userid 
group by 1,2
order by 5 desc
;

→ left join을 생각할 수 있는데, online_order에 null값이 그리 많지 않다. 어차피 성/연령 정보로 그룹핑을 해야하고,

성/연령 정보가 없는 데이터는 이 케이스에서는 누락시켜도 상관이 없기 때문에 inner join을 써도 상관이없다.

'SQL' 카테고리의 다른 글

SQL 심화이론 (1)	2023.03.07
SQL 응용 문법 (0)	2023.03.07
SQL Join의 유형 / 주의사항 (0)	2023.03.02
SQL 실습 ② 주문 데이터 분석 (join 이용) (0)	2023.03.02
SQL 주의사항 (0)	2023.02.28

Attitude of learning