Topic Modeling, LDA 구현

이번 글에서는 말뭉치로부터 토픽을 추출하는 토픽모델링(Topic Modeling) 기법 가운데 하나인 잠재디리클레할당(Latent Dirichlet Allocation, LDA)을 파이썬 코드로 구현하는 법을 살펴보도록 하겠습니다. 이 글은 ‘밑바닥부터 시작하는 데이터 과학(조엘 그루스 지음, 인사이트 펴냄)’을 정리하였음을 먼저 밝힙니다. LDA 기법 자체에 대한 자세한 내용은 이곳을 참고하시면 좋을 것 같습니다. 그럼 시작하겠습니다.

수식

LDA 모델을 모두 정리하면 $d$번째 문서 $i$번째 단어의 토픽 $z_{d,i}$가 $j$번째에 할당될 확률은 다음과 같이 쓸 수 있습니다.

\[p({ z }_{d, i }=j|{ z }_{ -i },w)=\frac { { n }_{ d,k }+{ \alpha }_{ j } }{ \sum _{ i=1 }^{ K }{ ({ n }_{ d,i }+{ \alpha }_{ i }) } } \times \frac { { v }_{ k,{ w }_{ d,n } }+{ \beta }_{ { w }_{ d,n } } }{ \sum _{ j=1 }^{ V }{ { v }_{ k,j }+{ \beta }_{ j } } }=AB\]

위 수식의 표기를 정리한 표는 다음과 같습니다.

표기	내용
$n_{d,k}$	$k$번째 토픽에 할당된 $d$번째 문서의 단어 빈도
$v_{k,w_{d,n}}$	전체 말뭉치에서 $k$번째 토픽에 할당된 단어 $w_{d,n}$의 빈도
$w_{d,n}$	$d$번째 문서에 $n$번째로 등장한 단어
$α$	문서의 토픽 분포 생성을 위한 디리클레 분포 파라메터
$β$	토픽의 단어 분포 생성을 위한 디리클레 분포 파라메터
$K$	사용자가 지정하는 토픽 수
$V$	말뭉치에 등장하는 전체 단어 수
$A$	$d$번째 문서가 $k$번째 토픽과 맺고 있는 연관성 정도
$B$	$d$번째 문서의 $n$번째 단어($w_{d,n}$)가 $k$번째 토픽과 맺고 있는 연관성 정도

변수 선언

LDA 학습을 위해서는 변수를 먼저 선언해주어야 합니다. 다음과 같습니다.

from collections import Counter

# 각 토픽이 각 문서에 할당되는 횟수
# Counter로 구성된 리스트
# 각 Counter는 각 문서를 의미
document_topic_counts = [Counter() for _ in documents]

# 각 단어가 각 토픽에 할당되는 횟수
# Counter로 구성된 리스트
# 각 Counter는 각 토픽을 의미
topic_word_counts = [Counter() for _ in range(K)]

# 각 토픽에 할당되는 총 단어수
# 숫자로 구성된 리스트
# 각각의 숫자는 각 토픽을 의미함
topic_counts = [0 for _ in range(K)]

# 각 문서에 포함되는 총 단어수
# 숫자로 구성된 리스트
# 각각의 숫자는 각 문서를 의미함
document_lengths = map(len, documents)

# 단어 종류의 수
distinct_words = set(word for document in documents for word in document)
V = len(distinct_words)

# 총 문서의 수
D = len(documents)

코드 변수명이 조금 생소하실 것 같아서 원 논문의 notation과 비교한 표를 다음과 같이 만들었습니다.

원래 notation	code 변수명
$n_{d,k}$	document_topic_counts
$v_{k,w_{d,n}}$	topic_word_counts
$Σ_{i=1}^Kn_{d,i}$	document_lengths
$Σ_{j=1}^Vv_{k,j}$	topic_counts

예컨대 세번째 문서 가운데 토픽 1과 관련있는 단어수는 다음과 같습니다.

document_topic_counts[3][1]

nlp라는 단어가 토픽2와 연관지어 등장한 횟수는 다음과 같습니다.

topic_word_counts[2][‘nlp’]

새로운 topic 계산하기

$d$번째 문서 $i$번째 단어의 토픽 $z_{d,i}$가 $j$번째에 할당될 확률은 $A$와 $B$를 곱해 구합니다. 아래 코드에서 p_topic_given_document가 $A$, p_word_given_topic이 $B$입니다. topic_weight 함수는 이 둘을 곱한 값이라는 걸 알 수 있습니다.

def p_topic_given_document(topic, d, alpha=0.1):
    # 문서 d의 모든 단어 가운데 topic에 속하는
    # 단어의 비율 (alpha를 더해 smoothing)
    return ((document_topic_counts[d][topic] + alpha) /
            (document_lengths[d] + K * alpha))

def p_word_given_topic(word, topic, beta=0.1):
    # topic에 속한 단어 가운데 word의 비율
    # (beta를 더해 smoothing)
    return ((topic_word_counts[topic][word] + beta) /
            (topic_counts[topic] + V * beta))

def topic_weight(d, word, k):
    # 문서와 문서의 단어가 주어지면
    # k번째 토픽의 weight를 반환
    return p_word_given_topic(word, k) * p_topic_given_document(k, d)

$AB$를 구했으니 이를 바탕으로 샘플링을 하여 $z_{d,i}$에 새로운 topic을 할당할 수 있습니다. 그 코드는 다음과 같습니다.

def choose_new_topic(d, word):
    return sample_from([topic_weight(d, word, k) for k in range(K)])

import random
def sample_from(weights):
    # i를 weights[i] / sum(weights)
    # 확률로 반환
    total = sum(weights)
    # 0과 total 사이를 균일하게 선택
    rnd = total * random.random()
    # 아래 식을 만족하는 가장 작은 i를 반환
    # weights[0] + ... + weights[i] >= rnd
    for i, w in enumerate(weights):
        rnd -= w
        if rnd <= 0:
            return i

inference

다음과 같은 데이터가 주어졌다고 칩시다.

documents = [["Hadoop", "Big Data", "HBase", "Java", "Spark", "Storm", "Cassandra"],
    ["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"],
    ["Python", "scikit-learn", "scipy", "numpy", "statsmodels", "pandas"],
    ["R", "Python", "statistics", "regression", "probability"],
    ["machine learning", "regression", "decision trees", "libsvm"],
    ["Python", "R", "Java", "C++", "Haskell", "programming languages"],
    ["statistics", "probability", "mathematics", "theory"],
    ["machine learning", "scikit-learn", "Mahout", "neural networks"],
    ["neural networks", "deep learning", "Big Data", "artificial intelligence"],
    ["Hadoop", "Java", "MapReduce", "Big Data"],
    ["statistics", "R", "statsmodels"],
    ["C++", "deep learning", "artificial intelligence", "probability"],
    ["pandas", "R", "Python"],
    ["databases", "HBase", "Postgres", "MySQL", "MongoDB"],
    ["libsvm", "regression", "support vector machines"]]

우선 inference에 필요한 기초 데이터를 만듭니다. 토픽수 $K$ 등 하이퍼파라메터를 정하고, 각 단어를 임의의 토픽에 배정한 뒤 필요한 숫자를 세어봐야 합니다.

random.seed(0)

# topic 수 지정
K=4

# 각 단어를 임의의 토픽에 랜덤 배정
document_topics = [[random.randrange(K) for word in document]
                    for document in documents]

# 위와 같이 랜덤 초기화한 상태에서 
# AB를 구하는 데 필요한 숫자를 세어봄
for d in range(D):
    for word, topic in zip(documents[d], document_topics[d]):
        document_topic_counts[d][topic] += 1
        topic_word_counts[topic][word] += 1
        topic_counts[topic] += 1

우리의 목표는 ‘토픽-단어’와 ‘문서-토픽’에 대한 결합확률분포(unknown)로부터 표본을 얻는 것이므로, 깁스샘플링을 수행하면 됩니다. iteration은 1000으로 설정했습니다.

for iter in range(1000):
    for d in range(D):
        for i, (word, topic) in enumerate(zip(documents[d],
                                              document_topics[d])):
            # 깁스 샘플링 수행을 위해
            # 샘플링 대상 word와 topic을 제외하고 세어봄
            document_topic_counts[d][topic] -= 1
            topic_word_counts[topic][word] -= 1
            topic_counts[topic] -= 1
            document_lengths[d] -= 1

            # 깁스 샘플링 대상 word와 topic을 제외한 
            # 말뭉치 모든 word의 topic 정보를 토대로
            # 샘플링 대상 word의 새로운 topic을 선택
            new_topic = choose_new_topic(d, word)
            document_topics[d][i] = new_topic

            # 샘플링 대상 word의 새로운 topic을 반영해 
            # 말뭉치 정보 업데이트
            document_topic_counts[d][new_topic] += 1
            topic_word_counts[new_topic][word] += 1
            topic_counts[new_topic] += 1
            document_lengths[d] += 1

파일럿 실험 결과

inference 결과 첫번째 문서의 토픽 비중은 다음과 같습니다. 전체 7개 단어 가운데 0번째 토픽과 관련된 단어가 4개 1번째 토픽 단어가 3개입니다. 따라서 이 문서는 첫번째 토픽일 확률이 가장 높네요.

>> document_topic_counts[0]

Counter({0: 4, 2: 3, 1: 0, 3: 0})

첫번째 토픽의 단어 비중은 다음과 같습니다. Java, Big Data, Hadoop, deep learning 등 단어의 빈도(비중)가 높네요. 이 토픽은 대략 ‘Big Data’에 해당하는 주제인 것 같다는 느낌이 듭니다.

>> topic_word_counts[0]

Counter({‘Java’: 3, ‘Big Data’: 3, ‘Hadoop’: 2, ‘deep learning’: 2, ‘artificial intelligence’: 2, ‘C++’: 2, ‘neural networks’: 1, ‘Storm’: 1, ‘programming languages’: 1, ‘MapReduce’: 1, ‘Haskell’: 1, ‘probability’: 0, ‘Mahout’: 0, ‘NoSQL’: 0, ‘MySQL’: 0, ‘regression’: 0, ‘statistics’: 0, ‘Postgres’: 0, ‘Python’: 0, ‘mathematics’: 0, ‘Spark’: 0, ‘numpy’: 0, ‘pandas’: 0, ‘theory’: 0, ‘libsvm’: 0, ‘scipy’: 0, ‘R’: 0, ‘HBase’: 0, ‘decision trees’: 0, ‘MongoDB’: 0, ‘scikit-learn’: 0, ‘machine learning’: 0, ‘databases’: 0, ‘statsmodels’: 0, ‘support vector machines’: 0, ‘Cassandra’: 0})

for textmining

Topic Modeling, LDA 구현

09 Jul 2017 | LDA

수식

변수 선언

새로운 topic 계산하기

inference

파일럿 실험 결과

Comments

for textmining

Topic Modeling, LDA 구현

09 Jul 2017 | LDA

수식

변수 선언

새로운 topic 계산하기

inference

파일럿 실험 결과

Related Posts

Topic Modeling, LDA 01 Jun 2017

선형판별분석(Linear Discriminant Analysis) 21 Mar 2017

Probabilistic Latent Semantic Analysis 25 May 2017

Word Weighting(1) 28 Mar 2017

문서 유사도 측정 20 Apr 2017

idea of statistical semantics 10 Mar 2017

Comments