Natural Language Processing for Python¶
Embedding¶
CharEmbedding
:PositionEmbedding
:WordEmbedding
:
Text classification¶
Available models¶
All the following models includes Dropout, Pooling and Dense layers with hyperparameters tuned for reasonable performance across standard text classification tasks. If necessary, they are good basis for further performance tuning.
text_cnn
:text_rnn
:attention_rnn
:text_rcnn
:text_han
:
Examples¶
Choose a pre-trained word embedding by setting the embedding_type
and the corresponding embedding dimensions. Set embedding_type=None
to initialize the word embeddings randomly (but make sure to set trainable_embeddings=True
so you actually train the embeddings).
FastText¶
Several pre-trained FastText embeddings are included. For now, we only have the word embeddings and not the n-gram features. All embedding have 300 dimensions.
- English Vectors: e.g.
fasttext.wn.1M.300d
, check out all avaiable embeddings - Multilang Vectors: in the format
fasttext.cc.LANG_CODE
e.g.fasttext.cc.en
- Wikipedia Vectors: in the format
fasttext.wiki.LANG_CODE
e.g.fasttext.wiki.en
.en
##Dataset
segment¶
Dataset and Model¶
Reading Comprehension¶
Dataset¶
- HistoryQA: Joseon History Question Answering Dataset (SQuAD Style)
- KorQuAD: KorQuAD는 한국어 Machine Reading Comprehension을 위해 만든 데이터셋입니다. 모든 질의에 대한 답변은 해당 Wikipedia 아티클 문단의 일부 하위 영역으로 이루어집니다. Stanford Question Answering Dataset(SQuAD) v1.0과 동일한 방식으로 구성되었습니다.
- SQuAD: Stanford Question Answering Dataset is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
Model¶
- BiDAF: Birectional Attention Flow for Machine Comprehension +
No Answer
- DrQA: Reading Wikipedia to Answer Open-Domain Questions
- DocQA: Simple and Effective Multi-Paragraph Reading Comprehension +
No Answer
- QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Semantic Parsing¶
Dataset¶
- WikiSQL: A large crowd-sourced dataset for developing natural language interfaces for relational databases. WikiSQL is the dataset released along with our work Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning.
Pretrained Vector¶
- List on DataServer
English¶
Counter Fitting
: Counter-fitting Word Vectors to Linguistic Constraints- counter_fitted_glove.300d.txt
Cove
: Learned in Translation: Contextualized Word Vectors (McCann et. al. 2017)- wmtlstm-b142a7f2.pth
fastText
: Enriching Word Vectors with Subword Information- fasttext.wiki.en.300d.txt
GloVe
: GloVe: Global Vectors for Word Representation- glove.6B.50d.txt
- glove.6B.100d.txt
- glove.6B.200d.txt
- glove.6B.300d.txt
- glove.840B.300d.txt
ELMo
: Deep contextualized word representations- elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5
- elmo_2x4096_512_2048cnn_2xhighway_options
Word2Vec
: Distributed Representations of Words and Phrases and their Compositionality- GoogleNews-vectors-negative300.txt