Information Retrieval and Text Mining project
- News Insight
- News Classification
- Text Regression
https://www.youtube.com/watch?v=9PFZ0_C2Sxo&feature=share
- 蒐集各方假新聞dataset
- 可以從假新聞或真新聞中分析出什麼樣的消息?
- 用怎樣的方法分析或比較?
- 假新聞相較於真新聞有怎樣的特徵?
- 怎麼抓取特徵或關鍵字?
- 可能用到的情緒字 / 情緒分析
- 依照詞性去對假真新聞決定可能會有那些常用字。EX:文字雲
- 語意分析
- 假新聞分類、評比
- 特性、提醒使用者
https://docs.google.com/document/d/10-7H9bPJYQRMdOUdugDlWeifdpvoN9twXZGT-m1fhdc/edit?usp=sharing
https://docs.google.com/document/d/1I9SWihDkgXx1NCYCsY-0e_XDicAK346PqQu5wMaesd0/edit
https://docs.google.com/presentation/d/1lRDR40UfcLpdRUSnfMbi6eOsR_jjxFdOcKwa8HvxHh8/edit#slide=id.p
Outline:
- 動機
- 做甚麼
- solution insight
- solution regression / classification
動機: 為什麼要做?因為假新聞氾濫、影響閱聽人、帶選舉風向的問題
- 假新聞的程度
- 真假新聞之間有什麼區別
- (假)新聞的種類
- 比較不同方法的performance
Solution
- TF-IDF。給Tagging
- POS (part-of-speech tagging) EX:openNLP、NLTK => a.每個不同dataset的詞性常出現哪些字 b. dictionary by overall dataset依詞性要用哪些字
- Sentiment Analysis EX:TextBlob、
- feature selection: 關鍵字、類別鑑別力
- 作者、來源的助益性。每一種類別的差別
- regression (ML方法、DL方法) / classification (IR方法)
GOAL
- 在相同dictionary大小下: 沒有分詞性情況下跑出來幾分,有詞性的dictionary跑出來幾分 EX:名詞dictionary幾分,動詞正確率幾趴?
- 前面所做的insight可以跟最後面產生的dict有關連
- 假新聞的程度、分類,兩者testing dataset互為兩者
- 時間切三塊或五塊: 選前、選舉正負一個禮拜、選後,主題、用字、情感的變動
Dataset
- 分為十類別(第二個dataset八類、第一個dataset兩類): 第三個dataset的True、mostly Tru放進去第一個dataset的true;第三個dataset的barely-true、false、pants-fire放進去第一個dataset的False
- 濾除標點符號跟數字、大寫變小寫 ,只留下 content(最長的attribute)、label (假新聞的程度、類別)
三個dataset的text,label合併資料集:https://drive.google.com/drive/u/2/folders/19CER5SrMU29n3UPAkQc2hPu3HA8vyqbc
Method
目前只看news content
- 十個類別的POS、overall dataset的POS https://drive.google.com/drive/folders/1C-6U9TcyUwgxzdArvAXPsnjx9yrPhxsh?usp=sharing
- 十個類別的長條圖of情緒分析。文獻探討: 詞性、情緒、feature selection、分類、回歸等等套件的論文
- 十個類別的文字雲、頻率圖=>做一個overall的,把各類別常見的term的濾掉
- 3 kind of feature selecion、tfidf of building overall dictionary
bs類別代表意義不大
testing Kaggle: https://www.kaggle.com/c/fake-news/submit
(測試clf好壞結果、reg好壞結果)
-
https://www.kaggle.com/c/fake-news/data (title、author、text、true/false;來自爬文的news articles) =>
-
https://github.com/KaiDMML/FakeNewsNet/tree/master/Data (news source, headline, image, body_text, publish_data, etc、包含真假新聞;爬文新聞)
-
https://www.kaggle.com/mrisdal/fake-news (uuidUnique identifier,ord_in_thread,authorauthor of story,publisheddate published ,titletitle of the story,texttext of story,languagedata from webhose.io,crawleddate the story was archived,site_urlsite URL from BS detector,countrydata from webhose.io,domain_rankdata from webhose.io,thread_title,spam_scoredata from webhose.io,main_img_urlimage from story,replies_countnumber of replies,participants_countnumber of participants,likesnumber of Facebook likes,commentsnumber of Facebook comments,sharesnumber of Facebook shares,typetype of website (label from BS detector)) https://github.com/bs-detector/bs-detector
-
https://github.com/GeorgeMcIntire/fake_real_news_dataset (csv file and contains 1000s of articles tagged as either real or fake)
-
https://www.cs.ucsb.edu/~william/data/liar_dataset.zip (假新聞程度分級;UCSB)(statement、speaker、conext、label、src)
-
https://www.kaggle.com/jruvika/fake-news-detection (URLs,Headline,Body,Label(T/F);)
-
https://www.kaggle.com/c/fake-news-pair-classification-challenge/data (fake news classification)
-
https://github.com/JasonKessler/fakeout (完整的project)
-
datasets: https://data.world/datasets/fake-news 、 https://github.com/sumeetkr/AwesomeFakeNews
-
preprocess ref: https://www.kaggle.com/rchitic17/fake-news 、 https://www.kaggle.com/michaleczuszek/fake-news-analysis
- https://www.ithome.com.tw/news/127214?fbclid=IwAR0oKz7wm0Ub0Kb5FDh9HAvjKX5tgidTtZrFRSY_kVsgQrue5_-K-5iSC-o
- https://www.ithome.com.tw/news/127201?fbclid=IwAR3_vIk3Pdvsem1d_uAWyaiZHUj8C51JLzene9jYOtc50KL31xgEHiHYfLQ
- 協助使用者判斷真假
- 知道假新聞pattern、用字特性、文章特徵
- 新聞分類
- 真假新聞常用的字
- 爬文insight ( https://shift.newco.co/2016/11/09/What-I-Discovered-About-Trump-and-Clinton-From-Analyzing-4-Million-Facebook-Posts/ )
- 分析 ( https://towardsdatascience.com/i-trained-fake-news-detection-ai-with-95-accuracy-and-almost-went-crazy-d10589aa57c 、 http://nbviewer.jupyter.org/github/JasonKessler/fakeout/blob/master/Fake%20News%20Analysis.ipynb)
- 題目參考資料1: http://www.im.ntu.edu.tw/~paton/courses.htm
- 題目參考資料2: https://mega.nz/#!xwdEgAjb!FAVoAznYD7bE5rsoXc7isRJUlAbF0m8mamYe2RiCwMM
- 題目參考資料3: https://mega.nz/#!UlNmXQIS!7dZhNx0Cy9-VyjlEI5GUO5zjIgYNJoe9dUAPaCNcowA
- 文字雲: https://www.kaggle.com/ngyptr/python-nltk-sentiment-analysis
- TextBlob情感分析: https://nlp.stanford.edu/courses/cs224n/2009/fp/24.pdf (套用NLTK movie_review當作training data)(https://stackoverflow.com/questions/34518570/how-are-sentiment-analysis-computed-in-blob/34519114#34519114)
- NLTK詞性分析(pos_tager): https://explosion.ai/blog/part-of-speech-pos-tagger-in-python (Greedy Averaged Perceptron tagger?)(taining data Sections 00-18 of the Wall Street Journal sections of OntoNotes 5)(https://stackoverflow.com/questions/32016545/how-does-nltk-pos-tag-work)
Datasets for sentiment analysis are available online.[1][2]
The following is a list of a few open source sentiment analysis tools.
- GATE plugins
- SEAS(gsi-upm/SEAS)
- SAGA(gsi-upm/SAGA)
- Stanford Sentiment Analysis Module (Deeply Moving: Deep Learning for Sentiment Analysis)
- LingPipe (Sentiment Analysis Tutorial)
- TextBlob (Tutorial: Quickstart)[3]
- Opinion Finder (OpinionFinder | MPQA)
- Clips pattern.en (pattern.en | CLiPS)
Open Source Dictionary or resources:
- SentiWordNet
- Bing Liu Datasets (Opinion Mining, Sentiment Analysis, Opinion Extraction)
- General Inquirer Dataset (General Inquirer Categories)
- MPQA Opinion Corpus (MPQA Resources)
- WordNet-Affect (WordNet Domains)
- SenticNet
- Emoji Sentiment Ranking
方向: 文字分類(classification) or 程度回歸(regression)
文字分類
- A novel text mining approach based on TF-IDF and Support Vector Machine for news classification https://ieeexplore.ieee.org/abstract/document/7569223
- TEXT CLASSIFICATION USING NAÏVE BAYES, VSM AND POS TAGGER https://pdfs.semanticscholar.org/43d0/0d394ff76c0a5426c37fe072038ac7ec7627.pdf
- Text categorization with Support Vector Machines: Learning with many relevant features https://link.springer.com/content/pdf/10.1007%2FBFb0026683.pdf
- Unsupervised Content-Based Identification of Fake News Articles with Tensor Decomposition Ensembles: http://snap.stanford.edu/mis2/files/MIS2_paper_2.pdf