EPF

Abstract

Our project will be focus on obtain, treat and represent data from a web forum called “Englishforum”, in an automatized way. For that we will be using the tools learned during the lectures and other new ones. This web forum does the same paper than reddit but on Switterland. The website appeared around 2005 and has reach the number of 68,000 users, since then. With an average of 179,958 visits per day, this website has become an information exchange node.

Data

As any forum, a lot of information is well organized and can be easily obtain by following a reusable pattern. As an example: Topic --> threads --> (Post + Commentaries). Where: Data structure:

Thread: Name of the thread.
User: Creator of the thread.
Views: Number of views of that thread
Replies: Number of replies in that thread
Location: Location of the user
Date: Date of creation of the thread
Posts: All posts of each thread
User_posts:
Since: Date of registration of the user
Exp: Level of experience of the user.
Thanked: Number of times been thanked of each user
Groaned: Number of times been groaned of each user
Reputation: Level of reputation of the user.

The potential of this data are the questions that we can answer form it, here are some examples:

Are new users from Switzerland or others countries, what do they look for in this forum, which is the average of active members compared with new members. How do they evolve in the forum?
Are the most active members relevant I the forum, or just spammers.
Finding which are the most important topics in the forum, are they relevant topics for daily life?
Is this website growing or it is stuck and dying?
How the social events affect in the forum?
Some Text mining.

Feasibility & Risks

Feasibility

The forum structure is well known. We can obtain accurate informatio. Data sparse: after the analysis, the treated data can be interpreted in different ways.

Risks

Get stuck in the process of finding possible relations. Data sparse: Some relations between the data couldn’t have enough measures. Just focus on a way of seeing the data Abuse of visualization or not well organized.

##Deliverables

For our first aim, we will develop an automatically download system that follows a pattern as the one shown before. After achieving this, we will have a well-organized continuous pipeline of data. Please notice that at this phase of the project the analyze of the data structure will be already be done. We will be saving it on others several structures, so in the future we will have the possibility of create new relations in an easy way. This could approximate our deliverables:

Setting up the working environment (Apache spark) and first tries downloading the data and working with it.
Making a continuous pipeline of data.
Plotting the actual data and making more advanced ways of data treatment (Machine learning tools).
A phase focus on plotting in the best way all the analysis made before.

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
.ipynb_checkpoints		.ipynb_checkpoints
csv		csv
old_csv		old_csv
wordcloud		wordcloud
.gitignore		.gitignore
ADA.odp		ADA.odp
English_forum.ipynb		English_forum.ipynb
English_forum_visual.ipynb		English_forum_visual.ipynb
README.md		README.md
ch-cantons.topojson.json		ch-cantons.topojson.json
map_2.html		map_2.html
map_2.png		map_2.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EPF

Abstract

Data

Feasibility & Risks

Feasibility

Risks

About

Releases

Packages

Languages

alvarogg777/EPF

Folders and files

Latest commit

History

Repository files navigation

EPF

Abstract

Data

Feasibility & Risks

Feasibility

Risks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages