-
Notifications
You must be signed in to change notification settings - Fork 1
Data Sharks
This report will start from the point we have finished watching all the tutorials and acquired all the necessary knowledge in order to start with the task we have been assigned to.
In the first place, to check the veracity of the hypothesis data was needed. After searching for some time we landed in different database organizations that were trustworthy, but a lot of them had abundant information. We arrived to the conclusion due to limitations of the hardware that not all indicators could be selected and therefore only around 30 of them would make it in the final analysis. Moreover, we had a difficult time choosing the indicators as a lot of them had a lot of missing information and for this reason we had to inspect each table manually.
When the selection of databases was completed we tried importing each database using the URL through python code, however, not all databases had a direct link. For this reason, downloading the .csv and converting all files to a .zip was a simpler approach. The first battle with the data was a really difficult process, each one had a different format of columns, many of the columns were in different dtypes and there were rows that had missing values or were in blank. A specific treatment for each provider of the database was needed, so all the files were named using a same approximation: 'NameOfTheProvider - Indicator Name'. A long time was used in the making a huge method called 'preprocess' and 'rename_value_column' that reads the name of the .csv and makes a special treatment taking into account from were the file was downloaded.
After a few issues, we managed to make the merge method work properly, resulting in a comprehensive database of around 70,000 rows and 34 columns, including those for the country and year.
Once all of the data was placed into one dataframe, we could refine the integration, identify the total of missing values for each country or period of years, and narrow our study to a sample with enough data.
In a first approach, we decided to ignore the entries older than 1990, and drop the countries that had no more than 15 indicators to study. This way, we reduced the size of our dataframe without compromising the quality of the data. This trimming process will be refined in the near future, as we finish the main part of the code and become able to focus more on the subject of study.
At the same time, a model definition diagram is created and still needs some improvements due to the fact that the python code is still in process. The structure and dimensions of the code made jupyter notebook no longer suitable for the development, so we migrated our code to Visual Studio and divided methods, functions and classes in different modules.
This change allowed for the creation of a poetry file and a more agile sharing via GitHub, but required a considerable effort to make adapt the code to the new environment. It was detected that the code can be further reestructured to make it more readable and tidy in this new environment, but it had to be postponed and instead we focused on making the code work again properly.
Finally, we have spent some time planning the next step on the treatment of the data. The null values are the first sensible option, and we decided to fill the blanks with values estimated through interpolation between the closest non-null values.
The second stage involved dealing with outliers. First of all, we had to identify which values were just anomalies due to explainable circumstances, and which ones were true outliers. This process was made first manually, and later with help from a primitive algorithm to roughly estimate if the values were part of a tendency. Those that weren't, had to be treated, and we decided to apply the same criteria than with the null values and use interpolation methods.
This second week has been a really big challenge in terms of normalizing the data and structuring code properly, however, we think a huge progress is made and learned a lot of code and what is the best way to work in a team.
This week's main focus was to tidy up the code, design useful test cases and rearrange the code so it is more readable and easy to maintain. Also, we started to work with thresholds for the countries and indicators, and we defined a year range first of 1990 to 2020, and later of 2000 to 2020. We then realized that we might need more sources of information, as we cut off a few indicators since there was not enough data available for the scope of our analysis.
During the restructuring of the code, a few bugs were found that, even though they did not affect the results, they could make the code to fail if the way the methods were used was slightly changed.
Finally, we modified some constants and arguments from the code so it could treat and preprocess any other database we decide to use in the future, with only a few changes needed depending on how different they are.
In summary, we invested a considerable amount of time in making the code more robust, versatile and intuitive to use; all while still filtering what information was useful and which should be gone.
With the code perfectly tuned, we started write the notebook that would separate the countries by regions, while preparing the first batch of charts and tests for the newly formed groups.
We decided to create a .csv file for every region with all the countries it contained, but first, we needed to define which countries were in which region. After searching for different ways of dividing them, we finally downloaded a small .csv from the World Bank that grouped the countries not just by continent, but also applied a more socio-economical criteria.
We also made a few tests with some clustering algorithms so we could group the countries based on more empirical evidence. Even though we managed to obtain satisfactory results, we decided we would postpone the clustering analysis so we could first refine our study methods and have more time to read about the options that clustering algorithms can offer to improved the results.
This week, we started to work with widgets that would provide a more interactive and visual way of representing the information we were obtaining from our study, such as showing the most correlated indicators based on a threshold the user could set of what they consider a significative correlation value.
On the other hand, we coded an aggregation algorithm that allowed us to infer the values of whole regions or even the world, based on a summatory of the absolute indicators and a weighted average for the relative ones. We decided to use the population of each country as the weight for this average, but any other absolute indicator, such as the GDP, can also be used with a simple modification to the code.
The last week before the demo has been loaded with work, from generating new methods of representing the data to discovering bugs in the code that needed to be solved. After arriving to the conclusion that the aggregation by population was not a good hypothesis we ended up programming a simpler approach, by median. Developing all this features allowed us to give a final answer in the Demo of which indicators are the most correlated with the GDP.
The days before we ensured that the demo presentation was up to the expectations, not repeating the same flaws we did in the first one, this time with a more formal approach and showing a lot of results.
Having a great demo the week before we started this one with motivation. In this period the clustering was started and giving a lot of troubles due to the fact that it was our first time experimenting with it. At first, there were problems deciding which algorithm should we choose for the dimension reduction and which for the clustering, in the end t-SNE and Affinity Propagation were the election.
The results of the clustering were not quite promising, a lot of groups that didn't had meaning were generated and this was caused by having too many indicators grouped together. For this reason we classified them in 3 groups: equality, socio-demographics and economics. This perspective allowed to have a better clustering which was far more logical than the outcomes we had before.
We added more indicators in the database to have more results, this task took some time because a lot of the dataframes had a lot of holes. In the eight week we optimized the clustering notebook to have a cleaner look and easier way to change the parameters.
We started the time series tasks with a lot of difficulties, at first we had problems deciding what we really needed and where we were heading. After receiving some guidance of our mentor, we started and intense week of programming, this has been the hardest task we have been assigned to. We made some calls along the sprint to consult Luis and check if we were in the right direction or we were lost.
After some more struggles with the time series task we ended up finishing it. The main blocking points in this section have been interacting in a more complex way with widgets and preparing the databases to be read. At first we thought that this task would be easier, although, having to prepare the display tables and all the stylers has been really difficult. There are some details left to polish before the demo, but we are very near to finish all the task in the assigned time.
The end of the project is getting closer and we have been polishing all the code. The dashboard is being started and we are having a difficult time deciding what technology to use to display all the data. We have spent some days searching for a solution. This week we also prepared for a second demo of the month.
This is the last entry of the diary. The dashboard has been a difficult task, because it had a lot problems and bugs to solve. Moreover, in parallel we have been watching some tutorials of Spark to at least have an idea of the possibilities that it opens, as it is really necessary when you use bigger dataframes. Approaching the final demo we are giving the final touches to the dashboard and preparing all to show the best of it.