Skip to content
datadave edited this page Apr 1, 2014 · 3 revisions

##What's the best way for me to setup Python on my windows machine?

The short answer is: via Anaconda for Linux in a VirtualBox running Ubuntu

####Additional Notes:

#####Conda for Windows is an option, but in the long run a Linux environment is preferred The bulk of what we'll be working on will be on Python, which you can install actually on Windows with Conda. That said, it will be helpful in the longer run for you to have access on your main PC to a Linux environment as it just runs open source and development data science software better. Other than tableau - the bulk of software we use tends to favor linux/mac environments. We've also seen reports that Anaconda on Windows can be buggy.

#####Instant Python in the cloud Head over to www.wakari.io and setup an account. Bam! Instant SciPy setup in 2 minutes. Knowing this is setup will help take the edge off if you end up having any installation issues with Windows. Its basically a full-fledged python environment in the cloud. Underpowered in the free version, but sufficient to get you coding for the first weeks of class. You can also access the Ubuntu terminal prompt from Wakari and run commands there.

How much of the coursework involves students doing independent thinking and problem creation/definition as different than using pre defined (or pre cleansed) data sets to solve a well defined problem.?

####Pre-Defined Data Sets Each lesson and topic is covered using pre-defined sets in the lectures and labs. With the topics approximately following those listed in our latest syllabus: https://github.com/datadave/GADS9-NYC-Spring2014/wiki/Syllabus

####Independent Thinking and Problem Creation/Definition

  • The last quarter of the course is comprised of student work towards their final projects which are expected to demonstrate "thorough understanding of statistical techniques, data management, and the application of these in programming". Part of the project includes a technical paper and presentation with the goal of clear communication to a professional, technical audience.

Although the course formally focuses on the projects only at the end, students are expected to be thinking about the projects throughout the course and applying skills learned towards these projects as we go. Many past students have used or published their projects professionally, and we highly encourage practical applications.

  • Additionally, Each learning unit is compromised of multiple lessons -- i.e. "Machine Learning" encompasses K-Means, classification, logit, etc. -- and involves one "Mini Project" for which the students will have more leeway to either use a given prepared data set or a data set of their choice -- regardless of the data source, the focus is on following their own methods and approaches, there is no 'cookbook' provided for these (of which there will be 3 or 4).

##Are the problems and exercises drawn from specific sets of test data to teach a technique (e.g. training corpus for a Bayesian spam filter) or are students expected to bring their own problem sets?

Per above, for final projects -- their own problem sets. For initial classroom learning, we use consistent data sets to ensure that students all have a complete example to fully understand (and, more importantly, to explain). We try to choose sets that provide a blend of both 'typical' and 'varied' challenges.

###Is the course more focused on mathematical/statistical techniques, software solutions or problem solving?

The luxury of a 12 week survey course is we have the ability to focus on both techniques as well as approaches. We'll see how this semester pans out depending on the class's skills, but my estimate would be 40% techniques, 40% problem solving and 20% software. While we'll be using Python (SK Learn, SciPy, etc.), all units and concepts and our approach is tool-agnostic. Software and environments take some time to setup, and we've decided Python gives the best blend of breadth, increasing relevance, and ability to turn analysis into products.

##What is the focus of the software dev in the class - is it focused on doing the analysis, data cleansing to facilitate analysis, manipulation/aggregation, etc.?

Varies in each class depending on the topic. Aggregation is focused on during the aggregation class, analysis during the regression class, etc. Data cleansing... always ;-)

##It looks like the class covers many of the core machine learning techniques that a data scientist should have in their toolkit - how much time is spent on helping the student understand when and why to apply one set of techniques over another?

Each topic will cover rules of thumb for when they should be applied. Our plan is to also focus on this towards the latter quarter of the class when we introduce 'ensemble' learning and also provide overviews of techniques covered.