Skip to content

Latest commit

 

History

History
201 lines (149 loc) · 9.8 KB

README.md

File metadata and controls

201 lines (149 loc) · 9.8 KB

Khan Academy Sushi Chef

Content integration script for the Khan Academy channels from https://khanacademy.org.

Step 0: Installation

  • Install pip if you don't have it already.
  • Install Python3 if you don't have it already
  • Install Git if you don't have it already
  • Open a terminal
  • Clone this repo, cd into it
  • Create a Python3 virtual env virtualenv -p python3 venv and activate it using source venv/bin/activate
  • Run pip install -r requirements.txt

Step 1: Obtaining an Authorization Token

You will need an authorization token to create a channel on Kolibri Studio. In order to obtain one:

  1. Create an account on Kolibri Studio.
  2. Navigate to the Tokens tab under your Settings page.
  3. Copy the given authorization token (you will need this for later).

Step 2: Running the chef

Running the KA sushi chef script requires loading some environment variables, and a single command:

source venv/bin/activate

source credentials/proxy_list.env
source credentials/crowdinkeys.env

./sushichef.py --reset --token=<token> --thumbnails lang=<lang_code>

You'll need to replace <token> with your Studio access token obtained earlier and <lang_code> with a le_utils code for the channel (e.g. en, es, pt-BR, etc.). When running the KA chef command on a remote server, use nohup ... & so that the long-running chef process will not exit when you "hang up" the ssh sesssion.

Implementation Overview

We use Khan Academy TSV export data to get a tree structure of the language, which includes topics, videos, and exercises. We map each of these content kinds to our KhanNode objects, which then get mapped to the ricecooker data structures:

KA TSV exports --(A)--> KhanNode tree --(B)--> ContentNode tree

During the processing step (A) slug-based filtering is applied to skip certain nodes (due to technical or licensing limitations). During processing step (B) the topic tree returned by the KA API is restructured to take advantage of LTTs see SLUG_BLACKLIST and TOPIC_TREE_REPLACMENTS_PER_LANG in curation.py.

Code

Important chef code

sushichef.py          Main code for the content integration script
tsvkhan.py            Functions for loading data from the new KA TSV exports
constants.py          Constants, metadata, and settings used in the code
curation.py           Topic node replacements to organize the KA topic trees
crowdin.py            Obtain translations from CrowdIn
common_core_tags.py   Helper class to obtain the CCSSM tags for KA exercises
network.py            Robust HTTP requests that use caching

Debugging and reports code

graphql.py            Helper method to extract localized topic trees form the KA website
katrees.py            Generate report and print topic tree from the KA API
kolibridb.py          Generate report and print topic tree from Kolibri DBs

Each of these scripts can be called as standalone command line scripts.

Example 1. Get localized topic tree info from Khan Academy website:

./graphql.py --lang hi   # check the topics structure used on hi.khanacademy.org
./graphql.py --lang en --curriculum us-cc   # topic structure for us-cc variant

Use the output of this script to add or update the info in curaiton.py.

Example 2. Print the entire topic tree from the French TSV export (up to 3 levels) and also save the tree as an HTML file:

./katrees.py --print --printmaxlevel=3 --htmlexport --htmlmaxlevel=4   --lang fr

Example 3. Print the topic tree for the currently published Kolibri channel Khan Academy (Français) which has channle ID 878ec2e6f88c5c268b1be6f202833cd4:

./kolibridb.py --channel_id 878ec2e6f88c5c268b1be6f202833cd4 --printmaxlevel 3 --update

The output is similar to tree output of Example 2 so it can be used for comparing and debugging (what we get from the TSV exports vs. what the final channel produced).

Example 4. The code in tsvkhan.py is also runnable as a standalone script:

./tsvkhan.py   # list the available TSV exports for all languages
./tsvkhan.py --kalang fr     # list the TSV exports available for French

KhanExercise

Each exercise has a list of assessment item IDs associated with it. In order to retrieve each assessment item we use, https://www.khanacademy.org/api/v1/assessment_items/{id}?lang={lang}. We only include the assessment item if the content is fully translated by looking at is_fully_translated from the response.

Data Mapping

Below is a table which shows the mapping from the Khan data structures to the Ricecooker data structures.

KA Data Structures Ricecooker Data Structures
KhanTopic nodes.TopicNode
KhanExercise nodes.ExerciseNode
KhanAsessmentItem questions.PerseusQuestion
KhanVideo nodes.VideoNode
KhanArticle Not Supported
KhanScratchpad Not Supported

TODOs

Channel variants and Localized Topic Trees (LTTs)

The Khan Academy content is available under different topic structures. The KA content in English was originally organized around high-level Subjects. Later, an additional "Math by grade" topic structure was added that contains the same videos and lessons but organized according to the US Common Core Math standards. Certain KA languages offer an additional topic structures aligned to local grade levels called Localized Topic Trees (LTTs). The KA website offers multiple top-level menu topics that can vary with both language and region. All the topic trees are available through the KA API, but the different topic trees co-exist within the same tree structure as returned by the KA API, which can be overwhelming for users since the same content appears repeatedly and organization is unexpected.

The SLUG_BLACKLIST and TOPIC_TREE_REPLACMENTS_PER_LANG info in curation.py allows us to take advantage of these multiple topics trees and present Kolibri users with a topic tree structure that closely resemble the Khan Academy website. Each Kolibri channel is created with a combination of (lang,variant) where lang is one of the le_utils language codes, and variant is one of the "curriculum" variants available for that language. For example, the Khan Academy English content comes in two variants, the US variant (us-cc) and the India variant (in-in). Here is a complete list of channels and command line options lang/variant for them: